xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
@ 2016-06-14 23:49 linux
  2016-06-15  8:29 ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: linux @ 2016-06-14 23:49 UTC (permalink / raw)
  To: xen-devel

Hi,

Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
but one of the latest commits seems to have broken boot of HVM guests
(using qemu-xen) previous build with xen_changeset git:6e908ee worked 
fine.

--
Sander

(XEN) [2016-06-14 22:47:36.827] HVM19 save: CPU
(XEN) [2016-06-14 22:47:36.827] HVM19 save: PIC
(XEN) [2016-06-14 22:47:36.827] HVM19 save: IOAPIC
(XEN) [2016-06-14 22:47:36.827] HVM19 save: LAPIC
(XEN) [2016-06-14 22:47:36.827] HVM19 save: LAPIC_REGS
(XEN) [2016-06-14 22:47:36.827] HVM19 save: PCI_IRQ
(XEN) [2016-06-14 22:47:36.827] HVM19 save: ISA_IRQ
(XEN) [2016-06-14 22:47:36.827] HVM19 save: PCI_LINK
(XEN) [2016-06-14 22:47:36.827] HVM19 save: PIT
(XEN) [2016-06-14 22:47:36.827] HVM19 save: RTC
(XEN) [2016-06-14 22:47:36.827] HVM19 save: HPET
(XEN) [2016-06-14 22:47:36.827] HVM19 save: PMTIMER
(XEN) [2016-06-14 22:47:36.827] HVM19 save: MTRR
(XEN) [2016-06-14 22:47:36.827] HVM19 save: VIRIDIAN_DOMAIN
(XEN) [2016-06-14 22:47:36.827] HVM19 save: CPU_XSAVE
(XEN) [2016-06-14 22:47:36.827] HVM19 save: VIRIDIAN_VCPU
(XEN) [2016-06-14 22:47:36.827] HVM19 save: VMCE_VCPU
(XEN) [2016-06-14 22:47:36.827] HVM19 save: TSC_ADJUST
(XEN) [2016-06-14 22:47:36.827] HVM19 restore: CPU 0
(d19) [2016-06-14 22:47:38.102] HVM Loader
(d19) [2016-06-14 22:47:38.102] Detected Xen v4.8-unstable
(d19) [2016-06-14 22:47:38.102] Xenbus rings @0xfeffc000, event channel 
1
(d19) [2016-06-14 22:47:38.102] System requested SeaBIOS
(d19) [2016-06-14 22:47:38.102] CPU speed is 3200 MHz
(d19) [2016-06-14 22:47:38.102] Relocating guest memory for lowmem MMIO 
space disabled
(XEN) [2016-06-14 22:47:38.102] irq.c:275: Dom19 PCI link 0 changed 0 -> 
5
(d19) [2016-06-14 22:47:38.103] PCI-ISA link 0 routed to IRQ5
(XEN) [2016-06-14 22:47:38.103] irq.c:275: Dom19 PCI link 1 changed 0 -> 
10
(d19) [2016-06-14 22:47:38.103] PCI-ISA link 1 routed to IRQ10
(XEN) [2016-06-14 22:47:38.103] irq.c:275: Dom19 PCI link 2 changed 0 -> 
11
(d19) [2016-06-14 22:47:38.103] PCI-ISA link 2 routed to IRQ11
(XEN) [2016-06-14 22:47:38.103] irq.c:275: Dom19 PCI link 3 changed 0 -> 
5
(d19) [2016-06-14 22:47:38.103] PCI-ISA link 3 routed to IRQ5
(d19) [2016-06-14 22:47:38.110] pci dev 01:3 INTA->IRQ10
(d19) [2016-06-14 22:47:38.112] pci dev 02:0 INTA->IRQ11
(d19) [2016-06-14 22:47:38.116] pci dev 04:0 INTA->IRQ5
(d19) [2016-06-14 22:47:38.127] No RAM in high memory; setting high_mem 
resource base to 100000000
(d19) [2016-06-14 22:47:38.127] pci dev 03:0 bar 10 size 002000000: 
0f0000008
(d19) [2016-06-14 22:47:38.128] pci dev 02:0 bar 14 size 001000000: 
0f2000008
(d19) [2016-06-14 22:47:38.128] pci dev 04:0 bar 30 size 000040000: 
0f3000000
(d19) [2016-06-14 22:47:38.129] pci dev 04:0 bar 10 size 000020000: 
0f3040000
(d19) [2016-06-14 22:47:38.129] pci dev 03:0 bar 30 size 000010000: 
0f3060000
(d19) [2016-06-14 22:47:38.130] pci dev 03:0 bar 14 size 000001000: 
0f3070000
(d19) [2016-06-14 22:47:38.130] pci dev 02:0 bar 10 size 000000100: 
00000c001
(d19) [2016-06-14 22:47:38.131] pci dev 04:0 bar 14 size 000000040: 
00000c101
(d19) [2016-06-14 22:47:38.132] pci dev 01:1 bar 20 size 000000010: 
00000c141
(d19) [2016-06-14 22:47:38.132] Multiprocessor initialisation:
(d19) [2016-06-14 22:47:38.132]  - CPU0 ... 48-bit phys ... fixed MTRRs 
... var MTRRs [1/8] ... done.
(d19) [2016-06-14 22:47:38.132]  - CPU1 ... 48-bit phys ... fixed MTRRs 
... var MTRRs [1/8] ... done.
(d19) [2016-06-14 22:47:38.133]  - CPU2 ... 48-bit phys ... fixed MTRRs 
... var MTRRs [1/8] ... done.
(d19) [2016-06-14 22:47:38.133]  - CPU3 ... 48-bit phys ... fixed MTRRs 
... var MTRRs [1/8] ... done.
(d19) [2016-06-14 22:47:38.133] Writing SMBIOS tables ...
(d19) [2016-06-14 22:47:38.134] Loading SeaBIOS ...
(d19) [2016-06-14 22:47:38.134] Creating MP tables ...
(d19) [2016-06-14 22:47:38.134] Loading ACPI ...
(d19) [2016-06-14 22:47:38.135] vm86 TSS at fc00a200
(d19) [2016-06-14 22:47:38.135] BIOS map:
(d19) [2016-06-14 22:47:38.135]  10000-100e3: Scratch space
(d19) [2016-06-14 22:47:38.135]  c0000-fffff: Main BIOS
(d19) [2016-06-14 22:47:38.135] E820 table:
(d19) [2016-06-14 22:47:38.135]  [00]: 00000000:00000000 - 
00000000:000a0000: RAM
(d19) [2016-06-14 22:47:38.135]  HOLE: 00000000:000a0000 - 
00000000:000c0000
(d19) [2016-06-14 22:47:38.135]  [01]: 00000000:000c0000 - 
00000000:00100000: RESERVED
(d19) [2016-06-14 22:47:38.135]  [02]: 00000000:00100000 - 
00000000:1f800000: RAM
(d19) [2016-06-14 22:47:38.135]  HOLE: 00000000:1f800000 - 
00000000:fc000000
(d19) [2016-06-14 22:47:38.135]  [03]: 00000000:fc000000 - 
00000001:00000000: RESERVED
(d19) [2016-06-14 22:47:38.136] Invoking SeaBIOS ...
(d19) [2016-06-14 22:47:38.137] SeaBIOS (version rel-1.9.2-0-gd2aeb7f)
(d19) [2016-06-14 22:47:38.137] BUILD: gcc: (Debian 4.9.2-10) 4.9.2 
binutils: (GNU Binutils for Debian) 2.25
(d19) [2016-06-14 22:47:38.137]
(d19) [2016-06-14 22:47:38.137] Found Xen hypervisor signature at 
40000000
(d19) [2016-06-14 22:47:38.138] Running on QEMU (i440fx)
(d19) [2016-06-14 22:47:38.138] xen: copy e820...
(d19) [2016-06-14 22:47:38.138] Relocating init from 0x000dcec0 to 
0x1f7ae250 (size 72992)
(d19) [2016-06-14 22:47:38.140] Found 7 PCI devices (max PCI bus is 00)
(d19) [2016-06-14 22:47:38.140] Allocated Xen hypercall page at 1f7ff000
(d19) [2016-06-14 22:47:38.140] Detected Xen v4.8-unstable
(d19) [2016-06-14 22:47:38.140] xen: copy BIOS tables...
(d19) [2016-06-14 22:47:38.140] Copying SMBIOS entry point from 
0x00010020 to 0x000f6d40
(d19) [2016-06-14 22:47:38.140] Copying MPTABLE from 0xfc0011b0/fc0011c0 
to 0x000f6c20
(d19) [2016-06-14 22:47:38.140] Copying PIR from 0x00010040 to 
0x000f6ba0
(d19) [2016-06-14 22:47:38.140] Copying ACPI RSDP from 0x000100c0 to 
0x000f6b70
(d19) [2016-06-14 22:47:38.140] Using pmtimer, ioport 0xb008
(d19) [2016-06-14 22:47:38.140] Scan for VGA option rom
(d19) [2016-06-14 22:47:38.148] Running option rom at c000:0003
(XEN) [2016-06-14 22:47:38.151] stdvga.c:174:d19v0 entering stdvga mode
(d19) [2016-06-14 22:47:38.160] pmm call arg1=0
(d19) [2016-06-14 22:47:38.161] Turning on vga text mode console
(XEN) [2016-06-14 22:47:38.178] domain_crash called from emulate.c:144
(XEN) [2016-06-14 22:47:38.178] Domain 19 (vcpu#0) crashed on cpu#5:
(XEN) [2016-06-14 22:47:38.178] ----[ Xen-4.8-unstable  x86_64  debug=y  
Not tainted ]----
(XEN) [2016-06-14 22:47:38.178] CPU:    5
(XEN) [2016-06-14 22:47:38.178] RIP:    c000:[<000000000000336a>]
(XEN) [2016-06-14 22:47:38.178] RFLAGS: 0000000000000046   CONTEXT: hvm 
guest (d19v0)
(XEN) [2016-06-14 22:47:38.178] rax: 0000000000000720   rbx: 
0000000000005d06   rcx: 0000000000004000
(XEN) [2016-06-14 22:47:38.178] rdx: 00000000000003c0   rsi: 
000000000e0fff67   rdi: 0000000000000000
(XEN) [2016-06-14 22:47:38.178] rbp: 00000000000001c8   rsp: 
00000000000001a8   r8:  0000000000000000
(XEN) [2016-06-14 22:47:38.178] r9:  0000000000000000   r10: 
0000000000000000   r11: 0000000000000000
(XEN) [2016-06-14 22:47:38.178] r12: 0000000000000000   r13: 
0000000000000000   r14: 0000000000000000
(XEN) [2016-06-14 22:47:38.178] r15: 0000000000000000   cr0: 
0000000000000010   cr4: 0000000000000000
(XEN) [2016-06-14 22:47:38.178] cr3: 0000000000000000   cr2: 
0000000000000000
(XEN) [2016-06-14 22:47:38.178] ds: ee16   es: b800   fs: 0000   gs: 
0000   ss: ee16   cs: c000


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-14 23:49 Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>] linux
@ 2016-06-15  8:29 ` Jan Beulich
  2016-06-15  8:57   ` Sander Eikelenboom
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15  8:29 UTC (permalink / raw)
  To: linux; +Cc: xen-devel

>>> On 15.06.16 at 01:49, <linux@eikelenboom.it> wrote:
> Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
> but one of the latest commits seems to have broken boot of HVM guests
> (using qemu-xen) previous build with xen_changeset git:6e908ee worked 
> fine.

Primary suspects would seem to be 67fc274bbe and bfa84968b2,
but (obviously) I didn't see any issues with them in my own
testing, so could you
- instead of doing a full bisect, revert just those two
- clarify what specific things you're having your guest(s) do that
  I may not have done with just "ordinary" guests?

And then of course this domain_crash() could of course be
accompanied by some helpful printk() ...

Thanks, Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15  8:29 ` Jan Beulich
@ 2016-06-15  8:57   ` Sander Eikelenboom
  2016-06-15  9:38     ` Sander Eikelenboom
  0 siblings, 1 reply; 21+ messages in thread
From: Sander Eikelenboom @ 2016-06-15  8:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel


Wednesday, June 15, 2016, 10:29:37 AM, you wrote:

>>>> On 15.06.16 at 01:49, <linux@eikelenboom.it> wrote:
>> Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
>> but one of the latest commits seems to have broken boot of HVM guests
>> (using qemu-xen) previous build with xen_changeset git:6e908ee worked 
>> fine.

> Primary suspects would seem to be 67fc274bbe and bfa84968b2,
> but (obviously) I didn't see any issues with them in my own
> testing, so could you
> - instead of doing a full bisect, revert just those two

Will give reverting that a shot.

> - clarify what specific things you're having your guest(s) do that
>   I may not have done with just "ordinary" guests?

That was what i was also wondering about, but it is an ordinary guest, no pci 
passthrough or something funky. 
Perhaps an Intel vs AMD (this machine has a AMD phenom X6) issue ?

--
Sander

> And then of course this domain_crash() could of course be
> accompanied by some helpful printk() ...

> Thanks, Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15  8:57   ` Sander Eikelenboom
@ 2016-06-15  9:38     ` Sander Eikelenboom
  2016-06-15 10:12       ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Sander Eikelenboom @ 2016-06-15  9:38 UTC (permalink / raw)
  To: Jan Beulich, xen-devel

Wednesday, June 15, 2016, 10:57:03 AM, you wrote:

> Wednesday, June 15, 2016, 10:29:37 AM, you wrote:

>>>>> On 15.06.16 at 01:49, <linux@eikelenboom.it> wrote:
>>> Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
>>> but one of the latest commits seems to have broken boot of HVM guests
>>> (using qemu-xen) previous build with xen_changeset git:6e908ee worked 
>>> fine.

>> Primary suspects would seem to be 67fc274bbe and bfa84968b2,
>> but (obviously) I didn't see any issues with them in my own
>> testing, so could you
>> - instead of doing a full bisect, revert just those two

> Will give reverting that a shot.

Reverting bfa84968b2 is sufficient.

>> - clarify what specific things you're having your guest(s) do that
>>   I may not have done with just "ordinary" guests?

> That was what i was also wondering about, but it is an ordinary guest, no pci 
> passthrough or something funky. 
> Perhaps an Intel vs AMD (this machine has a AMD phenom X6) issue ?

> --
> Sander

>> And then of course this domain_crash() could of course be
>> accompanied by some helpful printk() ...

Do you have a debug patch of what you are interested in ?

>> Thanks, Jan

--
Sander



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15  9:38     ` Sander Eikelenboom
@ 2016-06-15 10:12       ` Jan Beulich
  2016-06-15 12:00         ` Sander Eikelenboom
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 10:12 UTC (permalink / raw)
  To: Sander Eikelenboom; +Cc: xen-devel

>>> On 15.06.16 at 11:38, <linux@eikelenboom.it> wrote:
> Wednesday, June 15, 2016, 10:57:03 AM, you wrote:
> 
>> Wednesday, June 15, 2016, 10:29:37 AM, you wrote:
> 
>>>>>> On 15.06.16 at 01:49, <linux@eikelenboom.it> wrote:
>>>> Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
>>>> but one of the latest commits seems to have broken boot of HVM guests
>>>> (using qemu-xen) previous build with xen_changeset git:6e908ee worked 
>>>> fine.
> 
>>> Primary suspects would seem to be 67fc274bbe and bfa84968b2,
>>> but (obviously) I didn't see any issues with them in my own
>>> testing, so could you
>>> - instead of doing a full bisect, revert just those two
> 
>> Will give reverting that a shot.
> 
> Reverting bfa84968b2 is sufficient.

Could you give this wild guess a try on top of the tree without the
revert?

--- unstable.orig/xen/arch/x86/hvm/emulate.c
+++ unstable/xen/arch/x86/hvm/emulate.c
@@ -1180,7 +1180,7 @@ static int hvmemul_rep_movs(
         pfec |= PFEC_user_mode;
 
     bytes = PAGE_SIZE - (saddr & ~PAGE_MASK);
-    if ( vio->mmio_access.read_access &&
+    if ( vio->mmio_access.read_access && !vio->mmio_access.write_access &&
          (vio->mmio_gla == (saddr & PAGE_MASK)) &&
          bytes >= bytes_per_rep )
     {


>>> And then of course this domain_crash() could of course be
>>> accompanied by some helpful printk() ...
> 
> Do you have a debug patch of what you are interested in ?

Not yet - basically we should log all of the variables involved in the
condition leading to the domain_crash().

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 10:12       ` Jan Beulich
@ 2016-06-15 12:00         ` Sander Eikelenboom
  2016-06-15 12:48           ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Sander Eikelenboom @ 2016-06-15 12:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel


Wednesday, June 15, 2016, 12:12:37 PM, you wrote:

>>>> On 15.06.16 at 11:38, <linux@eikelenboom.it> wrote:
>> Wednesday, June 15, 2016, 10:57:03 AM, you wrote:
>> 
>>> Wednesday, June 15, 2016, 10:29:37 AM, you wrote:
>> 
>>>>>>> On 15.06.16 at 01:49, <linux@eikelenboom.it> wrote:
>>>>> Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
>>>>> but one of the latest commits seems to have broken boot of HVM guests
>>>>> (using qemu-xen) previous build with xen_changeset git:6e908ee worked 
>>>>> fine.
>> 
>>>> Primary suspects would seem to be 67fc274bbe and bfa84968b2,
>>>> but (obviously) I didn't see any issues with them in my own
>>>> testing, so could you
>>>> - instead of doing a full bisect, revert just those two
>> 
>>> Will give reverting that a shot.
>> 
>> Reverting bfa84968b2 is sufficient.

> Could you give this wild guess a try on top of the tree without the
> revert?

> --- unstable.orig/xen/arch/x86/hvm/emulate.c
> +++ unstable/xen/arch/x86/hvm/emulate.c
> @@ -1180,7 +1180,7 @@ static int hvmemul_rep_movs(
>          pfec |= PFEC_user_mode;
>  
>      bytes = PAGE_SIZE - (saddr & ~PAGE_MASK);
-    if ( vio->>mmio_access.read_access &&
+    if ( vio->>mmio_access.read_access && !vio->mmio_access.write_access &&
>           (vio->mmio_gla == (saddr & PAGE_MASK)) &&
>           bytes >= bytes_per_rep )
>      {

Unfortunately still crashes.

--
Sander

>>>> And then of course this domain_crash() could of course be
>>>> accompanied by some helpful printk() ...
>> 
>> Do you have a debug patch of what you are interested in ?

> Not yet - basically we should log all of the variables involved in the
> condition leading to the domain_crash().

> Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 12:00         ` Sander Eikelenboom
@ 2016-06-15 12:48           ` Jan Beulich
  2016-06-15 13:58             ` Sander Eikelenboom
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 12:48 UTC (permalink / raw)
  To: Sander Eikelenboom; +Cc: xen-devel

>>> On 15.06.16 at 14:00, <linux@eikelenboom.it> wrote:
> Wednesday, June 15, 2016, 12:12:37 PM, you wrote:
>>>>> On 15.06.16 at 11:38, <linux@eikelenboom.it> wrote:
>>> Wednesday, June 15, 2016, 10:57:03 AM, you wrote:
>>>> Wednesday, June 15, 2016, 10:29:37 AM, you wrote:
>>>>>>>> On 15.06.16 at 01:49, <linux@eikelenboom.it> wrote:
>>>>>> Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
>>>>>> but one of the latest commits seems to have broken boot of HVM guests
>>>>>> (using qemu-xen) previous build with xen_changeset git:6e908ee worked 
>>>>>> fine.
>>> 
>>>>> Primary suspects would seem to be 67fc274bbe and bfa84968b2,
>>>>> but (obviously) I didn't see any issues with them in my own
>>>>> testing, so could you
>>>>> - instead of doing a full bisect, revert just those two
>>> 
>>>> Will give reverting that a shot.
>>> 
>>> Reverting bfa84968b2 is sufficient.
> 
>> Could you give this wild guess a try on top of the tree without the
>> revert?
> 
>> --- unstable.orig/xen/arch/x86/hvm/emulate.c
>> +++ unstable/xen/arch/x86/hvm/emulate.c
>> @@ -1180,7 +1180,7 @@ static int hvmemul_rep_movs(
>>          pfec |= PFEC_user_mode;
>>  
>>      bytes = PAGE_SIZE - (saddr & ~PAGE_MASK);
> -    if ( vio->>mmio_access.read_access &&
> +    if ( vio->>mmio_access.read_access && !vio->mmio_access.write_access &&
>>           (vio->mmio_gla == (saddr & PAGE_MASK)) &&
>>           bytes >= bytes_per_rep )
>>      {
> 
> Unfortunately still crashes.

Thanks for trying. Which basically just leaves the p.count > *reps
part in that domain_crash() condition, as that's the only other thing
involved in that check which said commit could have an effect on (as
far as I can tell at least). Would you be up for another experiment,
removing that one line? Other things to try (just to understand the
issue) would be to
- revert only each half of said commit individually (the two hunks
  really are independent),
- remove just the two latch_linear_to_phys() calls.

Apart from that, and just to see whether there are other differences
between your guest(s) and mine, could you post a guest config from
one that's affected?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 12:48           ` Jan Beulich
@ 2016-06-15 13:58             ` Sander Eikelenboom
  2016-06-15 14:07               ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Sander Eikelenboom @ 2016-06-15 13:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel


Wednesday, June 15, 2016, 2:48:55 PM, you wrote:

>>>> On 15.06.16 at 14:00, <linux@eikelenboom.it> wrote:
>> Wednesday, June 15, 2016, 12:12:37 PM, you wrote:
>>>>>> On 15.06.16 at 11:38, <linux@eikelenboom.it> wrote:
>>>> Wednesday, June 15, 2016, 10:57:03 AM, you wrote:
>>>>> Wednesday, June 15, 2016, 10:29:37 AM, you wrote:
>>>>>>>>> On 15.06.16 at 01:49, <linux@eikelenboom.it> wrote:
>>>>>>> Just tested latest xen-unstable 4.8 (xen_changeset git:d337764),
>>>>>>> but one of the latest commits seems to have broken boot of HVM guests
>>>>>>> (using qemu-xen) previous build with xen_changeset git:6e908ee worked 
>>>>>>> fine.
>>>> 
>>>>>> Primary suspects would seem to be 67fc274bbe and bfa84968b2,
>>>>>> but (obviously) I didn't see any issues with them in my own
>>>>>> testing, so could you
>>>>>> - instead of doing a full bisect, revert just those two
>>>> 
>>>>> Will give reverting that a shot.
>>>> 
>>>> Reverting bfa84968b2 is sufficient.
>> 
>>> Could you give this wild guess a try on top of the tree without the
>>> revert?
>> 
>>> --- unstable.orig/xen/arch/x86/hvm/emulate.c
>>> +++ unstable/xen/arch/x86/hvm/emulate.c
>>> @@ -1180,7 +1180,7 @@ static int hvmemul_rep_movs(
>>>          pfec |= PFEC_user_mode;
>>>  
>>>      bytes = PAGE_SIZE - (saddr & ~PAGE_MASK);
>> -    if ( vio->>mmio_access.read_access &&
>> +    if ( vio->>mmio_access.read_access && !vio->mmio_access.write_access &&
>>>           (vio->mmio_gla == (saddr & PAGE_MASK)) &&
>>>           bytes >= bytes_per_rep )
>>>      {
>> 
>> Unfortunately still crashes.

> Thanks for trying. Which basically just leaves the p.count > *reps
> part in that domain_crash() condition, as that's the only other thing
> involved in that check which said commit could have an effect on (as
> far as I can tell at least). Would you be up for another experiment,
> removing that one line? Other things to try (just to understand the
> issue) would be to
> - revert only each half of said commit individually (the two hunks
>   really are independent),
> - remove just the two latch_linear_to_phys() calls.

Will try some of that and let you know.

> Apart from that, and just to see whether there are other differences
> between your guest(s) and mine, could you post a guest config from
> one that's affected?

Hope you are not too disappointed it's rather sparse:

builder='hvm'
device_model_version = 'qemu-xen'
device_model_user = 'root'
memory = 512
name = 'test_guest'
vcpus = 4
cpu_weight = 768
vif = [ 'bridge=xen_bridge, ip=192.168.1.15, mac=00:16:3E:C4:72:83, model=e1000' ]
disk = [ 'phy:/dev/xen_vms/test_guest1,hda,w', 'phy:/dev/xen_vms/test_guest2,hdb,w' ]
on_crash = 'preserve'
boot='c'
vnc=0
serial='pty'

Both dom0 and guest run Debian Jessie, as said platform is AMD, running a 
4.7-rc3ish kernel.


> Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 13:58             ` Sander Eikelenboom
@ 2016-06-15 14:07               ` Jan Beulich
  2016-06-15 14:20                 ` Boris Ostrovsky
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 14:07 UTC (permalink / raw)
  To: Sander Eikelenboom; +Cc: xen-devel

>>> On 15.06.16 at 15:58, <linux@eikelenboom.it> wrote:
> Wednesday, June 15, 2016, 2:48:55 PM, you wrote:
>> Apart from that, and just to see whether there are other differences
>> between your guest(s) and mine, could you post a guest config from
>> one that's affected?
> 
> Hope you are not too disappointed it's rather sparse:

In no way.

> builder='hvm'
> device_model_version = 'qemu-xen'
> device_model_user = 'root'
> memory = 512
> name = 'test_guest'
> vcpus = 4
> cpu_weight = 768
> vif = [ 'bridge=xen_bridge, ip=192.168.1.15, mac=00:16:3E:C4:72:83, 
> model=e1000' ]
> disk = [ 'phy:/dev/xen_vms/test_guest1,hda,w', 
> 'phy:/dev/xen_vms/test_guest2,hdb,w' ]
> on_crash = 'preserve'
> boot='c'
> vnc=0
> serial='pty'

I wonder whether mine having

stdvga=0

matters. Albeit a quick test passing stdvga=1 works here. And I
don't think the vnc= setting should have an effect here.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 14:07               ` Jan Beulich
@ 2016-06-15 14:20                 ` Boris Ostrovsky
  2016-06-15 14:32                   ` Boris Ostrovsky
  2016-06-15 14:35                   ` Jan Beulich
  0 siblings, 2 replies; 21+ messages in thread
From: Boris Ostrovsky @ 2016-06-15 14:20 UTC (permalink / raw)
  To: Jan Beulich, Sander Eikelenboom; +Cc: xen-devel

On 06/15/2016 10:07 AM, Jan Beulich wrote:
>>>> On 15.06.16 at 15:58, <linux@eikelenboom.it> wrote:
>> Wednesday, June 15, 2016, 2:48:55 PM, you wrote:
>>> Apart from that, and just to see whether there are other differences
>>> between your guest(s) and mine, could you post a guest config from
>>> one that's affected?
>> Hope you are not too disappointed it's rather sparse:
> In no way.
>
>> builder='hvm'
>> device_model_version = 'qemu-xen'
>> device_model_user = 'root'
>> memory = 512
>> name = 'test_guest'
>> vcpus = 4
>> cpu_weight = 768
>> vif = [ 'bridge=xen_bridge, ip=192.168.1.15, mac=00:16:3E:C4:72:83, 
>> model=e1000' ]
>> disk = [ 'phy:/dev/xen_vms/test_guest1,hda,w', 
>> 'phy:/dev/xen_vms/test_guest2,hdb,w' ]
>> on_crash = 'preserve'
>> boot='c'
>> vnc=0
>> serial='pty'
> I wonder whether mine having
>
> stdvga=0
>
> matters. Albeit a quick test passing stdvga=1 works here. And I
> don't think the vnc= setting should have an effect here.

Our nightly picked up this crash as well on an AMD box (Intel passed).

I believe this is due to

+       if ( *reps * bytes_per_rep > bytes )
+            *reps = bytes / bytes_per_rep;

in hvmemul_rep_stos() and then, as you pointed out in another message,
we fail p.count > *reps comparison.

-boris

-boris
in hvmemul_rep_stos.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 14:20                 ` Boris Ostrovsky
@ 2016-06-15 14:32                   ` Boris Ostrovsky
  2016-06-15 14:39                     ` Jan Beulich
  2016-06-15 14:35                   ` Jan Beulich
  1 sibling, 1 reply; 21+ messages in thread
From: Boris Ostrovsky @ 2016-06-15 14:32 UTC (permalink / raw)
  To: Jan Beulich, Sander Eikelenboom; +Cc: xen-devel

On 06/15/2016 10:20 AM, Boris Ostrovsky wrote:
> On 06/15/2016 10:07 AM, Jan Beulich wrote:
>>>>> On 15.06.16 at 15:58, <linux@eikelenboom.it> wrote:
>>> Wednesday, June 15, 2016, 2:48:55 PM, you wrote:
>>>> Apart from that, and just to see whether there are other differences
>>>> between your guest(s) and mine, could you post a guest config from
>>>> one that's affected?
>>> Hope you are not too disappointed it's rather sparse:
>> In no way.
>>
>>> builder='hvm'
>>> device_model_version = 'qemu-xen'
>>> device_model_user = 'root'
>>> memory = 512
>>> name = 'test_guest'
>>> vcpus = 4
>>> cpu_weight = 768
>>> vif = [ 'bridge=xen_bridge, ip=192.168.1.15, mac=00:16:3E:C4:72:83, 
>>> model=e1000' ]
>>> disk = [ 'phy:/dev/xen_vms/test_guest1,hda,w', 
>>> 'phy:/dev/xen_vms/test_guest2,hdb,w' ]
>>> on_crash = 'preserve'
>>> boot='c'
>>> vnc=0
>>> serial='pty'
>> I wonder whether mine having
>>
>> stdvga=0
>>
>> matters. Albeit a quick test passing stdvga=1 works here. And I
>> don't think the vnc= setting should have an effect here.
> Our nightly picked up this crash as well on an AMD box (Intel passed).
>
> I believe this is due to
>
> +       if ( *reps * bytes_per_rep > bytes )
> +            *reps = bytes / bytes_per_rep;
>
> in hvmemul_rep_stos() and then, as you pointed out in another message,
> we fail p.count > *reps comparison.
>
> -boris
>
> -boris
> in hvmemul_rep_stos.
>


So perhaps we shouldn't latch data for anything over page size.
Something like this (it seems to work):

diff --git a/xen/arch/x86/hvm/emulate.c b/xen/arch/x86/hvm/emulate.c
index d164092..6fabb76 100644
--- a/xen/arch/x86/hvm/emulate.c
+++ b/xen/arch/x86/hvm/emulate.c
@@ -1195,7 +1195,8 @@ static int hvmemul_rep_movs(
         if ( rc != X86EMUL_OKAY )
             return rc;
 
-        latch_linear_to_phys(vio, saddr, sgpa, 0);
+        if ( *reps * bytes_per_rep <= PAGE_SIZE)
+            latch_linear_to_phys(vio, saddr, sgpa, 0);
     }
 
     bytes = PAGE_SIZE - (daddr & ~PAGE_MASK);
@@ -1214,7 +1215,8 @@ static int hvmemul_rep_movs(
         if ( rc != X86EMUL_OKAY )
             return rc;
 
-        latch_linear_to_phys(vio, daddr, dgpa, 1);
+        if ( *reps * bytes_per_rep <= PAGE_SIZE)
+            latch_linear_to_phys(vio, daddr, dgpa, 1);
     }
 
     /* Check for MMIO ops */
@@ -1339,7 +1341,8 @@ static int hvmemul_rep_stos(
         if ( rc != X86EMUL_OKAY )
             return rc;
 
-        latch_linear_to_phys(vio, addr, gpa, 1);
+        if ( *reps * bytes_per_rep <= PAGE_SIZE)
+            latch_linear_to_phys(vio, addr, gpa, 1);
     }
 
     /* Check for MMIO op */


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 14:20                 ` Boris Ostrovsky
  2016-06-15 14:32                   ` Boris Ostrovsky
@ 2016-06-15 14:35                   ` Jan Beulich
  1 sibling, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 14:35 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Sander Eikelenboom, xen-devel

>>> On 15.06.16 at 16:20, <boris.ostrovsky@oracle.com> wrote:
> On 06/15/2016 10:07 AM, Jan Beulich wrote:
>>>>> On 15.06.16 at 15:58, <linux@eikelenboom.it> wrote:
>>> Wednesday, June 15, 2016, 2:48:55 PM, you wrote:
>>>> Apart from that, and just to see whether there are other differences
>>>> between your guest(s) and mine, could you post a guest config from
>>>> one that's affected?
>>> Hope you are not too disappointed it's rather sparse:
>> In no way.
>>
>>> builder='hvm'
>>> device_model_version = 'qemu-xen'
>>> device_model_user = 'root'
>>> memory = 512
>>> name = 'test_guest'
>>> vcpus = 4
>>> cpu_weight = 768
>>> vif = [ 'bridge=xen_bridge, ip=192.168.1.15, mac=00:16:3E:C4:72:83, 
>>> model=e1000' ]
>>> disk = [ 'phy:/dev/xen_vms/test_guest1,hda,w', 
>>> 'phy:/dev/xen_vms/test_guest2,hdb,w' ]
>>> on_crash = 'preserve'
>>> boot='c'
>>> vnc=0
>>> serial='pty'
>> I wonder whether mine having
>>
>> stdvga=0
>>
>> matters. Albeit a quick test passing stdvga=1 works here. And I
>> don't think the vnc= setting should have an effect here.
> 
> Our nightly picked up this crash as well on an AMD box (Intel passed).
> 
> I believe this is due to
> 
> +       if ( *reps * bytes_per_rep > bytes )
> +            *reps = bytes / bytes_per_rep;
> 
> in hvmemul_rep_stos() and then, as you pointed out in another message,
> we fail p.count > *reps comparison.

But the really interesting thing then is - why only AMD?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 14:32                   ` Boris Ostrovsky
@ 2016-06-15 14:39                     ` Jan Beulich
  2016-06-15 14:56                       ` Boris Ostrovsky
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 14:39 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Sander Eikelenboom, xen-devel

>>> On 15.06.16 at 16:32, <boris.ostrovsky@oracle.com> wrote:
> So perhaps we shouldn't latch data for anything over page size.

But why? What we latch is the start of the accessed range, so
the repeat count shouldn't matter?

> Something like this (it seems to work):

I'm rather hesitant to take a change like this without understanding
why this helps nor whether this really deals with the problem in all
cases.

Jan

> --- a/xen/arch/x86/hvm/emulate.c
> +++ b/xen/arch/x86/hvm/emulate.c
> @@ -1195,7 +1195,8 @@ static int hvmemul_rep_movs(
>          if ( rc != X86EMUL_OKAY )
>              return rc;
>  
> -        latch_linear_to_phys(vio, saddr, sgpa, 0);
> +        if ( *reps * bytes_per_rep <= PAGE_SIZE)
> +            latch_linear_to_phys(vio, saddr, sgpa, 0);
>      }
>  
>      bytes = PAGE_SIZE - (daddr & ~PAGE_MASK);
> @@ -1214,7 +1215,8 @@ static int hvmemul_rep_movs(
>          if ( rc != X86EMUL_OKAY )
>              return rc;
>  
> -        latch_linear_to_phys(vio, daddr, dgpa, 1);
> +        if ( *reps * bytes_per_rep <= PAGE_SIZE)
> +            latch_linear_to_phys(vio, daddr, dgpa, 1);
>      }
>  
>      /* Check for MMIO ops */
> @@ -1339,7 +1341,8 @@ static int hvmemul_rep_stos(
>          if ( rc != X86EMUL_OKAY )
>              return rc;
>  
> -        latch_linear_to_phys(vio, addr, gpa, 1);
> +        if ( *reps * bytes_per_rep <= PAGE_SIZE)
> +            latch_linear_to_phys(vio, addr, gpa, 1);
>      }
>  
>      /* Check for MMIO op */




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 14:39                     ` Jan Beulich
@ 2016-06-15 14:56                       ` Boris Ostrovsky
  2016-06-15 15:22                         ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Ostrovsky @ 2016-06-15 14:56 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Sander Eikelenboom, xen-devel

On 06/15/2016 10:39 AM, Jan Beulich wrote:
>>>> On 15.06.16 at 16:32, <boris.ostrovsky@oracle.com> wrote:
>> So perhaps we shouldn't latch data for anything over page size.
> But why? What we latch is the start of the accessed range, so
> the repeat count shouldn't matter?

Because otherwise we won't emulate full stos (or movs) --- we truncate
*reps to fit into a page, don't we? And then we fail the completion check.

And we should latch only when we don't cross page boundary, not just
when we are under 4K. Or maybe it's not that we don't latch it. It's
that we don't use latched data if page boundary is being crossed.


-boris


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 14:56                       ` Boris Ostrovsky
@ 2016-06-15 15:22                         ` Jan Beulich
  2016-06-15 15:29                           ` Paul Durrant
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 15:22 UTC (permalink / raw)
  To: Paul Durrant, Boris Ostrovsky; +Cc: Sander Eikelenboom, xen-devel

>>> On 15.06.16 at 16:56, <boris.ostrovsky@oracle.com> wrote:
> On 06/15/2016 10:39 AM, Jan Beulich wrote:
>>>>> On 15.06.16 at 16:32, <boris.ostrovsky@oracle.com> wrote:
>>> So perhaps we shouldn't latch data for anything over page size.
>> But why? What we latch is the start of the accessed range, so
>> the repeat count shouldn't matter?
> 
> Because otherwise we won't emulate full stos (or movs) --- we truncate
> *reps to fit into a page, don't we?

That merely causes the instruction to get restarted (with a smaller
rCX).

> And then we fail the completion check.
> 
> And we should latch only when we don't cross page boundary, not just
> when we are under 4K. Or maybe it's not that we don't latch it. It's
> that we don't use latched data if page boundary is being crossed.

Ah, I think that's it: When we hand a batch to qemu which crosses
a page boundary and latch the start address translation, upon
retry (after qemu did its job) we'd wrongly reduce the repeat count
because of finding the start address in the cache. So indeed I think
it should be the latter: Not using an available translation is likely
better than breaking up a large batch we hand to qemu. Paul, what
do you think?

In any event I'll revert the patch from staging, until I can provide
a fixed one.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 15:22                         ` Jan Beulich
@ 2016-06-15 15:29                           ` Paul Durrant
  2016-06-15 15:43                             ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Durrant @ 2016-06-15 15:29 UTC (permalink / raw)
  To: Jan Beulich, Boris Ostrovsky; +Cc: Sander Eikelenboom, xen-devel

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 15 June 2016 16:22
> To: Paul Durrant; Boris Ostrovsky
> Cc: Sander Eikelenboom; xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] Xen-unstable 4.8: HVM domain_crash called from
> emulate.c:144 RIP: c000:[<000000000000336a>]
> 
> >>> On 15.06.16 at 16:56, <boris.ostrovsky@oracle.com> wrote:
> > On 06/15/2016 10:39 AM, Jan Beulich wrote:
> >>>>> On 15.06.16 at 16:32, <boris.ostrovsky@oracle.com> wrote:
> >>> So perhaps we shouldn't latch data for anything over page size.
> >> But why? What we latch is the start of the accessed range, so
> >> the repeat count shouldn't matter?
> >
> > Because otherwise we won't emulate full stos (or movs) --- we truncate
> > *reps to fit into a page, don't we?
> 
> That merely causes the instruction to get restarted (with a smaller
> rCX).
> 
> > And then we fail the completion check.
> >
> > And we should latch only when we don't cross page boundary, not just
> > when we are under 4K. Or maybe it's not that we don't latch it. It's
> > that we don't use latched data if page boundary is being crossed.
> 
> Ah, I think that's it: When we hand a batch to qemu which crosses
> a page boundary and latch the start address translation, upon
> retry (after qemu did its job) we'd wrongly reduce the repeat count
> because of finding the start address in the cache. So indeed I think
> it should be the latter: Not using an available translation is likely
> better than breaking up a large batch we hand to qemu. Paul, what
> do you think?

Presumably we can tell the difference because we have the vio ioreq state, which should tell us that we're waiting for I/O completion and so, in this case, you can avoid reducing the repeat count when retrying. You should still be able to use the latched translation though, shouldn't you?

  Paul 

> 
> In any event I'll revert the patch from staging, until I can provide
> a fixed one.
> 
> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 15:29                           ` Paul Durrant
@ 2016-06-15 15:43                             ` Jan Beulich
  2016-06-15 15:46                               ` Paul Durrant
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 15:43 UTC (permalink / raw)
  To: Paul Durrant; +Cc: Sander Eikelenboom, Boris Ostrovsky, xen-devel

>>> On 15.06.16 at 17:29, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 15 June 2016 16:22
>> To: Paul Durrant; Boris Ostrovsky
>> Cc: Sander Eikelenboom; xen-devel@lists.xen.org 
>> Subject: Re: [Xen-devel] Xen-unstable 4.8: HVM domain_crash called from
>> emulate.c:144 RIP: c000:[<000000000000336a>]
>> 
>> >>> On 15.06.16 at 16:56, <boris.ostrovsky@oracle.com> wrote:
>> > On 06/15/2016 10:39 AM, Jan Beulich wrote:
>> >>>>> On 15.06.16 at 16:32, <boris.ostrovsky@oracle.com> wrote:
>> >>> So perhaps we shouldn't latch data for anything over page size.
>> >> But why? What we latch is the start of the accessed range, so
>> >> the repeat count shouldn't matter?
>> >
>> > Because otherwise we won't emulate full stos (or movs) --- we truncate
>> > *reps to fit into a page, don't we?
>> 
>> That merely causes the instruction to get restarted (with a smaller
>> rCX).
>> 
>> > And then we fail the completion check.
>> >
>> > And we should latch only when we don't cross page boundary, not just
>> > when we are under 4K. Or maybe it's not that we don't latch it. It's
>> > that we don't use latched data if page boundary is being crossed.
>> 
>> Ah, I think that's it: When we hand a batch to qemu which crosses
>> a page boundary and latch the start address translation, upon
>> retry (after qemu did its job) we'd wrongly reduce the repeat count
>> because of finding the start address in the cache. So indeed I think
>> it should be the latter: Not using an available translation is likely
>> better than breaking up a large batch we hand to qemu. Paul, what
>> do you think?
> 
> Presumably we can tell the difference because we have the vio ioreq state, 
> which should tell us that we're waiting for I/O completion and so, in this 
> case, you can avoid reducing the repeat count when retrying. You should still 
> be able to use the latched translation though, shouldn't you?

Would we want to rely on it despite crossing a page boundary?
Of course what was determined to be contiguous should
continue to be, so one might even say using the latched
translation in that case would provide more consistent results
(as we'd become independent of a guest page table change).
But then again a MOVS has two memory operands ...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 15:43                             ` Jan Beulich
@ 2016-06-15 15:46                               ` Paul Durrant
  2016-06-15 15:54                                 ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Durrant @ 2016-06-15 15:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Sander Eikelenboom, Boris Ostrovsky, xen-devel

> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 15 June 2016 16:43
> To: Paul Durrant
> Cc: Sander Eikelenboom; xen-devel@lists.xen.org; Boris Ostrovsky
> Subject: RE: [Xen-devel] Xen-unstable 4.8: HVM domain_crash called from
> emulate.c:144 RIP: c000:[<000000000000336a>]
> 
> >>> On 15.06.16 at 17:29, <Paul.Durrant@citrix.com> wrote:
> >>  -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: 15 June 2016 16:22
> >> To: Paul Durrant; Boris Ostrovsky
> >> Cc: Sander Eikelenboom; xen-devel@lists.xen.org
> >> Subject: Re: [Xen-devel] Xen-unstable 4.8: HVM domain_crash called
> from
> >> emulate.c:144 RIP: c000:[<000000000000336a>]
> >>
> >> >>> On 15.06.16 at 16:56, <boris.ostrovsky@oracle.com> wrote:
> >> > On 06/15/2016 10:39 AM, Jan Beulich wrote:
> >> >>>>> On 15.06.16 at 16:32, <boris.ostrovsky@oracle.com> wrote:
> >> >>> So perhaps we shouldn't latch data for anything over page size.
> >> >> But why? What we latch is the start of the accessed range, so
> >> >> the repeat count shouldn't matter?
> >> >
> >> > Because otherwise we won't emulate full stos (or movs) --- we truncate
> >> > *reps to fit into a page, don't we?
> >>
> >> That merely causes the instruction to get restarted (with a smaller
> >> rCX).
> >>
> >> > And then we fail the completion check.
> >> >
> >> > And we should latch only when we don't cross page boundary, not just
> >> > when we are under 4K. Or maybe it's not that we don't latch it. It's
> >> > that we don't use latched data if page boundary is being crossed.
> >>
> >> Ah, I think that's it: When we hand a batch to qemu which crosses
> >> a page boundary and latch the start address translation, upon
> >> retry (after qemu did its job) we'd wrongly reduce the repeat count
> >> because of finding the start address in the cache. So indeed I think
> >> it should be the latter: Not using an available translation is likely
> >> better than breaking up a large batch we hand to qemu. Paul, what
> >> do you think?
> >
> > Presumably we can tell the difference because we have the vio ioreq state,
> > which should tell us that we're waiting for I/O completion and so, in this
> > case, you can avoid reducing the repeat count when retrying. You should
> still
> > be able to use the latched translation though, shouldn't you?
> 
> Would we want to rely on it despite crossing a page boundary?
> Of course what was determined to be contiguous should
> continue to be, so one might even say using the latched
> translation in that case would provide more consistent results
> (as we'd become independent of a guest page table change).

Yes, exactly.

> But then again a MOVS has two memory operands ...
> 

True... more of an argument for having two latched addresses though, right?

  Paul

> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 15:46                               ` Paul Durrant
@ 2016-06-15 15:54                                 ` Jan Beulich
  2016-06-15 16:46                                   ` Boris Ostrovsky
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Beulich @ 2016-06-15 15:54 UTC (permalink / raw)
  To: Paul Durrant; +Cc: Sander Eikelenboom, Boris Ostrovsky, xen-devel

>>> On 15.06.16 at 17:46, <Paul.Durrant@citrix.com> wrote:
>>  -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: 15 June 2016 16:43
>> To: Paul Durrant
>> Cc: Sander Eikelenboom; xen-devel@lists.xen.org; Boris Ostrovsky
>> Subject: RE: [Xen-devel] Xen-unstable 4.8: HVM domain_crash called from
>> emulate.c:144 RIP: c000:[<000000000000336a>]
>> 
>> >>> On 15.06.16 at 17:29, <Paul.Durrant@citrix.com> wrote:
>> >>  -----Original Message-----
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: 15 June 2016 16:22
>> >> To: Paul Durrant; Boris Ostrovsky
>> >> Cc: Sander Eikelenboom; xen-devel@lists.xen.org 
>> >> Subject: Re: [Xen-devel] Xen-unstable 4.8: HVM domain_crash called
>> from
>> >> emulate.c:144 RIP: c000:[<000000000000336a>]
>> >>
>> >> >>> On 15.06.16 at 16:56, <boris.ostrovsky@oracle.com> wrote:
>> >> > On 06/15/2016 10:39 AM, Jan Beulich wrote:
>> >> >>>>> On 15.06.16 at 16:32, <boris.ostrovsky@oracle.com> wrote:
>> >> >>> So perhaps we shouldn't latch data for anything over page size.
>> >> >> But why? What we latch is the start of the accessed range, so
>> >> >> the repeat count shouldn't matter?
>> >> >
>> >> > Because otherwise we won't emulate full stos (or movs) --- we truncate
>> >> > *reps to fit into a page, don't we?
>> >>
>> >> That merely causes the instruction to get restarted (with a smaller
>> >> rCX).
>> >>
>> >> > And then we fail the completion check.
>> >> >
>> >> > And we should latch only when we don't cross page boundary, not just
>> >> > when we are under 4K. Or maybe it's not that we don't latch it. It's
>> >> > that we don't use latched data if page boundary is being crossed.
>> >>
>> >> Ah, I think that's it: When we hand a batch to qemu which crosses
>> >> a page boundary and latch the start address translation, upon
>> >> retry (after qemu did its job) we'd wrongly reduce the repeat count
>> >> because of finding the start address in the cache. So indeed I think
>> >> it should be the latter: Not using an available translation is likely
>> >> better than breaking up a large batch we hand to qemu. Paul, what
>> >> do you think?
>> >
>> > Presumably we can tell the difference because we have the vio ioreq state,
>> > which should tell us that we're waiting for I/O completion and so, in this
>> > case, you can avoid reducing the repeat count when retrying. You should
>> still
>> > be able to use the latched translation though, shouldn't you?
>> 
>> Would we want to rely on it despite crossing a page boundary?
>> Of course what was determined to be contiguous should
>> continue to be, so one might even say using the latched
>> translation in that case would provide more consistent results
>> (as we'd become independent of a guest page table change).
> 
> Yes, exactly.

The only downside is that this will need require peeking at
current->arch.hvm_vcpu.hvm_io.io_req.state, which would
(even if the source file is the same) be a mild layering violation.

>> But then again a MOVS has two memory operands ...
>> 
> 
> True... more of an argument for having two latched addresses though, right?

Yes, albeit two then isn't enough either if we want to fully address
the basic issue here: We'd have to latch as many translations as
there are possibly pages involved in the execution of a single
instruction.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 15:54                                 ` Jan Beulich
@ 2016-06-15 16:46                                   ` Boris Ostrovsky
  2016-06-16  8:03                                     ` Jan Beulich
  0 siblings, 1 reply; 21+ messages in thread
From: Boris Ostrovsky @ 2016-06-15 16:46 UTC (permalink / raw)
  To: Jan Beulich, Paul Durrant; +Cc: Sander Eikelenboom, xen-devel

On 06/15/2016 11:54 AM, Jan Beulich wrote:
>
> Yes, albeit two then isn't enough either if we want to fully address
> the basic issue here: We'd have to latch as many translations as
> there are possibly pages involved in the execution of a single
> instruction.

Re: translations changing under us --- can't they change between guest
issuing two STOS instructions and the emulator picking up the latched
one (from the first instruction) when emulating the second?

-boris



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>]
  2016-06-15 16:46                                   ` Boris Ostrovsky
@ 2016-06-16  8:03                                     ` Jan Beulich
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Beulich @ 2016-06-16  8:03 UTC (permalink / raw)
  To: Boris Ostrovsky; +Cc: Sander Eikelenboom, Paul Durrant, xen-devel

>>> On 15.06.16 at 18:46, <boris.ostrovsky@oracle.com> wrote:
> On 06/15/2016 11:54 AM, Jan Beulich wrote:
>>
>> Yes, albeit two then isn't enough either if we want to fully address
>> the basic issue here: We'd have to latch as many translations as
>> there are possibly pages involved in the execution of a single
>> instruction.
> 
> Re: translations changing under us --- can't they change between guest
> issuing two STOS instructions and the emulator picking up the latched
> one (from the first instruction) when emulating the second?

The latched translations get wiped as part of retiring an emulated
instruction (see handle_mmio()).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-06-16  8:03 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-14 23:49 Xen-unstable 4.8: HVM domain_crash called from emulate.c:144 RIP: c000:[<000000000000336a>] linux
2016-06-15  8:29 ` Jan Beulich
2016-06-15  8:57   ` Sander Eikelenboom
2016-06-15  9:38     ` Sander Eikelenboom
2016-06-15 10:12       ` Jan Beulich
2016-06-15 12:00         ` Sander Eikelenboom
2016-06-15 12:48           ` Jan Beulich
2016-06-15 13:58             ` Sander Eikelenboom
2016-06-15 14:07               ` Jan Beulich
2016-06-15 14:20                 ` Boris Ostrovsky
2016-06-15 14:32                   ` Boris Ostrovsky
2016-06-15 14:39                     ` Jan Beulich
2016-06-15 14:56                       ` Boris Ostrovsky
2016-06-15 15:22                         ` Jan Beulich
2016-06-15 15:29                           ` Paul Durrant
2016-06-15 15:43                             ` Jan Beulich
2016-06-15 15:46                               ` Paul Durrant
2016-06-15 15:54                                 ` Jan Beulich
2016-06-15 16:46                                   ` Boris Ostrovsky
2016-06-16  8:03                                     ` Jan Beulich
2016-06-15 14:35                   ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).