All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible issue with x86_emulate when writing results back to memory
@ 2014-01-09 15:39 Simon Graham
  2014-01-09 15:53 ` Andrew Cooper
  2014-01-09 16:00 ` Jan Beulich
  0 siblings, 2 replies; 16+ messages in thread
From: Simon Graham @ 2014-01-09 15:39 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 2886 bytes --]

We've been seeing very infrequent crashes in Windows VMs for a while where it appears that the top-byte in a longword that is supposed to hold a pointer is being set to zero - for example:

BUG CHECK 0000000A (00B49A40 00000002 00000001 8261092E)

The first parameter to keBugCheck is the faulting address - 00B49A40 in this case.

When we look at the dump, we are in a routine releasing a queued spinlock and the correct address that should have been used was 0xA3B49A40 and indeed memory contents in the windows dump have this value. Looking around some more, we see that the failing processor is executing the release queued spinlock code and another processor is executing the code to acquire the same queued spinlock and has recently written the 0xA3B49A40 value to the location where the failing instruction stream read it from.

If we look at the disassembly for the two code paths, the writing code does:

	mov dword ptr [edx],eax

and the reading code does the following to read this same value:

	mov ebx,dword ptr [eax]

On a hunch that this might be a problem with the x86_emulate code, I took a look at how the mov instruction would be emulated - in both cases where emulation can be done (hvm/emulate.c and mm/shadow/multi.c), the routines that write instruction results back to memory use memcpy() to actually copy the data. Looking at the implementation of memcpy in Xen, I see that, in a 64-bit build as ours is, it will use 'rep movsq' to move the data in quadwords and then uses 'rep movsb' to move the last 1-7 bytes -- so the instructions above will, I think, always use byte instructions for the four bytes.

Now, according to the X86 arch, 32-bit mov's are supposed to be atomic but based on the above they will not be and I am speculating that this is the cause of our occasional crash - the code path unlocking the spin lock on the other processor sees a partially written value in memory.

Does this seem a reasonable explanation of the issue? 

On the assumption that this is correct, I developed the attached patch (against 4.3.1) which updates all the code paths that are used to read and writeback the results of instruction emulation to use a simple assignment if the length is 2 or 4 bytes -- this doesn't fix the general case where you have a length > 8 but it does handle emulation of MOV instructions. Unfortunately, the use of emulation in the HVM code uses a generic routine for copying memory to the guest so every single place that guest memory is read or written will pay the penalty of the extra check for length - not sure if that is terrible or not. Since doing this we have not seen a single instance of the crash - but it's only been a month!

The attached patch is for discussion purposes only - if it is deemed acceptable I'll resubmit a proper patch request against unstable.

Simon Graham
Citrix Systems, Inc

[-- Attachment #2: fix-memcpy-in-x86-emulate --]
[-- Type: application/octet-stream, Size: 2838 bytes --]

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 2dd9b7e..2ad603e 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2602,6 +2602,26 @@ void hvm_task_switch(
     hvm_unmap_entry(nptss_desc);
 }
 
+/*
+ * Routine to make __hvm_copy appropriate to use for copying the
+ * results of instruction emulation back to guest memory - these
+ * typically require 64-bit, 32-bit and 16-bit writes to be atomic
+ * whereas memcpy is only atomic for 64-bit writes. This is still
+ * not 100% correct since copies larger than 64-bits will not be
+ * atomic for the last 2-6 bytes but should be good enough for
+ * instruction emulation
+ */
+static inline void __hvm_atomic_copy(
+    void *to, void *from, int count)
+{
+    if (count == sizeof(uint32_t))
+        *(uint32_t *)to = *(uint32_t *)from;
+    else if (count == sizeof(uint16_t))
+        *(uint16_t *)to = *(uint16_t *)from;
+    else
+        memcpy(to, from, count);
+}
+
 #define HVMCOPY_from_guest (0u<<0)
 #define HVMCOPY_to_guest   (1u<<0)
 #define HVMCOPY_no_fault   (0u<<1)
@@ -2701,7 +2721,7 @@ static enum hvm_copy_result __hvm_copy(
             }
             else
             {
-                memcpy(p, buf, count);
+                __hvm_atomic_copy(p, buf, count);
                 paging_mark_dirty(curr->domain, page_to_mfn(page));
             }
         }
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index 3fed0b6..5e0da82 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -4762,6 +4762,26 @@ static void emulate_unmap_dest(struct vcpu *v,
     atomic_inc(&v->domain->arch.paging.shadow.gtable_dirty_version);
 }
 
+/*
+ * Routine to make sh_x86_emulate_write appropriate to use for copying the
+ * results of instruction emulation back to guest memory - these
+ * typically require 64-bit, 32-bit and 16-bit writes to be atomic
+ * whereas memcpy is only atomic for 64-bit writes. This is still
+ * not 100% correct since copies larger than 64-bits will not be
+ * atomic for the last 2-6 bytes but should be good enough for
+ * instruction emulation
+ */
+static inline void __sh_atomic_write(
+    void *to, void *from, int count)
+{
+    if (count == sizeof(uint32_t))
+        *(uint32_t *)to = *(uint32_t *)from;
+    else if (count == sizeof(uint16_t))
+        *(uint16_t *)to = *(uint16_t *)from;
+    else
+        memcpy(to, from, count);
+}
+
 static int
 sh_x86_emulate_write(struct vcpu *v, unsigned long vaddr, void *src,
                      u32 bytes, struct sh_emulate_ctxt *sh_ctxt)
@@ -4777,7 +4797,7 @@ sh_x86_emulate_write(struct vcpu *v, unsigned long vaddr, void *src,
         return (long)addr;
 
     paging_lock(v->domain);
-    memcpy(addr, src, bytes);
+    __sh_atomic_write(addr, src, bytes);
 
     if ( tb_init_done )
     {

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 15:39 Possible issue with x86_emulate when writing results back to memory Simon Graham
@ 2014-01-09 15:53 ` Andrew Cooper
  2014-01-09 16:02   ` Simon Graham
                     ` (2 more replies)
  2014-01-09 16:00 ` Jan Beulich
  1 sibling, 3 replies; 16+ messages in thread
From: Andrew Cooper @ 2014-01-09 15:53 UTC (permalink / raw)
  To: Simon Graham; +Cc: xen-devel

On 09/01/14 15:39, Simon Graham wrote:
> We've been seeing very infrequent crashes in Windows VMs for a while where it appears that the top-byte in a longword that is supposed to hold a pointer is being set to zero - for example:
>
> BUG CHECK 0000000A (00B49A40 00000002 00000001 8261092E)
>
> The first parameter to keBugCheck is the faulting address - 00B49A40 in this case.
>
> When we look at the dump, we are in a routine releasing a queued spinlock and the correct address that should have been used was 0xA3B49A40 and indeed memory contents in the windows dump have this value. Looking around some more, we see that the failing processor is executing the release queued spinlock code and another processor is executing the code to acquire the same queued spinlock and has recently written the 0xA3B49A40 value to the location where the failing instruction stream read it from.
>
> If we look at the disassembly for the two code paths, the writing code does:
>
> 	mov dword ptr [edx],eax
>
> and the reading code does the following to read this same value:
>
> 	mov ebx,dword ptr [eax]
>
> On a hunch that this might be a problem with the x86_emulate code, I took a look at how the mov instruction would be emulated - in both cases where emulation can be done (hvm/emulate.c and mm/shadow/multi.c), the routines that write instruction results back to memory use memcpy() to actually copy the data. Looking at the implementation of memcpy in Xen, I see that, in a 64-bit build as ours is, it will use 'rep movsq' to move the data in quadwords and then uses 'rep movsb' to move the last 1-7 bytes -- so the instructions above will, I think, always use byte instructions for the four bytes.
>
> Now, according to the X86 arch, 32-bit mov's are supposed to be atomic but based on the above they will not be and I am speculating that this is the cause of our occasional crash - the code path unlocking the spin lock on the other processor sees a partially written value in memory.
>
> Does this seem a reasonable explanation of the issue? 
>
> On the assumption that this is correct, I developed the attached patch (against 4.3.1) which updates all the code paths that are used to read and writeback the results of instruction emulation to use a simple assignment if the length is 2 or 4 bytes -- this doesn't fix the general case where you have a length > 8 but it does handle emulation of MOV instructions. Unfortunately, the use of emulation in the HVM code uses a generic routine for copying memory to the guest so every single place that guest memory is read or written will pay the penalty of the extra check for length - not sure if that is terrible or not. Since doing this we have not seen a single instance of the crash - but it's only been a month!
>
> The attached patch is for discussion purposes only - if it is deemed acceptable I'll resubmit a proper patch request against unstable.

That seems like a plausible explanation.

The patch however needs some work.  As this function is identical, it
should have a common implementation somewhere, possibly part of
x86_emulate.h, and it would probably be better as:

To better match real hardware, it might be appropriate for
"memcpy_atomic()" (name subject to improved suggestions) to use a while
loop and issue 8 byte writes at a time, falling down to 4, 2 then 1 when
reaching the end of the data to be copied.

~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 15:39 Possible issue with x86_emulate when writing results back to memory Simon Graham
  2014-01-09 15:53 ` Andrew Cooper
@ 2014-01-09 16:00 ` Jan Beulich
  2014-01-09 16:07   ` Simon Graham
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2014-01-09 16:00 UTC (permalink / raw)
  To: Simon Graham; +Cc: xen-devel

>>> On 09.01.14 at 16:39, Simon Graham <simon.graham@citrix.com> wrote:
> We've been seeing very infrequent crashes in Windows VMs for a while where it 
> appears that the top-byte in a longword that is supposed to hold a pointer is 
> being set to zero - for example:
> 
> BUG CHECK 0000000A (00B49A40 00000002 00000001 8261092E)
> 
> The first parameter to keBugCheck is the faulting address - 00B49A40 in this 
> case.
> 
> When we look at the dump, we are in a routine releasing a queued spinlock 
> and the correct address that should have been used was 0xA3B49A40 and indeed 
> memory contents in the windows dump have this value. Looking around some 
> more, we see that the failing processor is executing the release queued 
> spinlock code and another processor is executing the code to acquire the same 
> queued spinlock and has recently written the 0xA3B49A40 value to the location 
> where the failing instruction stream read it from.
> 
> If we look at the disassembly for the two code paths, the writing code does:
> 
> 	mov dword ptr [edx],eax
> 
> and the reading code does the following to read this same value:
> 
> 	mov ebx,dword ptr [eax]
> 
> On a hunch that this might be a problem with the x86_emulate code, I took a 
> look at how the mov instruction would be emulated - in both cases where 
> emulation can be done (hvm/emulate.c and mm/shadow/multi.c), the routines 
> that write instruction results back to memory use memcpy() to actually copy 
> the data. Looking at the implementation of memcpy in Xen, I see that, in a 
> 64-bit build as ours is, it will use 'rep movsq' to move the data in 
> quadwords and then uses 'rep movsb' to move the last 1-7 bytes -- so the 
> instructions above will, I think, always use byte instructions for the four 
> bytes.
> 
> Now, according to the X86 arch, 32-bit mov's are supposed to be atomic but 
> based on the above they will not be and I am speculating that this is the 
> cause of our occasional crash - the code path unlocking the spin lock on the 
> other processor sees a partially written value in memory.
> 
> Does this seem a reasonable explanation of the issue? 

Yes - as long as you can also explain why a spin lock operation
would make it into the emulation code in the first place.

> On the assumption that this is correct, I developed the attached patch 
> (against 4.3.1) which updates all the code paths that are used to read and 
> writeback the results of instruction emulation to use a simple assignment if 
> the length is 2 or 4 bytes -- this doesn't fix the general case where you have 
> a length > 8 but it does handle emulation of MOV instructions. Unfortunately, 
> the use of emulation in the HVM code uses a generic routine for copying 
> memory to the guest so every single place that guest memory is read or 
> written will pay the penalty of the extra check for length - not sure if that 
> is terrible or not. Since doing this we have not seen a single instance of 
> the crash - but it's only been a month!
> 
> The attached patch is for discussion purposes only - if it is deemed 
> acceptable I'll resubmit a proper patch request against unstable.

I'd rather not add limited scope special casing like that, but instead
make the copying much more like real hardware (i.e. not just deal
with the 16- and 32-bit cases, and especially not rely on memcpy()
using 64-bit reads/writes when it can). IOW - don't use memcpy()
here at all (and have a single routine doing The Right Thing (tm)
rather than having two clones now, and perhaps more later on -
I'd in particular think that the read side in shadow code would also
need a similar adjustment).

On the mechanical side of things: Such a generic routine should
have proper parameter types: "const void *" for the source pointer
and "unsigned long" or "size_t" for the count.

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 15:53 ` Andrew Cooper
@ 2014-01-09 16:02   ` Simon Graham
  2014-01-09 16:05   ` Jan Beulich
  2014-01-09 16:06   ` David Vrabel
  2 siblings, 0 replies; 16+ messages in thread
From: Simon Graham @ 2014-01-09 16:02 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel

Thanks,

> > The attached patch is for discussion purposes only - if it is deemed
> acceptable I'll resubmit a proper patch request against unstable.
> 
> That seems like a plausible explanation.
> 
> The patch however needs some work.  As this function is identical, it
> should have a common implementation somewhere, possibly part of
> x86_emulate.h, and it would probably be better as:
> 

Not sure I want the generic HVM code to be dependent on x86_emulate... not sure it should be in hvm.c either

> To better match real hardware, it might be appropriate for
> "memcpy_atomic()" (name subject to improved suggestions) to use a while
> loop and issue 8 byte writes at a time, falling down to 4, 2 then 1 when
> reaching the end of the data to be copied.
> 

My concern here would be that the generic hvm routine __hvm_copy needs to use this when emulating instruction but not the rest of the time and I'd be concerned about the perf impact.

I'll noodle on a suitable single place to put this...

> ~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 15:53 ` Andrew Cooper
  2014-01-09 16:02   ` Simon Graham
@ 2014-01-09 16:05   ` Jan Beulich
  2014-01-09 16:06   ` David Vrabel
  2 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2014-01-09 16:05 UTC (permalink / raw)
  To: Andrew Cooper, Simon Graham; +Cc: xen-devel

>>> On 09.01.14 at 16:53, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> To better match real hardware, it might be appropriate for
> "memcpy_atomic()" (name subject to improved suggestions) to use a while
> loop and issue 8 byte writes at a time, falling down to 4, 2 then 1 when
> reaching the end of the data to be copied.

Except that's not what real hardware does. You'd want to take
initial alignment into account here, shrinking the access width
right away to one suitable for the passed in alignment. Mis-
aligned locked accesses may need extra consideration (but I'd
hope the emulator already does well enough when LOCK is in
use).

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 15:53 ` Andrew Cooper
  2014-01-09 16:02   ` Simon Graham
  2014-01-09 16:05   ` Jan Beulich
@ 2014-01-09 16:06   ` David Vrabel
  2 siblings, 0 replies; 16+ messages in thread
From: David Vrabel @ 2014-01-09 16:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Simon Graham

On 09/01/14 15:53, Andrew Cooper wrote:
> On 09/01/14 15:39, Simon Graham wrote:
>> We've been seeing very infrequent crashes in Windows VMs for a while where it appears that the top-byte in a longword that is supposed to hold a pointer is being set to zero - for example:
>>
>> BUG CHECK 0000000A (00B49A40 00000002 00000001 8261092E)
>>
>> The first parameter to keBugCheck is the faulting address - 00B49A40 in this case.
>>
>> When we look at the dump, we are in a routine releasing a queued spinlock and the correct address that should have been used was 0xA3B49A40 and indeed memory contents in the windows dump have this value. Looking around some more, we see that the failing processor is executing the release queued spinlock code and another processor is executing the code to acquire the same queued spinlock and has recently written the 0xA3B49A40 value to the location where the failing instruction stream read it from.
>>
>> If we look at the disassembly for the two code paths, the writing code does:
>>
>> 	mov dword ptr [edx],eax
>>
>> and the reading code does the following to read this same value:
>>
>> 	mov ebx,dword ptr [eax]
>>
>> On a hunch that this might be a problem with the x86_emulate code, I took a look at how the mov instruction would be emulated - in both cases where emulation can be done (hvm/emulate.c and mm/shadow/multi.c), the routines that write instruction results back to memory use memcpy() to actually copy the data. Looking at the implementation of memcpy in Xen, I see that, in a 64-bit build as ours is, it will use 'rep movsq' to move the data in quadwords and then uses 'rep movsb' to move the last 1-7 bytes -- so the instructions above will, I think, always use byte instructions for the four bytes.
>>
>> Now, according to the X86 arch, 32-bit mov's are supposed to be atomic but based on the above they will not be and I am speculating that this is the cause of our occasional crash - the code path unlocking the spin lock on the other processor sees a partially written value in memory.
>>
>> Does this seem a reasonable explanation of the issue? 
>>
>> On the assumption that this is correct, I developed the attached patch (against 4.3.1) which updates all the code paths that are used to read and writeback the results of instruction emulation to use a simple assignment if the length is 2 or 4 bytes -- this doesn't fix the general case where you have a length > 8 but it does handle emulation of MOV instructions. Unfortunately, the use of emulation in the HVM code uses a generic routine for copying memory to the guest so every single place that guest memory is read or written will pay the penalty of the extra check for length - not sure if that is terrible or not. Since doing this we have not seen a single instance of the crash - but it's only been a month!
>>
>> The attached patch is for discussion purposes only - if it is deemed acceptable I'll resubmit a proper patch request against unstable.
> 
> That seems like a plausible explanation.
> 
> The patch however needs some work.  As this function is identical, it
> should have a common implementation somewhere, possibly part of
> x86_emulate.h, and it would probably be better as:
> 
> To better match real hardware, it might be appropriate for
> "memcpy_atomic()" (name subject to improved suggestions) to use a while
> loop and issue 8 byte writes at a time, falling down to 4, 2 then 1 when
> reaching the end of the data to be copied.

Definitely not "memcpy_atomic()" as that suggests the whole copy is
atomic which isn't the case.

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 16:00 ` Jan Beulich
@ 2014-01-09 16:07   ` Simon Graham
  2014-01-09 16:21     ` Andrew Cooper
  2014-01-09 16:23     ` Jan Beulich
  0 siblings, 2 replies; 16+ messages in thread
From: Simon Graham @ 2014-01-09 16:07 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

> > Does this seem a reasonable explanation of the issue?
> 
> Yes - as long as you can also explain why a spin lock operation
> would make it into the emulation code in the first place.
> 

Well, that one is tough and I don't have a good answer... the only thing I would say is that in our system we ALWAYS have the shadow memory tracking enabled (to track changes to framebuffers)

> > The attached patch is for discussion purposes only - if it is deemed
> > acceptable I'll resubmit a proper patch request against unstable.
> 
> I'd rather not add limited scope special casing like that, but instead
> make the copying much more like real hardware (i.e. not just deal
> with the 16- and 32-bit cases, and especially not rely on memcpy()
> using 64-bit reads/writes when it can). IOW - don't use memcpy()
> here at all (and have a single routine doing The Right Thing (tm)
> rather than having two clones now, and perhaps more later on -
> I'd in particular think that the read side in shadow code would also
> need a similar adjustment).

My concern was that memcpy is (I assume!) highly optimized - it certainly should be if it isn't and I would worry that a change to make it atomic for the purposes of instruction emulation would result in an across the board perf hit when in most cases it isn't necessary that it be atomic.

This would be fine for the writeback code in the shadow module BUT the __hvm_copy routine is used generically in situations where atomicity is not required...

> 
> On the mechanical side of things: Such a generic routine should
> have proper parameter types: "const void *" for the source pointer
> and "unsigned long" or "size_t" for the count.

Sure - thanks.

> 
> Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 16:07   ` Simon Graham
@ 2014-01-09 16:21     ` Andrew Cooper
  2014-01-09 17:33       ` Simon Graham
  2014-01-09 16:23     ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Cooper @ 2014-01-09 16:21 UTC (permalink / raw)
  To: Simon Graham; +Cc: xen-devel, Jan Beulich

On 09/01/14 16:07, Simon Graham wrote:
>>> Does this seem a reasonable explanation of the issue?
>> Yes - as long as you can also explain why a spin lock operation
>> would make it into the emulation code in the first place.
>>
> Well, that one is tough and I don't have a good answer... the only thing I would say is that in our system we ALWAYS have the shadow memory tracking enabled (to track changes to framebuffers)

Full shadow, or just logdirty?

With logdirty, frequent pagefaults will occur (which are costly in terms
of vmexits), but I would not expect emulation to occur.

Even with full shadow, emulation only kicks in for non-standard RAM,
which is basically IO to qemu, and instructions trying to write to the
pagetables themselves, which are trapped and emulated for safety reasons.

>
>>> The attached patch is for discussion purposes only - if it is deemed
>>> acceptable I'll resubmit a proper patch request against unstable.
>> I'd rather not add limited scope special casing like that, but instead
>> make the copying much more like real hardware (i.e. not just deal
>> with the 16- and 32-bit cases, and especially not rely on memcpy()
>> using 64-bit reads/writes when it can). IOW - don't use memcpy()
>> here at all (and have a single routine doing The Right Thing (tm)
>> rather than having two clones now, and perhaps more later on -
>> I'd in particular think that the read side in shadow code would also
>> need a similar adjustment).
> My concern was that memcpy is (I assume!) highly optimized - it certainly should be if it isn't and I would worry that a change to make it atomic for the purposes of instruction emulation would result in an across the board perf hit when in most cases it isn't necessary that it be atomic.
>
> This would be fine for the writeback code in the shadow module BUT the __hvm_copy routine is used generically in situations where atomicity is not required...

__hvm_copy() is probably too low to be thinking about this.  There are
many things such as grant_copy() which do not want "hardware like" copy
properties, preferring instead to have less overhead.

~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 16:07   ` Simon Graham
  2014-01-09 16:21     ` Andrew Cooper
@ 2014-01-09 16:23     ` Jan Beulich
  2014-01-09 16:30       ` Simon Graham
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2014-01-09 16:23 UTC (permalink / raw)
  To: Simon Graham; +Cc: xen-devel

>>> On 09.01.14 at 17:07, Simon Graham <simon.graham@citrix.com> wrote:
>> > The attached patch is for discussion purposes only - if it is deemed
>> > acceptable I'll resubmit a proper patch request against unstable.
>> 
>> I'd rather not add limited scope special casing like that, but instead
>> make the copying much more like real hardware (i.e. not just deal
>> with the 16- and 32-bit cases, and especially not rely on memcpy()
>> using 64-bit reads/writes when it can). IOW - don't use memcpy()
>> here at all (and have a single routine doing The Right Thing (tm)
>> rather than having two clones now, and perhaps more later on -
>> I'd in particular think that the read side in shadow code would also
>> need a similar adjustment).
> 
> My concern was that memcpy is (I assume!) highly optimized - it certainly 
> should be if it isn't and I would worry that a change to make it atomic for 
> the purposes of instruction emulation would result in an across the board 
> perf hit when in most cases it isn't necessary that it be atomic.

And I didn't mean to fiddle with memcpy(), but rather create a
specialized copying function just for the use in the context of
emulation.

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 16:23     ` Jan Beulich
@ 2014-01-09 16:30       ` Simon Graham
  0 siblings, 0 replies; 16+ messages in thread
From: Simon Graham @ 2014-01-09 16:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

> > My concern was that memcpy is (I assume!) highly optimized - it certainly
> > should be if it isn't and I would worry that a change to make it atomic for
> > the purposes of instruction emulation would result in an across the board
> > perf hit when in most cases it isn't necessary that it be atomic.
> 
> And I didn't mean to fiddle with memcpy(), but rather create a
> specialized copying function just for the use in the context of
> emulation.
> 

-sigh- OK I'll look at that - might not be as bad as I originally thought anyway.

> Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 16:21     ` Andrew Cooper
@ 2014-01-09 17:33       ` Simon Graham
  2014-01-10  9:29         ` Jan Beulich
  2014-01-10 15:47         ` Jan Beulich
  0 siblings, 2 replies; 16+ messages in thread
From: Simon Graham @ 2014-01-09 17:33 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: xen-devel, Jan Beulich

> > Well, that one is tough and I don't have a good answer... the only thing I
> would say is that in our system we ALWAYS have the shadow memory
> tracking enabled (to track changes to framebuffers)
> 
> Full shadow, or just logdirty?
> 
> With logdirty, frequent pagefaults will occur (which are costly in terms
> of vmexits), but I would not expect emulation to occur.
> 
> Even with full shadow, emulation only kicks in for non-standard RAM,
> which is basically IO to qemu, and instructions trying to write to the
> pagetables themselves, which are trapped and emulated for safety reasons.
> 

logdirty (somewhat modified to enable multiple framebuffers to be tracked).

I agree that it _shouldn't_ end up emulating -- but the shadow page fault routine has a ton of code paths that I've never managed to fully grok 

(As an aside, I've previously looked at other cases where the shadow code ends up emulating instructions that are unexpected that cause VMs to hang because the shadow module doesn't have a proper implementation of the x86_emulate callbacks... e.g. if you try to run the old MS Virtual Server product inside a Xen VM that has logdirty enabled it _will_ hard hang).

> > My concern was that memcpy is (I assume!) highly optimized - it certainly
> should be if it isn't and I would worry that a change to make it atomic for the
> purposes of instruction emulation would result in an across the board perf hit
> when in most cases it isn't necessary that it be atomic.
> >
> > This would be fine for the writeback code in the shadow module BUT the
> __hvm_copy routine is used generically in situations where atomicity is not
> required...
> 
> __hvm_copy() is probably too low to be thinking about this.  There are
> many things such as grant_copy() which do not want "hardware like" copy
> properties, preferring instead to have less overhead.
> 

Yeah... I'll rework the patch to do this...

> ~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 17:33       ` Simon Graham
@ 2014-01-10  9:29         ` Jan Beulich
  2014-01-10 13:09           ` Simon Graham
  2014-01-10 15:47         ` Jan Beulich
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2014-01-10  9:29 UTC (permalink / raw)
  To: Andrew Cooper, Simon Graham; +Cc: xen-devel

>>> On 09.01.14 at 18:33, Simon Graham <simon.graham@citrix.com> wrote:
> I agree that it _shouldn't_ end up emulating -- but the shadow page fault 
> routine has a ton of code paths that I've never managed to fully grok 
> 
> (As an aside, I've previously looked at other cases where the shadow code 
> ends up emulating instructions that are unexpected that cause VMs to hang 
> because the shadow module doesn't have a proper implementation of the 
> x86_emulate callbacks... e.g. if you try to run the old MS Virtual Server 
> product inside a Xen VM that has logdirty enabled it _will_ hard hang).

Perhaps that's then what really needs fixing?

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-10  9:29         ` Jan Beulich
@ 2014-01-10 13:09           ` Simon Graham
  0 siblings, 0 replies; 16+ messages in thread
From: Simon Graham @ 2014-01-10 13:09 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper; +Cc: xen-devel

> >>> On 09.01.14 at 18:33, Simon Graham <simon.graham@citrix.com> wrote:
> > I agree that it _shouldn't_ end up emulating -- but the shadow page fault
> > routine has a ton of code paths that I've never managed to fully grok
> >
> > (As an aside, I've previously looked at other cases where the shadow code
> > ends up emulating instructions that are unexpected that cause VMs to
> hang
> > because the shadow module doesn't have a proper implementation of the
> > x86_emulate callbacks... e.g. if you try to run the old MS Virtual Server
> > product inside a Xen VM that has logdirty enabled it _will_ hard hang).
> 
> Perhaps that's then what really needs fixing?
> 

Well, I don't disagree but I also think the two problems are orthogonal -- the shadow use of x86_emulate is incomplete but every use of x86_emulate suffers from the problem that copies to and from memory are not following the definition of the x86 architecture.

I previously looked at fixing the shadow use of x86_emulate but it's a big job that I don't have the expertise or time to address.

Simon

> Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-09 17:33       ` Simon Graham
  2014-01-10  9:29         ` Jan Beulich
@ 2014-01-10 15:47         ` Jan Beulich
  2014-01-10 15:57           ` Simon Graham
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Beulich @ 2014-01-10 15:47 UTC (permalink / raw)
  To: Simon Graham; +Cc: Andrew Cooper, xen-devel

>>> On 09.01.14 at 18:33, Simon Graham <simon.graham@citrix.com> wrote:
>> __hvm_copy() is probably too low to be thinking about this.  There are
>> many things such as grant_copy() which do not want "hardware like" copy
>> properties, preferring instead to have less overhead.
>> 
> 
> Yeah... I'll rework the patch to do this...

Looking a little more closely, hvm_copy_{to,from}_guest_virt()
are what you want to have the adjusted behavior. That way
you'd in particular not touch the behavior of the more generic
copying routines copy_{to,from}_user_hvm(). And adjusting
the behavior would seem to be doable cleanly by adding e.g.
HVMCOPY_atomic as a new flag, thus informing __hvm_copy()
to not use memcpy().

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-10 15:47         ` Jan Beulich
@ 2014-01-10 15:57           ` Simon Graham
  2014-01-10 16:26             ` Jan Beulich
  0 siblings, 1 reply; 16+ messages in thread
From: Simon Graham @ 2014-01-10 15:57 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, xen-devel

> >>> On 09.01.14 at 18:33, Simon Graham <simon.graham@citrix.com> wrote:
> >> __hvm_copy() is probably too low to be thinking about this.  There are
> >> many things such as grant_copy() which do not want "hardware like" copy
> >> properties, preferring instead to have less overhead.
> >>
> >
> > Yeah... I'll rework the patch to do this...
> 
> Looking a little more closely, hvm_copy_{to,from}_guest_virt()
> are what you want to have the adjusted behavior. That way
> you'd in particular not touch the behavior of the more generic
> copying routines copy_{to,from}_user_hvm(). And adjusting
> the behavior would seem to be doable cleanly by adding e.g.
> HVMCOPY_atomic as a new flag, thus informing __hvm_copy()
> to not use memcpy().
> 

Thanks -- I was coming to the same conclusion slowly too! Although I think I might call it HVMCOPY_emulate rather than atomic since it's not the case that the entire copy is atomic...

Now I just have to figure out why the shadow code doesn't use the hvm copy_to_ routine...

> Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Possible issue with x86_emulate when writing results back to memory
  2014-01-10 15:57           ` Simon Graham
@ 2014-01-10 16:26             ` Jan Beulich
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Beulich @ 2014-01-10 16:26 UTC (permalink / raw)
  To: Simon Graham; +Cc: Andrew Cooper, Tim Deegan, xen-devel

>>> On 10.01.14 at 16:57, Simon Graham <simon.graham@citrix.com> wrote:
>> >>> On 09.01.14 at 18:33, Simon Graham <simon.graham@citrix.com> wrote:
>> >> __hvm_copy() is probably too low to be thinking about this.  There are
>> >> many things such as grant_copy() which do not want "hardware like" copy
>> >> properties, preferring instead to have less overhead.
>> >>
>> >
>> > Yeah... I'll rework the patch to do this...
>> 
>> Looking a little more closely, hvm_copy_{to,from}_guest_virt()
>> are what you want to have the adjusted behavior. That way
>> you'd in particular not touch the behavior of the more generic
>> copying routines copy_{to,from}_user_hvm(). And adjusting
>> the behavior would seem to be doable cleanly by adding e.g.
>> HVMCOPY_atomic as a new flag, thus informing __hvm_copy()
>> to not use memcpy().
>> 
> 
> Thanks -- I was coming to the same conclusion slowly too! Although I think I 
> might call it HVMCOPY_emulate rather than atomic since it's not the case that 
> the entire copy is atomic...

I'd read "atomic" here as "as atomic as possible". "emulate" is bad
imo because the function may be used for other purposes too.

> Now I just have to figure out why the shadow code doesn't use the hvm 
> copy_to_ routine...

Perhaps because it doesn't work on virtual addresses (page tables
always hold physical ones)? Maybe it could use
hvm_copy_{to,from}_guest_phys(), but I would assume those
routines didn't exist yet when the shadow code was written. Tim
may know further details...

Jan

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-01-10 16:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-09 15:39 Possible issue with x86_emulate when writing results back to memory Simon Graham
2014-01-09 15:53 ` Andrew Cooper
2014-01-09 16:02   ` Simon Graham
2014-01-09 16:05   ` Jan Beulich
2014-01-09 16:06   ` David Vrabel
2014-01-09 16:00 ` Jan Beulich
2014-01-09 16:07   ` Simon Graham
2014-01-09 16:21     ` Andrew Cooper
2014-01-09 17:33       ` Simon Graham
2014-01-10  9:29         ` Jan Beulich
2014-01-10 13:09           ` Simon Graham
2014-01-10 15:47         ` Jan Beulich
2014-01-10 15:57           ` Simon Graham
2014-01-10 16:26             ` Jan Beulich
2014-01-09 16:23     ` Jan Beulich
2014-01-09 16:30       ` Simon Graham

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.