[RFC PATCH] Drop error return if size mismatch is due to xcr0 settings

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
@ 2014-10-08 18:29 Don Koch
  2014-10-08 18:31 ` Don Koch
  2014-10-09 15:45 ` Don Koch
  0 siblings, 2 replies; 8+ messages in thread
From: Don Koch @ 2014-10-08 18:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Keir Fraser, Don Koch, Jan Beulich


This prevents migration from 4.3 to 4.4 (or newer) xen on AMD machines, at
least.

Signed-off-by: Don Koch <dkoch@verizon.com>
---
The question I have is whether to drop just the error return, leaving
a warning which might be useful for debugging, or drop the entire test
and the assigment before it. I think it would be prudent to test for
desc->length > max_size (i.e. XSTATE_CPUID's ecx value).

Personally, I think the warning is useful and a check against the max
size test should print an error/warning message and return an error;
the latter would require a new function.

The change that introduced this was 4cc134444.

The following is the minimum change "fix."

 xen/arch/x86/hvm/hvm.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index f0e1edc..dfedfb3 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2041,7 +2041,6 @@ static int hvm_load_cpu_xsave_states(struct domain *d, hvm_domain_context_t *h)
         printk(XENLOG_G_WARNING
                "HVM%d.%d restore mismatch: xsave length %u > %u\n",
                d->domain_id, vcpuid, desc->length, size);
-        return -EOPNOTSUPP;
     }
     /* Checking finished */
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
  2014-10-08 18:29 [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings Don Koch
@ 2014-10-08 18:31 ` Don Koch
  2014-10-09 15:45 ` Don Koch
  1 sibling, 0 replies; 8+ messages in thread
From: Don Koch @ 2014-10-08 18:31 UTC (permalink / raw)
  To: Don Koch; +Cc: Keir Fraser, Jan Beulich, xen-devel

Ignore this version of the patch. Sent using wrong alias (i.e. it didn't go to
the list). Apologies for the noise.

-d

On Wed, 8 Oct 2014 14:29:36 -0400
Don Koch <dkoch@verizon.com> wrote:

> 
> This prevents migration from 4.3 to 4.4 (or newer) xen on AMD machines, at
> least.
> 
> Signed-off-by: Don Koch <dkoch@verizon.com>
> ---
> The question I have is whether to drop just the error return, leaving
> a warning which might be useful for debugging, or drop the entire test
> and the assigment before it. I think it would be prudent to test for
> desc->length > max_size (i.e. XSTATE_CPUID's ecx value).
> 
> Personally, I think the warning is useful and a check against the max
> size test should print an error/warning message and return an error;
> the latter would require a new function.
> 
> The change that introduced this was 4cc134444.
> 
> The following is the minimum change "fix."
> 
>  xen/arch/x86/hvm/hvm.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index f0e1edc..dfedfb3 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2041,7 +2041,6 @@ static int hvm_load_cpu_xsave_states(struct domain *d, hvm_domain_context_t *h)
>          printk(XENLOG_G_WARNING
>                 "HVM%d.%d restore mismatch: xsave length %u > %u\n",
>                 d->domain_id, vcpuid, desc->length, size);
> -        return -EOPNOTSUPP;
>      }
>      /* Checking finished */
>  
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
  2014-10-08 18:29 [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings Don Koch
  2014-10-08 18:31 ` Don Koch
@ 2014-10-09 15:45 ` Don Koch
  2014-10-09 15:56   ` Andrew Cooper
  1 sibling, 1 reply; 8+ messages in thread
From: Don Koch @ 2014-10-09 15:45 UTC (permalink / raw)
  To: Don Koch; +Cc: Keir Fraser, Jan Beulich, xen-devel

On Wed, 8 Oct 2014 14:29:36 -0400
Don Koch <dkoch@verizon.com> wrote:

> 
> This prevents migration from 4.3 to 4.4 (or newer) xen on AMD machines, at
> least.

A clarification: a previous change made migration from xen 4.3 to 4.4 on AMD
machine fail. This patch provides a (minimal) fix for the problem. IMO, it should
be targeted for 4.5 and 4.4.x (whatever the next 'x' is).

If it's decided to add the other changes I've suggested, those could be provided
in a separate patch, especially if we're time constrained (like for getting it
into 4.5).

-d

> Signed-off-by: Don Koch <dkoch@verizon.com>
> ---
> The question I have is whether to drop just the error return, leaving
> a warning which might be useful for debugging, or drop the entire test
> and the assigment before it. I think it would be prudent to test for
> desc->length > max_size (i.e. XSTATE_CPUID's ecx value).
> 
> Personally, I think the warning is useful and a check against the max
> size test should print an error/warning message and return an error;
> the latter would require a new function.
> 
> The change that introduced this was 4cc134444.
> 
> The following is the minimum change "fix."
> 
>  xen/arch/x86/hvm/hvm.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> index f0e1edc..dfedfb3 100644
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -2041,7 +2041,6 @@ static int hvm_load_cpu_xsave_states(struct domain *d, hvm_domain_context_t *h)
>          printk(XENLOG_G_WARNING
>                 "HVM%d.%d restore mismatch: xsave length %u > %u\n",
>                 d->domain_id, vcpuid, desc->length, size);
> -        return -EOPNOTSUPP;
>      }
>      /* Checking finished */
>  
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
  2014-10-09 15:45 ` Don Koch
@ 2014-10-09 15:56   ` Andrew Cooper
  2014-10-09 16:10     ` Don Koch
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2014-10-09 15:56 UTC (permalink / raw)
  To: Don Koch; +Cc: Keir Fraser, Jan Beulich, xen-devel

On 09/10/14 16:45, Don Koch wrote:
> On Wed, 8 Oct 2014 14:29:36 -0400
> Don Koch <dkoch@verizon.com> wrote:
>
>> This prevents migration from 4.3 to 4.4 (or newer) xen on AMD machines, at
>> least.
> A clarification: a previous change made migration from xen 4.3 to 4.4 on AMD
> machine fail. This patch provides a (minimal) fix for the problem. IMO, it should
> be targeted for 4.5 and 4.4.x (whatever the next 'x' is).
>
> If it's decided to add the other changes I've suggested, those could be provided
> in a separate patch, especially if we're time constrained (like for getting it
> into 4.5).
>
> -d

Can you explain what the bug is and why this is an appropriate fix?

What is happening here is that the migration stream is providing an
xsave area larger than the size it should be based on the xcr0 sent with it.

This means that either the sending Xen sent a bogus record, or the new
Xen has incorrectly calculated the size from xcr0, but *both* of these
cases are potential VM data corruption issues.

As this currently stands with no analysis of the problem and proof as to
why the change is safe, dropping the "return -EOPNOTSUPP;" is not valid IMO.

~Andrew

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
  2014-10-09 15:56   ` Andrew Cooper
@ 2014-10-09 16:10     ` Don Koch
  2014-10-09 16:20       ` Andrew Cooper
  0 siblings, 1 reply; 8+ messages in thread
From: Don Koch @ 2014-10-09 16:10 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Keir Fraser, Jan Beulich, xen-devel

On Thu, 9 Oct 2014 16:56:49 +0100
Andrew Cooper <andrew.cooper3@citrix.com> wrote:

> On 09/10/14 16:45, Don Koch wrote:
> > On Wed, 8 Oct 2014 14:29:36 -0400
> > Don Koch <dkoch@verizon.com> wrote:
> >
> >> This prevents migration from 4.3 to 4.4 (or newer) xen on AMD machines, at
> >> least.
> > A clarification: a previous change made migration from xen 4.3 to 4.4 on AMD
> > machine fail. This patch provides a (minimal) fix for the problem. IMO, it should
> > be targeted for 4.5 and 4.4.x (whatever the next 'x' is).
> >
> > If it's decided to add the other changes I've suggested, those could be provided
> > in a separate patch, especially if we're time constrained (like for getting it
> > into 4.5).
> >
> > -d
> 
> Can you explain what the bug is and why this is an appropriate fix?
> 
> What is happening here is that the migration stream is providing an
> xsave area larger than the size it should be based on the xcr0 sent with it.

The old 4.3 system is providing a maximum size xsave area. The 4.4 xen
is calculating a smaller area required for said xsave area. All this means
is that the overflow at the end is meaningless and can be ignored (i.e.
restoring it shouldn't hurt).  If the data sent was smaller than what was
expected, i.e. something is missing, that would be an error.

I consider leaving the check and warning message useful as it allows
some debugging info if there indeed was something wrong. I tested this
on an AMD migrating from 4.3 to 4.4 and checking various ymm registers
with no data lost.

> This means that either the sending Xen sent a bogus record, or the new
> Xen has incorrectly calculated the size from xcr0, but *both* of these
> cases are potential VM data corruption issues.
> 
> As this currently stands with no analysis of the problem and proof as to
> why the change is safe, dropping the "return -EOPNOTSUPP;" is not valid IMO.
> 
> ~Andrew

Not having some sort of fix for this problem means a flag day for anyone
running on AMD.

Thanks,
-d

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
  2014-10-09 16:10     ` Don Koch
@ 2014-10-09 16:20       ` Andrew Cooper
  2014-10-10  6:36         ` Jan Beulich
  0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2014-10-09 16:20 UTC (permalink / raw)
  To: Don Koch; +Cc: Keir Fraser, Jan Beulich, xen-devel

On 09/10/14 17:10, Don Koch wrote:
> On Thu, 9 Oct 2014 16:56:49 +0100
> Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
>> On 09/10/14 16:45, Don Koch wrote:
>>> On Wed, 8 Oct 2014 14:29:36 -0400
>>> Don Koch <dkoch@verizon.com> wrote:
>>>
>>>> This prevents migration from 4.3 to 4.4 (or newer) xen on AMD machines, at
>>>> least.
>>> A clarification: a previous change made migration from xen 4.3 to 4.4 on AMD
>>> machine fail. This patch provides a (minimal) fix for the problem. IMO, it should
>>> be targeted for 4.5 and 4.4.x (whatever the next 'x' is).
>>>
>>> If it's decided to add the other changes I've suggested, those could be provided
>>> in a separate patch, especially if we're time constrained (like for getting it
>>> into 4.5).
>>>
>>> -d
>> Can you explain what the bug is and why this is an appropriate fix?
>>
>> What is happening here is that the migration stream is providing an
>> xsave area larger than the size it should be based on the xcr0 sent with it.
> The old 4.3 system is providing a maximum size xsave area. The 4.4 xen
> is calculating a smaller area required for said xsave area. All this means
> is that the overflow at the end is meaningless and can be ignored (i.e.
> restoring it shouldn't hurt).  If the data sent was smaller than what was
> expected, i.e. something is missing, that would be an error.
>
> I consider leaving the check and warning message useful as it allows
> some debugging info if there indeed was something wrong. I tested this
> on an AMD migrating from 4.3 to 4.4 and checking various ymm registers
> with no data lost.

Right ok - given this info, the patch looks plausible, but these details
must be in the patch description.

Given this diagnosis, I think it is reasonable to not fail the hypercall
if we detect this condition and confirm that all unexpectedly-extra
bytes are 0.

In the case that there is a non-zero byte in there, we must fail the
hypercall to prevent VM data corruption.  The warning can be dropped for
the "fixing up from Xen 4.3" case, but a sentence of two comment in the
code will certainly be needed as justification.

~Andrew

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
  2014-10-09 16:20       ` Andrew Cooper
@ 2014-10-10  6:36         ` Jan Beulich
  2014-10-10 15:30           ` Don Koch
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2014-10-10  6:36 UTC (permalink / raw)
  To: Andrew Cooper, Don Koch; +Cc: Keir Fraser, xen-devel

>>> On 09.10.14 at 18:20, <andrew.cooper3@citrix.com> wrote:
> On 09/10/14 17:10, Don Koch wrote:
>> On Thu, 9 Oct 2014 16:56:49 +0100
>> Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> Can you explain what the bug is and why this is an appropriate fix?
>>>
>>> What is happening here is that the migration stream is providing an
>>> xsave area larger than the size it should be based on the xcr0 sent with it.
>> The old 4.3 system is providing a maximum size xsave area. The 4.4 xen
>> is calculating a smaller area required for said xsave area. All this means
>> is that the overflow at the end is meaningless and can be ignored (i.e.
>> restoring it shouldn't hurt).  If the data sent was smaller than what was
>> expected, i.e. something is missing, that would be an error.
>>
>> I consider leaving the check and warning message useful as it allows
>> some debugging info if there indeed was something wrong. I tested this
>> on an AMD migrating from 4.3 to 4.4 and checking various ymm registers
>> with no data lost.
> 
> Right ok - given this info, the patch looks plausible, but these details
> must be in the patch description.
> 
> Given this diagnosis, I think it is reasonable to not fail the hypercall
> if we detect this condition and confirm that all unexpectedly-extra
> bytes are 0.
> 
> In the case that there is a non-zero byte in there, we must fail the
> hypercall to prevent VM data corruption.  The warning can be dropped for
> the "fixing up from Xen 4.3" case, but a sentence of two comment in the
> code will certainly be needed as justification.

That's not going to be sufficient: The error return (not so sure about
the warning) should be dropped only in exactly the one case where
a fixup is really needed, i.e. not blindly for any length larger than that
needed to match xcr0_accum, but no larger than that needed to cover
all features (and as you say also not for any contents, i.e. a zero
check must also be added to guard the bypass of the error return).

Additionally in the to be extended patch description it should be made
clear that only 4.3.0 is affected (and then again early 4.2.x appear
to suffer the same issue, so just saying 4.3 is in any event misleading);
in fact with that I wonder whether fixing this is really worthwhile - it
should go almost without saying that stable releases (of which by now
we have seen three) get applied. Same goes for mentioning AMD here:
I can't see anything AMD specific in what is being done.

Jan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings
  2014-10-10  6:36         ` Jan Beulich
@ 2014-10-10 15:30           ` Don Koch
  0 siblings, 0 replies; 8+ messages in thread
From: Don Koch @ 2014-10-10 15:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Keir Fraser, xen-devel

On Fri, 10 Oct 2014 07:36:11 +0100
Jan Beulich <JBeulich@suse.com> wrote:

> >>> On 09.10.14 at 18:20, <andrew.cooper3@citrix.com> wrote:
> > On 09/10/14 17:10, Don Koch wrote:
> >> On Thu, 9 Oct 2014 16:56:49 +0100
> >> Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> >>> Can you explain what the bug is and why this is an appropriate fix?
> >>>
> >>> What is happening here is that the migration stream is providing an
> >>> xsave area larger than the size it should be based on the xcr0 sent with it.
> >> The old 4.3 system is providing a maximum size xsave area. The 4.4 xen
> >> is calculating a smaller area required for said xsave area. All this means
> >> is that the overflow at the end is meaningless and can be ignored (i.e.
> >> restoring it shouldn't hurt).  If the data sent was smaller than what was
> >> expected, i.e. something is missing, that would be an error.
> >>
> >> I consider leaving the check and warning message useful as it allows
> >> some debugging info if there indeed was something wrong. I tested this
> >> on an AMD migrating from 4.3 to 4.4 and checking various ymm registers
> >> with no data lost.
> > 
> > Right ok - given this info, the patch looks plausible, but these details
> > must be in the patch description.
> > 
> > Given this diagnosis, I think it is reasonable to not fail the hypercall
> > if we detect this condition and confirm that all unexpectedly-extra
> > bytes are 0.
> > 
> > In the case that there is a non-zero byte in there, we must fail the
> > hypercall to prevent VM data corruption.  The warning can be dropped for
> > the "fixing up from Xen 4.3" case, but a sentence of two comment in the
> > code will certainly be needed as justification.
> 
> That's not going to be sufficient: The error return (not so sure about
> the warning) should be dropped only in exactly the one case where
> a fixup is really needed, i.e. not blindly for any length larger than that
> needed to match xcr0_accum, but no larger than that needed to cover
> all features (and as you say also not for any contents, i.e. a zero
> check must also be added to guard the bypass of the error return).

I had suggested adding a check against the maximum size, albeit the type
of check was implied.

I'm currently testing a change where the area between xcr0-derived size
and maximum is checked for zero values.

> Additionally in the to be extended patch description it should be made
> clear that only 4.3.0 is affected (and then again early 4.2.x appear
> to suffer the same issue, so just saying 4.3 is in any event misleading);
> in fact with that I wonder whether fixing this is really worthwhile - it
> should go almost without saying that stable releases (of which by now
> we have seen three) get applied. Same goes for mentioning AMD here:
> I can't see anything AMD specific in what is being done.

Since we're talking about live migration, I would have figured that
people would have inferred that 4.3 implied any earlier rev.

The reason I mention AMD is because that is where the problem showed
up. AMD has an additional area that is not on Intel, Lightweight
Profiling, which isn't used that often and, therefore, that area in the
xsave buffer isn't used. It would probably happen on an Intel machine
if someone disabled the 256-bit SSE.

It seems to me that my proposed hack is no worse than the existing
4.3 to 4.3 migration code; i.e. we just copy it over regardless of
whether it's used or not. But I do agree that extra checks are a good
thing and am willing to put those in.

Unless I hear otherwise, either with other suggestions or "it's not
worth fixing," I'll update the patch to include the zero check and the
size check against the maximum size per xcr0.

Should I flag this as possible 4.5 fodder?

> Jan

Thanks,
-d

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-10-10 15:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-08 18:29 [RFC PATCH] Drop error return if size mismatch is due to xcr0 settings Don Koch
2014-10-08 18:31 ` Don Koch
2014-10-09 15:45 ` Don Koch
2014-10-09 15:56   ` Andrew Cooper
2014-10-09 16:10     ` Don Koch
2014-10-09 16:20       ` Andrew Cooper
2014-10-10  6:36         ` Jan Beulich
2014-10-10 15:30           ` Don Koch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.