All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xen: only clobber multicall elements without error
@ 2018-11-23 13:25 Juergen Gross
  2018-11-23 13:28 ` Andrew Cooper
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-23 13:25 UTC (permalink / raw)
  To: xen-devel
  Cc: Juergen Gross, sstabellini, wei.liu2, konrad.wilk, George.Dunlap,
	andrew.cooper3, ian.jackson, tim, julien.grall, jbeulich

In debug builds the hypervisor will deliberately clobber processed
elements of the multicall structure. In order to ease diagnostic data
printout in the affected guest only clobber elements which didn't
return an error.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 xen/common/multicall.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/xen/common/multicall.c b/xen/common/multicall.c
index 5a199ebf8f..48622619ce 100644
--- a/xen/common/multicall.c
+++ b/xen/common/multicall.c
@@ -74,6 +74,7 @@ do_multicall(
         ASSERT_NOT_IN_ATOMIC();
 
 #ifndef NDEBUG
+        if ( (long)mcs->call.result >= 0 )
         {
             /*
              * Deliberately corrupt the contents of the multicall structure.
-- 
2.16.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-23 13:25 [PATCH] xen: only clobber multicall elements without error Juergen Gross
@ 2018-11-23 13:28 ` Andrew Cooper
  2018-11-23 13:40   ` Juergen Gross
  2018-11-26 10:58 ` Jan Beulich
       [not found] ` <5BFBD1C602000078001FFE09@suse.com>
  2 siblings, 1 reply; 22+ messages in thread
From: Andrew Cooper @ 2018-11-23 13:28 UTC (permalink / raw)
  To: Juergen Gross, xen-devel
  Cc: sstabellini, wei.liu2, konrad.wilk, George.Dunlap, ian.jackson,
	tim, julien.grall, jbeulich

On 23/11/2018 13:25, Juergen Gross wrote:
> In debug builds the hypervisor will deliberately clobber processed
> elements of the multicall structure. In order to ease diagnostic data
> printout in the affected guest only clobber elements which didn't
> return an error.
>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
>  xen/common/multicall.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/xen/common/multicall.c b/xen/common/multicall.c
> index 5a199ebf8f..48622619ce 100644
> --- a/xen/common/multicall.c
> +++ b/xen/common/multicall.c
> @@ -74,6 +74,7 @@ do_multicall(
>          ASSERT_NOT_IN_ATOMIC();
>  
>  #ifndef NDEBUG
> +        if ( (long)mcs->call.result >= 0 )

While I appreciate your point and agree that this is good in principle,
the failure condition is per-hypercall, and not always negative.

I've tried playing a similar game before and couldn't come up with a
viable option.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-23 13:28 ` Andrew Cooper
@ 2018-11-23 13:40   ` Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-23 13:40 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel
  Cc: sstabellini, wei.liu2, konrad.wilk, George.Dunlap, ian.jackson,
	tim, julien.grall, jbeulich

On 23/11/2018 14:28, Andrew Cooper wrote:
> On 23/11/2018 13:25, Juergen Gross wrote:
>> In debug builds the hypervisor will deliberately clobber processed
>> elements of the multicall structure. In order to ease diagnostic data
>> printout in the affected guest only clobber elements which didn't
>> return an error.
>>
>> Signed-off-by: Juergen Gross <jgross@suse.com>
>> ---
>>  xen/common/multicall.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/xen/common/multicall.c b/xen/common/multicall.c
>> index 5a199ebf8f..48622619ce 100644
>> --- a/xen/common/multicall.c
>> +++ b/xen/common/multicall.c
>> @@ -74,6 +74,7 @@ do_multicall(
>>          ASSERT_NOT_IN_ATOMIC();
>>  
>>  #ifndef NDEBUG
>> +        if ( (long)mcs->call.result >= 0 )
> 
> While I appreciate your point and agree that this is good in principle,
> the failure condition is per-hypercall, and not always negative.

In theory you may be right.

OTOH the related code in the Linux kernel checks exactly what I am
aiming at: In case any multicall element has a negative result this is
regarded to be an error and a WARN is issued.

A negative return value of the multicall hypercall is regarded to be a
BUG().

> I've tried playing a similar game before and couldn't come up with a
> viable option.

In the end the result will be better than today, even if the heuristic
for detecting an error in a multicall entry might be not optimal.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-23 13:25 [PATCH] xen: only clobber multicall elements without error Juergen Gross
  2018-11-23 13:28 ` Andrew Cooper
@ 2018-11-26 10:58 ` Jan Beulich
  2018-11-26 14:54   ` Ian Jackson
       [not found] ` <5BFBD1C602000078001FFE09@suse.com>
  2 siblings, 1 reply; 22+ messages in thread
From: Jan Beulich @ 2018-11-26 10:58 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

>>> On 23.11.18 at 14:25, <jgross@suse.com> wrote:
> In debug builds the hypervisor will deliberately clobber processed
> elements of the multicall structure. In order to ease diagnostic data
> printout in the affected guest only clobber elements which didn't
> return an error.

Besides what Andrew has said such a relaxation reduces
the guarding against bad guest side code. If a guest really
wishes to produce diagnostics, I think it should go to the
lengths of copying arguments (if they can't be re-calculated
anyway). Suppressing the clobbering in more cases merely
invites guests to read the arguments after the call, which
they simply should not do. Not clobbering the values in
release builds is a performance choice, and we ought to be
allowed to change our opinion regarding this implementation
detail at any point in time.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
       [not found] ` <5BFBD1C602000078001FFE09@suse.com>
@ 2018-11-26 13:52   ` Juergen Gross
  2018-11-26 14:01     ` Jan Beulich
       [not found]     ` <5BFBFCD3020000780020004F@suse.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-26 13:52 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

On 26/11/2018 11:58, Jan Beulich wrote:
>>>> On 23.11.18 at 14:25, <jgross@suse.com> wrote:
>> In debug builds the hypervisor will deliberately clobber processed
>> elements of the multicall structure. In order to ease diagnostic data
>> printout in the affected guest only clobber elements which didn't
>> return an error.
> 
> Besides what Andrew has said such a relaxation reduces
> the guarding against bad guest side code. If a guest really
> wishes to produce diagnostics, I think it should go to the
> lengths of copying arguments (if they can't be re-calculated
> anyway). Suppressing the clobbering in more cases merely
> invites guests to read the arguments after the call, which
> they simply should not do. Not clobbering the values in
> release builds is a performance choice, and we ought to be
> allowed to change our opinion regarding this implementation
> detail at any point in time.

Right. And not copying the values before the call is a performance
choice on guest side, as errors are not the common case.

I know there is no guarantee for the guest that the values are preserved
after the call, but in the error case (which should be _very_ rare) it
will make diagnosis of that case much easier.

I don't think the hypervisor should explicitly try to make it as hard as
possible for the guest to find problems in the code.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-26 13:52   ` Juergen Gross
@ 2018-11-26 14:01     ` Jan Beulich
       [not found]     ` <5BFBFCD3020000780020004F@suse.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2018-11-26 14:01 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
> I don't think the hypervisor should explicitly try to make it as hard as
> possible for the guest to find problems in the code.

That's indeed not the hypervisor's goal. Instead it tries to make
it as hard as possible for the guest (developer) to make wrong
assumptions.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-26 14:23       ` Juergen Gross
  2018-11-26 14:58         ` Jan Beulich
       [not found]         ` <5BFC0A1302000078002000E1@suse.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-26 14:23 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

On 26/11/2018 15:01, Jan Beulich wrote:
>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>> I don't think the hypervisor should explicitly try to make it as hard as
>> possible for the guest to find problems in the code.
> 
> That's indeed not the hypervisor's goal. Instead it tries to make
> it as hard as possible for the guest (developer) to make wrong
> assumptions.

Let's look at the current example why I wrote this patch:

The Linux kernel's use of multicalls should never trigger any single
call to return an error (return value < 0). A kernel compiled for
productive use will catch such errors, but has no knowledge which
single call has failed, as it doesn't keep track of the single entries
(non-productive kernels have an option available in the respective
source to copy the entries before doing the multicall in order to have
some diagnostic data available in case of such an error). Catching an
error from a multicall right now means a WARN() with a stack backtrace
(for the multicall itself, not for the entry causing the error).

I have a customer report for a case where such a backtrace was produced
and a kernel crash some seconds later, obviously due to illegally
unmapped memory pages resulting from the failed multicall. Unfortunately
there are multiple possibilities what might have gone wrong and I don't
know which one was the culprit. The problem can't be a very common one,
because there is only one such report right now, which might depend on
a special driver.

Finding this bug without a known reproducer and the current amount of
diagnostic data is next to impossible. So I'd like to have more data
available without having to hurt performance for the 99.999999% of the
cases where nothing bad happens.

In case you have an idea how to solve this problem in another way I'd be
happy to follow that route. I'd really like to be able to have a better
clue in case such an error occurs in future.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-26 10:58 ` Jan Beulich
@ 2018-11-26 14:54   ` Ian Jackson
  2018-11-26 15:25     ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: Ian Jackson @ 2018-11-26 14:54 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Juergen Gross, Stefano Stabellini, Wei Liu,
	Konrad Rzeszutek Wilk, George Dunlap, Andrew Cooper, Tim Deegan,
	Julien Grall, xen-devel

Jan Beulich writes ("Re: [PATCH] xen: only clobber multicall elements without error"):
> On 23.11.18 at 14:25, <jgross@suse.com> wrote:
> > In debug builds the hypervisor will deliberately clobber processed
> > elements of the multicall structure. In order to ease diagnostic data
> > printout in the affected guest only clobber elements which didn't
> > return an error.
> 
> Besides what Andrew has said such a relaxation reduces
> the guarding against bad guest side code. If a guest really
> wishes to produce diagnostics, I think it should go to the
> lengths of copying arguments (if they can't be re-calculated
> anyway). Suppressing the clobbering in more cases merely
> invites guests to read the arguments after the call, which
> they simply should not do. Not clobbering the values in
> release builds is a performance choice, and we ought to be
> allowed to change our opinion regarding this implementation
> detail at any point in time.

Maybe they could be clobbered losslessly ?  Eg, by xoring with 0xaa or
something.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-26 14:23       ` Juergen Gross
@ 2018-11-26 14:58         ` Jan Beulich
       [not found]         ` <5BFC0A1302000078002000E1@suse.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2018-11-26 14:58 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>>> I don't think the hypervisor should explicitly try to make it as hard as
>>> possible for the guest to find problems in the code.
>> 
>> That's indeed not the hypervisor's goal. Instead it tries to make
>> it as hard as possible for the guest (developer) to make wrong
>> assumptions.
> 
> Let's look at the current example why I wrote this patch:
> 
> The Linux kernel's use of multicalls should never trigger any single
> call to return an error (return value < 0). A kernel compiled for
> productive use will catch such errors, but has no knowledge which
> single call has failed, as it doesn't keep track of the single entries
> (non-productive kernels have an option available in the respective
> source to copy the entries before doing the multicall in order to have
> some diagnostic data available in case of such an error). Catching an
> error from a multicall right now means a WARN() with a stack backtrace
> (for the multicall itself, not for the entry causing the error).
> 
> I have a customer report for a case where such a backtrace was produced
> and a kernel crash some seconds later, obviously due to illegally
> unmapped memory pages resulting from the failed multicall. Unfortunately
> there are multiple possibilities what might have gone wrong and I don't
> know which one was the culprit. The problem can't be a very common one,
> because there is only one such report right now, which might depend on
> a special driver.
> 
> Finding this bug without a known reproducer and the current amount of
> diagnostic data is next to impossible. So I'd like to have more data
> available without having to hurt performance for the 99.999999% of the
> cases where nothing bad happens.
> 
> In case you have an idea how to solve this problem in another way I'd be
> happy to follow that route. I'd really like to be able to have a better
> clue in case such an error occurs in future.

Since you have a production kernel, I assume you also have a
production hypervisor. This hypervisor doesn't clobber the
arguments if I'm not mistaken. Therefore
- in the debugging scenario you (can) have all data available by
  virtue of the information getting copied in the kernel,
- in the release scenario you have all data available since it's
  left un-clobbered.
Am I missing anything (I don't view mixed debug/release setups
of kernel and hypervisor as overly important here)?

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-26 14:54   ` Ian Jackson
@ 2018-11-26 15:25     ` Juergen Gross
  2018-11-26 15:40       ` Ian Jackson
  0 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-11-26 15:25 UTC (permalink / raw)
  To: Ian Jackson, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Julien Grall,
	xen-devel

On 26/11/2018 15:54, Ian Jackson wrote:
> Jan Beulich writes ("Re: [PATCH] xen: only clobber multicall elements without error"):
>> On 23.11.18 at 14:25, <jgross@suse.com> wrote:
>>> In debug builds the hypervisor will deliberately clobber processed
>>> elements of the multicall structure. In order to ease diagnostic data
>>> printout in the affected guest only clobber elements which didn't
>>> return an error.
>>
>> Besides what Andrew has said such a relaxation reduces
>> the guarding against bad guest side code. If a guest really
>> wishes to produce diagnostics, I think it should go to the
>> lengths of copying arguments (if they can't be re-calculated
>> anyway). Suppressing the clobbering in more cases merely
>> invites guests to read the arguments after the call, which
>> they simply should not do. Not clobbering the values in
>> release builds is a performance choice, and we ought to be
>> allowed to change our opinion regarding this implementation
>> detail at any point in time.
> 
> Maybe they could be clobbered losslessly ?  Eg, by xoring with 0xaa or
> something.

This would be rather hacky: I'd need to find out if clobbering was
performed or not and the way of clobbering would be kind of an interface
which I guess we'd like to avoid.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-26 15:29           ` Juergen Gross
  2018-11-26 16:01             ` Julien Grall
  0 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-11-26 15:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

On 26/11/2018 15:58, Jan Beulich wrote:
>>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>>>> I don't think the hypervisor should explicitly try to make it as hard as
>>>> possible for the guest to find problems in the code.
>>>
>>> That's indeed not the hypervisor's goal. Instead it tries to make
>>> it as hard as possible for the guest (developer) to make wrong
>>> assumptions.
>>
>> Let's look at the current example why I wrote this patch:
>>
>> The Linux kernel's use of multicalls should never trigger any single
>> call to return an error (return value < 0). A kernel compiled for
>> productive use will catch such errors, but has no knowledge which
>> single call has failed, as it doesn't keep track of the single entries
>> (non-productive kernels have an option available in the respective
>> source to copy the entries before doing the multicall in order to have
>> some diagnostic data available in case of such an error). Catching an
>> error from a multicall right now means a WARN() with a stack backtrace
>> (for the multicall itself, not for the entry causing the error).
>>
>> I have a customer report for a case where such a backtrace was produced
>> and a kernel crash some seconds later, obviously due to illegally
>> unmapped memory pages resulting from the failed multicall. Unfortunately
>> there are multiple possibilities what might have gone wrong and I don't
>> know which one was the culprit. The problem can't be a very common one,
>> because there is only one such report right now, which might depend on
>> a special driver.
>>
>> Finding this bug without a known reproducer and the current amount of
>> diagnostic data is next to impossible. So I'd like to have more data
>> available without having to hurt performance for the 99.999999% of the
>> cases where nothing bad happens.
>>
>> In case you have an idea how to solve this problem in another way I'd be
>> happy to follow that route. I'd really like to be able to have a better
>> clue in case such an error occurs in future.
> 
> Since you have a production kernel, I assume you also have a
> production hypervisor. This hypervisor doesn't clobber the
> arguments if I'm not mistaken. Therefore
> - in the debugging scenario you (can) have all data available by
>   virtue of the information getting copied in the kernel,
> - in the release scenario you have all data available since it's
>   left un-clobbered.
> Am I missing anything (I don't view mixed debug/release setups
> of kernel and hypervisor as overly important here)?

No, you are missing nothing here. OTOH a debug hypervisor destroying
debug data is kind of weird, so I posted this patch.

I'll add the related Linux kernel patch (in case it is acked by Boris)
with or without this hypervisor patch, but I thought it would be better
to have the hypervisor patch in place, especially as e.g. a hypervisor
from xen-unstable might have a bug which could be easier to diagnose
with this patch in place.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-26 15:25     ` Juergen Gross
@ 2018-11-26 15:40       ` Ian Jackson
  0 siblings, 0 replies; 22+ messages in thread
From: Ian Jackson @ 2018-11-26 15:40 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Tim Deegan, Julien Grall,
	Jan Beulich, xen-devel

Juergen Gross writes ("Re: [PATCH] xen: only clobber multicall elements without error"):
> On 26/11/2018 15:54, Ian Jackson wrote:
> > Maybe they could be clobbered losslessly ?  Eg, by xoring with 0xaa or
> > something.
> 
> This would be rather hacky: I'd need to find out if clobbering was
> performed or not and the way of clobbering would be kind of an interface
> which I guess we'd like to avoid.

You can bake the precise method of clobbering into your kernel
debugging dump stuff, along with a mechanism which guesses whether
things have been clobbered, so long as you don't go utterly wrong if
your guesses are wrong.

Since after all Xen doesn't guarantee to preserve the arguments at
all, and therefore you mustn't rely on it anyway.

This seems fine for a debug dump mechanism - debug dump mechanisms are
generally best effort anyway.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-26 16:01             ` Julien Grall
  2018-11-26 16:17               ` Juergen Gross
  0 siblings, 1 reply; 22+ messages in thread
From: Julien Grall @ 2018-11-26 16:01 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan, xen-devel



On 26/11/2018 15:29, Juergen Gross wrote:
> On 26/11/2018 15:58, Jan Beulich wrote:
>>>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
>>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>>>>> I don't think the hypervisor should explicitly try to make it as hard as
>>>>> possible for the guest to find problems in the code.
>>>>
>>>> That's indeed not the hypervisor's goal. Instead it tries to make
>>>> it as hard as possible for the guest (developer) to make wrong
>>>> assumptions.
>>>
>>> Let's look at the current example why I wrote this patch:
>>>
>>> The Linux kernel's use of multicalls should never trigger any single
>>> call to return an error (return value < 0). A kernel compiled for
>>> productive use will catch such errors, but has no knowledge which
>>> single call has failed, as it doesn't keep track of the single entries
>>> (non-productive kernels have an option available in the respective
>>> source to copy the entries before doing the multicall in order to have
>>> some diagnostic data available in case of such an error). Catching an
>>> error from a multicall right now means a WARN() with a stack backtrace
>>> (for the multicall itself, not for the entry causing the error).
>>>
>>> I have a customer report for a case where such a backtrace was produced
>>> and a kernel crash some seconds later, obviously due to illegally
>>> unmapped memory pages resulting from the failed multicall. Unfortunately
>>> there are multiple possibilities what might have gone wrong and I don't
>>> know which one was the culprit. The problem can't be a very common one,
>>> because there is only one such report right now, which might depend on
>>> a special driver.
>>>
>>> Finding this bug without a known reproducer and the current amount of
>>> diagnostic data is next to impossible. So I'd like to have more data
>>> available without having to hurt performance for the 99.999999% of the
>>> cases where nothing bad happens.
>>>
>>> In case you have an idea how to solve this problem in another way I'd be
>>> happy to follow that route. I'd really like to be able to have a better
>>> clue in case such an error occurs in future.
>>
>> Since you have a production kernel, I assume you also have a
>> production hypervisor. This hypervisor doesn't clobber the
>> arguments if I'm not mistaken. Therefore
>> - in the debugging scenario you (can) have all data available by
>>    virtue of the information getting copied in the kernel,
>> - in the release scenario you have all data available since it's
>>    left un-clobbered.
>> Am I missing anything (I don't view mixed debug/release setups
>> of kernel and hypervisor as overly important here)?
> 
> No, you are missing nothing here. OTOH a debug hypervisor destroying
> debug data is kind of weird, so I posted this patch.

This is a quite common approach if you want to enforce the other entity to not 
rely on some fields. This also follows what we do for hypercalls (at least on Arm).

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-26 16:17               ` Juergen Gross
  2018-11-26 16:50                 ` Jan Beulich
       [not found]                 ` <5BFC2462020000780020022A@suse.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-26 16:17 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan, xen-devel

On 26/11/2018 17:01, Julien Grall wrote:
> 
> 
> On 26/11/2018 15:29, Juergen Gross wrote:
>> On 26/11/2018 15:58, Jan Beulich wrote:
>>>>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
>>>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>>>>>> I don't think the hypervisor should explicitly try to make it as
>>>>>> hard as
>>>>>> possible for the guest to find problems in the code.
>>>>>
>>>>> That's indeed not the hypervisor's goal. Instead it tries to make
>>>>> it as hard as possible for the guest (developer) to make wrong
>>>>> assumptions.
>>>>
>>>> Let's look at the current example why I wrote this patch:
>>>>
>>>> The Linux kernel's use of multicalls should never trigger any single
>>>> call to return an error (return value < 0). A kernel compiled for
>>>> productive use will catch such errors, but has no knowledge which
>>>> single call has failed, as it doesn't keep track of the single entries
>>>> (non-productive kernels have an option available in the respective
>>>> source to copy the entries before doing the multicall in order to have
>>>> some diagnostic data available in case of such an error). Catching an
>>>> error from a multicall right now means a WARN() with a stack backtrace
>>>> (for the multicall itself, not for the entry causing the error).
>>>>
>>>> I have a customer report for a case where such a backtrace was produced
>>>> and a kernel crash some seconds later, obviously due to illegally
>>>> unmapped memory pages resulting from the failed multicall.
>>>> Unfortunately
>>>> there are multiple possibilities what might have gone wrong and I don't
>>>> know which one was the culprit. The problem can't be a very common one,
>>>> because there is only one such report right now, which might depend on
>>>> a special driver.
>>>>
>>>> Finding this bug without a known reproducer and the current amount of
>>>> diagnostic data is next to impossible. So I'd like to have more data
>>>> available without having to hurt performance for the 99.999999% of the
>>>> cases where nothing bad happens.
>>>>
>>>> In case you have an idea how to solve this problem in another way
>>>> I'd be
>>>> happy to follow that route. I'd really like to be able to have a better
>>>> clue in case such an error occurs in future.
>>>
>>> Since you have a production kernel, I assume you also have a
>>> production hypervisor. This hypervisor doesn't clobber the
>>> arguments if I'm not mistaken. Therefore
>>> - in the debugging scenario you (can) have all data available by
>>>    virtue of the information getting copied in the kernel,
>>> - in the release scenario you have all data available since it's
>>>    left un-clobbered.
>>> Am I missing anything (I don't view mixed debug/release setups
>>> of kernel and hypervisor as overly important here)?
>>
>> No, you are missing nothing here. OTOH a debug hypervisor destroying
>> debug data is kind of weird, so I posted this patch.
> 
> This is a quite common approach if you want to enforce the other entity
> to not rely on some fields. This also follows what we do for hypercalls
> (at least on Arm).

I don't question that general mechanism.

What I question is doing the clobbering in this case where the caller
would need to take special measures for copying the needed debug data
which will hit performance. So not doing the clobbering would only be
in the very rare error case, not in the common case where the guest
should really have no need to see the preserved data.

I really fail to see why it is so bad to not clobber data in a case
which normally should never occur. The only outcome of clobbering the
data in the error case is making diagnosis of that error much harder.
Its not as if there would be secret data suddenly made available to the
guest. Its just avoiding the need for the guest to copy the data for
each multicall for the very unlikely chance an error might occur. And we
are not speaking of a hypercall issued then and now, but of the path hit
for nearly every memory-management action and context switch of PV
guests. So doing always the copy would really be visible.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-26 16:17               ` Juergen Gross
@ 2018-11-26 16:50                 ` Jan Beulich
       [not found]                 ` <5BFC2462020000780020022A@suse.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2018-11-26 16:50 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

>>> On 26.11.18 at 17:17, <jgross@suse.com> wrote:
> I really fail to see why it is so bad to not clobber data in a case
> which normally should never occur.

See Andrew's original reply. You're also clobbering on potential
success paths.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-27  7:37                   ` Juergen Gross
  2018-11-27  9:24                     ` Jan Beulich
  0 siblings, 1 reply; 22+ messages in thread
From: Juergen Gross @ 2018-11-27  7:37 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

On 26/11/2018 17:50, Jan Beulich wrote:
>>>> On 26.11.18 at 17:17, <jgross@suse.com> wrote:
>> I really fail to see why it is so bad to not clobber data in a case
>> which normally should never occur.
> 
> See Andrew's original reply. You're also clobbering on potential
> success paths.

I think you are missing a "not" here.

But yes, I agree. This is a downside _I_ think we can live with.

How many cases (with productive hypervisors) are known where multicall
parameters have been illegally re-used? I guess the number is exactly 0.

I know of least one case where not clobbering would be useful for
diagnostic reasons.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
  2018-11-27  7:37                   ` Juergen Gross
@ 2018-11-27  9:24                     ` Jan Beulich
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Beulich @ 2018-11-27  9:24 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

>>> On 27.11.18 at 08:37, <jgross@suse.com> wrote:
> On 26/11/2018 17:50, Jan Beulich wrote:
>>>>> On 26.11.18 at 17:17, <jgross@suse.com> wrote:
>>> I really fail to see why it is so bad to not clobber data in a case
>>> which normally should never occur.
>> 
>> See Andrew's original reply. You're also clobbering on potential
>> success paths.
> 
> I think you are missing a "not" here.
> 
> But yes, I agree. This is a downside _I_ think we can live with.
> 
> How many cases (with productive hypervisors) are known where multicall
> parameters have been illegally re-used? I guess the number is exactly 0.

I'm not aware of any either, but this means exactly nothing.

> I know of least one case where not clobbering would be useful for
> diagnostic reasons.

Well, I'm in agreement with Julien that the behavior between ordinary
hypercalls and multicalls would better be as consistent as possible.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-27  9:28 Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-27  9:28 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan,
	Julien Grall, xen-devel

On 27/11/2018 10:24, Jan Beulich wrote:
>>>> On 27.11.18 at 08:37, <jgross@suse.com> wrote:
>> On 26/11/2018 17:50, Jan Beulich wrote:
>>>>>> On 26.11.18 at 17:17, <jgross@suse.com> wrote:
>>>> I really fail to see why it is so bad to not clobber data in a case
>>>> which normally should never occur.
>>>
>>> See Andrew's original reply. You're also clobbering on potential
>>> success paths.
>>
>> I think you are missing a "not" here.
>>
>> But yes, I agree. This is a downside _I_ think we can live with.
>>
>> How many cases (with productive hypervisors) are known where multicall
>> parameters have been illegally re-used? I guess the number is exactly 0.
> 
> I'm not aware of any either, but this means exactly nothing.
> 
>> I know of least one case where not clobbering would be useful for
>> diagnostic reasons.
> 
> Well, I'm in agreement with Julien that the behavior between ordinary
> hypercalls and multicalls would better be as consistent as possible.

I will no longer push for the patch then. I still believe your decision
is based on wrong priorities, but I don't want to waste our time any
longer.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-27  9:25 Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-27  9:25 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan, xen-devel

On 27/11/2018 10:16, Julien Grall wrote:
> Hi,
> 
> On 11/27/18 7:34 AM, Juergen Gross wrote:
>> On 26/11/2018 17:51, Julien Grall wrote:
>>>
>>>
>>> On 26/11/2018 16:17, Juergen Gross wrote:
>>>> On 26/11/2018 17:01, Julien Grall wrote:
>>>>>
>>>>>
>>>>> On 26/11/2018 15:29, Juergen Gross wrote:
>>>>>> On 26/11/2018 15:58, Jan Beulich wrote:
>>>>>>>>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
>>>>>>>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>>>>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>>> Secondly, you are introducing an ABI change without explicitly telling
>>> it. This should be clarified in the public interface and probably the
>>> multicall code to avoid re-introducing the clobbering in the future.
>>
>> No, I don't think I'm introducing an ABI change. There is no guarantee
>> Xen will clobber all multicall arguments. In fact a NDEBUG hypervisor
>> never does. The guest must not rely on non-clobbered values, of course.
> 
> The last sentence is what the ABI promises you. The rest is a
> implementation decision to avoid performance impact.
> 
> If you don't change the ABI, then how the guest will be able to know the
> values are correct? Furthermore, as your change are not documented how
> do you guarantee that this will not be reverted in the future?

This is for diagnostic purpose only. Common sense should be applied when
interpreting the printed data, like e.g. in case of a stack backtrace.

The guest doesn't have to know (and shouldn't) whether the data is
clobbered or not. The person trying to locate a bug in the guest will be
more capable to do so in case at least some of the additional data isn't
clobbered. That's the reason for the patch.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-27  9:16 Julien Grall
  0 siblings, 0 replies; 22+ messages in thread
From: Julien Grall @ 2018-11-27  9:16 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan, xen-devel

Hi,

On 11/27/18 7:34 AM, Juergen Gross wrote:
> On 26/11/2018 17:51, Julien Grall wrote:
>>
>>
>> On 26/11/2018 16:17, Juergen Gross wrote:
>>> On 26/11/2018 17:01, Julien Grall wrote:
>>>>
>>>>
>>>> On 26/11/2018 15:29, Juergen Gross wrote:
>>>>> On 26/11/2018 15:58, Jan Beulich wrote:
>>>>>>>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
>>>>>>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>>>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>> Secondly, you are introducing an ABI change without explicitly telling
>> it. This should be clarified in the public interface and probably the
>> multicall code to avoid re-introducing the clobbering in the future.
> 
> No, I don't think I'm introducing an ABI change. There is no guarantee
> Xen will clobber all multicall arguments. In fact a NDEBUG hypervisor
> never does. The guest must not rely on non-clobbered values, of course.

The last sentence is what the ABI promises you. The rest is a 
implementation decision to avoid performance impact.

If you don't change the ABI, then how the guest will be able to know the 
values are correct? Furthermore, as your change are not documented how 
do you guarantee that this will not be reverted in the future?

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-27  7:34 Juergen Gross
  0 siblings, 0 replies; 22+ messages in thread
From: Juergen Gross @ 2018-11-27  7:34 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan, xen-devel

On 26/11/2018 17:51, Julien Grall wrote:
> 
> 
> On 26/11/2018 16:17, Juergen Gross wrote:
>> On 26/11/2018 17:01, Julien Grall wrote:
>>>
>>>
>>> On 26/11/2018 15:29, Juergen Gross wrote:
>>>> On 26/11/2018 15:58, Jan Beulich wrote:
>>>>>>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
>>>>>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>>>>>>>> I don't think the hypervisor should explicitly try to make it as
>>>>>>>> hard as
>>>>>>>> possible for the guest to find problems in the code.
>>>>>>>
>>>>>>> That's indeed not the hypervisor's goal. Instead it tries to make
>>>>>>> it as hard as possible for the guest (developer) to make wrong
>>>>>>> assumptions.
>>>>>>
>>>>>> Let's look at the current example why I wrote this patch:
>>>>>>
>>>>>> The Linux kernel's use of multicalls should never trigger any single
>>>>>> call to return an error (return value < 0). A kernel compiled for
>>>>>> productive use will catch such errors, but has no knowledge which
>>>>>> single call has failed, as it doesn't keep track of the single
>>>>>> entries
>>>>>> (non-productive kernels have an option available in the respective
>>>>>> source to copy the entries before doing the multicall in order to
>>>>>> have
>>>>>> some diagnostic data available in case of such an error). Catching an
>>>>>> error from a multicall right now means a WARN() with a stack
>>>>>> backtrace
>>>>>> (for the multicall itself, not for the entry causing the error).
>>>>>>
>>>>>> I have a customer report for a case where such a backtrace was
>>>>>> produced
>>>>>> and a kernel crash some seconds later, obviously due to illegally
>>>>>> unmapped memory pages resulting from the failed multicall.
>>>>>> Unfortunately
>>>>>> there are multiple possibilities what might have gone wrong and I
>>>>>> don't
>>>>>> know which one was the culprit. The problem can't be a very common
>>>>>> one,
>>>>>> because there is only one such report right now, which might
>>>>>> depend on
>>>>>> a special driver.
>>>>>>
>>>>>> Finding this bug without a known reproducer and the current amount of
>>>>>> diagnostic data is next to impossible. So I'd like to have more data
>>>>>> available without having to hurt performance for the 99.999999% of
>>>>>> the
>>>>>> cases where nothing bad happens.
>>>>>>
>>>>>> In case you have an idea how to solve this problem in another way
>>>>>> I'd be
>>>>>> happy to follow that route. I'd really like to be able to have a
>>>>>> better
>>>>>> clue in case such an error occurs in future.
>>>>>
>>>>> Since you have a production kernel, I assume you also have a
>>>>> production hypervisor. This hypervisor doesn't clobber the
>>>>> arguments if I'm not mistaken. Therefore
>>>>> - in the debugging scenario you (can) have all data available by
>>>>>     virtue of the information getting copied in the kernel,
>>>>> - in the release scenario you have all data available since it's
>>>>>     left un-clobbered.
>>>>> Am I missing anything (I don't view mixed debug/release setups
>>>>> of kernel and hypervisor as overly important here)?
>>>>
>>>> No, you are missing nothing here. OTOH a debug hypervisor destroying
>>>> debug data is kind of weird, so I posted this patch.
>>>
>>> This is a quite common approach if you want to enforce the other entity
>>> to not rely on some fields. This also follows what we do for hypercalls
>>> (at least on Arm).
>>
>> I don't question that general mechanism.
>>
>> What I question is doing the clobbering in this case where the caller
>> would need to take special measures for copying the needed debug data
>> which will hit performance. So not doing the clobbering would only be
>> in the very rare error case, not in the common case where the guest
>> should really have no need to see the preserved data.
>>
>> I really fail to see why it is so bad to not clobber data in a case
>> which normally should never occur.
>>  The only outcome of clobbering the
>> data in the error case is making diagnosis of that error much harder.
>> Its not as if there would be secret data suddenly made available to the
>> guest. Its just avoiding the need for the guest to copy the data for
>> each multicall for the very unlikely chance an error might occur. And we
>> are not speaking of a hypercall issued then and now, but of the path hit
>> for nearly every memory-management action and context switch of PV
>> guests. So doing always the copy would really be visible.
> 
> For a first, now the behavior will not be the same as when handling a
> single hypercall. I would much prefer if the behavior is kept the same
> everywhere.

Clobbering the multicall parameters in memory and the registers of a
hypercall are different from the beginning. Register clobbering is very
valuable as it is easy to mess up register usage on the guest side (just
had this issue in grub2 where re-using a register value would occur only
without diagnostic prints enabled due to inlining). I fail to see how it
is possible to re-use multicall arguments other than by explicitly
programming it this way which _is_ wrong, of course.

> Secondly, you are introducing an ABI change without explicitly telling
> it. This should be clarified in the public interface and probably the
> multicall code to avoid re-introducing the clobbering in the future.

No, I don't think I'm introducing an ABI change. There is no guarantee
Xen will clobber all multicall arguments. In fact a NDEBUG hypervisor
never does. The guest must not rely on non-clobbered values, of course.

> But I am concerned that the conditional clobbering (i.e depending on the
> return) will be overlooked by the guest and will probably introduce some
> more interesting^weird bug.

Maybe I'm looking at it too much from the Linux kernel side.

OTOH I still believe the pros of my patch (better diagnostics in case of
rare, but actually observed, failures) outweigh the cons (theoretical
bugs which are not known today and can show up with NDEBUG hypervisors
nevertheless).

I can live with my patch being rejected due to the NDEBUG behavior being
not to clobber the data, but I still think the reasons for rejecting the
patch are using the wrong priorities (purely theoretical assumptions
above actual problems).


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] xen: only clobber multicall elements without error
@ 2018-11-26 16:51 Julien Grall
  0 siblings, 0 replies; 22+ messages in thread
From: Julien Grall @ 2018-11-26 16:51 UTC (permalink / raw)
  To: Juergen Gross, Jan Beulich
  Cc: Stefano Stabellini, Wei Liu, Konrad Rzeszutek Wilk,
	George Dunlap, Andrew Cooper, Ian Jackson, Tim Deegan, xen-devel



On 26/11/2018 16:17, Juergen Gross wrote:
> On 26/11/2018 17:01, Julien Grall wrote:
>>
>>
>> On 26/11/2018 15:29, Juergen Gross wrote:
>>> On 26/11/2018 15:58, Jan Beulich wrote:
>>>>>>> On 26.11.18 at 15:23, <jgross@suse.com> wrote:
>>>>> On 26/11/2018 15:01, Jan Beulich wrote:
>>>>>>>>> On 26.11.18 at 14:52, <jgross@suse.com> wrote:
>>>>>>> I don't think the hypervisor should explicitly try to make it as
>>>>>>> hard as
>>>>>>> possible for the guest to find problems in the code.
>>>>>>
>>>>>> That's indeed not the hypervisor's goal. Instead it tries to make
>>>>>> it as hard as possible for the guest (developer) to make wrong
>>>>>> assumptions.
>>>>>
>>>>> Let's look at the current example why I wrote this patch:
>>>>>
>>>>> The Linux kernel's use of multicalls should never trigger any single
>>>>> call to return an error (return value < 0). A kernel compiled for
>>>>> productive use will catch such errors, but has no knowledge which
>>>>> single call has failed, as it doesn't keep track of the single entries
>>>>> (non-productive kernels have an option available in the respective
>>>>> source to copy the entries before doing the multicall in order to have
>>>>> some diagnostic data available in case of such an error). Catching an
>>>>> error from a multicall right now means a WARN() with a stack backtrace
>>>>> (for the multicall itself, not for the entry causing the error).
>>>>>
>>>>> I have a customer report for a case where such a backtrace was produced
>>>>> and a kernel crash some seconds later, obviously due to illegally
>>>>> unmapped memory pages resulting from the failed multicall.
>>>>> Unfortunately
>>>>> there are multiple possibilities what might have gone wrong and I don't
>>>>> know which one was the culprit. The problem can't be a very common one,
>>>>> because there is only one such report right now, which might depend on
>>>>> a special driver.
>>>>>
>>>>> Finding this bug without a known reproducer and the current amount of
>>>>> diagnostic data is next to impossible. So I'd like to have more data
>>>>> available without having to hurt performance for the 99.999999% of the
>>>>> cases where nothing bad happens.
>>>>>
>>>>> In case you have an idea how to solve this problem in another way
>>>>> I'd be
>>>>> happy to follow that route. I'd really like to be able to have a better
>>>>> clue in case such an error occurs in future.
>>>>
>>>> Since you have a production kernel, I assume you also have a
>>>> production hypervisor. This hypervisor doesn't clobber the
>>>> arguments if I'm not mistaken. Therefore
>>>> - in the debugging scenario you (can) have all data available by
>>>>     virtue of the information getting copied in the kernel,
>>>> - in the release scenario you have all data available since it's
>>>>     left un-clobbered.
>>>> Am I missing anything (I don't view mixed debug/release setups
>>>> of kernel and hypervisor as overly important here)?
>>>
>>> No, you are missing nothing here. OTOH a debug hypervisor destroying
>>> debug data is kind of weird, so I posted this patch.
>>
>> This is a quite common approach if you want to enforce the other entity
>> to not rely on some fields. This also follows what we do for hypercalls
>> (at least on Arm).
> 
> I don't question that general mechanism.
> 
> What I question is doing the clobbering in this case where the caller
> would need to take special measures for copying the needed debug data
> which will hit performance. So not doing the clobbering would only be
> in the very rare error case, not in the common case where the guest
> should really have no need to see the preserved data.
> 
> I really fail to see why it is so bad to not clobber data in a case
> which normally should never occur.
>  The only outcome of clobbering the
> data in the error case is making diagnosis of that error much harder.
> Its not as if there would be secret data suddenly made available to the
> guest. Its just avoiding the need for the guest to copy the data for
> each multicall for the very unlikely chance an error might occur. And we
> are not speaking of a hypercall issued then and now, but of the path hit
> for nearly every memory-management action and context switch of PV
> guests. So doing always the copy would really be visible.

For a first, now the behavior will not be the same as when handling a single 
hypercall. I would much prefer if the behavior is kept the same everywhere.

Secondly, you are introducing an ABI change without explicitly telling it. This 
should be clarified in the public interface and probably the multicall code to 
avoid re-introducing the clobbering in the future.

But I am concerned that the conditional clobbering (i.e depending on the return) 
will be overlooked by the guest and will probably introduce some more 
interesting^weird bug.

Cheers,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-11-27  9:28 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-23 13:25 [PATCH] xen: only clobber multicall elements without error Juergen Gross
2018-11-23 13:28 ` Andrew Cooper
2018-11-23 13:40   ` Juergen Gross
2018-11-26 10:58 ` Jan Beulich
2018-11-26 14:54   ` Ian Jackson
2018-11-26 15:25     ` Juergen Gross
2018-11-26 15:40       ` Ian Jackson
     [not found] ` <5BFBD1C602000078001FFE09@suse.com>
2018-11-26 13:52   ` Juergen Gross
2018-11-26 14:01     ` Jan Beulich
     [not found]     ` <5BFBFCD3020000780020004F@suse.com>
2018-11-26 14:23       ` Juergen Gross
2018-11-26 14:58         ` Jan Beulich
     [not found]         ` <5BFC0A1302000078002000E1@suse.com>
     [not found]           ` <06592737*411f*d518*d972*6b4cdc704d9b@suse.com>
2018-11-26 15:29           ` Juergen Gross
2018-11-26 16:01             ` Julien Grall
2018-11-26 16:17               ` Juergen Gross
2018-11-26 16:50                 ` Jan Beulich
     [not found]                 ` <5BFC2462020000780020022A@suse.com>
2018-11-27  7:37                   ` Juergen Gross
2018-11-27  9:24                     ` Jan Beulich
2018-11-26 16:51 Julien Grall
2018-11-27  7:34 Juergen Gross
2018-11-27  9:16 Julien Grall
2018-11-27  9:25 Juergen Gross
2018-11-27  9:28 Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.