All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash
@ 2014-08-08 14:22 Vitaly Kuznetsov
  2014-08-08 15:03 ` Jan Beulich
  2014-08-08 15:05 ` David Vrabel
  0 siblings, 2 replies; 7+ messages in thread
From: Vitaly Kuznetsov @ 2014-08-08 14:22 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Jones, David Vrabel

When EVTCHNOP_reset is being performed last_vcpu_id attribute is not being
cleaned by __evtchn_close(). In case last_vcpu_id != 0 for a particular
event channel and this event channel is going to be used for event delivery
(for another vcpu) before EVTCHNOP_init_control for vcpu == last_vcpu_id
was done the following crash is observed:

 ...
 (XEN) Xen call trace:
 (XEN)    [<ffff82d080127785>] _spin_lock_irqsave+0x5/0x70
 (XEN)    [<ffff82d0801097db>] evtchn_fifo_set_pending+0xdb/0x370
 (XEN)    [<ffff82d080107146>] evtchn_send+0xd6/0x160
 (XEN)    [<ffff82d080107df9>] do_event_channel_op+0x6a9/0x16c0
 (XEN)    [<ffff82d0801ce800>] vmx_intr_assist+0x30/0x480
 (XEN)    [<ffff82d080219e99>] syscall_enter+0xa9/0xae

This happens because lock_old_queue() does not check VCPU's control
block existence and after EVTCHNOP_reset they are all cleaned.

I suggest we fix the issue twice: reset last_vcpu_id to 0 in __evtchn_close()
and add appropriate check to lock_old_queue() as lost event is much better
than hypervisor crash.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 xen/common/event_channel.c | 3 +++
 xen/common/event_fifo.c    | 9 +++++++++
 2 files changed, 12 insertions(+)

diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
index a7becae..67b9d53 100644
--- a/xen/common/event_channel.c
+++ b/xen/common/event_channel.c
@@ -578,6 +578,9 @@ static long __evtchn_close(struct domain *d1, int port1)
     chn1->state          = ECS_FREE;
     chn1->notify_vcpu_id = 0;
 
+    /* Reset last_vcpu_id to vcpu0 as control block can be freed */
+    chn1->last_vcpu_id = 0;
+
     xsm_evtchn_close_post(chn1);
 
  out:
diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
index 51b4ff6..e4bef80 100644
--- a/xen/common/event_fifo.c
+++ b/xen/common/event_fifo.c
@@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d,
     for ( try = 0; try < 3; try++ )
     {
         v = d->vcpu[evtchn->last_vcpu_id];
+
+        if ( !v->evtchn_fifo )
+        {
+            gdprintk(XENLOG_ERR,
+                     "domain %d vcpu %d has no control block!\n",
+                     d->domain_id, v->vcpu_id);
+            return NULL;
+        }
+
         old_q = &v->evtchn_fifo->queue[evtchn->last_priority];
 
         spin_lock_irqsave(&old_q->lock, *flags);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash
  2014-08-08 14:22 [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash Vitaly Kuznetsov
@ 2014-08-08 15:03 ` Jan Beulich
  2014-08-08 15:05 ` David Vrabel
  1 sibling, 0 replies; 7+ messages in thread
From: Jan Beulich @ 2014-08-08 15:03 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: xen-devel, Andrew Jones, David Vrabel

>>> On 08.08.14 at 16:22, <vkuznets@redhat.com> wrote:
> --- a/xen/common/event_fifo.c
> +++ b/xen/common/event_fifo.c
> @@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d,
>      for ( try = 0; try < 3; try++ )
>      {
>          v = d->vcpu[evtchn->last_vcpu_id];
> +
> +        if ( !v->evtchn_fifo )
> +        {
> +            gdprintk(XENLOG_ERR,
> +                     "domain %d vcpu %d has no control block!\n",
> +                     d->domain_id, v->vcpu_id);

But not gdprintk() please (which also prints the current vCPU/domain
(confusing everyone looking at such a message), and make use of
%pv.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash
  2014-08-08 14:22 [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash Vitaly Kuznetsov
  2014-08-08 15:03 ` Jan Beulich
@ 2014-08-08 15:05 ` David Vrabel
  2014-08-08 15:17   ` Vitaly Kuznetsov
  1 sibling, 1 reply; 7+ messages in thread
From: David Vrabel @ 2014-08-08 15:05 UTC (permalink / raw)
  To: Vitaly Kuznetsov, xen-devel; +Cc: Andrew Jones

On 08/08/14 15:22, Vitaly Kuznetsov wrote:
> When EVTCHNOP_reset is being performed last_vcpu_id attribute is not being
> cleaned by __evtchn_close(). In case last_vcpu_id != 0 for a particular
> event channel and this event channel is going to be used for event delivery
> (for another vcpu) before EVTCHNOP_init_control for vcpu == last_vcpu_id
> was done the following crash is observed:
> 
>  ...
>  (XEN) Xen call trace:
>  (XEN)    [<ffff82d080127785>] _spin_lock_irqsave+0x5/0x70
>  (XEN)    [<ffff82d0801097db>] evtchn_fifo_set_pending+0xdb/0x370
>  (XEN)    [<ffff82d080107146>] evtchn_send+0xd6/0x160
>  (XEN)    [<ffff82d080107df9>] do_event_channel_op+0x6a9/0x16c0
>  (XEN)    [<ffff82d0801ce800>] vmx_intr_assist+0x30/0x480
>  (XEN)    [<ffff82d080219e99>] syscall_enter+0xa9/0xae
> 
> This happens because lock_old_queue() does not check VCPU's control
> block existence and after EVTCHNOP_reset they are all cleaned.
> 
> I suggest we fix the issue twice: reset last_vcpu_id to 0 in __evtchn_close()
> and add appropriate check to lock_old_queue() as lost event is much better
> than hypervisor crash.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  xen/common/event_channel.c | 3 +++
>  xen/common/event_fifo.c    | 9 +++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
> index a7becae..67b9d53 100644
> --- a/xen/common/event_channel.c
> +++ b/xen/common/event_channel.c
> @@ -578,6 +578,9 @@ static long __evtchn_close(struct domain *d1, int port1)
>      chn1->state          = ECS_FREE;
>      chn1->notify_vcpu_id = 0;
>  
> +    /* Reset last_vcpu_id to vcpu0 as control block can be freed */
> +    chn1->last_vcpu_id = 0;

This is broken if the event channel is closed and rebound while the
event is linked.

You can only safely clear chn->last_vcpu_id during evtchn_fifo_destroy().

You also need to clear last_priority.

> +
>      xsm_evtchn_close_post(chn1);
>  
>   out:
> diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
> index 51b4ff6..e4bef80 100644
> --- a/xen/common/event_fifo.c
> +++ b/xen/common/event_fifo.c
> @@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d,
>      for ( try = 0; try < 3; try++ )
>      {
>          v = d->vcpu[evtchn->last_vcpu_id];
> +
> +        if ( !v->evtchn_fifo )
> +        {
> +            gdprintk(XENLOG_ERR,
> +                     "domain %d vcpu %d has no control block!\n",
> +                     d->domain_id, v->vcpu_id);
> +            return NULL;
> +        }

I think this check needs to be in evtchn_fifo_init() to prevent the
event from being bound to VCPU that does not have a control block.

> +
>          old_q = &v->evtchn_fifo->queue[evtchn->last_priority];
>  
>          spin_lock_irqsave(&old_q->lock, *flags);

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash
  2014-08-08 15:05 ` David Vrabel
@ 2014-08-08 15:17   ` Vitaly Kuznetsov
  2014-08-08 17:20     ` David Vrabel
  0 siblings, 1 reply; 7+ messages in thread
From: Vitaly Kuznetsov @ 2014-08-08 15:17 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Andrew Jones

David Vrabel <david.vrabel@citrix.com> writes:

> On 08/08/14 15:22, Vitaly Kuznetsov wrote:
>> When EVTCHNOP_reset is being performed last_vcpu_id attribute is not being
>> cleaned by __evtchn_close(). In case last_vcpu_id != 0 for a particular
>> event channel and this event channel is going to be used for event delivery
>> (for another vcpu) before EVTCHNOP_init_control for vcpu == last_vcpu_id
>> was done the following crash is observed:
>> 
>>  ...
>>  (XEN) Xen call trace:
>>  (XEN)    [<ffff82d080127785>] _spin_lock_irqsave+0x5/0x70
>>  (XEN)    [<ffff82d0801097db>] evtchn_fifo_set_pending+0xdb/0x370
>>  (XEN)    [<ffff82d080107146>] evtchn_send+0xd6/0x160
>>  (XEN)    [<ffff82d080107df9>] do_event_channel_op+0x6a9/0x16c0
>>  (XEN)    [<ffff82d0801ce800>] vmx_intr_assist+0x30/0x480
>>  (XEN)    [<ffff82d080219e99>] syscall_enter+0xa9/0xae
>> 
>> This happens because lock_old_queue() does not check VCPU's control
>> block existence and after EVTCHNOP_reset they are all cleaned.
>> 
>> I suggest we fix the issue twice: reset last_vcpu_id to 0 in __evtchn_close()
>> and add appropriate check to lock_old_queue() as lost event is much better
>> than hypervisor crash.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  xen/common/event_channel.c | 3 +++
>>  xen/common/event_fifo.c    | 9 +++++++++
>>  2 files changed, 12 insertions(+)
>> 
>> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
>> index a7becae..67b9d53 100644
>> --- a/xen/common/event_channel.c
>> +++ b/xen/common/event_channel.c
>> @@ -578,6 +578,9 @@ static long __evtchn_close(struct domain *d1, int port1)
>>      chn1->state          = ECS_FREE;
>>      chn1->notify_vcpu_id = 0;
>>  
>> +    /* Reset last_vcpu_id to vcpu0 as control block can be freed */
>> +    chn1->last_vcpu_id = 0;
>
> This is broken if the event channel is closed and rebound while the
> event is linked.
>
> You can only safely clear chn->last_vcpu_id during evtchn_fifo_destroy().
>
> You also need to clear last_priority.
>

Thanks, alternatively I can do that in evtchn_reset() after
evtchn_fifo_destroy() as it is the only path leading to the issue. I
wanted to avoid that to exclude additional loop for all event channels.

>> +
>>      xsm_evtchn_close_post(chn1);
>>  
>>   out:
>> diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
>> index 51b4ff6..e4bef80 100644
>> --- a/xen/common/event_fifo.c
>> +++ b/xen/common/event_fifo.c
>> @@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d,
>>      for ( try = 0; try < 3; try++ )
>>      {
>>          v = d->vcpu[evtchn->last_vcpu_id];
>> +
>> +        if ( !v->evtchn_fifo )
>> +        {
>> +            gdprintk(XENLOG_ERR,
>> +                     "domain %d vcpu %d has no control block!\n",
>> +                     d->domain_id, v->vcpu_id);
>> +            return NULL;
>> +        }
>
> I think this check needs to be in evtchn_fifo_init() to prevent the
> event from being bound to VCPU that does not have a control block.
>

I *think* it is not the issue here - the event is being bound to VCPU
with this block initialized. But last_vcpu_id for this particular event
channel points to some other VCPU which has not initialized its control
block yet (so d->vcpu[evtchn->last_vcpu_id]->evtchn_fifo is NULL). There
is no path to get in such situation (after we clear last_vcpu_id), I
just wanted to put reasonable message here in case something will change
in future.

>> +
>>          old_q = &v->evtchn_fifo->queue[evtchn->last_priority];
>>  
>>          spin_lock_irqsave(&old_q->lock, *flags);
>
> David

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash
  2014-08-08 15:17   ` Vitaly Kuznetsov
@ 2014-08-08 17:20     ` David Vrabel
  2014-08-11  9:01       ` Vitaly Kuznetsov
  2014-08-11 10:03       ` Vitaly Kuznetsov
  0 siblings, 2 replies; 7+ messages in thread
From: David Vrabel @ 2014-08-08 17:20 UTC (permalink / raw)
  To: Vitaly Kuznetsov; +Cc: xen-devel, Andrew Jones

On 08/08/14 16:17, Vitaly Kuznetsov wrote:
> David Vrabel <david.vrabel@citrix.com> writes:
> 
>> On 08/08/14 15:22, Vitaly Kuznetsov wrote:
>>> When EVTCHNOP_reset is being performed last_vcpu_id attribute is not being
>>> cleaned by __evtchn_close(). In case last_vcpu_id != 0 for a particular
>>> event channel and this event channel is going to be used for event delivery
>>> (for another vcpu) before EVTCHNOP_init_control for vcpu == last_vcpu_id
>>> was done the following crash is observed:
>>>
>>>  ...
>>>  (XEN) Xen call trace:
>>>  (XEN)    [<ffff82d080127785>] _spin_lock_irqsave+0x5/0x70
>>>  (XEN)    [<ffff82d0801097db>] evtchn_fifo_set_pending+0xdb/0x370
>>>  (XEN)    [<ffff82d080107146>] evtchn_send+0xd6/0x160
>>>  (XEN)    [<ffff82d080107df9>] do_event_channel_op+0x6a9/0x16c0
>>>  (XEN)    [<ffff82d0801ce800>] vmx_intr_assist+0x30/0x480
>>>  (XEN)    [<ffff82d080219e99>] syscall_enter+0xa9/0xae
>>>
>>> This happens because lock_old_queue() does not check VCPU's control
>>> block existence and after EVTCHNOP_reset they are all cleaned.
>>>
>>> I suggest we fix the issue twice: reset last_vcpu_id to 0 in __evtchn_close()
>>> and add appropriate check to lock_old_queue() as lost event is much better
>>> than hypervisor crash.
>>>
>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>> ---
>>>  xen/common/event_channel.c | 3 +++
>>>  xen/common/event_fifo.c    | 9 +++++++++
>>>  2 files changed, 12 insertions(+)
>>>
>>> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
>>> index a7becae..67b9d53 100644
>>> --- a/xen/common/event_channel.c
>>> +++ b/xen/common/event_channel.c
>>> @@ -578,6 +578,9 @@ static long __evtchn_close(struct domain *d1, int port1)
>>>      chn1->state          = ECS_FREE;
>>>      chn1->notify_vcpu_id = 0;
>>>  
>>> +    /* Reset last_vcpu_id to vcpu0 as control block can be freed */
>>> +    chn1->last_vcpu_id = 0;
>>
>> This is broken if the event channel is closed and rebound while the
>> event is linked.
>>
>> You can only safely clear chn->last_vcpu_id during evtchn_fifo_destroy().
>>
>> You also need to clear last_priority.
>>
> 
> Thanks, alternatively I can do that in evtchn_reset() after
> evtchn_fifo_destroy() as it is the only path leading to the issue. I
> wanted to avoid that to exclude additional loop for all event channels.
> 
>>> +
>>>      xsm_evtchn_close_post(chn1);
>>>  
>>>   out:
>>> diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
>>> index 51b4ff6..e4bef80 100644
>>> --- a/xen/common/event_fifo.c
>>> +++ b/xen/common/event_fifo.c
>>> @@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d,
>>>      for ( try = 0; try < 3; try++ )
>>>      {
>>>          v = d->vcpu[evtchn->last_vcpu_id];
>>> +
>>> +        if ( !v->evtchn_fifo )
>>> +        {
>>> +            gdprintk(XENLOG_ERR,
>>> +                     "domain %d vcpu %d has no control block!\n",
>>> +                     d->domain_id, v->vcpu_id);
>>> +            return NULL;
>>> +        }
>>
>> I think this check needs to be in evtchn_fifo_init() to prevent the
>> event from being bound to VCPU that does not have a control block.
>>
> 
> I *think* it is not the issue here - the event is being bound to VCPU
> with this block initialized. But last_vcpu_id for this particular event
> channel points to some other VCPU which has not initialized its control
> block yet (so d->vcpu[evtchn->last_vcpu_id]->evtchn_fifo is NULL). There
> is no path to get in such situation (after we clear last_vcpu_id), I
> just wanted to put reasonable message here in case something will change
> in future.

Then evtchn_fifo_init() needs to check both the new VCPU and
last_vcpu_id have control blocks.

I much prefer failing the bind up front than detecting the problem later.

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash
  2014-08-08 17:20     ` David Vrabel
@ 2014-08-11  9:01       ` Vitaly Kuznetsov
  2014-08-11 10:03       ` Vitaly Kuznetsov
  1 sibling, 0 replies; 7+ messages in thread
From: Vitaly Kuznetsov @ 2014-08-11  9:01 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Andrew Jones

David Vrabel <david.vrabel@citrix.com> writes:

> On 08/08/14 16:17, Vitaly Kuznetsov wrote:
>> David Vrabel <david.vrabel@citrix.com> writes:
>> 
>>> On 08/08/14 15:22, Vitaly Kuznetsov wrote:
>>>> When EVTCHNOP_reset is being performed last_vcpu_id attribute is not being
>>>> cleaned by __evtchn_close(). In case last_vcpu_id != 0 for a particular
>>>> event channel and this event channel is going to be used for event delivery
>>>> (for another vcpu) before EVTCHNOP_init_control for vcpu == last_vcpu_id
>>>> was done the following crash is observed:
>>>>
>>>>  ...
>>>>  (XEN) Xen call trace:
>>>>  (XEN)    [<ffff82d080127785>] _spin_lock_irqsave+0x5/0x70
>>>>  (XEN)    [<ffff82d0801097db>] evtchn_fifo_set_pending+0xdb/0x370
>>>>  (XEN)    [<ffff82d080107146>] evtchn_send+0xd6/0x160
>>>>  (XEN)    [<ffff82d080107df9>] do_event_channel_op+0x6a9/0x16c0
>>>>  (XEN)    [<ffff82d0801ce800>] vmx_intr_assist+0x30/0x480
>>>>  (XEN)    [<ffff82d080219e99>] syscall_enter+0xa9/0xae
>>>>
>>>> This happens because lock_old_queue() does not check VCPU's control
>>>> block existence and after EVTCHNOP_reset they are all cleaned.
>>>>
>>>> I suggest we fix the issue twice: reset last_vcpu_id to 0 in __evtchn_close()
>>>> and add appropriate check to lock_old_queue() as lost event is much better
>>>> than hypervisor crash.
>>>>
>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>> ---
>>>>  xen/common/event_channel.c | 3 +++
>>>>  xen/common/event_fifo.c    | 9 +++++++++
>>>>  2 files changed, 12 insertions(+)
>>>>
>>>> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
>>>> index a7becae..67b9d53 100644
>>>> --- a/xen/common/event_channel.c
>>>> +++ b/xen/common/event_channel.c
>>>> @@ -578,6 +578,9 @@ static long __evtchn_close(struct domain *d1, int port1)
>>>>      chn1->state          = ECS_FREE;
>>>>      chn1->notify_vcpu_id = 0;
>>>>  
>>>> +    /* Reset last_vcpu_id to vcpu0 as control block can be freed */
>>>> +    chn1->last_vcpu_id = 0;
>>>
>>> This is broken if the event channel is closed and rebound while the
>>> event is linked.
>>>
>>> You can only safely clear chn->last_vcpu_id during evtchn_fifo_destroy().
>>>
>>> You also need to clear last_priority.
>>>
>> 
>> Thanks, alternatively I can do that in evtchn_reset() after
>> evtchn_fifo_destroy() as it is the only path leading to the issue. I
>> wanted to avoid that to exclude additional loop for all event channels.
>> 
>>>> +
>>>>      xsm_evtchn_close_post(chn1);
>>>>  
>>>>   out:
>>>> diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
>>>> index 51b4ff6..e4bef80 100644
>>>> --- a/xen/common/event_fifo.c
>>>> +++ b/xen/common/event_fifo.c
>>>> @@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d,
>>>>      for ( try = 0; try < 3; try++ )
>>>>      {
>>>>          v = d->vcpu[evtchn->last_vcpu_id];
>>>> +
>>>> +        if ( !v->evtchn_fifo )
>>>> +        {
>>>> +            gdprintk(XENLOG_ERR,
>>>> +                     "domain %d vcpu %d has no control block!\n",
>>>> +                     d->domain_id, v->vcpu_id);
>>>> +            return NULL;
>>>> +        }
>>>
>>> I think this check needs to be in evtchn_fifo_init() to prevent the
>>> event from being bound to VCPU that does not have a control block.
>>>
>> 
>> I *think* it is not the issue here - the event is being bound to VCPU
>> with this block initialized. But last_vcpu_id for this particular event
>> channel points to some other VCPU which has not initialized its control
>> block yet (so d->vcpu[evtchn->last_vcpu_id]->evtchn_fifo is NULL). There
>> is no path to get in such situation (after we clear last_vcpu_id), I
>> just wanted to put reasonable message here in case something will change
>> in future.
>
> Then evtchn_fifo_init() needs to check both the new VCPU and
> last_vcpu_id have control blocks.
>
> I much prefer failing the bind up front than detecting the problem later.
>

Can we assume that VCPU0 always has its control block initialized (as we
need to reset notify_vcpu_id to something in case it points to a VCPU
which does not have its control block)? Switching to FIFO ABI implies
initializing at least one control block (as it's done from
evtchn_fifo_init_control()) but *in theory* it can be any VCPU, not only
VCPU0.

Alternatively, we can search for the first VCPU with initialized control
block and redirect notify_vcpu_id there. last_vcpu_id can always be
reset to notify_vcpu_id it is already checked.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash
  2014-08-08 17:20     ` David Vrabel
  2014-08-11  9:01       ` Vitaly Kuznetsov
@ 2014-08-11 10:03       ` Vitaly Kuznetsov
  1 sibling, 0 replies; 7+ messages in thread
From: Vitaly Kuznetsov @ 2014-08-11 10:03 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Andrew Jones

David Vrabel <david.vrabel@citrix.com> writes:

> On 08/08/14 16:17, Vitaly Kuznetsov wrote:
>> David Vrabel <david.vrabel@citrix.com> writes:
>> 
>>> On 08/08/14 15:22, Vitaly Kuznetsov wrote:
>>>> When EVTCHNOP_reset is being performed last_vcpu_id attribute is not being
>>>> cleaned by __evtchn_close(). In case last_vcpu_id != 0 for a particular
>>>> event channel and this event channel is going to be used for event delivery
>>>> (for another vcpu) before EVTCHNOP_init_control for vcpu == last_vcpu_id
>>>> was done the following crash is observed:
>>>>
>>>>  ...
>>>>  (XEN) Xen call trace:
>>>>  (XEN)    [<ffff82d080127785>] _spin_lock_irqsave+0x5/0x70
>>>>  (XEN)    [<ffff82d0801097db>] evtchn_fifo_set_pending+0xdb/0x370
>>>>  (XEN)    [<ffff82d080107146>] evtchn_send+0xd6/0x160
>>>>  (XEN)    [<ffff82d080107df9>] do_event_channel_op+0x6a9/0x16c0
>>>>  (XEN)    [<ffff82d0801ce800>] vmx_intr_assist+0x30/0x480
>>>>  (XEN)    [<ffff82d080219e99>] syscall_enter+0xa9/0xae
>>>>
>>>> This happens because lock_old_queue() does not check VCPU's control
>>>> block existence and after EVTCHNOP_reset they are all cleaned.
>>>>
>>>> I suggest we fix the issue twice: reset last_vcpu_id to 0 in __evtchn_close()
>>>> and add appropriate check to lock_old_queue() as lost event is much better
>>>> than hypervisor crash.
>>>>
>>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>>> ---
>>>>  xen/common/event_channel.c | 3 +++
>>>>  xen/common/event_fifo.c    | 9 +++++++++
>>>>  2 files changed, 12 insertions(+)
>>>>
>>>> diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c
>>>> index a7becae..67b9d53 100644
>>>> --- a/xen/common/event_channel.c
>>>> +++ b/xen/common/event_channel.c
>>>> @@ -578,6 +578,9 @@ static long __evtchn_close(struct domain *d1, int port1)
>>>>      chn1->state          = ECS_FREE;
>>>>      chn1->notify_vcpu_id = 0;
>>>>  
>>>> +    /* Reset last_vcpu_id to vcpu0 as control block can be freed */
>>>> +    chn1->last_vcpu_id = 0;
>>>
>>> This is broken if the event channel is closed and rebound while the
>>> event is linked.
>>>
>>> You can only safely clear chn->last_vcpu_id during evtchn_fifo_destroy().
>>>
>>> You also need to clear last_priority.
>>>
>> 
>> Thanks, alternatively I can do that in evtchn_reset() after
>> evtchn_fifo_destroy() as it is the only path leading to the issue. I
>> wanted to avoid that to exclude additional loop for all event channels.
>> 
>>>> +
>>>>      xsm_evtchn_close_post(chn1);
>>>>  
>>>>   out:
>>>> diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c
>>>> index 51b4ff6..e4bef80 100644
>>>> --- a/xen/common/event_fifo.c
>>>> +++ b/xen/common/event_fifo.c
>>>> @@ -61,6 +61,15 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d,
>>>>      for ( try = 0; try < 3; try++ )
>>>>      {
>>>>          v = d->vcpu[evtchn->last_vcpu_id];
>>>> +
>>>> +        if ( !v->evtchn_fifo )
>>>> +        {
>>>> +            gdprintk(XENLOG_ERR,
>>>> +                     "domain %d vcpu %d has no control block!\n",
>>>> +                     d->domain_id, v->vcpu_id);
>>>> +            return NULL;
>>>> +        }
>>>
>>> I think this check needs to be in evtchn_fifo_init() to prevent the
>>> event from being bound to VCPU that does not have a control block.
>>>
>> 
>> I *think* it is not the issue here - the event is being bound to VCPU
>> with this block initialized. But last_vcpu_id for this particular event
>> channel points to some other VCPU which has not initialized its control
>> block yet (so d->vcpu[evtchn->last_vcpu_id]->evtchn_fifo is NULL). There
>> is no path to get in such situation (after we clear last_vcpu_id), I
>> just wanted to put reasonable message here in case something will change
>> in future.
>
> Then evtchn_fifo_init() needs to check both the new VCPU and
> last_vcpu_id have control blocks.
>
> I much prefer failing the bind up front than detecting the problem later.

I discovered an issue with such approach: xen_setup_timer() is being
called earlier that EVTCHNOP_init_control for all secondary VCPUs in
current linux kernel (but after VCPU0 switched ABI to FIFO). This means
evtchn_fifo_init() is being called for VIRQ event channel which has
notify_vcpu_id pointing to a VCPU with uninitialized control block. And
if we reset is to VCPU0 we'll break everything...

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-08-11 10:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-08 14:22 [PATCH] evtchn: clean last_vcpu_id on EVTCHNOP_reset to avoid crash Vitaly Kuznetsov
2014-08-08 15:03 ` Jan Beulich
2014-08-08 15:05 ` David Vrabel
2014-08-08 15:17   ` Vitaly Kuznetsov
2014-08-08 17:20     ` David Vrabel
2014-08-11  9:01       ` Vitaly Kuznetsov
2014-08-11 10:03       ` Vitaly Kuznetsov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.