All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Issue policing writes from Xen to PV domain memory
@ 2014-05-08 22:58 Joe Epstein
  0 siblings, 0 replies; 28+ messages in thread
From: Joe Epstein @ 2014-05-08 22:58 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, TimDeegan(tim@xen.org), Aravindh Puthiyaparambil (aravindp)


[-- Attachment #1.1: Type: text/plain, Size: 1661 bytes --]

Hi Jan,

It's a good question.  Capturing Xen-initiated access would be good for
some use cases, but I had thought of that as a future enhancement.

Joe


On Wed, May 7, 2014 at 11:26 PM, Jan Beulich <JBeulich@suse.com> wrote:

> >>> On 07.05.14 at 21:37, <aravindp@cisco.com> wrote:
> > I dug in further to figure out if there is any difference between HVM
> and PV
> > domains with policing writes emanating from Xen. I started with how the
> > runstate area in the guest is updated. It is done using
> __copy_to_guest().
> > Here is the flow for PV and HVM.
> >
> > For PV:
> > __copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest
> >
> > I think in the above scenario, the page permissions that are present in
> the
> > shadow are adhered to for PV domains running with shadow and hence
> faults can
> > occur.
> >
> > For HVM:
> > __copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest ->
> > copy_to_user_hvm -> hvm_copy_to_guest_virt_nofault  -> __hvm_copy(flags =
> > HVMCOPY_to_guest | HVMCOPY_no_fault | HVMCOPY_virt)
> >
> > If I look in __hvm_copy(), I see that access permissions are not adhered
> to.
> > Writes to guest memory will go through even if the p2m_access type for
> that
> > page has it set as non-writable.  So it seems that we do not police
> writes to
> > guest memory that emanate from Xen even for the HVM case. Is my reading
> of
> > the code correct?
>
> It would seem so, but the question (including whether actual behavior
> matches intentions) really ought to be answered by the original author
> of the code - let's see if he's still around under his email address from
> back then... Joe?
>
> Jan
>
>

[-- Attachment #1.2: Type: text/html, Size: 2294 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-28 20:54                                   ` Jan Beulich
@ 2014-05-28 21:10                                     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-28 21:10 UTC (permalink / raw)
  To: Jan Beulich, andreslc; +Cc: xen-devel, tim

>> I am re-opening this thread due to "Xen 4.5 Development update"
>> thread. I was under the impression that you wanted to keep PV on par
>> with HVM and not allow policing of Xen writes to guest memory. So I
>> continued down that path by detecting if the write is coming from Xen
>> and allowing the write fault to be handled gracefully and not passed
>> to the mem_access listener. Please let me know if you are fine with this
>approach.
>
>Yes, I'm fine with the approach. Without knowing how you intend to achieve
>the graceful handling of write faults from the hypervisor, I obviously can't say
>whether whatever you're doing is going to be acceptable...

OK, I will send out an RFC v2 in the next couple of days to get your feedback about this.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-28 19:03                                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-28 20:54                                   ` Jan Beulich
  2014-05-28 21:10                                     ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2014-05-28 20:54 UTC (permalink / raw)
  To: aravindp, andreslc; +Cc: xen-devel, tim

>>> "Aravindh Puthiyaparambil (aravindp)" <aravindp@cisco.com> 05/28/14 9:04 PM >>>
> I am re-opening this thread due to "Xen 4.5 Development update" thread. I was
> under the impression that you wanted to keep PV on par with HVM and not
> allow policing of Xen writes to guest memory. So I continued down that path by
> detecting if the write is coming from Xen and allowing the write fault to be handled
> gracefully and not passed to the mem_access listener. Please let me know if you
> are fine with this approach.

Yes, I'm fine with the approach. Without knowing how you intend to achieve the
graceful handling of write faults from the hypervisor, I obviously can't say whether
whatever you're doing is going to be acceptable...

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-16 16:21                               ` Andres Lagar-Cavilla
  2014-05-16 21:16                                 ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-28 19:03                                 ` Aravindh Puthiyaparambil (aravindp)
  2014-05-28 20:54                                   ` Jan Beulich
  1 sibling, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-28 19:03 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, Jan Beulich; +Cc: xen-devel, TimDeegan (tim@xen.org)

>If you want to single step ... well that goes to the point Jan raises below. You
>can't single-step the hypervisor. I guess the right question to ask is "What
>would GDB do?". Imagine you are tracing a user-space process with shared
>memory and a separate process starts changing that shared memory?
>
>I think that spells the answer.
>
>We have the luxury of defining this interface, and the responsibility to keep it
>as stable as possible. I think that for extending to PV we should either (i) live
>with this event merge and not pretend we can even remotely single-step
>hypervisor accesses to shared memory -- but keep in mind the potential for
>ring exhaustion exists; or (ii) discard altogether accesses by Xen to shared
>memory from the set of trackable operations.
>
>Aravindh, is there a reason not to choose (ii)?
>	And no, I have no good idea about how to deal with the situation. I
>	continue to think that this is the wrong overall approach though, due
>	to the resultant inconsistency with HVM (where Xen writes are being
>	ignored), i.e. I believe you ought to rather think about making these
>	not fault, or deal with the faults gracefully and without needing to
>	emulate the instructions. One possibility I can see would be to clear
>	CR0.WP in the fault handler (and set a flag that this was done),
>	restoring it on the path back out of guest memory access function(s).
>	But that of course implies that the situation can only ever arise for
>	writes. And it would need to be done extremely carefully to make
>	sure you don't inadvertently allow writes to regions that are truly
>	write protected.
>
>As per above, we could forsake this and choose (ii).
>
>
>	> PS: I did try running with a hacked up version where the max ring
>buffer
>	> entries (nr_ents) is 1 and I could not make this situation happen but
>I guess
>	> it is still theoretically possible. Or am I mistaken?
>
>	No, you aren't - if the ring doesn't get consumed from, it will
>	unavoidably become full at some point.
>
>Yes the fundamental problem (with PV) remains, if we choose (i).
>
>Andres
>
>I am fine with going with option (ii) and keep things consistent with HVM.
>
>Jan, I will resubmit this patch with the other changes you and Tim asked for.

Jan,

I am re-opening this thread due to "Xen 4.5 Development update" thread. I was under the impression that you wanted to keep PV on par with HVM and not allow policing of Xen writes to guest memory. So I continued down that path by detecting if the write is coming from Xen and allowing the write fault to be handled gracefully and not passed to the mem_access listener. Please let me know if you are fine with this approach.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-16 16:21                               ` Andres Lagar-Cavilla
@ 2014-05-16 21:16                                 ` Aravindh Puthiyaparambil (aravindp)
  2014-05-28 19:03                                 ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-16 21:16 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, Jan Beulich; +Cc: xen-devel, TimDeegan (tim@xen.org)


[-- Attachment #1.1: Type: text/plain, Size: 4636 bytes --]


> I took your advice and tried to stop the second invocation of wait_event()
> by using the method suggested by Andres which is to merge the recursive
> faults.
>
> <quote>
> 1. read last unconsumed
> 2. compare
> 3. different? write into ring
> 4. same? atomic_read consumer index to ensure still unconsumed
> 5. consumed? write into ring <still a possible race below, but not a tragedy>
> 6. not consumed? drop…
> <end quote>
Whether that's safe/correct/... I have to leave to Andres; it doesn't
feel well to do things like that.
I'm neither thrilled nor disgusted. If you want to keep track of the first time something happens, this is perfectly fine.

If you want to single step ... well that goes to the point Jan raises below. You can't single-step the hypervisor. I guess the right question to ask is "What would GDB do?". Imagine you are tracing a user-space process with shared memory and a separate process starts changing that shared memory?

I think that spells the answer.

We have the luxury of defining this interface, and the responsibility to keep it as stable as possible. I think that for extending to PV we should either (i) live with this event merge and not pretend we can even remotely single-step hypervisor accesses to shared memory -- but keep in mind the potential for ring exhaustion exists; or (ii) discard altogether accesses by Xen to shared memory from the set of trackable operations.

Aravindh, is there a reason not to choose (ii)?


> +bool_t mem_event_check_duplicate(struct mem_event_domain *med,
> +                                 mem_event_request_t *req)
> +{
> +    mem_event_front_ring_t *front_ring;
> +    mem_event_request_t *ureq;
> +    RING_IDX req_event;
> +    bool_t rc = 0;
> +
> +    mem_event_ring_lock(med);
> +
> +    /* Due to the reservations, this step must succeed. */
> +    front_ring = &med->front_ring;
> +
> +    /* Index of last unconsumed request */
> +    req_event = front_ring->sring->req_event - 1;
> +    ureq = RING_GET_REQUEST(front_ring, req_event);
> +
> +    if ( req->gfn != ureq->gfn )
> +        goto out;
> +    if ( current->vcpu_id != ureq->vcpu_id )
> +        goto out;
> +    if ( req->access_r != ureq->access_r )
> +        goto out;
> +    if ( req->access_w != ureq->access_w )
> +        goto out;
> +    if ( req->access_x != ureq->access_x )
> +        goto out;
> +
> +    /* Check consumer index has not moved */
> +    if ( req_event == read_atomic(&front_ring->sring->req_event) - 1 )
> +        rc = 1;
This should all be one big if() imo.

> With this I am able to prevent recursive identical faults in the runstate
> area causing the assert to fire. However there still exists a corner case. I
> still do not know what to do about the situation when the ring is full, the
> last unconsumed event was not for the runstate area, the runstate area is
> marked not writable and the guest or xen writes to it. I _think_ this would
> again cause the assert to fire. Can you give me some guidance as to what to
> do in this situation? Should I corner case the runstate area?
No, that wouldn't help as that's not the only thing Xen writes to guest
memory.

And no, I have no good idea about how to deal with the situation. I
continue to think that this is the wrong overall approach though, due
to the resultant inconsistency with HVM (where Xen writes are being
ignored), i.e. I believe you ought to rather think about making these
not fault, or deal with the faults gracefully and without needing to
emulate the instructions. One possibility I can see would be to clear
CR0.WP in the fault handler (and set a flag that this was done),
restoring it on the path back out of guest memory access function(s).
But that of course implies that the situation can only ever arise for
writes. And it would need to be done extremely carefully to make
sure you don't inadvertently allow writes to regions that are truly
write protected.
As per above, we could forsake this and choose (ii).

> PS: I did try running with a hacked up version where the max ring buffer
> entries (nr_ents) is 1 and I could not make this situation happen but I guess
> it is still theoretically possible. Or am I mistaken?
No, you aren't - if the ring doesn't get consumed from, it will
unavoidably become full at some point.
Yes the fundamental problem (with PV) remains, if we choose (i).
Andres

I am fine with going with option (ii) and keep things consistent with HVM.

Jan, I will resubmit this patch with the other changes you and Tim asked for.

Thanks,
Aravindh


[-- Attachment #1.2: Type: text/html, Size: 10133 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-16  6:48                             ` Jan Beulich
@ 2014-05-16 16:21                               ` Andres Lagar-Cavilla
  2014-05-16 21:16                                 ` Aravindh Puthiyaparambil (aravindp)
  2014-05-28 19:03                                 ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 2 replies; 28+ messages in thread
From: Andres Lagar-Cavilla @ 2014-05-16 16:21 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, TimDeegan (tim@xen.org), Aravindh Puthiyaparambil (aravindp)


[-- Attachment #1.1: Type: text/plain, Size: 5017 bytes --]

On Fri, May 16, 2014 at 2:48 AM, Jan Beulich <JBeulich@suse.com> wrote:

> >>> On 16.05.14 at 06:11, <aravindp@cisco.com> wrote:
> > I took your advice and tried to stop the second invocation of
> wait_event()
> > by using the method suggested by Andres which is to merge the recursive
> > faults.
> >
> > <quote>
> > 1. read last unconsumed
> > 2. compare
> > 3. different? write into ring
> > 4. same? atomic_read consumer index to ensure still unconsumed
> > 5. consumed? write into ring <still a possible race below, but not a
> tragedy>
> > 6. not consumed? drop…
> > <end quote>
>
> Whether that's safe/correct/... I have to leave to Andres; it doesn't
> feel well to do things like that.
>
I'm neither thrilled nor disgusted. If you want to keep track of the first
time something happens, this is perfectly fine.

If you want to single step ... well that goes to the point Jan raises
below. You can't single-step the hypervisor. I guess the right question to
ask is "What would GDB do?". Imagine you are tracing a user-space process
with shared memory and a separate process starts changing that shared
memory?

I think that spells the answer.

We have the luxury of defining this interface, and the responsibility to
keep it as stable as possible. I think that for extending to PV we should
either (i) live with this event merge and not pretend we can even remotely
single-step hypervisor accesses to shared memory -- but keep in mind the
potential for ring exhaustion exists; or (ii) discard altogether accesses
by Xen to shared memory from the set of trackable operations.

Aravindh, is there a reason not to choose (ii)?


>
> > +bool_t mem_event_check_duplicate(struct mem_event_domain *med,
> > +                                 mem_event_request_t *req)
> > +{
> > +    mem_event_front_ring_t *front_ring;
> > +    mem_event_request_t *ureq;
> > +    RING_IDX req_event;
> > +    bool_t rc = 0;
> > +
> > +    mem_event_ring_lock(med);
> > +
> > +    /* Due to the reservations, this step must succeed. */
> > +    front_ring = &med->front_ring;
> > +
> > +    /* Index of last unconsumed request */
> > +    req_event = front_ring->sring->req_event - 1;
> > +    ureq = RING_GET_REQUEST(front_ring, req_event);
> > +
> > +    if ( req->gfn != ureq->gfn )
> > +        goto out;
> > +    if ( current->vcpu_id != ureq->vcpu_id )
> > +        goto out;
> > +    if ( req->access_r != ureq->access_r )
> > +        goto out;
> > +    if ( req->access_w != ureq->access_w )
> > +        goto out;
> > +    if ( req->access_x != ureq->access_x )
> > +        goto out;
> > +
> > +    /* Check consumer index has not moved */
> > +    if ( req_event == read_atomic(&front_ring->sring->req_event) - 1 )
> > +        rc = 1;
>
> This should all be one big if() imo.
>
> > With this I am able to prevent recursive identical faults in the runstate
> > area causing the assert to fire. However there still exists a corner
> case. I
> > still do not know what to do about the situation when the ring is full,
> the
> > last unconsumed event was not for the runstate area, the runstate area is
> > marked not writable and the guest or xen writes to it. I _think_ this
> would
> > again cause the assert to fire. Can you give me some guidance as to what
> to
> > do in this situation? Should I corner case the runstate area?
>
> No, that wouldn't help as that's not the only thing Xen writes to guest
> memory.
>
> And no, I have no good idea about how to deal with the situation. I
> continue to think that this is the wrong overall approach though, due
> to the resultant inconsistency with HVM (where Xen writes are being
> ignored), i.e. I believe you ought to rather think about making these
> not fault, or deal with the faults gracefully and without needing to
> emulate the instructions. One possibility I can see would be to clear
> CR0.WP in the fault handler (and set a flag that this was done),
> restoring it on the path back out of guest memory access function(s).
> But that of course implies that the situation can only ever arise for
> writes. And it would need to be done extremely carefully to make
> sure you don't inadvertently allow writes to regions that are truly
> write protected.
>
As per above, we could forsake this and choose (ii).

>
> > PS: I did try running with a hacked up version where the max ring buffer
> > entries (nr_ents) is 1 and I could not make this situation happen but I
> guess
> > it is still theoretically possible. Or am I mistaken?
>
> No, you aren't - if the ring doesn't get consumed from, it will
> unavoidably become full at some point.
>
Yes the fundamental problem (with PV) remains, if we choose (i).
Andres


> Jan
>



-- 
[image: Gridcentric Logo]
*Scalable, Efficient, Instant-On Virtualization*
------------------------------

*Andres Lagar-Cavilla*
Chief Scientist
Phone: 647-778-4380
Email: andres@gridcentric.com

[-- Attachment #1.2: Type: text/html, Size: 7231 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-16  4:11                           ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-16  6:48                             ` Jan Beulich
  2014-05-16 16:21                               ` Andres Lagar-Cavilla
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2014-05-16  6:48 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp)
  Cc: Andres Lagar-Cavilla(andreslc@gridcentric.ca),
	TimDeegan (tim@xen.org),
	xen-devel

>>> On 16.05.14 at 06:11, <aravindp@cisco.com> wrote:
> I took your advice and tried to stop the second invocation of wait_event() 
> by using the method suggested by Andres which is to merge the recursive 
> faults.
> 
> <quote>
> 1. read last unconsumed
> 2. compare
> 3. different? write into ring
> 4. same? atomic_read consumer index to ensure still unconsumed 
> 5. consumed? write into ring <still a possible race below, but not a tragedy> 
> 6. not consumed? drop…
> <end quote>

Whether that's safe/correct/... I have to leave to Andres; it doesn't
feel well to do things like that.

> +bool_t mem_event_check_duplicate(struct mem_event_domain *med,
> +                                 mem_event_request_t *req)
> +{
> +    mem_event_front_ring_t *front_ring;
> +    mem_event_request_t *ureq;
> +    RING_IDX req_event;
> +    bool_t rc = 0;
> +
> +    mem_event_ring_lock(med);
> +
> +    /* Due to the reservations, this step must succeed. */
> +    front_ring = &med->front_ring;
> +
> +    /* Index of last unconsumed request */
> +    req_event = front_ring->sring->req_event - 1;
> +    ureq = RING_GET_REQUEST(front_ring, req_event);
> +
> +    if ( req->gfn != ureq->gfn )
> +        goto out;
> +    if ( current->vcpu_id != ureq->vcpu_id )
> +        goto out;
> +    if ( req->access_r != ureq->access_r )
> +        goto out;
> +    if ( req->access_w != ureq->access_w )
> +        goto out;
> +    if ( req->access_x != ureq->access_x )
> +        goto out;
> +
> +    /* Check consumer index has not moved */
> +    if ( req_event == read_atomic(&front_ring->sring->req_event) - 1 )
> +        rc = 1;

This should all be one big if() imo.

> With this I am able to prevent recursive identical faults in the runstate 
> area causing the assert to fire. However there still exists a corner case. I 
> still do not know what to do about the situation when the ring is full, the 
> last unconsumed event was not for the runstate area, the runstate area is 
> marked not writable and the guest or xen writes to it. I _think_ this would 
> again cause the assert to fire. Can you give me some guidance as to what to 
> do in this situation? Should I corner case the runstate area?

No, that wouldn't help as that's not the only thing Xen writes to guest
memory.

And no, I have no good idea about how to deal with the situation. I
continue to think that this is the wrong overall approach though, due
to the resultant inconsistency with HVM (where Xen writes are being
ignored), i.e. I believe you ought to rather think about making these
not fault, or deal with the faults gracefully and without needing to
emulate the instructions. One possibility I can see would be to clear
CR0.WP in the fault handler (and set a flag that this was done),
restoring it on the path back out of guest memory access function(s).
But that of course implies that the situation can only ever arise for
writes. And it would need to be done extremely carefully to make
sure you don't inadvertently allow writes to regions that are truly
write protected.

> PS: I did try running with a hacked up version where the max ring buffer 
> entries (nr_ents) is 1 and I could not make this situation happen but I guess 
> it is still theoretically possible. Or am I mistaken?

No, you aren't - if the ring doesn't get consumed from, it will
unavoidably become full at some point.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-12  8:25                         ` Jan Beulich
  2014-05-12 18:27                           ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-16  4:11                           ` Aravindh Puthiyaparambil (aravindp)
  2014-05-16  6:48                             ` Jan Beulich
  1 sibling, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-16  4:11 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Tim Deegan (tim@xen.org),
	Andres Lagar-Cavilla (andreslc@gridcentric.ca)

>> Let me step through this. The pagefault for runstate occurs and
>wait_event()
>> gets called and the ring is full.
>>
>> wait_event():
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will
>return EBUSY
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )   ==> This will
>return EBUSY
>>             break;
>>         wait();                                                                                                   ==> Write to
>runstate area occurs...
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> ...now this write to runstate will again cause a pagefault and we will end
>> up in wait_event() again. The previous attempt would have called
>> prepare_to_wait() but finish_wait() was not called.
>> finish_wait()->__finish_wait() is where wqv->esp gets reset to NULL. So
>now:
>>
>> wait_event():
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will
>return EBUSY
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);                                                 ==> Assertion
>'wqv->esp == 0' fails
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>             break;
>>         wait();
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> Does this make sense?
>
>Of course. But it still doesn't mean you can replace the wait() here
>with something else (e.g. the effectively spin wait loop you suggested
>earlier). As said earlier on, you need to find a way to avoid the second
>invocation of wait_event(), rather than a way to make it "work".

I took your advice and tried to stop the second invocation of wait_event() by using the method suggested by Andres which is to merge the recursive faults.

<quote>
1. read last unconsumed
2. compare
3. different? write into ring
4. same? atomic_read consumer index to ensure still unconsumed 
5. consumed? write into ring <still a possible race below, but not a tragedy> 
6. not consumed? drop…
<end quote>

Here is the gist of the patch. I have only included the pertinent part for now.

diff --git a/xen/arch/x86/mm/mem_event.c b/xen/arch/x86/mm/mem_event.c
index 9b45504..2b697d7 100644
--- a/xen/arch/x86/mm/mem_event.c
+++ b/xen/arch/x86/mm/mem_event.c
@@ -267,6 +267,49 @@ void mem_event_mark_and_pause(struct vcpu *v, struct mem_event_domain *med)
 }
 
 /*
+ * Check if duplicate requests are being placed in the ring. The check is not
+ * race free but the worst that can happen is the caller detecting an already
+ * consumed request as duplicate. This is safe for sh_page_fault() which is the
+ * only caller of this function.
+ */
+bool_t mem_event_check_duplicate(struct mem_event_domain *med,
+                                 mem_event_request_t *req)
+{
+    mem_event_front_ring_t *front_ring;
+    mem_event_request_t *ureq;
+    RING_IDX req_event;
+    bool_t rc = 0;
+
+    mem_event_ring_lock(med);
+
+    /* Due to the reservations, this step must succeed. */
+    front_ring = &med->front_ring;
+
+    /* Index of last unconsumed request */
+    req_event = front_ring->sring->req_event - 1;
+    ureq = RING_GET_REQUEST(front_ring, req_event);
+
+    if ( req->gfn != ureq->gfn )
+        goto out;
+    if ( current->vcpu_id != ureq->vcpu_id )
+        goto out;
+    if ( req->access_r != ureq->access_r )
+        goto out;
+    if ( req->access_w != ureq->access_w )
+        goto out;
+    if ( req->access_x != ureq->access_x )
+        goto out;
+
+    /* Check consumer index has not moved */
+    if ( req_event == read_atomic(&front_ring->sring->req_event) - 1 )
+        rc = 1;
+
+ out:
+    mem_event_ring_unlock(med);
+    return rc;
+}
+
+/*

diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index aee061c..235bfa7 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3065,22 +3064,6 @@ static int sh_page_fault(struct vcpu *v,
                 break;
             }
 
-            /*
-             * Do not police writes to guest memory emanating from the Xen
-             * kernel. Trying to do so will cause the same pagefault to occur
-             * over and over again with an event being sent to the access
-             * listener for each fault. If the access listener's vcpu is not
-             * scheduled during this time, the violation is never resolved and
-             * will eventually end with the host crashing.
-             */
-            if ( (violation && access_w) &&
-                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
-            {
-                violation = 0;
-                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
-                                    p2m_ram_rw, p2m_access_rw);
-            }
-
             if ( violation )
             {
                 paddr_t gpa = (mfn_x(gmfn) << PAGE_SHIFT) +
@@ -3095,7 +3078,16 @@ static int sh_page_fault(struct vcpu *v,
                                   (access_w ? 'w' : '-'),
                                   (access_x ? 'x' : '-'),
                                   mfn_x(gmfn), p2ma);
-                    /* Rights not promoted, vcpu paused, work here is done */
+
+                    if ( !mem_event_check_duplicate(&d->mem_event->access,
+                                                    req_ptr) )
+                        /* Rights not promoted, vcpu paused, work here is done */
+                        goto out_put_gfn;
+
+                    /* Duplicate event detected. Unpause the vcpu and exit */
+                    xfree(req_ptr);
+                    req_ptr = NULL;
+                    vcpu_unpause(current);
                     goto out_put_gfn;
                 }
             }

With this I am able to prevent recursive identical faults in the runstate area causing the assert to fire. However there still exists a corner case. I still do not know what to do about the situation when the ring is full, the last unconsumed event was not for the runstate area, the runstate area is marked not writable and the guest or xen writes to it. I _think_ this would again cause the assert to fire. Can you give me some guidance as to what to do in this situation? Should I corner case the runstate area?

PS: I did try running with a hacked up version where the max ring buffer entries (nr_ents) is 1 and I could not make this situation happen but I guess it is still theoretically possible. Or am I mistaken?

Thanks,
Aravindh

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-12  8:25                         ` Jan Beulich
@ 2014-05-12 18:27                           ` Aravindh Puthiyaparambil (aravindp)
  2014-05-16  4:11                           ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-12 18:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>>>That would effectively be a spin wait then, which is surely not the right
>thing.
>>>But I don't follow your explanation above anyway - when coming back
>>>from wait(), the state is the same as the original one, so the page
>>>fault handling continues, it's not being retried.
>>
>> Let me step through this. The pagefault for runstate occurs and
>> wait_event() gets called and the ring is full.
>>
>> wait_event():
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will
>return EBUSY
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )   ==> This will
>return EBUSY
>>             break;
>>         wait();                                                                                                   ==> Write to
>runstate area occurs...
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> ...now this write to runstate will again cause a pagefault and we will
>> end up in wait_event() again. The previous attempt would have called
>> prepare_to_wait() but finish_wait() was not called.
>> finish_wait()->__finish_wait() is where wqv->esp gets reset to NULL. So
>now:
>>
>> wait_event():
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will
>return EBUSY
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);                                                 ==> Assertion
>'wqv->esp == 0' fails
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>             break;
>>         wait();
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> Does this make sense?
>
>Of course. But it still doesn't mean you can replace the wait() here with
>something else (e.g. the effectively spin wait loop you suggested earlier). As
>said earlier on, you need to find a way to avoid the second invocation of
>wait_event(), rather than a way to make it "work".

OK, I will look in to that. Thanks for your time.

Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-09 17:48                       ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-12  8:25                         ` Jan Beulich
  2014-05-12 18:27                           ` Aravindh Puthiyaparambil (aravindp)
  2014-05-16  4:11                           ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Beulich @ 2014-05-12  8:25 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, TimDeegan(tim@xen.org)

>>> On 09.05.14 at 19:48, <aravindp@cisco.com> wrote:
>>> Looking at what the solution for the ring being full in the PV case
>>> whether we are policing Xen writes or not, calling wait() will not
>>> work due to the scenario I had mentioned a while back and is shown above
>>in the stack trace.
>>> I am repeating that flow here
>>> mem_event_claim_slot() ->
>>> 	mem_event_wait_slot() ->
>>> 		 wait_event(mem_event_wait_try_grab(med, &rc) != -
>>EBUSY)
>>>
>>> wait_event() macro looks like this:
>>> do {
>>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>>         break;
>>>     for ( ; ; ) {
>>>         prepare_to_wait(&med->wq);
>>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>>             break;
>>>         wait();
>>>     }
>>>     finish_wait(&med->wq);
>>> } while (0)
>>>
>>> In the case where the ring is full, wait() gets called and the cpu
>>> gets scheduled away. But since it is in middle of a pagefault, when it
>>> runs again it ends up in handle_exception_saved and the same pagefault is
>>tried again.
>>> But since finish_wait() never ends up being called wqv->esp never
>>> becomes 0 and hence the assert fires on the next go around. So I think
>>> we should be calling
>>> process_pending_softirqs() instead of wait() for PV domains.
>>
>>That would effectively be a spin wait then, which is surely not the right thing.
>>But I don't follow your explanation above anyway - when coming back from
>>wait(), the state is the same as the original one, so the page fault handling
>>continues, it's not being retried.
> 
> Let me step through this. The pagefault for runstate occurs and wait_event() 
> gets called and the ring is full. 
> 
> wait_event():
> do { 
>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will return EBUSY
>         break; 
>     for ( ; ; ) { 
>         prepare_to_wait(&med->wq); 
>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )   ==> This will return EBUSY
>             break; 
>         wait();                                                                                                   ==> Write to runstate area occurs...
>     } 
>     finish_wait(&med->wq);
> } while (0)
> 
> ...now this write to runstate will again cause a pagefault and we will end 
> up in wait_event() again. The previous attempt would have called 
> prepare_to_wait() but finish_wait() was not called. 
> finish_wait()->__finish_wait() is where wqv->esp gets reset to NULL. So now:
> 
> wait_event():
> do { 
>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will return EBUSY
>         break; 
>     for ( ; ; ) { 
>         prepare_to_wait(&med->wq);                                                 ==> Assertion 'wqv->esp == 0' fails
>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
>             break; 
>         wait();
>     } 
>     finish_wait(&med->wq);
> } while (0)
> 
> Does this make sense?

Of course. But it still doesn't mean you can replace the wait() here
with something else (e.g. the effectively spin wait loop you suggested
earlier). As said earlier on, you need to find a way to avoid the second
invocation of wait_event(), rather than a way to make it "work".

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-09 15:09 ` Andres Lagar-Cavilla
@ 2014-05-09 18:22   ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 0 replies; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-09 18:22 UTC (permalink / raw)
  To: Andres Lagar-Cavilla, xen-devel
  Cc: Tim Deegan (tim@xen.org), Joe Epstein, Jan Beulich

>Aravindh, I thought of a solution to your problem. It seems the ring collapses
>under recursive identical faults. If they are all the same, you can merge them
>up. It's relatively simple and robust logic to compare against the last un-
>consumed element in the ring
>
>1. read last unconsumed
>2. compare
>3. different? write into ring
>4. same? atomic_read consumer index to ensure still unconsumed 5.
>consumed? write into ring <still a possible race below, but not a tragedy> 6.
>not consumed? drop…
>
>Kind of like what remote display protocols do quite aggressively. More
>aggressive batching and reordering is hard with a consumer as decoupled as in
>the Xen shared rings case. I'm pretty hopeful this will solve this one problem.

This sounds very promising. I will give it a shot. Thanks so much for the idea.

>But am not sure how to solve the very general problem of aggressive
>overflow in a scheduling context that does not allow exiting to schedule() and
>thus wait().

Yes, this is the larger problem. I am hoping Jan or Tim will have an idea about what to do as we are entering areas in Xen that I am not very familiar with.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-09  6:49                     ` Jan Beulich
@ 2014-05-09 17:48                       ` Aravindh Puthiyaparambil (aravindp)
  2014-05-12  8:25                         ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-09 17:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>> The above sequence of events would not cause an EPT violation (I
>> realize this can happen only in the guest context) or pagefault (The
>> page is present and marked writable in the guest). All that would
>> happen is an event to be sent to the listener from __hvm_copy() and no
>cascading faults will occur.
>
>I have to admit that I'm unable to spot where such an event gets sent.

I had preceded this by mentioning "Now take the similar scenario in the hypothetical HVM + EPT case where we are policing Xen writes to guest memory." The code would look something like this if it was implemented. Please note this was just a quick hacky write up :-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index da220bf..8ce99d5 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2750,6 +2750,65 @@ static enum hvm_copy_result __hvm_copy(
             return HVMCOPY_unhandleable;
         }
 
+        if ( unlikely(mem_event_check_ring(&curr->domain->mem_event->access)) )
+        {
+            struct p2m_domain *p2m = p2m_get_hostp2m(curr->domain);
+            mfn_t gmfn;
+            p2m_type_t _p2mt;
+            p2m_access_t p2ma;
+            bool_t violation = 0;
+            mem_event_request_t *req_ptr = NULL;
+
+            gmfn = p2m->get_entry(p2m, gfn, &_p2mt, &p2ma, 0, NULL);;
+
+            /* If the access is against the permissions, then send to mem_event */
+            switch (p2ma)
+            {
+            case p2m_access_n:
+            case p2m_access_n2rwx:
+            case p2m_access_x:
+                violation = 1;
+                break;
+
+            case p2m_access_w:
+            case p2m_access_wx:
+                if ( flags & HVMCOPY_from_guest )
+                    violation = 1;
+                break;
+
+            case p2m_access_r:
+            case p2m_access_rx:
+            case p2m_access_rx2rw:
+                if ( flags & HVMCOPY_to_guest )
+                    violation = 1;
+                break;
+
+            case p2m_access_rw:
+            case p2m_access_rwx:
+                break;
+            }
+
+            if ( violation )
+            {
+                paddr_t gpa = (gfn << PAGE_SHIFT) +
+                              (flags & HVMCOPY_virt ?
+                               (addr & ((1 << PAGE_SHIFT) - 1)) : 0);
+
+                if ( !p2m_mem_access_check(gpa, (flags & HVMCOPY_virt ? 1 : 0),
+                                           addr, 0, 1, 0, &req_ptr) )
+                {
+                    if ( unlikely(req_ptr != NULL) )
+                    {
+                        mem_access_send_req(curr->domain, req_ptr);
+                        xfree(req_ptr);
+                    }
+                    put_page(page);
+                    return HVMCOPY_unhandleable;
+                }
+            }
+        }
+
         p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
         if ( flags & HVMCOPY_to_guest )

>What is quite clear is that you don't want multiple nested runstate area
>updates to occur in the first place, i.e. you need to properly deal with the
>_first_ page fault occurring instead of waiting until multiple such events pile
>up. And I'm afraid doing so will require some (perhaps gross) hackery...

OK, I will take a stab at this. 

>> Looking at what the solution for the ring being full in the PV case
>> whether we are policing Xen writes or not, calling wait() will not
>> work due to the scenario I had mentioned a while back and is shown above
>in the stack trace.
>> I am repeating that flow here
>> mem_event_claim_slot() ->
>> 	mem_event_wait_slot() ->
>> 		 wait_event(mem_event_wait_try_grab(med, &rc) != -
>EBUSY)
>>
>> wait_event() macro looks like this:
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>             break;
>>         wait();
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> In the case where the ring is full, wait() gets called and the cpu
>> gets scheduled away. But since it is in middle of a pagefault, when it
>> runs again it ends up in handle_exception_saved and the same pagefault is
>tried again.
>> But since finish_wait() never ends up being called wqv->esp never
>> becomes 0 and hence the assert fires on the next go around. So I think
>> we should be calling
>> process_pending_softirqs() instead of wait() for PV domains.
>
>That would effectively be a spin wait then, which is surely not the right thing.
>But I don't follow your explanation above anyway - when coming back from
>wait(), the state is the same as the original one, so the page fault handling
>continues, it's not being retried.

Let me step through this. The pagefault for runstate occurs and wait_event() gets called and the ring is full. 

wait_event():
do { 
    if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will return EBUSY
        break; 
    for ( ; ; ) { 
        prepare_to_wait(&med->wq); 
        if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )   ==> This will return EBUSY
            break; 
        wait();                                                                                                   ==> Write to runstate area occurs...
    } 
    finish_wait(&med->wq);
} while (0)

...now this write to runstate will again cause a pagefault and we will end up in wait_event() again. The previous attempt would have called prepare_to_wait() but finish_wait() was not called. finish_wait()->__finish_wait() is where wqv->esp gets reset to NULL. So now:

wait_event():
do { 
    if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will return EBUSY
        break; 
    for ( ; ; ) { 
        prepare_to_wait(&med->wq);                                                 ==> Assertion 'wqv->esp == 0' fails
        if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
            break; 
        wait();
    } 
    finish_wait(&med->wq);
} while (0)

Does this make sense?

Thanks,
Aravindh

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
       [not found] <mailman.17701.1399617131.24322.xen-devel@lists.xen.org>
@ 2014-05-09 15:09 ` Andres Lagar-Cavilla
  2014-05-09 18:22   ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 28+ messages in thread
From: Andres Lagar-Cavilla @ 2014-05-09 15:09 UTC (permalink / raw)
  To: xen-devel
  Cc: <tim@xen.org> Deegan, Joe Epstein, Jan Beulich,
	Aravindh Puthiyaparambil

> Hi Jan,
> 
> It's a good question.  Capturing Xen-initiated access would be good for
> some use cases, but I had thought of that as a future enhancement.
> 
> Joe

Well, there is more than that. Does Xen-access capture foreign writable mappings via grants and dom0 map_foreign_bulk and friends?

Aravindh, I thought of a solution to your problem. It seems the ring collapses under recursive identical faults. If they are all the same, you can merge them up. It's relatively simple and robust logic to compare against the last un-consumed element in the ring

1. read last unconsumed
2. compare
3. different? write into ring
4. same? atomic_read consumer index to ensure still unconsumed
5. consumed? write into ring
<still a possible race below, but not a tragedy>
6. not consumed? drop…

Kind of like what remote display protocols do quite aggressively. More aggressive batching and reordering is hard with a consumer as decoupled as in the Xen shared rings case. I'm pretty hopeful this will solve this one problem.

But am not sure how to solve the very general problem of aggressive overflow in a scheduling context that does not allow exiting to schedule() and thus wait().

Andres
> 
> 
> On Wed, May 7, 2014 at 11:26 PM, Jan Beulich <JBeulich@suse.com> wrote:
> 
>>>>> On 07.05.14 at 21:37, <aravindp@cisco.com> wrote:
>>> I dug in further to figure out if there is any difference between HVM
>> and PV
>>> domains with policing writes emanating from Xen. I started with how the
>>> runstate area in the guest is updated. It is done using
>> __copy_to_guest().
>>> Here is the flow for PV and HVM.
>>> 
>>> For PV:
>>> __copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest
>>> 
>>> I think in the above scenario, the page permissions that are present in
>> the
>>> shadow are adhered to for PV domains running with shadow and hence
>> faults can
>>> occur.
>>> 
>>> For HVM:
>>> __copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest ->
>>> copy_to_user_hvm -> hvm_copy_to_guest_virt_nofault  -> __hvm_copy(flags =
>>> HVMCOPY_to_guest | HVMCOPY_no_fault | HVMCOPY_virt)
>>> 
>>> If I look in __hvm_copy(), I see that access permissions are not adhered
>> to.
>>> Writes to guest memory will go through even if the p2m_access type for
>> that
>>> page has it set as non-writable.  So it seems that we do not police
>> writes to
>>> guest memory that emanate from Xen even for the HVM case. Is my reading
>> of
>>> the code correct?
>> 
>> It would seem so, but the question (including whether actual behavior
>> matches intentions) really ought to be answered by the original author
>> of the code - let's see if he's still around under his email address from
>> back then... Joe?
>> 
>> Jan
>> 
>> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-09  2:42                   ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-09  6:49                     ` Jan Beulich
  2014-05-09 17:48                       ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2014-05-09  6:49 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, TimDeegan(tim@xen.org)

>>> On 09.05.14 at 04:42, <aravindp@cisco.com> wrote:
> The above sequence of events would not cause an EPT violation (I realize this 
> can happen only in the guest context) or pagefault (The page is present and 
> marked writable in the guest). All that would happen is an event to be sent 
> to the listener from __hvm_copy() and no cascading faults will occur.

I have to admit that I'm unable to spot where such an event gets sent.

> The one thing I am not able to figure out is why doesn't the listener i.e. 
> Dom0's VCPU get to run and process the events in the window between access 
> enable and process events loop. I am not familiar with the Xen scheduler well 
> enough to know how it would react to cascading pagefaults occurring for a 
> guest area in the Xen context like it is happening above. I have even tried 
> pinning Dom0 and the guest on different CPUs but this still occurs. I would 
> be grateful if you could provide some insight here.

I'm in no way convinced that this is scheduler related at all, which your
pinning experiment would appear to confirm.

What is quite clear is that you don't want multiple nested runstate
area updates to occur in the first place, i.e. you need to properly
deal with the _first_ page fault occurring instead of waiting until
multiple such events pile up. And I'm afraid doing so will require some
(perhaps gross) hackery...

> Looking at what the solution for the ring being full in the PV case whether 
> we are policing Xen writes or not, calling wait() will not work due to the 
> scenario I had mentioned a while back and is shown above in the stack trace. 
> I am repeating that flow here
> mem_event_claim_slot() -> 
> 	mem_event_wait_slot() ->
> 		 wait_event(mem_event_wait_try_grab(med, &rc) != -EBUSY)
> 
> wait_event() macro looks like this:
> do { 
>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
>         break; 
>     for ( ; ; ) { 
>         prepare_to_wait(&med->wq); 
>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
>             break; 
>         wait(); 
>     } 
>     finish_wait(&med->wq); 
> } while (0)
> 
> In the case where the ring is full, wait() gets called and the cpu gets 
> scheduled away. But since it is in middle of a pagefault, when it runs again 
> it ends up in handle_exception_saved and the same pagefault is tried again. 
> But since finish_wait() never ends up being called wqv->esp never becomes 0 and 
> hence the assert fires on the next go around. So I think we should be calling 
> process_pending_softirqs() instead of wait() for PV domains.

That would effectively be a spin wait then, which is surely not the right
thing. But I don't follow your explanation above anyway - when coming
back from wait(), the state is the same as the original one, so the page
fault handling continues, it's not being retried.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-07  6:36                 ` Jan Beulich
  2014-05-07 19:37                   ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-09  2:42                   ` Aravindh Puthiyaparambil (aravindp)
  2014-05-09  6:49                     ` Jan Beulich
  1 sibling, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-09  2:42 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>I can only repeat what I said above: You first need to understand why the ring
>is (or appears to be) full. But even with that clarified you still need to have a
>proper solution for the case where the ring might end up being full for valid
>reasons. 

I continued digging into if the ring is actually getting full. The ring is indeed getting full with events for the runstate area. If I run with my patch to wait_event(), the first chance the listener looks for unconsumed ring requests it finds around 50-60 of them all pointing to the runstate area page. This is proved further below.

I then moved to trying to figure out why it is getting full. I added the following patch on top of the one I submitted. It adds couple of prints to sh_page_fault(),  allow Xen writes to be policed and print out runstate_guest in VCPUOP_register_runstate_memory_area.

diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index aee061c..569cea8 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2846,7 +2846,7 @@ static int sh_page_fault(struct vcpu *v,
     int fast_emul = 0;
 #endif
 
-    SHADOW_PRINTK("d:v=%u:%u va=%#lx err=%u, rip=%lx\n",
+    SHADOW_ERROR("d:v=%u:%u va=%#lx err=%u, rip=%lx\n",
                   v->domain->domain_id, v->vcpu_id, va, regs->error_code,
                   regs->eip);
 
@@ -3073,6 +3073,7 @@ static int sh_page_fault(struct vcpu *v,
              * scheduled during this time, the violation is never resolved and
              * will eventually end with the host crashing.
              */
+#if 0
             if ( (violation && access_w) &&
                  (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
             {
@@ -3080,6 +3081,7 @@ static int sh_page_fault(struct vcpu *v,
                 rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
                                     p2m_ram_rw, p2m_access_rw);
             }
+#endif
 
             if ( violation )
             {
@@ -3089,7 +3091,7 @@ static int sh_page_fault(struct vcpu *v,
                                            access_r, access_w, access_x,
                                            &req_ptr) )
                 {
-                    SHADOW_PRINTK("Page access %c%c%c for gmfn=%"PRI_mfn" "
+                    SHADOW_ERROR("Page access %c%c%c for gmfn=%"PRI_mfn" "
                                   "p2ma: %d\n",
                                   (access_r ? 'r' : '-'),
                                   (access_w ? 'w' : '-'),

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 4291e29..3281195 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -1186,6 +1186,7 @@ long do_vcpu_op(int cmd, int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
 
         rc = 0;
         runstate_guest(v) = area.addr.h;
+        gdprintk(XENLOG_DEBUG, "runstate_guest: %p\n", runstate_guest(v).p);
 
         if ( v == current )
         {
On attaching xen-access listener to the PV domain, I see the following:

<Xen serial output>

(d2) mapping kernel into physical memory
(d2) about to get started...
(XEN) domain.c:1189:d2v0 runstate_guest: ffffffff81cefd80
(XEN) domain.c:1189:d2v0 runstate_guest: ffff88000fc0bd80
<PV guest is up and running now>
<Attaching xen-access listener>
(XEN) sh error: sh_page_fault__guest_4(): d:v=2:0 va=0xffff88000fc0bd80 err=2, rip=ffff82d08018e516
(XEN) sh error: sh_page_fault__guest_4(): Page access -w- for gmfn=134eb p2ma: 5
<the above prints are repeated 61 times>
(XEN) Assertion 'wqv->esp == 0' failed at wait.c:133
(XEN) ----[ Xen-4.5-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82d08012feb6>] prepare_to_wait+0x61/0x243
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: 0000000000000100   rbx: 0000000000000100   rcx: 0000000000000000
(XEN) rdx: ffff82d0802eff00   rsi: 0000000000000000   rdi: ffff830032f1a000
(XEN) rbp: ffff83003ffef4c8   rsp: ffff83003ffef498   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000020   r11: 000000000000000a
(XEN) r12: ffff830032f0cc70   r13: ffff830032f0ca30   r14: ffff83003ffeff18
(XEN) r15: ffff830032f1a000   cr0: 000000008005003b   cr4: 00000000000426b0
(XEN) cr3: 00000000130f3000   cr2: ffff88000fc0bd80
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83003ffef498:
(XEN)    ffff83003ffef4e4 ffff830032f0c9f0 ffff830032f0ca30 ffff830032f24000
(XEN)    ffff83003ffef7a8 0000000000000006 ffff83003ffef508 ffff82d0801f23f1
(XEN)    000001003ffef528 fffffff000000030 ffff83003ffef540 ffff830032f24000
(XEN)    ffff830032f15bc0 ffff830032f24000 ffff83003ffef528 ffff82d0801f7003
(XEN)    0000000000000000 ffff830032f1a000 ffff83003ffef758 ffff82d08021ab68
(XEN)    ffff830000000005 0000000000000002 000000000000000b 0000000000000058
(XEN)    ffff82d080319b78 00000000000003f0 0000000000000000 0000000000000880
(XEN)    ffff83003faf8430 0000000000000108 0000000004040008 ffff82d080319b68
(XEN)    000ffff88000fc0b fffeffff80166300 0000000000013400 ffff88000fc0bd80
(XEN)    0000000000000001 0000000000000000 00000000000134eb 00000000000134eb
(XEN)    ffff83003ffef5e8 0000000000000046 ffff83003ffef608 0000000000000040
(XEN)    ffff83003ffef658 0000000000000046 ffff830032f15bc0 0000000000000005
(XEN)    000000fc04040008 ffff83003ffef688 ffff83003ffef658 0000000000000040
(XEN)    ffff82d0802eff00 0000000000000001 ffff83003ffef700 0000002fbfcd8300
(XEN)    ffff83003ffef668 ffff82d080181fc2 ffff83003ffef678 ffff82d0801824e6
(XEN)    ffff83003ffef6c8 ffff82d080128a81 0000000000000001 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff83003fec9130 ffff83003faf8840
(XEN)    0000000000000001 ffff830032f0ccd0 ffff83003ffef778 ffff82d08011f191
(XEN)    0000000000000000 ffff88000fc0bd80 00000000333bc067 00000000333b8067
(XEN)    0000000039f8a067 80100000134eb067 0000000000014ec2 00000000000333bc
(XEN) Xen call trace:
(XEN)    [<ffff82d08012feb6>] prepare_to_wait+0x61/0x243
(XEN)    [<ffff82d0801f23f1>] __mem_event_claim_slot+0x56/0xb2
(XEN)    [<ffff82d0801f7003>] mem_access_send_req+0x2e/0x5a
(XEN)    [<ffff82d08021ab68>] sh_page_fault__guest_4+0x1d0a/0x24a2
(XEN)    [<ffff82d08018e2fd>] do_page_fault+0x386/0x532
(XEN)    [<ffff82d08022adfd>] handle_exception_saved+0x2e/0x6c
(XEN)    [<ffff82d08018e516>] __copy_to_user_ll+0x2a/0x37            <==== pagefault happens here
(XEN)    [<ffff82d08015fe7d>] update_runstate_area+0xfd/0x106
(XEN)    [<ffff82d08015fe97>] _update_runstate_area+0x11/0x39
(XEN)    [<ffff82d08015ff71>] context_switch+0xb2/0xf05
(XEN)    [<ffff82d080125bd3>] schedule+0x5c8/0x5d7
(XEN)    [<ffff82d0801283f1>] wait+0x9/0xb
(XEN)    [<ffff82d0801f2407>] __mem_event_claim_slot+0x6c/0xb2
(XEN)    [<ffff82d0801f7003>] mem_access_send_req+0x2e/0x5a
(XEN)    [<ffff82d08021ab68>] sh_page_fault__guest_4+0x1d0a/0x24a2
(XEN)    [<ffff82d08018e2fd>] do_page_fault+0x386/0x532
(XEN)    [<ffff82d08022adfd>] handle_exception_saved+0x2e/0x6c
(XEN)    [<ffff82d08018e516>] __copy_to_user_ll+0x2a/0x37              <==== pagefault happens here
(XEN)    [<ffff82d08015fe7d>] update_runstate_area+0xfd/0x106
(XEN)    [<ffff82d08015fe97>] _update_runstate_area+0x11/0x39
(XEN)    [<ffff82d080160db1>] context_switch+0xef2/0xf05
(XEN)    [<ffff82d080125bd3>] schedule+0x5c8/0x5d7
(XEN)    [<ffff82d080128959>] __do_softirq+0x81/0x8c
(XEN)    [<ffff82d0801289b2>] do_softirq+0x13/0x15
(XEN)    [<ffff82d08015d2ae>] idle_loop+0x62/0x72
(XEN) 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'wqv->esp == 0' failed at wait.c:133
(XEN) ****************************************

Xen-access output:

# ./xen-access 2 write
xenaccess init
max_pages = 10100
ring_mfn: 0x30f90
starting write 2
Got event from Xen

<host crashed>

In the above you can see the RIP where the fault is occurring matches with that of __copy_to_user_ll() and the VA is the runstate area registered by the guest.

When a listener attaches to a running domain shadow is enabled and shadow_blow_tables() is called. This means new entries are created with access permissions of not writable. The first and only RIP I see from the above prints is coming from Xen. Guest code does not even get a chance to execute. I guess this is because Xen is trying to schedule the guest and hence trying to update the guest runstate area which causes the pagefault and an event to be sent to the ring. Now take the similar scenario in the hypothetical HVM + EPT case where we are policing Xen writes to guest memory. The above sequence of events would not cause an EPT violation (I realize this can happen only in the guest context) or pagefault (The page is present and marked writable in the guest). All that would happen is an e
 vent to be sent to the listener from __hvm_copy() and no cascading faults will occur. In the PV case as the pagefault occurs in the Xen context which I am guessing would need to be handled right away as there is no pausing the CPU. I would venture to guess that if we implemented mem_access for HVM guests running with shadow pagetables, the PV situation would arise there too. 

The one thing I am not able to figure out is why doesn't the listener i.e. Dom0's VCPU get to run and process the events in the window between access enable and process events loop. I am not familiar with the Xen scheduler well enough to know how it would react to cascading pagefaults occurring for a guest area in the Xen context like it is happening above. I have even tried pinning Dom0 and the guest on different CPUs but this still occurs. I would be grateful if you could provide some insight here.

Looking at what the solution for the ring being full in the PV case whether we are policing Xen writes or not, calling wait() will not work due to the scenario I had mentioned a while back and is shown above in the stack trace. I am repeating that flow here
mem_event_claim_slot() -> 
	mem_event_wait_slot() ->
		 wait_event(mem_event_wait_try_grab(med, &rc) != -EBUSY)

wait_event() macro looks like this:
do { 
    if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
        break; 
    for ( ; ; ) { 
        prepare_to_wait(&med->wq); 
        if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
            break; 
        wait(); 
    } 
    finish_wait(&med->wq); 
} while (0)

In the case where the ring is full, wait() gets called and the cpu gets scheduled away. But since it is in middle of a pagefault, when it runs again it ends up in handle_exception_saved and the same pagefault is tried again. But since finish_wait() never ends up being called wqv->esp never becomes 0 and hence the assert fires on the next go around. So I think we should be calling process_pending_softirqs() instead of wait() for PV domains. If there is a better solution please let me know and I will look in to implementing that.

Thanks,
Aravindh

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-07 19:37                   ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-08  6:26                     ` Jan Beulich
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Beulich @ 2014-05-08  6:26 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp), jepstein98
  Cc: xen-devel, TimDeegan(tim@xen.org)

>>> On 07.05.14 at 21:37, <aravindp@cisco.com> wrote:
> I dug in further to figure out if there is any difference between HVM and PV 
> domains with policing writes emanating from Xen. I started with how the 
> runstate area in the guest is updated. It is done using __copy_to_guest(). 
> Here is the flow for PV and HVM.
> 
> For PV:
> __copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest 
> 
> I think in the above scenario, the page permissions that are present in the 
> shadow are adhered to for PV domains running with shadow and hence faults can 
> occur.
> 
> For HVM:
> __copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest -> 
> copy_to_user_hvm -> hvm_copy_to_guest_virt_nofault  -> __hvm_copy(flags = 
> HVMCOPY_to_guest | HVMCOPY_no_fault | HVMCOPY_virt)
> 
> If I look in __hvm_copy(), I see that access permissions are not adhered to. 
> Writes to guest memory will go through even if the p2m_access type for that 
> page has it set as non-writable.  So it seems that we do not police writes to 
> guest memory that emanate from Xen even for the HVM case. Is my reading of 
> the code correct?

It would seem so, but the question (including whether actual behavior
matches intentions) really ought to be answered by the original author
of the code - let's see if he's still around under his email address from
back then... Joe?

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-07  6:36                 ` Jan Beulich
@ 2014-05-07 19:37                   ` Aravindh Puthiyaparambil (aravindp)
  2014-05-08  6:26                     ` Jan Beulich
  2014-05-09  2:42                   ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-07 19:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>> This happens only when the access listener first attaches to a PV domain for
>> write events. It occurs in the window between mem_access has been
>enabled and
>> the listener is ready to handle events. I have not seen it happen once the
>> listener starts handling events. And I have not run in to this issue with
>> execute violations.
>
>Which is still only describing observations, not anything leading to an
>understanding of why this is happening.

>I can only repeat what I said above: You first need to understand
>why the ring is (or appears to be) full. But even with that clarified
>you still need to have a proper solution for the case where the ring
>might end up being full for valid reasons. And as also said earlier, I
>am of the opinion that the behavior regarding Xen accesses would
>ideally not differ between HVM, PVH, and PV.

I dug in further to figure out if there is any difference between HVM and PV domains with policing writes emanating from Xen. I started with how the runstate area in the guest is updated. It is done using __copy_to_guest(). Here is the flow for PV and HVM.

For PV:
__copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest 

I think in the above scenario, the page permissions that are present in the shadow are adhered to for PV domains running with shadow and hence faults can occur.

For HVM:
__copy_to_guest -> __copy_to_guest_offset -> __raw_copy_to_guest -> copy_to_user_hvm -> hvm_copy_to_guest_virt_nofault  -> __hvm_copy(flags = HVMCOPY_to_guest | HVMCOPY_no_fault | HVMCOPY_virt)

If I look in __hvm_copy(), I see that access permissions are not adhered to. Writes to guest memory will go through even if the p2m_access type for that page has it set as non-writable.  So it seems that we do not police writes to guest memory that emanate from Xen even for the HVM case. Is my reading of the code correct?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-06 19:01               ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-07  6:36                 ` Jan Beulich
  2014-05-07 19:37                   ` Aravindh Puthiyaparambil (aravindp)
  2014-05-09  2:42                   ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Beulich @ 2014-05-07  6:36 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, TimDeegan(tim@xen.org)

>>> On 06.05.14 at 21:01, <aravindp@cisco.com> wrote:
> This happens only when the access listener first attaches to a PV domain for 
> write events. It occurs in the window between mem_access has been enabled and 
> the listener is ready to handle events. I have not seen it happen once the 
> listener starts handling events. And I have not run in to this issue with 
> execute violations. 

Which is still only describing observations, not anything leading to an
understanding of why this is happening.

> So I tried a couple of things. One was to check if the ring was full before 
> sending a write violation emanating from Xen. If the ring was full instead of 
> calling mem_access_send_req() which would end up calling schedule(),  I 
> allowed the PTE to be created with the same default permissions which would 
> then cause the pagefault to be retried indirectly giving the listener more 
> time to handle events. This way the write would be policed. This is what the 
> change looks like on top of the original patch.
> ...
> +unsigned int mem_event_check_ring_availability(struct mem_event_domain *med)
> +{
> +    int avail_req = RING_FREE_REQUESTS(&med->front_ring);
> +
> +    mem_event_ring_lock(med);
> +    avail_req = mem_event_ring_available(med);
> +    mem_event_ring_unlock(med);
> +    return avail_req;
> +}

With this it is already clear that this can't be a solution: By the time the
caller evaluates the return value, the ring may have become full.

> The other approach I took was to detect if the write violation was emanating 
> from Xen when the ring was full and sending that fact down to wait_event(). 
> If the context was Xen, then instead of calling wait(), wait_event() would 
> call process_pending_softirqs(). This is how wait_event() would look with 
> this change:
> ...

A common code change like this can only ever be acceptable if it
addresses a common problem, but definitely not to work around a
non-understood problem.

I can only repeat what I said above: You first need to understand
why the ring is (or appears to be) full. But even with that clarified
you still need to have a proper solution for the case where the ring
might end up being full for valid reasons. And as also said earlier, I
am of the opinion that the behavior regarding Xen accesses would
ideally not differ between HVM, PVH, and PV.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-06  7:59             ` Jan Beulich
@ 2014-05-06 19:01               ` Aravindh Puthiyaparambil (aravindp)
  2014-05-07  6:36                 ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-06 19:01 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>> It looks like the nested attempts to wait() happens only when the ring
>> is full. The flow is
>> mem_event_claim_slot() ->
>> 	mem_event_wait_slot() ->
>> 		 wait_event(mem_event_wait_try_grab(med, &rc) != -
>EBUSY)
>>
>> wait_event() macro looks like this:
>> do {                                            \
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>    \
>>         break;                                  \
>>     for ( ; ; ) {                               \
>>         prepare_to_wait(&med->wq);                   \
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>    \
>>             break;                              \
>>         wait();                                 \
>>     }                                           \
>>     finish_wait(&med->wq);                           \
>> } while (0)
>>
>> In the case where the ring is full, wait() gets called and the cpu
>> gets scheduled away. But since it is in middle of a pagefault, when it
>> runs again it ends up in handle_exception_saved and the same pagefault is
>tried again.
>> But since finish_wait() never ends up being called wqv->esp never
>> becomes 0 and hence the assert fires on the next go around. Am I on the
>right track?
>
>Looks like so, with the caveat that it's then unclear to me why the ring would
>be full in the first place - shouldn't it get drained of earlier requests quite
>quickly? But anyway, even if this didn't occur immediately, but only rarely
>after many hours of running, it would still need taking care of.

This happens only when the access listener first attaches to a PV domain for write events. It occurs in the window between mem_access has been enabled and the listener is ready to handle events. I have not seen it happen once the listener starts handling events. And I have not run in to this issue with execute violations. 

So I tried a couple of things. One was to check if the ring was full before sending a write violation emanating from Xen. If the ring was full instead of calling mem_access_send_req() which would end up calling schedule(),  I allowed the PTE to be created with the same default permissions which would then cause the pagefault to be retried indirectly giving the listener more time to handle events. This way the write would be policed. This is what the change looks like on top of the original patch.

diff --git a/xen/arch/x86/mm/mem_event.c b/xen/arch/x86/mm/mem_event.c
index 9b45504..ea84bd8 100644
--- a/xen/arch/x86/mm/mem_event.c
+++ b/xen/arch/x86/mm/mem_event.c
@@ -121,6 +121,15 @@ static unsigned int mem_event_ring_available(struct mem_event_domain *med)
     return avail_req;
 }
 
+unsigned int mem_event_check_ring_availability(struct mem_event_domain *med)
+{
+    int avail_req = RING_FREE_REQUESTS(&med->front_ring);
+
+    mem_event_ring_lock(med);
+    avail_req = mem_event_ring_available(med);
+    mem_event_ring_unlock(med);
+    return avail_req;
+}
 /*
  * mem_event_wake_blocked() will wakeup vcpus waiting for room in the
  * ring. These vCPUs were paused on their way out after placing an event,
diff --git a/xen/arch/x86/mm/shadow/multi.c b/xen/arch/x86/mm/shadow/multi.c
index aee061c..a827ca1 100644
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3066,20 +3065,16 @@ static int sh_page_fault(struct vcpu *v,
             }
 
             /*
-             * Do not police writes to guest memory emanating from the Xen
-             * kernel. Trying to do so will cause the same pagefault to occur
-             * over and over again with an event being sent to the access
-             * listener for each fault. If the access listener's vcpu is not
-             * scheduled during this time, the violation is never resolved and
-             * will eventually end with the host crashing.
+             * Xen is the source of the write violation and if it allowed to go
+             * through when the ring is full it will end up calling nested
+             * wait() and the host will crash. Instead of passing the violation
+             * to the listener, we will let the pagefault be retried, indirectly
+             * giving the listener more time to handle the existing violations.
              */
             if ( (violation && access_w) &&
-                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
-            {
-                violation = 0;
-                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
-                                    p2m_ram_rw, p2m_access_rw);
-            }
+                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) &&
+                 (!mem_event_check_ring_availability(&d->mem_event->access)) )
+                    violation = 0;
 
             if ( violation )
             {

The other approach I took was to detect if the write violation was emanating from Xen when the ring was full and sending that fact down to wait_event(). If the context was Xen, then instead of calling wait(), wait_event() would call process_pending_softirqs(). This is how wait_event() would look with this change:

<In sh_page_fault()>

            if ( (violation && access_w) &&
                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) &&
                 (!mem_event_check_ring_availability(&d->mem_event->access)) )
                xen_ctxt = 1;

<In wait.h>
#define wait_event(wq, condition, xen_ctxt)     \
do {                                            \
    if ( condition )                            \
        break;                                  \
    for ( ; ; ) {                               \
        if ( !xen_ctxt )                        \
            prepare_to_wait(&wq);               \
        if ( condition )                        \
            break;                              \
        if ( !xen_ctxt )                        \
            wait();                             \
        else                                    \
            process_pending_softirqs();         \
    }                                           \
    if ( !xen_ctxt )                            \
        finish_wait(&wq);                       \
} while (0)

Is either of the above solutions viable?

Thanks,
Aravindh

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-05 19:27           ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-06  7:59             ` Jan Beulich
  2014-05-06 19:01               ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2014-05-06  7:59 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, TimDeegan(tim@xen.org)

>>> On 05.05.14 at 21:27, <aravindp@cisco.com> wrote:
> It looks like the nested attempts to wait() happens only when the ring is 
> full. The flow is
> mem_event_claim_slot() -> 
> 	mem_event_wait_slot() ->
> 		 wait_event(mem_event_wait_try_grab(med, &rc) != -EBUSY)
> 
> wait_event() macro looks like this:
> do {                                            \
>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )                         
>    \
>         break;                                  \
>     for ( ; ; ) {                               \
>         prepare_to_wait(&med->wq);                   \
>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )                     
>    \
>             break;                              \
>         wait();                                 \
>     }                                           \
>     finish_wait(&med->wq);                           \
> } while (0)
> 
> In the case where the ring is full, wait() gets called and the cpu gets 
> scheduled away. But since it is in middle of a pagefault, when it runs again 
> it ends up in handle_exception_saved and the same pagefault is tried again. 
> But since finish_wait() never ends up being called wqv->esp never becomes 0 and 
> hence the assert fires on the next go around. Am I on the right track?

Looks like so, with the caveat that it's then unclear to me why the
ring would be full in the first place - shouldn't it get drained of earlier
requests quite quickly? But anyway, even if this didn't occur
immediately, but only rarely after many hours of running, it would
still need taking care of.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-05  6:21         ` Jan Beulich
@ 2014-05-05 19:27           ` Aravindh Puthiyaparambil (aravindp)
  2014-05-06  7:59             ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-05 19:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>>>>>> > On adding some debugging, I discovered that it happens after
>>>>>> > mem_access
>>>>>>is
>>>>>>> enabled but xen-access has not started handling events. After
>>>>>>> comparing
>>>>>>the
>>>>>>> stack trace and gla in question There are multiple write faults
>>>>>>> to the runstate_guest(v), each causing an event to be sent to
>>>>>>> xen-access. Since
>>>>>>the
>>>>>>> listener is not handling events yet, the fault continues to
>>>>>>> occur. I am not sure why the listener does not get a chance to
>>>>>>> run.  I also do not follow is that why there are multiple faults
>>>>>>> as the vcpu should have been paused
>>>>>>after
>>>>>>> the first event was sent to xen-access and only be resumed after
>>>>>>> violation
>>>>>>has
>>>>>>> been resolved and when it calls xc_access_resume(), which ends up
>>>>>>unpausing
>>>>>>> the vcpu. Or is this occurring because runstate_guest(v).p is
>>>>>>> being accessed from Xen?
>>>>>>
>>>>>>The runstate changes (and hence needs to get written) as a side
>>>>>>effect of pausing the guest (as can be seen from the stack trace).
>>>>>>The first question that needs clarification (for me at least, since
>>>>>>I don't know much about the access stuff for HVM) is how the same
>>>>>>situation gets handled
>>>>>>there: Do Xen writes to HVM guest memory get intercepted? Other
>>>>>>than for PV, they're not going through the same page tables, so
>>>>>>special precautions would be needed to filter them. Quite obviously
>>>>>>(I think) if they're not being filtered for HVM, then they shouldn't be
>for PV.
>>>>>
>>>>> AFAIK, they are not being filtered for HVM. I brought up a HVM
>>>>> domain and printed out the runstate_guest area value when it is
>>>>> registered. I then ran the xen-access test program to monitor for writes.
>>>>> Interestingly I never saw a GLA that matched the runstate_guest area.
>>>>> This is not the case for a PV domain as it is one of the first
>>>>> violations
>> the
>>>xen-
>>>>access test program sees.
>>>>> Is this an EPT vs regular page table difference as in one case it is shared?
>>>>
>>>>Yes, as said in my earlier reply. And when hypervisor writes aren't
>>>>subject
>> to
>>>>filtering in HVM, you probably want/need to make things behave that
>>>>way
>>>for
>>>>PV too (albeit it may not be trivial, as you clearly don't want to
>>>>start
>>>emulating
>>>>hypervisor accesses).
>>>
>>>The main issue I am seeing is the same violation being sent multiple
>>>times to the mem_event queue and the listener not getting scheduled to
>>>run. I would think that the same issue would happen with a HVM guest
>>>but in that case I don't even see a violation of the runstate_guest
>>>area. I guess I should try figuring out why wqv->esp becomes 0 instead
>>>of trying to stop multiple events for the same violations being sent.
>>>If you have a suggestion in the
>> area
>>>I should be looking at, it would be welcome.
>>
>> I meant to say why wqv->esp becomes non-zero in the message above.
>
>That's pretty obvious - this is a per-vCPU field, so there can't be two nested
>attempts to wait, yet the call stack you got made pretty clear that this is what
>is happening (and what is expected to be happening - the vCPU gets a
>wakeup, needs to write the change runstate, faults, wants the mem-event to
>be delivered, hence needs to be put to sleep, which in turn requires another
>change to the runstate).

It looks like the nested attempts to wait() happens only when the ring is full. The flow is
mem_event_claim_slot() -> 
	mem_event_wait_slot() ->
		 wait_event(mem_event_wait_try_grab(med, &rc) != -EBUSY)

wait_event() macro looks like this:
do {                                            \
    if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )                            \
        break;                                  \
    for ( ; ; ) {                               \
        prepare_to_wait(&med->wq);                   \
        if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )                        \
            break;                              \
        wait();                                 \
    }                                           \
    finish_wait(&med->wq);                           \
} while (0)

In the case where the ring is full, wait() gets called and the cpu gets scheduled away. But since it is in middle of a pagefault, when it runs again it ends up in handle_exception_saved and the same pagefault is tried again. But since finish_wait() never ends up being called wqv->esp never becomes 0 and hence the assert fires on the next go around. Am I on the right track?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-02 20:52       ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-05  6:21         ` Jan Beulich
  2014-05-05 19:27           ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2014-05-05  6:21 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, TimDeegan(tim@xen.org)

>>> On 02.05.14 at 22:52, <aravindp@cisco.com> wrote:
>>>>> > On adding some debugging, I discovered that it happens after
>>>>> > mem_access
>>>>>is
>>>>>> enabled but xen-access has not started handling events. After
>>>>>> comparing
>>>>>the
>>>>>> stack trace and gla in question There are multiple write faults to
>>>>>> the runstate_guest(v), each causing an event to be sent to
>>>>>> xen-access. Since
>>>>>the
>>>>>> listener is not handling events yet, the fault continues to occur. I
>>>>>> am not sure why the listener does not get a chance to run.  I also
>>>>>> do not follow is that why there are multiple faults as the vcpu
>>>>>> should have been paused
>>>>>after
>>>>>> the first event was sent to xen-access and only be resumed after
>>>>>> violation
>>>>>has
>>>>>> been resolved and when it calls xc_access_resume(), which ends up
>>>>>unpausing
>>>>>> the vcpu. Or is this occurring because runstate_guest(v).p is being
>>>>>> accessed from Xen?
>>>>>
>>>>>The runstate changes (and hence needs to get written) as a side effect
>>>>>of pausing the guest (as can be seen from the stack trace). The first
>>>>>question that needs clarification (for me at least, since I don't know
>>>>>much about the access stuff for HVM) is how the same situation gets
>>>>>handled
>>>>>there: Do Xen writes to HVM guest memory get intercepted? Other than
>>>>>for PV, they're not going through the same page tables, so special
>>>>>precautions would be needed to filter them. Quite obviously (I think)
>>>>>if they're not being filtered for HVM, then they shouldn't be for PV.
>>>>
>>>> AFAIK, they are not being filtered for HVM. I brought up a HVM domain
>>>> and printed out the runstate_guest area value when it is registered. I
>>>> then ran the xen-access test program to monitor for writes.
>>>> Interestingly I never saw a GLA that matched the runstate_guest area.
>>>> This is not the case for a PV domain as it is one of the first violations 
> the
>>xen-
>>>access test program sees.
>>>> Is this an EPT vs regular page table difference as in one case it is shared?
>>>
>>>Yes, as said in my earlier reply. And when hypervisor writes aren't subject 
> to
>>>filtering in HVM, you probably want/need to make things behave that way
>>for
>>>PV too (albeit it may not be trivial, as you clearly don't want to start
>>emulating
>>>hypervisor accesses).
>>
>>The main issue I am seeing is the same violation being sent multiple times to
>>the mem_event queue and the listener not getting scheduled to run. I would
>>think that the same issue would happen with a HVM guest but in that case I
>>don't even see a violation of the runstate_guest area. I guess I should try
>>figuring out why wqv->esp becomes 0 instead of trying to stop multiple
>>events for the same violations being sent. If you have a suggestion in the 
> area
>>I should be looking at, it would be welcome.
> 
> I meant to say why wqv->esp becomes non-zero in the message above.

That's pretty obvious - this is a per-vCPU field, so there can't be two
nested attempts to wait, yet the call stack you got made pretty clear
that this is what is happening (and what is expected to be happening -
the vCPU gets a wakeup, needs to write the change runstate, faults,
wants the mem-event to be delivered, hence needs to be put to sleep,
which in turn requires another change to the runstate).

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-02  9:03     ` Jan Beulich
  2014-05-02 17:00       ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-02 20:52       ` Aravindh Puthiyaparambil (aravindp)
  2014-05-05  6:21         ` Jan Beulich
  1 sibling, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-02 20:52 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>>>> > On adding some debugging, I discovered that it happens after
>>>> > mem_access
>>>>is
>>>>> enabled but xen-access has not started handling events. After
>>>>> comparing
>>>>the
>>>>> stack trace and gla in question There are multiple write faults to
>>>>> the runstate_guest(v), each causing an event to be sent to
>>>>> xen-access. Since
>>>>the
>>>>> listener is not handling events yet, the fault continues to occur. I
>>>>> am not sure why the listener does not get a chance to run.  I also
>>>>> do not follow is that why there are multiple faults as the vcpu
>>>>> should have been paused
>>>>after
>>>>> the first event was sent to xen-access and only be resumed after
>>>>> violation
>>>>has
>>>>> been resolved and when it calls xc_access_resume(), which ends up
>>>>unpausing
>>>>> the vcpu. Or is this occurring because runstate_guest(v).p is being
>>>>> accessed from Xen?
>>>>
>>>>The runstate changes (and hence needs to get written) as a side effect
>>>>of pausing the guest (as can be seen from the stack trace). The first
>>>>question that needs clarification (for me at least, since I don't know
>>>>much about the access stuff for HVM) is how the same situation gets
>>>>handled
>>>>there: Do Xen writes to HVM guest memory get intercepted? Other than
>>>>for PV, they're not going through the same page tables, so special
>>>>precautions would be needed to filter them. Quite obviously (I think)
>>>>if they're not being filtered for HVM, then they shouldn't be for PV.
>>>
>>> AFAIK, they are not being filtered for HVM. I brought up a HVM domain
>>> and printed out the runstate_guest area value when it is registered. I
>>> then ran the xen-access test program to monitor for writes.
>>> Interestingly I never saw a GLA that matched the runstate_guest area.
>>> This is not the case for a PV domain as it is one of the first violations the
>xen-
>>access test program sees.
>>> Is this an EPT vs regular page table difference as in one case it is shared?
>>
>>Yes, as said in my earlier reply. And when hypervisor writes aren't subject to
>>filtering in HVM, you probably want/need to make things behave that way
>for
>>PV too (albeit it may not be trivial, as you clearly don't want to start
>emulating
>>hypervisor accesses).
>
>The main issue I am seeing is the same violation being sent multiple times to
>the mem_event queue and the listener not getting scheduled to run. I would
>think that the same issue would happen with a HVM guest but in that case I
>don't even see a violation of the runstate_guest area. I guess I should try
>figuring out why wqv->esp becomes 0 instead of trying to stop multiple
>events for the same violations being sent. If you have a suggestion in the area
>I should be looking at, it would be welcome.

I meant to say why wqv->esp becomes non-zero in the message above.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-05-02  9:03     ` Jan Beulich
@ 2014-05-02 17:00       ` Aravindh Puthiyaparambil (aravindp)
  2014-05-02 20:52       ` Aravindh Puthiyaparambil (aravindp)
  1 sibling, 0 replies; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-05-02 17:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, TimDeegan(tim@xen.org)

>>> > On adding some debugging, I discovered that it happens after
>>> > mem_access
>>>is
>>>> enabled but xen-access has not started handling events. After
>>>> comparing
>>>the
>>>> stack trace and gla in question There are multiple write faults to
>>>> the runstate_guest(v), each causing an event to be sent to
>>>> xen-access. Since
>>>the
>>>> listener is not handling events yet, the fault continues to occur. I
>>>> am not sure why the listener does not get a chance to run.  I also
>>>> do not follow is that why there are multiple faults as the vcpu
>>>> should have been paused
>>>after
>>>> the first event was sent to xen-access and only be resumed after
>>>> violation
>>>has
>>>> been resolved and when it calls xc_access_resume(), which ends up
>>>unpausing
>>>> the vcpu. Or is this occurring because runstate_guest(v).p is being
>>>> accessed from Xen?
>>>
>>>The runstate changes (and hence needs to get written) as a side effect
>>>of pausing the guest (as can be seen from the stack trace). The first
>>>question that needs clarification (for me at least, since I don't know
>>>much about the access stuff for HVM) is how the same situation gets
>>>handled
>>>there: Do Xen writes to HVM guest memory get intercepted? Other than
>>>for PV, they're not going through the same page tables, so special
>>>precautions would be needed to filter them. Quite obviously (I think)
>>>if they're not being filtered for HVM, then they shouldn't be for PV.
>>
>> AFAIK, they are not being filtered for HVM. I brought up a HVM domain
>> and printed out the runstate_guest area value when it is registered. I
>> then ran the xen-access test program to monitor for writes.
>> Interestingly I never saw a GLA that matched the runstate_guest area.
>> This is not the case for a PV domain as it is one of the first violations the xen-
>access test program sees.
>> Is this an EPT vs regular page table difference as in one case it is shared?
>
>Yes, as said in my earlier reply. And when hypervisor writes aren't subject to
>filtering in HVM, you probably want/need to make things behave that way for
>PV too (albeit it may not be trivial, as you clearly don't want to start emulating
>hypervisor accesses).

The main issue I am seeing is the same violation being sent multiple times to the mem_event queue and the listener not getting scheduled to run. I would think that the same issue would happen with a HVM guest but in that case I don't even see a violation of the runstate_guest area. I guess I should try figuring out why wqv->esp becomes 0 instead of trying to stop multiple events for the same violations being sent. If you have a suggestion in the area I should be looking at, it would be welcome.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-04-30 23:00   ` Aravindh Puthiyaparambil (aravindp)
@ 2014-05-02  9:03     ` Jan Beulich
  2014-05-02 17:00       ` Aravindh Puthiyaparambil (aravindp)
  2014-05-02 20:52       ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Beulich @ 2014-05-02  9:03 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, TimDeegan(tim@xen.org)

>>> On 01.05.14 at 01:00, <aravindp@cisco.com> wrote:
>> > On adding some debugging, I discovered that it happens after mem_access
>>is
>>> enabled but xen-access has not started handling events. After comparing
>>the
>>> stack trace and gla in question There are multiple write faults to the
>>> runstate_guest(v), each causing an event to be sent to xen-access. Since
>>the
>>> listener is not handling events yet, the fault continues to occur. I am not
>>> sure why the listener does not get a chance to run.  I also do not follow is
>>> that why there are multiple faults as the vcpu should have been paused
>>after
>>> the first event was sent to xen-access and only be resumed after violation
>>has
>>> been resolved and when it calls xc_access_resume(), which ends up
>>unpausing
>>> the vcpu. Or is this occurring because runstate_guest(v).p is being accessed
>>> from Xen?
>>
>>The runstate changes (and hence needs to get written) as a side effect
>>of pausing the guest (as can be seen from the stack trace). The first
>>question that needs clarification (for me at least, since I don't know much
>>about the access stuff for HVM) is how the same situation gets handled
>>there: Do Xen writes to HVM guest memory get intercepted? Other than
>>for PV, they're not going through the same page tables, so special
>>precautions would be needed to filter them. Quite obviously (I think) if
>>they're not being filtered for HVM, then they shouldn't be for PV.
> 
> AFAIK, they are not being filtered for HVM. I brought up a HVM domain and 
> printed out the runstate_guest area value when it is registered. I then ran 
> the xen-access test program to monitor for writes. Interestingly I never saw a 
> GLA that matched the runstate_guest area. This is not the case for a PV 
> domain as it is one of the first violations the xen-access test program sees. 
> Is this an EPT vs regular page table difference as in one case it is shared?

Yes, as said in my earlier reply. And when hypervisor writes aren't
subject to filtering in HVM, you probably want/need to make things
behave that way for PV too (albeit it may not be trivial, as you
clearly don't want to start emulating hypervisor accesses).

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-04-30  7:08 ` Jan Beulich
@ 2014-04-30 23:00   ` Aravindh Puthiyaparambil (aravindp)
  2014-05-02  9:03     ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-04-30 23:00 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Tim Deegan(tim@xen.org)

>> On adding some debugging, I discovered that it happens after mem_access
>is
>> enabled but xen-access has not started handling events. After comparing
>the
>> stack trace and gla in question There are multiple write faults to the
>> runstate_guest(v), each causing an event to be sent to xen-access. Since
>the
>> listener is not handling events yet, the fault continues to occur. I am not
>> sure why the listener does not get a chance to run.  I also do not follow is
>> that why there are multiple faults as the vcpu should have been paused
>after
>> the first event was sent to xen-access and only be resumed after violation
>has
>> been resolved and when it calls xc_access_resume(), which ends up
>unpausing
>> the vcpu. Or is this occurring because runstate_guest(v).p is being accessed
>> from Xen?
>
>The runstate changes (and hence needs to get written) as a side effect
>of pausing the guest (as can be seen from the stack trace). The first
>question that needs clarification (for me at least, since I don't know much
>about the access stuff for HVM) is how the same situation gets handled
>there: Do Xen writes to HVM guest memory get intercepted? Other than
>for PV, they're not going through the same page tables, so special
>precautions would be needed to filter them. Quite obviously (I think) if
>they're not being filtered for HVM, then they shouldn't be for PV.

AFAIK, they are not being filtered for HVM. I brought up a HVM domain and printed out the runstate_guest area value when it is registered. I then ran the xen-access test program to monitor for writes. Interestingly I never saw a GLA that matched the runstate_guest area. This is not the case for a PV domain as it is one of the first violations the xen-access test program sees. Is this an EPT vs regular page table difference as in one case it is shared?

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Issue policing writes from Xen to PV domain memory
  2014-04-30  0:38 Aravindh Puthiyaparambil (aravindp)
@ 2014-04-30  7:08 ` Jan Beulich
  2014-04-30 23:00   ` Aravindh Puthiyaparambil (aravindp)
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2014-04-30  7:08 UTC (permalink / raw)
  To: Aravindh Puthiyaparambil (aravindp); +Cc: xen-devel, Tim Deegan(tim@xen.org)

>>> On 30.04.14 at 02:38, <aravindp@cisco.com> wrote:
> On adding some debugging, I discovered that it happens after mem_access is 
> enabled but xen-access has not started handling events. After comparing the 
> stack trace and gla in question There are multiple write faults to the 
> runstate_guest(v), each causing an event to be sent to xen-access. Since the 
> listener is not handling events yet, the fault continues to occur. I am not 
> sure why the listener does not get a chance to run.  I also do not follow is 
> that why there are multiple faults as the vcpu should have been paused after 
> the first event was sent to xen-access and only be resumed after violation has 
> been resolved and when it calls xc_access_resume(), which ends up unpausing 
> the vcpu. Or is this occurring because runstate_guest(v).p is being accessed 
> from Xen? 

The runstate changes (and hence needs to get written) as a side effect
of pausing the guest (as can be seen from the stack trace). The first
question that needs clarification (for me at least, since I don't know much
about the access stuff for HVM) is how the same situation gets handled
there: Do Xen writes to HVM guest memory get intercepted? Other than
for PV, they're not going through the same page tables, so special
precautions would be needed to filter them. Quite obviously (I think) if
they're not being filtered for HVM, then they shouldn't be for PV.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Issue policing writes from Xen to PV domain memory
@ 2014-04-30  0:38 Aravindh Puthiyaparambil (aravindp)
  2014-04-30  7:08 ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Aravindh Puthiyaparambil (aravindp) @ 2014-04-30  0:38 UTC (permalink / raw)
  To: xen-devel; +Cc: Tim Deegan (tim@xen.org), Jan Beulich (JBeulich@suse.com)

This is an issue I am running in to as part of my PV mem_access work. 
http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg03802.html

The code in question is part of this patch:
http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg03805.html

I am running in to an issue when trying to police writes from Xen to the PV domain's memory like the the runstate_guest area. If I run the xen-access test program to get write events for a PV domain, it causes the host to crash immediately after enabling mem_access with the following trace:

(XEN) Assertion 'wqv->esp == 0' failed at wait.c:133
(XEN) ----[ Xen-4.5-unstable  x86_64  debug=y  Tainted:    C ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82d08012ffa6>] prepare_to_wait+0x61/0x243
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: 0000000000000100   rbx: 0000000000000100   rcx: ffff82d0802f99cb
(XEN) rdx: ffff82d0802eff00   rsi: 0000000000000000   rdi: ffff83003cc7a000
(XEN) rbp: ffff83003ffe74c8   rsp: ffff83003ffe7498   r8:  000000000001ffff
(XEN) r9:  ffff83003fec0000   r10: 0000000000000030   r11: 0000000000000010
(XEN) r12: ffff83003cc7cb40   r13: ffff830032f44920   r14: ffff83003ffe7f18
(XEN) r15: ffff83003cc7a000   cr0: 000000008005003b   cr4: 00000000000426b0
(XEN) cr3: 0000000012fed000   cr2: ffff88000fc0bd80
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff83003ffe7498:
(XEN)    ffff83003ffe74e4 ffff830032f448e0 ffff830032f44920 ffff830032f45000
(XEN)    ffff83003ffe77a8 0000000000000006 ffff83003ffe7508 ffff82d0801f0ff1
(XEN)    000001003ffe7528 fffffff080243060 00000000000134cb ffff830032f45000
(XEN)    ffff830012ed69e0 ffff830032f45000 ffff83003ffe7528 ffff82d0801f5e78
(XEN)    0000000000000000 ffff83003cc7a000 ffff83003ffe7758 ffff82d08021ae90
(XEN)    ffff830000000005 00000000000333d5 000000000000000b 0000000000000058
(XEN)    ffff82d080319f78 00000000000003f0 0000000000000000 0000000000000880
(XEN)    383283003ffe75d8 0000000000000108 353483003ffe75e8 ffff82d080319f68
(XEN)    000ffff88000fc0b 0000000000000200 00000000000134cb ffff88000fc0bd80
(XEN)    0000000000000201 ffff82d000000000 00000000000134cb 00000000000134cb
(XEN)    623783003ffe7710 ffff82d080313330 ffff83003ffe76f8 ffff82d08012fb27
(XEN)    0030830000000000 ffff82d080363231 ffff830012ed69e0 0000000000000005
(XEN)    0000000000000400 ffff82d0802f99cc 0000000000000001 ffff82d0802f9d7f
(XEN)    ffff83003ffe7770 ffff82d08027358e ffff83003ffe7758 ffff82d0802f9980
(XEN)    0000000000000000 ffff82d0802f9d7f 383583003ffe77a0 ffff82d080333966
(XEN)    ffff83003ffe7788 ffff82d08012fb27 0000000500000000 ffff82d080273590
(XEN)    ffff83003ffe76d8 ffff82d08012f02a 0000000000000400 ffff82d0802f99c8
(XEN)    0000000000000002 ffff82d0802f9d7f ffff83003ffe7800 ffff82d080271e6c
(XEN)    ffff83003ffe77e8 ffff88000fc0bd80 00000000333bd067 00000000333b9067
(XEN)    0000000039f8b067 80100000134cb067 00000000000148fd 00000000000333bd
(XEN) Xen call trace:
(XEN)    [<ffff82d08012ffa6>] prepare_to_wait+0x61/0x243
(XEN)    [<ffff82d0801f0ff1>] __mem_event_claim_slot+0x56/0xb2
(XEN)    [<ffff82d0801f5e78>] mem_access_send_req+0x2e/0x5a
(XEN)    [<ffff82d08021ae90>] sh_page_fault__guest_4+0x1e50/0x2690
(XEN)    [<ffff82d08018dce1>] do_page_fault+0x386/0x532
(XEN)    [<ffff82d08022bc3d>] handle_exception_saved+0x2e/0x6c
(XEN)    [<ffff82d08018defa>] __copy_to_user_ll+0x2a/0x37
(XEN)    [<ffff82d08015fb0f>] update_runstate_area+0xfd/0x106
(XEN)    [<ffff82d08015fb29>] _update_runstate_area+0x11/0x39
(XEN)    [<ffff82d08015fc03>] context_switch+0xb2/0xf03
(XEN)    [<ffff82d080125c23>] schedule+0x5c8/0x5d7
(XEN)    [<ffff82d080128491>] wait+0x9/0xb
(XEN)    [<ffff82d0801f1007>] __mem_event_claim_slot+0x6c/0xb2
(XEN)    [<ffff82d0801f5e78>] mem_access_send_req+0x2e/0x5a
(XEN)    [<ffff82d08021ae90>] sh_page_fault__guest_4+0x1e50/0x2690
(XEN)    [<ffff82d08018dce1>] do_page_fault+0x386/0x532
(XEN)    [<ffff82d08022bc3d>] handle_exception_saved+0x2e/0x6c
(XEN)    [<ffff82d08018defa>] __copy_to_user_ll+0x2a/0x37
(XEN)    [<ffff82d08015fb0f>] update_runstate_area+0xfd/0x106
(XEN)    [<ffff82d08015fb29>] _update_runstate_area+0x11/0x39
(XEN)    [<ffff82d080160a41>] context_switch+0xef0/0xf03
(XEN)    [<ffff82d080125c23>] schedule+0x5c8/0x5d7
(XEN)    [<ffff82d080128a49>] __do_softirq+0x81/0x8c
(XEN)    [<ffff82d080128aa2>] do_softirq+0x13/0x15
(XEN)    [<ffff82d08015cf45>] idle_loop+0x66/0x76
(XEN) 

On adding some debugging, I discovered that it happens after mem_access is enabled but xen-access has not started handling events. After comparing the stack trace and gla in question There are multiple write faults to the runstate_guest(v), each causing an event to be sent to xen-access. Since the listener is not handling events yet, the fault continues to occur. I am not sure why the listener does not get a chance to run.  I also do not follow is that why there are multiple faults as the vcpu should have been paused after the first event was sent to xen-access and only be resumed after violation has been resolved and when it calls xc_access_resume(), which ends up unpausing the vcpu. Or is this occurring because runstate_guest(v).p is being accessed from Xen? 

My "fix" for this is to not police writes if they are originating from the Xen kernel. I should also mention that while debugging before the hacks went in, if I added too much debug to Xen, it would slow things down allowing xen-access to start and things would proceed smoothly from there on. In those cases I would see multiple consecutive events for runstate_guest area but since they get resolved "quick" enough, the crash does not occur.

Please advise as to how I should proceed in solving this issue.

Thanks,
Aravindh

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-05-28 21:10 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-08 22:58 Issue policing writes from Xen to PV domain memory Joe Epstein
     [not found] <mailman.17701.1399617131.24322.xen-devel@lists.xen.org>
2014-05-09 15:09 ` Andres Lagar-Cavilla
2014-05-09 18:22   ` Aravindh Puthiyaparambil (aravindp)
  -- strict thread matches above, loose matches on Subject: below --
2014-04-30  0:38 Aravindh Puthiyaparambil (aravindp)
2014-04-30  7:08 ` Jan Beulich
2014-04-30 23:00   ` Aravindh Puthiyaparambil (aravindp)
2014-05-02  9:03     ` Jan Beulich
2014-05-02 17:00       ` Aravindh Puthiyaparambil (aravindp)
2014-05-02 20:52       ` Aravindh Puthiyaparambil (aravindp)
2014-05-05  6:21         ` Jan Beulich
2014-05-05 19:27           ` Aravindh Puthiyaparambil (aravindp)
2014-05-06  7:59             ` Jan Beulich
2014-05-06 19:01               ` Aravindh Puthiyaparambil (aravindp)
2014-05-07  6:36                 ` Jan Beulich
2014-05-07 19:37                   ` Aravindh Puthiyaparambil (aravindp)
2014-05-08  6:26                     ` Jan Beulich
2014-05-09  2:42                   ` Aravindh Puthiyaparambil (aravindp)
2014-05-09  6:49                     ` Jan Beulich
2014-05-09 17:48                       ` Aravindh Puthiyaparambil (aravindp)
2014-05-12  8:25                         ` Jan Beulich
2014-05-12 18:27                           ` Aravindh Puthiyaparambil (aravindp)
2014-05-16  4:11                           ` Aravindh Puthiyaparambil (aravindp)
2014-05-16  6:48                             ` Jan Beulich
2014-05-16 16:21                               ` Andres Lagar-Cavilla
2014-05-16 21:16                                 ` Aravindh Puthiyaparambil (aravindp)
2014-05-28 19:03                                 ` Aravindh Puthiyaparambil (aravindp)
2014-05-28 20:54                                   ` Jan Beulich
2014-05-28 21:10                                     ` Aravindh Puthiyaparambil (aravindp)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.