From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Aravindh Puthiyaparambil (aravindp)" <aravindp@cisco.com>
Subject: Re: Issue policing writes from Xen to PV domain memory
Date: Fri, 9 May 2014 17:48:34 +0000
Message-ID: <97A500D504438F4ABC02EBA81613CC63317ED7CE@xmb-aln-x02.cisco.com>
References: <97A500D504438F4ABC02EBA81613CC63317E7BBC@xmb-aln-x02.cisco.com>
	<5360BDA7020000780000D9F4@g0-1-119.ukb-fw-asa.gns.novell.com>
	<97A500D504438F4ABC02EBA81613CC63317E95B0@xmb-aln-x02.cisco.com>
	<53637B80020000780000E5F1@mail.emea.novell.com>
	<97A500D504438F4ABC02EBA81613CC63317EA268@xmb-aln-x02.cisco.com>
	<53674A17020000780000ED37@mail.emea.novell.com>
	<97A500D504438F4ABC02EBA81613CC63317EA99C@xmb-aln-x02.cisco.com>
	<5368B299020000780000F2FD@mail.emea.novell.com>
	<97A500D504438F4ABC02EBA81613CC63317EBC1B@xmb-aln-x02.cisco.com>
	<5369F079020000780000FAA8@mail.emea.novell.com>
	<97A500D504438F4ABC02EBA81613CC63317ED4A5@xmb-aln-x02.cisco.com>
	<536C96B20200007800010BD7@mail.emea.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta14.messagelabs.com ([193.109.254.103])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <aravindp@cisco.com>) id 1WiouV-000133-Ed
	for xen-devel@lists.xenproject.org; Fri, 09 May 2014 17:48:39 +0000
In-Reply-To: <536C96B20200007800010BD7@mail.emea.novell.com>
Content-Language: en-US
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>, "TimDeegan(tim@xen.org)" <tim@xen.org>
List-Id: xen-devel@lists.xenproject.org

>> The above sequence of events would not cause an EPT violation (I
>> realize this can happen only in the guest context) or pagefault (The
>> page is present and marked writable in the guest). All that would
>> happen is an event to be sent to the listener from __hvm_copy() and no
>cascading faults will occur.
>
>I have to admit that I'm unable to spot where such an event gets sent.

I had preceded this by mentioning "Now take the similar scenario in the hypothetical HVM + EPT case where we are policing Xen writes to guest memory." The code would look something like this if it was implemented. Please note this was just a quick hacky write up :-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index da220bf..8ce99d5 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -2750,6 +2750,65 @@ static enum hvm_copy_result __hvm_copy(
             return HVMCOPY_unhandleable;
         }
 
+        if ( unlikely(mem_event_check_ring(&curr->domain->mem_event->access)) )
+        {
+            struct p2m_domain *p2m = p2m_get_hostp2m(curr->domain);
+            mfn_t gmfn;
+            p2m_type_t _p2mt;
+            p2m_access_t p2ma;
+            bool_t violation = 0;
+            mem_event_request_t *req_ptr = NULL;
+
+            gmfn = p2m->get_entry(p2m, gfn, &_p2mt, &p2ma, 0, NULL);;
+
+            /* If the access is against the permissions, then send to mem_event */
+            switch (p2ma)
+            {
+            case p2m_access_n:
+            case p2m_access_n2rwx:
+            case p2m_access_x:
+                violation = 1;
+                break;
+
+            case p2m_access_w:
+            case p2m_access_wx:
+                if ( flags & HVMCOPY_from_guest )
+                    violation = 1;
+                break;
+
+            case p2m_access_r:
+            case p2m_access_rx:
+            case p2m_access_rx2rw:
+                if ( flags & HVMCOPY_to_guest )
+                    violation = 1;
+                break;
+
+            case p2m_access_rw:
+            case p2m_access_rwx:
+                break;
+            }
+
+            if ( violation )
+            {
+                paddr_t gpa = (gfn << PAGE_SHIFT) +
+                              (flags & HVMCOPY_virt ?
+                               (addr & ((1 << PAGE_SHIFT) - 1)) : 0);
+
+                if ( !p2m_mem_access_check(gpa, (flags & HVMCOPY_virt ? 1 : 0),
+                                           addr, 0, 1, 0, &req_ptr) )
+                {
+                    if ( unlikely(req_ptr != NULL) )
+                    {
+                        mem_access_send_req(curr->domain, req_ptr);
+                        xfree(req_ptr);
+                    }
+                    put_page(page);
+                    return HVMCOPY_unhandleable;
+                }
+            }
+        }
+
         p = (char *)__map_domain_page(page) + (addr & ~PAGE_MASK);
 
         if ( flags & HVMCOPY_to_guest )

>What is quite clear is that you don't want multiple nested runstate area
>updates to occur in the first place, i.e. you need to properly deal with the
>_first_ page fault occurring instead of waiting until multiple such events pile
>up. And I'm afraid doing so will require some (perhaps gross) hackery...

OK, I will take a stab at this. 

>> Looking at what the solution for the ring being full in the PV case
>> whether we are policing Xen writes or not, calling wait() will not
>> work due to the scenario I had mentioned a while back and is shown above
>in the stack trace.
>> I am repeating that flow here
>> mem_event_claim_slot() ->
>> 	mem_event_wait_slot() ->
>> 		 wait_event(mem_event_wait_try_grab(med, &rc) != -
>EBUSY)
>>
>> wait_event() macro looks like this:
>> do {
>>     if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>         break;
>>     for ( ; ; ) {
>>         prepare_to_wait(&med->wq);
>>         if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )
>>             break;
>>         wait();
>>     }
>>     finish_wait(&med->wq);
>> } while (0)
>>
>> In the case where the ring is full, wait() gets called and the cpu
>> gets scheduled away. But since it is in middle of a pagefault, when it
>> runs again it ends up in handle_exception_saved and the same pagefault is
>tried again.
>> But since finish_wait() never ends up being called wqv->esp never
>> becomes 0 and hence the assert fires on the next go around. So I think
>> we should be calling
>> process_pending_softirqs() instead of wait() for PV domains.
>
>That would effectively be a spin wait then, which is surely not the right thing.
>But I don't follow your explanation above anyway - when coming back from
>wait(), the state is the same as the original one, so the page fault handling
>continues, it's not being retried.

Let me step through this. The pagefault for runstate occurs and wait_event() gets called and the ring is full. 

wait_event():
do { 
    if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will return EBUSY
        break; 
    for ( ; ; ) { 
        prepare_to_wait(&med->wq); 
        if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )   ==> This will return EBUSY
            break; 
        wait();                                                                                                   ==> Write to runstate area occurs...
    } 
    finish_wait(&med->wq);
} while (0)

...now this write to runstate will again cause a pagefault and we will end up in wait_event() again. The previous attempt would have called prepare_to_wait() but finish_wait() was not called. finish_wait()->__finish_wait() is where wqv->esp gets reset to NULL. So now:

wait_event():
do { 
    if ( mem_event_wait_try_grab(med, &rc) != -EBUSY )       ==> This will return EBUSY
        break; 
    for ( ; ; ) { 
        prepare_to_wait(&med->wq);                                                 ==> Assertion 'wqv->esp == 0' fails
        if ( mem_event_wait_try_grab(med, &rc) != -EBUSY ) 
            break; 
        wait();
    } 
    finish_wait(&med->wq);
} while (0)

Does this make sense?

Thanks,
Aravindh