From mboxrd@z Thu Jan  1 00:00:00 1970
From: Razvan Cojocaru <rcojocaru@bitdefender.com>
Subject: Re: [PATCH RFC V9 4/5] xen,
 libxc: Request page fault injection via libxc
Date: Tue, 02 Sep 2014 12:18:06 +0300
Message-ID: <54058B4E.9060001@bitdefender.com>
References: <1409226482-12657-1-git-send-email-rcojocaru@bitdefender.com>	<1409226482-12657-4-git-send-email-rcojocaru@bitdefender.com>	<53FF36A1020000780002EAED@mail.emea.novell.com>	<53FF1BD8.5010401@bitdefender.com>	<53FF38A6020000780002EB2B@mail.emea.novell.com>	<54002F43.4070802@bitdefender.com>	<5400638A020000780002EFD6@mail.emea.novell.com>	<540421E1.9020505@bitdefender.com>	<540453C8020000780002F59C@mail.emea.novell.com>	<54045E7C.50604@bitdefender.com>
	<54047D1D020000780002F73A@mail.emea.novell.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta4.messagelabs.com ([85.158.143.247])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <rcojocaru@bitdefender.com>) id 1XOkEB-0000t0-MS
	for xen-devel@lists.xenproject.org; Tue, 02 Sep 2014 09:18:15 +0000
In-Reply-To: <54047D1D020000780002F73A@mail.emea.novell.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Jan Beulich <JBeulich@suse.com>
Cc: kevin.tian@intel.com, eddie.dong@intel.com, stefano.stabellini@eu.citrix.com, andrew.cooper3@citrix.com, tim@xen.org, jun.nakajima@intel.com, xen-devel@lists.xenproject.org, ian.jackson@eu.citrix.com, ian.campbell@citrix.com
List-Id: xen-devel@lists.xenproject.org

On 09/01/2014 03:05 PM, Jan Beulich wrote:
>>>> On 01.09.14 at 13:54, <rcojocaru@bitdefender.com> wrote:
>> On 09/01/2014 12:08 PM, Jan Beulich wrote:
>>>>>> On 01.09.14 at 09:36, <rcojocaru@bitdefender.com> wrote:
>>>> On 08/29/2014 12:27 PM, Jan Beulich wrote:
>>>>>>>> On 29.08.14 at 09:44, <rcojocaru@bitdefender.com> wrote:
>>>>>> I do understand the preference for a VCPU-based mechanism from a
>>>>>> concurrency point of view, but that would simply potentially fail for
>>>>>> us, hence defeating the purpose of the patch. I'm also not sure how that
>>>>>> would be useful in the general case either, since the same problem that
>>>>>> applies to us would seem to apply to the general case as well.
>>>>>
>>>>> Yeah, the whole thing probably needs a bit more thinking so that the
>>>>> interface doesn't end up being a BitDefender-special. Indeed together
>>>>> with the address space qualification, the interface might not be very
>>>>> useful when made vCPU-bound. And taking it a little further into the
>>>>> "generic" direction, allowing this to only inject #PF doesn't make a
>>>>> very nice interface either. Plus we already have HVMOP_inject_trap,
>>>>> i.e. your first line of thinking (and eventual explaining as the
>>>>> motivation for a patch) should be why that can't be used.
>>>>
>>>> I'd say that it's memory-introspection specific rather than 3rd-party
>>>> vendor specific. Without this this patch, memory-introspection support
>>>> in general is impacted / less flexible, since there's no other way to
>>>> bring swapped out pages back in.
>>>>
>>>> For all the reasons you've explained (at least as far as I understand
>>>> it) there's not much room to go more generic - so maybe just renaming
>>>> the libxc wrapper to something more specific (
>>>> xc_domain_request_usermode_pagefault?) is the solution here?
>>>
>>> Maybe, but only after you explained why the existing interface can
>>> neither be used nor suitably extended.
>>
>> As far as I understand the HVMOP_inject_trap interface, it is simply (in
>> this case) a way to trigger the equivalent of hvm_inject_page_fault()
>> from userspace (via libxc).
>>
>> We need two additional checks:
>>
>> 1. That CR3 matches, because the way the application works, we need to
>> inject a page fault related to the address space of whatever process is
>> interesting at the time, and
>>
>> 2. That we're in usermode (the CPL check), because we know that it's
>> safe to inject a page fault when we're in user mode, but it's not always
>> safe to do so in kernel mode.
>>
>> The current interface seems to be a low-level, basic wrapper around
>> hvm_inject_trap().
>>
>> What we're trying to do is ask for a page fault when we're both A) in
>> usermode, and B) when a target process matches - and are only interested
>> in page faults, no other trap vector.
>>
>> Technically, the checks could, probably, be moved into (and new
>> parameters added to) hvm_inject_trap() & friends, but it seems unlikely
>> that users other than memory introspection applications would be
>> interested in them, while they would possibly add to the complexity of
>> the interface. The rest of the clients would have to carry dummy
>> parameters around to use it.
> 
> I'm not sure: Injecting faults with certain restrictions (like in your
> case CPL and CR3) does seem to make quite some sense in
> general when the fault isn't intended to be terminal (of a process
> or the entire VM). It's just that so far we used fault injection only
> to indicate error conditions.
> 
> But as always - let's see if others are of differing opinion before
> settling on either route.

While waiting for other opinions, I've started looking at the
HVMOP_inject_trap code, and it uses a per-VCPU data-saving model similar
to the one we've been discussing (and agreed that does not work for the
explained use-cases).

The first giveaway is the libxc function:

/*
 * Injects a hardware/software CPU trap, to take effect the next time
the HVM
 * resumes.
 */
int xc_hvm_inject_trap(
    xc_interface *xch, domid_t dom, int vcpu, uint32_t vector,
    uint32_t type, uint32_t error_code, uint32_t insn_len,
    uint64_t cr2);

which takes an "int vcpu" parameter. Then, this happens in
xen/arch/x86/hvm/hvm.c:

    case HVMOP_inject_trap:
    {
        xen_hvm_inject_trap_t tr;
        struct domain *d;
        struct vcpu *v;

        if ( copy_from_guest(&tr, arg, 1 ) )
            return -EFAULT;

        rc = rcu_lock_remote_domain_by_id(tr.domid, &d);
        if ( rc != 0 )
            return rc;

        rc = -EINVAL;
        if ( !is_hvm_domain(d) )
            goto param_fail8;

        rc = xsm_hvm_control(XSM_DM_PRIV, d, op);
        if ( rc )
            goto param_fail8;

        rc = -ENOENT;
        if ( tr.vcpuid >= d->max_vcpus || (v = d->vcpu[tr.vcpuid]) == NULL )
            goto param_fail8;

        if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
            rc = -EBUSY;
        else
        {
            v->arch.hvm_vcpu.inject_trap.vector = tr.vector;
            v->arch.hvm_vcpu.inject_trap.type = tr.type;
            v->arch.hvm_vcpu.inject_trap.error_code = tr.error_code;
            v->arch.hvm_vcpu.inject_trap.insn_len = tr.insn_len;
            v->arch.hvm_vcpu.inject_trap.cr2 = tr.cr2;
            rc = 0;
        }

    param_fail8:
        rcu_unlock_domain(d);
        break;
    }

The "v->arch.hvm_vcpu.inject_trap.<member> = initializer;" code again
points to a per-VCPU implementation.

And finally, hvm_do_resume() is also VCPU-dependent:

void hvm_do_resume(struct vcpu *v)
{
    struct domain *d = v->domain;
    struct hvm_ioreq_server *s;

    check_wakeup_from_wait();

    if ( is_hvm_vcpu(v) )
        pt_restore_timer(v);

    list_for_each_entry ( s,
                          &d->arch.hvm_domain.ioreq_server.list,
                          list_entry )
    {
        struct hvm_ioreq_vcpu *sv;

        list_for_each_entry ( sv,
                              &s->ioreq_vcpu_list,
                              list_entry )
        {
            if ( sv->vcpu == v )
            {
                if ( !hvm_wait_for_io(sv, get_ioreq(s, v)) )
                    return;

                break;
            }
        }
    }

    /* Inject pending hw/sw trap */
    if ( v->arch.hvm_vcpu.inject_trap.vector != -1 )
    {
        hvm_inject_trap(&v->arch.hvm_vcpu.inject_trap);
        v->arch.hvm_vcpu.inject_trap.vector = -1;
    }
}

While we need to set the data per-domain and have whatever VCPU inject
the page fault - _but_only_if_ it's in usermode and its CR3 points to
something interesting.

Hope I've been following the code correctly.


Thanks,
Razvan Cojocaru