Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access

From: "Aravindh Puthiyaparambil (aravindp)" <aravindp@cisco.com>
To: Tim Deegan <tim@xen.org>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	Keir Fraser <keir@xen.org>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
Date: Thu, 1 May 2014 19:26:17 +0000	[thread overview]
Message-ID: <97A500D504438F4ABC02EBA81613CC63317E9B13@xmb-aln-x02.cisco.com> (raw)
In-Reply-To: <20140501143920.GD86038@deinos.phlegethon.org>

>Thanks for your patch.  I have to say, this looks a lot less terrifying than I
>thought it would. :)

I am glad to hear that. I too feared the worst :-) On that note many thanks to you and Jan for taking the time for giving valuable feedback.

>> Shadow mem_access mode
>> ----------------------
>> Add a new shadow mode for mem_access. This should only be enabled by a
>> mem_access listener when it calls
>> XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE
>> for a PV domain.
>
>Do we need a specific mode for this?  If the default behaviour is to allow all
>accesses until told otherwise, then it can be enabled at all times.

You might be right. I will look in to getting rid of it.

>> P2M changes
>> -----------
>> Add a new p2m implementation for mem_access. Central to this is the
>> access lookup table. This is an array of mfns. Each mfn is a page
>> allocated from the domain's shadow memory. The pages hold the
>> p2m_access_t values for each guest gmfn. p2m_mem_access_set_entry()
>> sets the access value of the mfn given as input and blows the shadow
>> entries for the mfn. p2m_mem_access_get_entry() returns the access
>> value of the mfn given as input.
>
>I think this array needs to be a trie, at least; it may make sense to reuse the
>p2m_pt implementation and just restrict/ignore the types and MFNs.
>That way you won't have to reinvent the wheel, esp. around avoiding long-
>running operations.

I like the idea you mentioned below of stashing the access type in the shadow_flags field of struct page_info. If that does not work out I will look in to reusing the p2m_pt implementation.

>> +/* Convert access restrictions to page table flags */ void
>> +p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access) {
>> +
>> +    /* Restrict with access permissions */
>> +    switch (access)
>> +    {
>> +        case p2m_access_r:
>> +            *flags &= ~(_PAGE_RW);
>> +            *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT);
>
>IIUC this function is called to add further restrictions to an existing set of flags.
>In that case, it should never set the _PAGE_PRESENT bit.  Rather, it should
>only ever _reduce_ permissions (i.e. set _NX or clear other bits).  Then...
>
>> +            break;
>> +        case p2m_access_rx:
>> +        case p2m_access_rx2rw:
>> +            *flags &= ~(_PAGE_NX_BIT|_PAGE_RW);
>> +            *flags |= _PAGE_PRESENT;
>> +            break;
>> +        case p2m_access_rw:
>> +            *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT);
>> +            break;
>> +        case p2m_access_rwx:
>> +        default:
>> +            *flags &= ~(_PAGE_NX_BIT);
>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>> +            break;
>> +    }
>> +
>> +    // Allow more restrictive guest flags to be propagated instead of access
>> +    // permissions
>> +    if ( !(gflags & _PAGE_RW) )
>> +        *flags &= ~(_PAGE_RW);
>> +    if ( gflags & _PAGE_NX_BIT )
>> +        *flags |= _PAGE_NX_BIT;
>
>..you won't need these either.

That makes sense. I will follow your suggestion.

>> +}
>> +
>> +/*
>> + * Set the page permission of the mfn. This in effect removes all
>> +shadow
>> + * mappings of that mfn. The access type of that mfn is stored in the
>> +access
>> + * lookup table.
>> + */
>> +static int
>> +p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long
>gfn, mfn_t mfn,
>> +                         unsigned int page_order, p2m_type_t p2mt,
>> +                         p2m_access_t p2ma) {
>> +    struct domain *d = p2m->domain;
>> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
>> +    uint table_idx;
>> +    uint page_idx;
>> +    uint8_t *access_table_page;
>> +
>> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table !=
>NULL);
>> +
>> +    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions
>*/
>> +    if ( unlikely(p2ma != p2m_access_r &&
>> +                  p2ma != p2m_access_rw &&
>> +                  p2ma != p2m_access_rx &&
>> +                  p2ma != p2m_access_rwx &&
>> +                  p2ma != p2m_access_rx2rw) )
>> +        return -EINVAL;
>> +
>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>> +        return -ENOENT;
>
>If we're only ever setting access permissions for a guest's own pages (i.e. not
>setting them for grant mappings or foreign mappings) then the access
>permissions lookup table could potentially be shared among all PV VMs.  It
>might even be possible to fold the access-type into the shadow_flags field in
>struct page_info and avoid having a lookup table at all.  If that's possible, I
>think it will make some thing a _lot_ easier.

This idea looks very promising. I remember mentioning in this is some of my earlier emails but got side tracked thinking PV memory is contiguous and I could use a simple indexed lookup table. So I will resubmit this patch series after I implement a new version of p2m-ma using the shadow_flags or reusing the p2m-pt implementation.

>> -    /* Sanity check the call */
>> -    if ( d == current->domain || (d->arch.paging.mode & mode) == mode )
>> +    /*
>> +     * Sanity check the call.
>> +     * Do not allow shadow mem_access mode to be enabled when
>> +     * log-dirty or translate mode is enabled.
>
>Why not log-dirty?  That would mean you can't use mem-access and live
>migration at the same time.

I was afraid they would conflict and prevent log-dirty from working correctly. But I forgot about it being used during live-migration. Let me see if I can get both of them to live together.

>> +    /* Propagate access permissions */
>> +    if ( unlikely((level == 1) &&
>> +                  mem_event_check_ring(&d->mem_event->access) &&
>> +                  !sh_mfn_is_a_page_table(target_mfn)) )
>> +    {
>> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
>> +        p2m_access_t a;
>> +        p2m_type_t t;
>> +        mfn_t mfn;
>> +        mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0,
>> + NULL);
>
>Ah, no.  This access lookup ought to happen as part of the original p2m lookup
>that produced target_mfn.  Having a second lookup here is unpleasant and
>confusing -- this code keeps the distinction between gfn and mfn pretty
>strictly, so passing an mfn into a p2m lookup is wrong, even though you
>happen to know they're the same namespace in this PV guest.
>
>Of course, if you can hide the access type in the struct page_info, then it's
>correct to recover it here.

OK, I am hoping to do that here. 

>> +            /*
>> +             * Do not police writes to guest memory emanating from the Xen
>> +             * kernel. Trying to do so will cause the same pagefault to occur
>> +             * over and over again with an event being sent to the access
>> +             * listener for each fault. If the access listener's vcpu is not
>> +             * scheduled during this time, the violation is never resolved and
>> +             * will eventually end with the host crashing.
>> +             */
>> +            if ( (violation && access_w) &&
>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
>> +            {
>> +                violation = 0;
>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>> +                                    p2m_ram_rw, p2m_access_rw);
>> +            }
>
>That's exploitable, surely: the guest just needs to make a hypercall with a
>chosen buffer address to give itself _rw access to a page.
>(Which I guess is OK for your intended use of rw<->rx but not in
>general.)

I agree but could not come up with a way to get it to work. I have sent a separate email regarding this issue. 

Cheers,
Aravindh