Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access

From: Tim Deegan <tim@xen.org>
To: Aravindh Puthiyaparambil <aravindp@cisco.com>
Cc: xen-devel@lists.xenproject.org, Keir Fraser <keir@xen.org>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Ian Campbell <ian.campbell@citrix.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
Date: Thu, 1 May 2014 16:39:20 +0200	[thread overview]
Message-ID: <20140501143920.GD86038@deinos.phlegethon.org> (raw)
In-Reply-To: <1398746705-6658-2-git-send-email-aravindp@cisco.com>

Hi,

Thanks for your patch.  I have to say, this looks a lot less
terrifying than I thought it would. :)

At 21:45 -0700 on 28 Apr (1398717902), Aravindh Puthiyaparambil wrote:
> Shadow mem_access mode
> ----------------------
> Add a new shadow mode for mem_access. This should only be enabled by a
> mem_access listener when it calls XEN_DOMCTL_MEM_EVENT_OP_ACCESS_ENABLE
> for a PV domain.

Do we need a specific mode for this?  If the default behaviour is to
allow all accesses until told otherwise, then it can be enabled at all
times.

> P2M changes
> -----------
> Add a new p2m implementation for mem_access. Central to this is the
> access lookup table. This is an array of mfns. Each mfn is a page
> allocated from the domain's shadow memory. The pages hold the
> p2m_access_t values for each guest gmfn. p2m_mem_access_set_entry()
> sets the access value of the mfn given as input and blows the shadow
> entries for the mfn. p2m_mem_access_get_entry() returns the access
> value of the mfn given as input.

I think this array needs to be a trie, at least; it may make sense to reuse
the p2m_pt implementation and just restrict/ignore the types and MFNs.
That way you won't have to reinvent the wheel, esp. around avoiding
long-running operations.

> +/* Convert access restrictions to page table flags */
> +void p2m_access_to_flags(u32 *flags, u32 gflags, p2m_access_t access)
> +{
> +
> +    /* Restrict with access permissions */
> +    switch (access)
> +    {
> +        case p2m_access_r:
> +            *flags &= ~(_PAGE_RW);
> +            *flags |= (_PAGE_NX_BIT|_PAGE_PRESENT);

IIUC this function is called to add further restrictions to an
existing set of flags.  In that case, it should never set the
_PAGE_PRESENT bit.  Rather, it should only ever _reduce_ permissions 
(i.e. set _NX or clear other bits).  Then...

> +            break;
> +        case p2m_access_rx:
> +        case p2m_access_rx2rw:
> +            *flags &= ~(_PAGE_NX_BIT|_PAGE_RW);
> +            *flags |= _PAGE_PRESENT;
> +            break;
> +        case p2m_access_rw:
> +            *flags |= (_PAGE_NX_BIT|_PAGE_RW|_PAGE_PRESENT);
> +            break;
> +        case p2m_access_rwx:
> +        default:
> +            *flags &= ~(_PAGE_NX_BIT);
> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
> +            break;
> +    }
> +
> +    // Allow more restrictive guest flags to be propagated instead of access
> +    // permissions
> +    if ( !(gflags & _PAGE_RW) )
> +        *flags &= ~(_PAGE_RW);
> +    if ( gflags & _PAGE_NX_BIT )
> +        *flags |= _PAGE_NX_BIT;

..you won't need these either.

> +}
> +
> +/*
> + * Set the page permission of the mfn. This in effect removes all shadow
> + * mappings of that mfn. The access type of that mfn is stored in the access
> + * lookup table.
> + */
> +static int
> +p2m_mem_access_set_entry(struct p2m_domain *p2m, unsigned long gfn, mfn_t mfn,
> +                         unsigned int page_order, p2m_type_t p2mt,
> +                         p2m_access_t p2ma)
> +{
> +    struct domain *d = p2m->domain;
> +    mfn_t *access_lookup_table = p2m->access_lookup_table;
> +    uint table_idx;
> +    uint page_idx;
> +    uint8_t *access_table_page;
> +
> +    ASSERT(shadow_mode_mem_access(d) && access_lookup_table != NULL);
> +
> +    /* For PV domains we only support rw, rx, rx2rw, rwx access permissions */
> +    if ( unlikely(p2ma != p2m_access_r &&
> +                  p2ma != p2m_access_rw &&
> +                  p2ma != p2m_access_rx &&
> +                  p2ma != p2m_access_rwx &&
> +                  p2ma != p2m_access_rx2rw) )
> +        return -EINVAL;
> +
> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
> +        return -ENOENT;

If we're only ever setting access permissions for a guest's own pages
(i.e. not setting them for grant mappings or foreign mappings) then
the access permissions lookup table could potentially be shared among
all PV VMs.  It might even be possible to fold the access-type into
the shadow_flags field in struct page_info and avoid having a lookup
table at all.  If that's possible, I think it will make some thing a
_lot_ easier. 

(Based on this, and the worries about gfn-space being sparse, I'm not
going to give detailed feedback about the implementation of the lookup
table in this version; Jan's already pointed out some flaws there.)

> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));

Beware: the guest can set this m2p entry to any value it likes...

> +    /*
> +     * Values with the MSB set denote MFNs that aren't really part of the
> +     * domain's pseudo-physical memory map (e.g., the shared info frame).
> +     * Nothing to do here.
> +     */
> +    if ( unlikely(!VALID_M2P(gfn)) )
> +        return 0;

...effectively preventing the tools from setting the access type.
That's also relevant below where you use _rw access for any frame with
a bogus gfn.

> +    if ( gfn > (d->tot_pages - 1) )
> +        return -EINVAL;

Yeah, as others have pointed out, that's not a good check even if the
guest is well-behaved.

> @@ -3179,8 +3183,14 @@ static int shadow_one_bit_enable(struct domain *d, u32 mode)
>  {
>      ASSERT(paging_locked_by_me(d));
>  
> -    /* Sanity check the call */
> -    if ( d == current->domain || (d->arch.paging.mode & mode) == mode )
> +    /*
> +     * Sanity check the call.
> +     * Do not allow shadow mem_access mode to be enabled when
> +     * log-dirty or translate mode is enabled.

Why not log-dirty?  That would mean you can't use mem-access and live
migration at the same time. 

> --- a/xen/arch/x86/mm/shadow/multi.c
> +++ b/xen/arch/x86/mm/shadow/multi.c
> @@ -38,6 +38,8 @@
>  #include <asm/hvm/cacheattr.h>
>  #include <asm/mtrr.h>
>  #include <asm/guest_pt.h>
> +#include <asm/mem_event.h>
> +#include <asm/mem_access.h>
>  #include <public/sched.h>
>  #include "private.h"
>  #include "types.h"
> @@ -625,6 +627,20 @@ _sh_propagate(struct vcpu *v,
>              }
>      }
>  
> +    /* Propagate access permissions */
> +    if ( unlikely((level == 1) &&
> +                  mem_event_check_ring(&d->mem_event->access) &&
> +                  !sh_mfn_is_a_page_table(target_mfn)) )
> +    {
> +        struct p2m_domain *p2m = p2m_get_hostp2m(d);
> +        p2m_access_t a;
> +        p2m_type_t t;
> +        mfn_t mfn;
> +        mfn = p2m->get_entry(p2m, mfn_x(target_mfn), &t, &a, 0, NULL);

Ah, no.  This access lookup ought to happen as part of the original
p2m lookup that produced target_mfn.  Having a second lookup here is
unpleasant and confusing -- this code keeps the distinction between
gfn and mfn pretty strictly, so passing an mfn into a p2m lookup is
wrong, even though you happen to know they're the same namespace in
this PV guest.

Of course, if you can hide the access type in the struct page_info,
then it's correct to recover it here.

> +            /*
> +             * Do not police writes to guest memory emanating from the Xen
> +             * kernel. Trying to do so will cause the same pagefault to occur
> +             * over and over again with an event being sent to the access
> +             * listener for each fault. If the access listener's vcpu is not
> +             * scheduled during this time, the violation is never resolved and
> +             * will eventually end with the host crashing.
> +             */
> +            if ( (violation && access_w) &&
> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
> +            {
> +                violation = 0;
> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
> +                                    p2m_ram_rw, p2m_access_rw);
> +            }

That's exploitable, surely: the guest just needs to make a hypercall
with a chosen buffer address to give itself _rw access to a page.
(Which I guess is OK for your intended use of rw<->rx but not in
general.)

Cheers,

Tim.