From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Jan Beulich" <JBeulich@suse.com>
Subject: Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for
 PV mem_access
Date: Wed, 30 Apr 2014 07:32:29 +0100
Message-ID: <5360B51D020000780000D9A2@g0-1-119.ukb-fw-asa.gns.novell.com>
References: <1398746705-6658-1-git-send-email-aravindp@cisco.com>
	<1398746705-6658-2-git-send-email-aravindp@cisco.com>
	<535F83FE020000780000D397@nat28.tlf.novell.com>
	<97A500D504438F4ABC02EBA81613CC63317E7AC4@xmb-aln-x02.cisco.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
Received: from mail6.bemta3.messagelabs.com ([195.245.230.39])
	by lists.xen.org with esmtp (Exim 4.72)
	(envelope-from <JBeulich@suse.com>) id 1WfO4H-0005Yj-7R
	for xen-devel@lists.xenproject.org; Wed, 30 Apr 2014 06:32:33 +0000
In-Reply-To: <97A500D504438F4ABC02EBA81613CC63317E7AC4@xmb-aln-x02.cisco.com>
Content-Disposition: inline
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: "Aravindh Puthiyaparambil (aravindp)" <aravindp@cisco.com>
Cc: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>, Keir Fraser <keir@xen.org>, Ian Jackson <ian.jackson@eu.citrix.com>, Ian Campbell <ian.campbell@citrix.com>, Tim Deegan <tim@xen.org>
List-Id: xen-devel@lists.xenproject.org

>>> On 30.04.14 at 01:10, <aravindp@cisco.com> wrote:
>>> +        default:
>>> +            *flags &= ~(_PAGE_NX_BIT);
>>> +            *flags |= (_PAGE_RW|_PAGE_PRESENT);
>>> +            break;
>>> +    }
>>
>>Tim will have the final say here, but I'd suggest dropping all these
>>needless parentheses (I see only one case where they're really
>>needed).
> 
> OK, I will drop all the needless ones.  Could you please point out the case 
> they are needed? I couldn't find anything in the CODING_STYLE that dictates 
> what should be done.

Convention is to use them when operator precedence isn't considered
obvious. What "obvious" here is differs between people, but at least
all precedence rules defined by school mathematics should generally
not require parenthesizing. Which in the case above means no
parentheses around the right side expression of assignment operators.

>>> +
>>> +    if ( page_get_owner(mfn_to_page(mfn)) != d )
>>> +        return -ENOENT;
>>> +
>>> +    gfn = get_gpfn_from_mfn(mfn_x(mfn));
>>
>>You get "gfn" passed in and blindly overwrite it here? _If_ you need to
>>do this lookup, shouldn't you instead check it matches the passed in one?
> 
> The "gfn" and "mfn" passed in to the function should be the same as it is a 
> PV guest. That is why I am overwriting "gfn" to get the "gpfn" from the 
> machine_to_phys_mapping.

Since this overwrites/discards a function parameter, this surely is
worth a comment (or a pseudo-comment like ASSERT(gfn == mfn_x(mfn))).

>>> +
>>> +    /*
>>> +     * Values with the MSB set denote MFNs that aren't really part of the
>>> +     * domain's pseudo-physical memory map (e.g., the shared info frame).
>>> +     * Nothing to do here.
>>> +     */
>>> +    if ( unlikely(!VALID_M2P(gfn)) )
>>> +        return 0;
>>> +
>>> +    if ( gfn > (d->tot_pages - 1) )
>>> +        return -EINVAL;
>>
>>Hardly - a PV domain can easily make its address space sparse, and
>>iirc pv-ops Linux even does so by default to avoid PFN/MFN collisions
>>on MMIO space. (And as a side note, this would better be "gfn >=
>>d->tot_pages".)
> 
> I was using this as a guard against going out of bounds in the access lookup 
> table which is created based on the domains tot_pages. Please let me know 
> what I should be using instead of tot_pages when doing this.

Afaict For PV guests there's absolutely nothing you can use to bound
the range.

>>> +    paging_lock(d);
>>> +
>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>> +    access_table_page =
>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>> +    access_table_page[page_idx] = p2ma;
>>> +    unmap_domain_page(access_table_page);
>>> +
>>> +    if ( sh_remove_all_mappings(d->vcpu[0], mfn) )
>>
>>Is there anything guaranteeing d->vcpu and d->vcpu[0] being non-
>>NULL?
> 
> I am not sure. I see shadow_track_dirty_vram() calling this without a check. 
> Maybe I should check and call it only if d->vcpu is not null.

You ought to check both, at least via ASSERT().

>>> +    table_idx = MEM_ACCESS_TABLE_IDX(gfn);
>>> +    page_idx = MEM_ACCESS_PAGE_IDX(gfn);
>>> +
>>> +    access_table_page =
>>map_domain_page(mfn_x(access_lookup_table[table_idx]));
>>> +
>>> +    /* This is a hint to take the default permissions */
>>> +    if ( access_table_page[page_idx] == p2m_access_n )
>>> +        access_table_page[page_idx] = p2m->default_access;
>>
>>We're in "get" here - why does that modify any global state?
> 
> I do not initialize the access lookup table to the default value. But the 
> table is zeroed as part of initialization when each page is allocated using 
> shadow_alloc_p2m_page(). I then use the "p2m_access_n" / 0 value as hint that 
> the default value should be returned as p2m_access_n is not valid for PV 
> domains. If you prefer that I initialize the table to the default value, I 
> will gladly do so.  

I think implying _n to mean "default" is at least a latent bug anyway
(since that precludes setting that type). And _if_ "get" really does
any modification, that minimally would require a big comment.

Furthermore, with it being possible for the default type to change,
the net effect might easily be different between setting the default
type at initialization time or when the first lookup happens.

>>> +int p2m_mem_access_init(struct p2m_domain *p2m)
>>> +{
>>> +    struct domain *d = p2m->domain;
>>> +    mfn_t *access_lookup_table;
>>> +    uint32_t nr_access_table_pages;
>>> +    uint32_t ctr;
>>> +
>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>> +    access_lookup_table = xzalloc_array(mfn_t, nr_access_table_pages);
>>
>>This surely is going to be an order > 0 allocation, and you even risk
>>it being an order > MAX_ORDER one. The former is disallowed at
>>runtime by convention/agreement, the latter is an outright bug.
> 
> Sorry, I was not aware of the convention. What should I be doing here 
> instead?

Especially with the above note about GFN space being effectively
unbounded for PV guests, you need to find a different data structure
to represent the information you need. Perhaps something similar to
what log-dirty mode uses, except that you would need more levels in
the extreme case, and hence it might be worthwhile making the actually
used number of levels dynamic?

>>> @@ -1414,6 +1419,8 @@ long p2m_set_mem_access(struct domain *d,
>>unsigned long pfn, uint32_t nr,
>>>      if ( pfn == ~0ul )
>>>      {
>>>          p2m->default_access = a;
>>> +        if ( is_pv_domain(d) )
>>> +            return p2m_mem_access_set_default(p2m);
>>
>>Is it really correct to set p2m->default_access _before_ calling that
>>function? Perhaps it wouldn't be correct doing it the other way
>>around either - I suppose you need to hold the necessary lock across
>>both operations.
> 
> I do not see the problem here. p2m_mem_access_set_default() just blows the 
> shadows away after taking the paging_lock. There is no actual setting of 
> default permission for all pages in the PV case.

But the function having a return value means it might fail, in which
case you shouldn't have set the new default.

>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>mfn_t gmfn)
>>>          if ( !(shadow_mode_external(v->domain)
>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>> -                   == !!is_xen_heap_page(page))) )
>>> +                       == !!is_xen_heap_page(page)))
>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>
>>You're breaking indentation, there are pointless parentheses again,
>>but most importantly - why?
> 
> Sorry, I meant to ask a question about this in my patch message and mark 
> this as a workaround. I am seeing the message on all sh_remove_all_mappings() 
> and I was not able to figure out why this was happening. I just added this as 
> a work around. I was hoping you or Tim would shed more light on this.

I'm afraid that without detail on which pages the triggers on, and you
at least having spent some time finding out where the stray/extra
references may be coming from it's going to be hard to help.

>>> +            /*
>>> +             * Do not police writes to guest memory emanating from the Xen
>>> +             * kernel. Trying to do so will cause the same pagefault to occur
>>> +             * over and over again with an event being sent to the access
>>> +             * listener for each fault. If the access listener's vcpu is not
>>> +             * scheduled during this time, the violation is never resolved and
>>> +             * will eventually end with the host crashing.
>>> +             */
>>> +            if ( (violation && access_w) &&
>>> +                 (regs->eip >= XEN_VIRT_START && regs->eip <= XEN_VIRT_END) )
>>> +            {
>>> +                violation = 0;
>>> +                rc = p2m->set_entry(p2m, gfn_x(gfn), gmfn, PAGE_ORDER_4K,
>>> +                                    p2m_ram_rw, p2m_access_rw);
>>> +            }
>>
>>This looks more like a hack than a proper solution - shouldn't the
>>listener be allowed to know of hypervisor side accesses?
> 
> Ideally the listener should know of the hypervisor side accesses but I ran 
> in to cases where the listener would not get to run and the pagefault would 
> be tried over and over, eventually causing the host to crash. I guess more 
> detail about the problem is needed. I will send out a separate email 
> regarding this, CCing you and Tim.

And if it's intended as a workaround until proper resolution only, you
probably should say so explicitly in the comment, avoiding the need
for reviewers to comment on this being a problem.

>>> +/* Number of access table pages for a PV domain */
>>> +#define get_domain_nr_access_table_pages(d) \
>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1), PAGE_SIZE)
>>
>>And once again. I wonder whether this code was tested on a suitably
>>big pv-ops PV guest.
> 
> The largest PV domain I tried this with was 28GB.

Which ought to have had GFNs larger than ->tot_pages, so I wonder
how this worked.

Jan