Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access

From: "Jan Beulich" <JBeulich@suse.com>
To: "Aravindh Puthiyaparambil (aravindp)" <aravindp@cisco.com>
Cc: Keir Fraser <keir@xen.org>,
	Ian Campbell <ian.campbell@citrix.com>,
	"Andres Lagar-Cavilla (andreslc@gridcentric.ca)"
	<andreslc@gridcentric.ca>, Tim Deegan <tim@xen.org>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
	Ian Jackson <ian.jackson@eu.citrix.com>
Subject: Re: [PATCH RFC 1/4] x86/mm: Shadow and p2m changes for PV mem_access
Date: Fri, 02 May 2014 09:26:44 +0100	[thread overview]
Message-ID: <536372E4020000780000E5A3@mail.emea.novell.com> (raw)
In-Reply-To: <97A500D504438F4ABC02EBA81613CC63317E9527@xmb-aln-x02.cisco.com>

>>> On 01.05.14 at 00:20, <aravindp@cisco.com> wrote:
>>I think implying _n to mean "default" is at least a latent bug anyway (since that
>>precludes setting that type). And _if_ "get" really does any modification, that
>>minimally would require a big comment.
> 
> For PV domains access_n is precluded.
> 
>>Furthermore, with it being possible for the default type to change, the net
>>effect might easily be different between setting the default type at
>>initialization time or when the first lookup happens.
> 
> I am not sure that I follow your concern here. Setting default access does 
> not mean that the access table gets cleared. So if a particular MFN entry's 
> access value changed and the default value was changed, then the MFN should 
> retain the access value it was changed to and not the default value. The new 
> default value will only be returned for MFNs for which set_entry was not 
> called on and I think that is correct.

It might still be a difference when the designated access gets effected:
I would assume that a request for the default type should obtain the
default set at the time of the request, not the one in effects at the time
the entry gets first used.

>>>>> +int p2m_mem_access_init(struct p2m_domain *p2m) {
>>>>> +    struct domain *d = p2m->domain;
>>>>> +    mfn_t *access_lookup_table;
>>>>> +    uint32_t nr_access_table_pages;
>>>>> +    uint32_t ctr;
>>>>> +
>>>>> +    nr_access_table_pages = get_domain_nr_access_table_pages(d);
>>>>> +    access_lookup_table = xzalloc_array(mfn_t,
>>>>> + nr_access_table_pages);
>>>>
>>>>This surely is going to be an order > 0 allocation, and you even risk
>>>>it being an order > MAX_ORDER one. The former is disallowed at runtime
>>>>by convention/agreement, the latter is an outright bug.
>>>
>>> Sorry, I was not aware of the convention. What should I be doing here
>>> instead?
>>
>>Especially with the above note about GFN space being effectively unbounded
>>for PV guests, you need to find a different data structure to represent the
>>information you need. Perhaps something similar to what log-dirty mode
>>uses, except that you would need more levels in the extreme case, and hence
>>it might be worthwhile making the actually used number of levels dynamic?
> 
> I am a little confused here. In my initial discussions about this I was told 
> that PV guests have bounded and contiguous memory. This is why I took the 
> approach of an indexable array. 
> 
> " OTOH, all you need is a byte per pfn, and the great thing is that in PV 
> domains, the physmap is bounded and continuous. Unlike HVM and its PCI holes, 
> etc, which demand the sparse tree structure. So you can allocate an easily 
> indexable array, notwithstanding super page concerns (I think/hope)."
> 
> http://lists.xenproject.org/archives/html/xen-devel/2013-11/msg03860.html 
> 
> Did I misunderstand something here? 

What he stated is the situation at domain start. Due to ballooning and
other kinds of manipulations (as said, pv-ops re-arranges things to
avoid having memory [PFNs] overlap MMIO [MFNs]) this may change
immediately after the domain started.

And even if it was a plain contiguous linear address space, you still
wouldn't be permitted to allocate a flat one-byte-per-page array, as
for large domains this might still overflow the MAX_ORDER allocation
constraint.

>>>>> @@ -2443,7 +2443,8 @@ int sh_remove_all_mappings(struct vcpu *v,
>>>>mfn_t gmfn)
>>>>>          if ( !(shadow_mode_external(v->domain)
>>>>>                 && (page->count_info & PGC_count_mask) <= 3
>>>>>                 && ((page->u.inuse.type_info & PGT_count_mask)
>>>>> -                   == !!is_xen_heap_page(page))) )
>>>>> +                       == !!is_xen_heap_page(page)))
>>>>> +                    && !(shadow_mode_mem_access(v->domain)) )
>>>>
>>>>You're breaking indentation, there are pointless parentheses again,
>>>>but most importantly - why?
>>>
>>> Sorry, I meant to ask a question about this in my patch message and
>>> mark this as a workaround. I am seeing the message on all
>>> sh_remove_all_mappings() and I was not able to figure out why this was
>>> happening. I just added this as a work around. I was hoping you or Tim
>>would shed more light on this.
>>
>>I'm afraid that without detail on which pages the triggers on, and you at 
> least
>>having spent some time finding out where the stray/extra references may be
>>coming from it's going to be hard to help.
> 
> Here are some of the prints I saw. I typically saw it for every set_entry() 
> call.

There are some rather large usage counts among the examples you
gave - these surely need understanding rather than blindly ignoring.

>>>>> +/* Number of access table pages for a PV domain */ #define
>>>>> +get_domain_nr_access_table_pages(d) \
>>>>> +        DIV_ROUND_UP(P2M_ACCESS_SIZE * (d->tot_pages - 1),
>>>>> +PAGE_SIZE)
>>>>
>>>>And once again. I wonder whether this code was tested on a suitably
>>>>big pv-ops PV guest.
>>>
>>> The largest PV domain I tried this with was 28GB.
>>
>>Which ought to have had GFNs larger than ->tot_pages, so I wonder how this
>>worked.
> 
> It would have had MFNs larger than tot_pages. I don't understand how it 
> would have had GFNs larger than tot_pages.

Once again, because it re-arranges its PFN space to avoid MMIO holes
(plus - applicable to non-pv-ops too - it may be started ballooned or
balloon out memory subsequently).

Jan