Re: [PATCH v4 3/3] x86/ioreq server: Add HVMOP to map guest ram with p2m_ioreq_server to an ioreq server.

From: Yu Zhang <yu.c.zhang@linux.intel.com>
To: George Dunlap <george.dunlap@citrix.com>,
	Jan Beulich <JBeulich@suse.com>
Cc: Kevin Tian <kevin.tian@intel.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Tim Deegan <tim@xen.org>,
	xen-devel@lists.xen.org, Paul Durrant <paul.durrant@citrix.com>,
	zhiyuan.lv@intel.com, JunNakajima <jun.nakajima@intel.com>
Subject: Re: [PATCH v4 3/3] x86/ioreq server: Add HVMOP to map guest ram with p2m_ioreq_server to an ioreq server.
Date: Fri, 24 Jun 2016 12:16:08 +0800	[thread overview]
Message-ID: <576CB408.90104@linux.intel.com> (raw)
In-Reply-To: <b0f86fb1-d408-ac80-060b-274c7559f922@citrix.com>

On 6/23/2016 6:33 PM, George Dunlap wrote:
> On 23/06/16 08:37, Yu Zhang wrote:
>> On 6/22/2016 7:33 PM, George Dunlap wrote:
>>> On 22/06/16 11:07, Yu Zhang wrote:
>>>> On 6/22/2016 5:47 PM, George Dunlap wrote:
>>>>> On 22/06/16 10:29, Jan Beulich wrote:
>>>>>>>>> On 22.06.16 at 11:16, <george.dunlap@citrix.com> wrote:
>>>>>>> On 22/06/16 07:39, Jan Beulich wrote:
>>>>>>>>>>> On 21.06.16 at 16:38, <george.dunlap@citrix.com> wrote:
>>>>>>>>> On 21/06/16 10:47, Jan Beulich wrote:
>>>>>>>>>>>>>> And then - didn't we mean to disable that part of XenGT during
>>>>>>>>>>>>>> migration, i.e. temporarily accept the higher performance
>>>>>>>>>>>>>> overhead without the p2m_ioreq_server entries? In which case
>>>>>>>>>>>>>> flipping everything back to p2m_ram_rw after (completed or
>>>>>>>>>>>>>> canceled) migration would be exactly what we want. The (new
>>>>>>>>>>>>>> or previous) ioreq server should attach only afterwards, and
>>>>>>>>>>>>>> can then freely re-establish any p2m_ioreq_server entries it
>>>>>>>>>>>>>> deems necessary.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Well, I agree this part of XenGT should be disabled during
>>>>>>>>>>>>> migration.
>>>>>>>>>>>>> But in such
>>>>>>>>>>>>> case I think it's device model's job to trigger the p2m type
>>>>>>>>>>>>> flipping(i.e. by calling
>>>>>>>>>>>>> HVMOP_set_mem_type).
>>>>>>>>>>>> I agree - this would seem to be the simpler model here, despite
>>>>>>>>>>>> (as
>>>>>>>>>>>> George validly says) the more consistent model would be for the
>>>>>>>>>>>> hypervisor to do the cleanup. Such cleanup would imo be
>>>>>>>>>>>> reasonable
>>>>>>>>>>>> only if there was an easy way for the hypervisor to enumerate
>>>>>>>>>>>> all
>>>>>>>>>>>> p2m_ioreq_server pages.
>>>>>>>>>>> Well, for me, the "easy way" means we should avoid traversing
>>>>>>>>>>> the whole ept
>>>>>>>>>>> paging structure all at once, right?
>>>>>>>>>> Yes.
>>>>>>>>> Does calling p2m_change_entry_type_global() not satisfy this
>>>>>>>>> requirement?
>>>>>>>> Not really - that addresses the "low overhead" aspect, but not the
>>>>>>>> "enumerate all such entries" one.
>>>>>>> I'm sorry, I think I'm missing something here.  What do we need the
>>>>>>> enumeration for?
>>>>>> We'd need that if we were to do the cleanup in the hypervisor (as
>>>>>> we can't rely on all p2m entry re-calculation to have happened by
>>>>>> the time a new ioreq server registers for the type).
>>>>> So you're afraid of this sequence of events?
>>>>> 1) Server A de-registered, triggering a ioreq_server -> ram_rw type
>>>>> change
>>>>> 2) gfn N is marked as misconfigured
>>>>> 3) Server B registers and marks gfn N as ioreq_server
>>>>> 4) When N is accessed, the misconfiguration is resolved incorrectly to
>>>>> ram_rw
>>>>>
>>>>> But that can't happen, because misconfigured entries are resolved
>>>>> before
>>>>> setting a p2m entry; so at step 3, gfn N will be first set to
>>>>> (non-misconfigured) ram_rw, then changed to (non-misconfigured)
>>>>> ioreq_server.
>>>>>
>>>>> Or is there another sequence of events that I'm missing?
>>>> Thanks for your reply, George. :)
>>>> If no log dirty is triggered during this process, your sequence is
>>>> correct.
>>>> However, if log dirty is triggered, we'll met problems. I have described
>>>> this
>>>> in previous mails :
>>>>
>>>> http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02426.html
>>>>
>>>> on Jun 20
>>>>
>>>> and
>>>>
>>>> http://lists.xenproject.org/archives/html/xen-devel/2016-06/msg02575.html
>>>>
>>>> on Jun 21
>>> Right -- sorry, now I see the issue:
>>>
>>> 1. Server A marks gfn X as ioreq_server
>>> 2. Server A deregisters, gfn X misconfigured
>>> 3. Server B registers, marks gfn Y as ioreq_server
>>> 4. Logdirty mode enabled; gfn Y misconfigured
>>> 5. When X or Y are accessed, resolve_misconfigure() has no way of
>>> telling whether the entry is from server A (which should be set to
>>> logdirty) or from server B (which should be left as ioreq_server).
>> Exactly.  :)
>> Another simpler scenario would be
>> 1. Server A marks gfn X as p2m_ioreq_server;
>> 2. Logdirty mode enabled; gfn X misconfigured;
>> 3. When X is written, it will not cause ept vioalation, but ept
>> misconfig, and the
>> resolve_misconfig() would set gfn X back to p2m_ram_rw, thereafter we
>> can not
>> track access to X;
> Right, so this is a reason that simply making misconfigurations always
> resolve ioreq_server into ram_rw isn't compatible with logdirty.
>
>> Note: Not resetting the p2m type for p2m_ioreq_server when
>> p2m->ioreq_server is
>> not NULL is suitable for this simpler scenario, but is not correct if
>> take your scenario
>> into account.
>>
>> The core reason is I could not find a simple solution in
>> resolve_misconfig() to handle
>> handle both the outdated p2m_ioreq_server entries, the in-use ones and
>> to support
>> the logdirty feature at the same time.
> Indeed; and as I said, the real problem is that
> p2m_change_entry_type_global() isn't really properly abstracted; in
> order to use it you need to know how it works and be careful not to use
> it at the wrong time.
>
> Short-term, thinking through a handful of the scenarios we want to
> support should be good enough.  Long-term, making it more robust so that
> we don't have to think so hard about is probably better.
>
>>> In a sense this is a deficiency in the change_entry_type_global()
>>> interface.  A common OS principle is "make the common case fast, and the
>>> uncommon case correct".  The scenario described above seems to me to be
>>> an uncommon case which is handled quickly but incorrectly; ideally we
>>> should handle it correctly, even if it's not very quick.
>>>
>>> Synchronously resolving a previous misconfig is probably the most
>>> straightforward thing to do.  It could be done at point #3, when an M->N
>>> type change is not complete and a new p2m entry of type M is written; it
>>> could be at point #4, when an N->O type change is initiated while an
>>> M->N type change hasn't completed.  Or it could be when an N->O type
>>> change happens while there are unfinished M->N transitions *and*
>>> post-type-change M entries.
>> Sorry, I did not quite get it.  Could you please elaborate more? Thanks! :)
> Well the basic idea is to make change_entry_type_global() *appear* to
> all external callers as though the change happened immediately.  And the
> basic problem is that at the moment, you can start a second
> change_entry_type_global() before the first one has actually finished
> changing all the types it was meant to change.  So the improvement is to
> make sure that all the types which need to be changed actually get
> changed before the second invocation starts.
>
> The absolute simplest thing to do would be to make it actually search
> through the p2m table and make the change immediately.  But we'd like to
> avoid this if we can because it's so slow.
>
> So the next simplest thing to do would be that when someone calls
> change_entry_type_global() a second time, you go through every
> misconfigured entry and resolve it, so that the new
> change_entry_type_global() starts with a clean slate.  This should be
> faster than just sweeping the whole p2m table, since we only need to
> check the p2m entries that haven't been touched since we did the type
> change, but it may still be a lot of work.
>
> So there are other optimizations we might be able to make to try to
> avoid going through and re-syncing things; and those are the examples
> that I gave.
>
>>> For the time being though, this will fail at #4, right?  That is,
>>> logdirty mode cannot be enabled while server B is registered?
>>>
>>> That does mean we'd be forced to sort out the situation before we allow
>>> logdirty and ioreq_server to be used at the same time, but that doesn't
>>> really seem like such a bad idea to me.
>> One solution I thought of is to just return failure in
>> hap_enable_log_dirty()
>> if p2m->ioreq.server is not NULL. But I did not choose such approach,
>> because:
>>
>> 1> I still want to keep the logdirty feature so that XenGT can use it to
>> keep track
>> of the dirty rams when we support live migration in the future;
> But that's not something you can set in stone -- you can have it return
> -EBUSY now, and then at such time as you add dirty vram support for
> ioreq_server p2m types (which I don't think should be hard at all,
> actually), you can remove that restriction.

OK. Returning -EBUSY is fine to me.
In fact, I do not really worry about the tracking of ioreq_server gfns, 
I was
worrying about the normal dirty ram pages. But if p2m_change_entry_type()
can be enhanced in the future, I can remove that restriction then. :)

> The fact is that at he moment, setting logdirty *will not work*; and the
> best interface is one in which broken things cannot happen even by
> accident.  Regardless of what we end up deciding wrt Xen changing the
> entries, I think that Xen should refuse to enable logdirty mode when
> there is an ioreq server registered for ioreq_server p2m entries.
>
>> 2> I also agree with Paul's argument: it is device model's duty to do
>> the p2m type
>> resetting work.
> But that's not really the point.  We all agree that people should look
> where they're going and it's generally an individual's responsibility
> not to fall into pits or run into things in the road or on the sidewalk.
>   But that doesn't mean that it's therefore OK to dig a pit with spikes
> at the bottom in the middle of where people normally walk or cycle.
> Because even though it is an individual person's responsibility not to
> walk into holes, occasionally people are distracted or don't see well or
> make mistakes; and the consequences for being temporarily distracted
> should never be "falling onto a bed of sharp spikes". :-)  If you do
> have to dig a hole in the sidewalk, then at very least you need to put a
> physical barrier around it, so that the consequences for being
> distracted are "runs into a barrier" rather than "falls into a pit".
> But the best of all, if you can manage it, is not to dig the hole at all.
>
> Similarly, even if it could be in theory the device model's duty to
> reset the p2m entries it changed, it's still the case that programmers
> make mistakes.  When those mistakes happen, we at very least want it to
> be as easy to figure out what the problem is as possible; and if we can,
> we want to make those mistakes completely harmless.
>
> Making it possible for ioreq server A's entries to remain outstanding
> after ioreq server B connects is the programming equivalent of leaving a
> big open hole in the middle of a sidewalk: it means that when there's a
> mistake made, there aren't any obvious immediate failures that tell you,
> "Server A forgot to release some entries".  Instead, you will get random
> failures, as bits of memory behave strangely or run very slowly for no
> apparent reason.
>
> Having it impossible to connect ioreq server B if there are still
> outstanding entries from ioreq server A is the equivalent of digging a
> hole and then putting up a barrier.  Now you get a failure when you try
> it, and you're told exactly what the problem is -- the last guy didn't
> release all his entries.  If you don't have the code you're still a bit
> stuck, but at least you know it just doesn't work, rather than failing
> in mysterious and difficult to detect ways later.
>
> Having Xen reset the entries just makes this entire problem go away --
> it's like accessing whatever you needed to access by digging sideways
> from the storm drains, rather than digging a hole in the sidewalk.
>
> It's all well and good to say, "It's the device model's responsibility",
> but we need to plan on programmers making mistakes.  (And we also need
> to plan for administrators making mistakes, which is why I think
> returning -EBUSY when you try to enable logdirty with

Hah. This is a very good metaphor. I am convinced. :)
Though I have doubts about how to refactor the 
p2m_change_entry_type_global() interface,
I'm now willing to take your suggestions:
a> still need the p2m resetting when ioreq server is unbounded;
b> disable log dirty feature if one ioreq server is bounded.

Does anyone else has different opinions? Thanks!

>>> I'm still open to being convinced, but at the moment it really seems to
>>> me like improving the situation is the better long-term option.
>>>
>> Thanks for all your advices, George. I'm also willing to taking other
>> advices, if we have
>> a more acceptable(for you, Jan and other maintainers) resync approach in
>> hypervisor,
>> I'd like  to add this. If the code is too complicated, I can submit it
>> in a separate new
>> patchset. :)
> Well I think sometime in early July I should be able to make some time
> to take a look at it properly.  Maybe I can start with a "draft" patch,
> and you can take it and fix it up and make it work.  Or maybe I'll find
> it's actually too complicated, and then agree with you that relying on
> the server to clean up after itself is the only option. :-)

Thank you, George. I definitely would like to take this work.
And before that, I think disable the log dirty could be OK for me(
after all, making vGPU live migratible requires more features added).

Yu

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel