Re: vmx: VT-d posted-interrupt core logic handling

From: George Dunlap <george.dunlap@citrix.com>
To: David Vrabel <david.vrabel@citrix.com>,
	Jan Beulich <JBeulich@suse.com>,
	Kevin Tian <kevin.tian@intel.com>
Cc: Lars Kurth <lars.kurth@citrix.com>, Feng Wu <feng.wu@intel.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Dario Faggioli <dario.faggioli@citrix.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: vmx: VT-d posted-interrupt core logic handling
Date: Thu, 10 Mar 2016 10:46:55 +0000	[thread overview]
Message-ID: <56E1509F.1060108@citrix.com> (raw)
In-Reply-To: <56E14DDA.3040500@citrix.com>

On 10/03/16 10:35, David Vrabel wrote:
> On 10/03/16 10:18, Jan Beulich wrote:
>>>>> On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
>>>>  From: Tian, Kevin
>>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>>
>>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>>
>>>>>
>>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>>> have different theoretical maximum possible number. The closest
>>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>>> page based so could grow 'overly large'. Other examples are
>>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>>> also want to create some artificial scenarios to examine them
>>>>>> since based on actual operation K-level entries may also become
>>>>>> a problem?
>>>>>>
>>>>>> Just want to figure out how best we can solve all related linked-list
>>>>>> usages in current hypervisor.
>>>>>
>>>>> As you say, those are (perhaps with the exception of tmem, which
>>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>>> isn't on by default) in the order of a few thousand list elements.
>>>>> And as mentioned above, different bounds apply for lists traversed
>>>>> in interrupt context vs such traversed only in "normal" context.
>>>>>
>>>>
>>>> That's a good point. Interrupt context should have more restrictions.
>>>
>>> Hi, Jan,
>>>
>>> I'm thinking your earlier idea about evenly distributed list:
>>>
>>> --
>>> Ah, right, I think that limitation was named before, yet I've
>>> forgotten about it again. But that only slightly alters the
>>> suggestion: To distribute vCPU-s evenly would then require to
>>> change their placement on the pCPU in the course of entering
>>> blocked state.
>>> --
>>>
>>> Actually after more thinking, there is no hard requirement that
>>> the vcpu must block on the pcpu which is configured in 'NDST'
>>> of that vcpu's PI descriptor. What really matters, is that the
>>> vcpu is added to the linked list of the very pcpu, then when PI
>>> notification comes we can always find out the vcpu struct from
>>> that pcpu's linked list. Of course one drawback of such placement
>>> is additional IPI incurred in wake up path.
>>>
>>> Then one possible optimized policy within vmx_vcpu_block could 
>>> be:
>>>
>>> (Say PCPU1 which VCPU1 is currently blocked on)
>>> - As long as the #vcpus in the linked list on PCPU1 is below a 
>>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>>> Upon PI notification on PCPU1, local linked list is searched to
>>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>>
>>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution 
>>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to 
>>> unblock VCPU1;
>>
>> Sounds possible, if the lock handling can be got right. But of
>> course there can't be any hard limit like 16, at least not alone
>> (on a systems with extremely many mostly idle vCPU-s we'd
>> need to allow larger counts - see my earlier explanations in this
>> regard).
> 
> You could also consider only waking the first N VCPUs and just making
> the rest runnable.  If you wake more VCPUs than PCPUs at the same time
> most of them won't actually be scheduled.

"Waking" a vcpu means "changing from blocked to runnable", so those two
things are the same.  And I can't figure out what you mean instead --
can you elaborate?

Waking up 1000 vcpus is going to take strictly more time than checking
whether there's a PI interrupt pending on 1000 vcpus to see if they need
to be woken up.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel