Re: Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling

From: George Dunlap <george.dunlap@citrix.com>
To: Jan Beulich <JBeulich@suse.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Feng Wu <feng.wu@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>, Keir Fraser <keir@xen.org>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Dario Faggioli <dario.faggioli@citrix.com>,
	"xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Subject: Re: Ideas Re: [PATCH v14 1/2] vmx: VT-d posted-interrupt core logic handling
Date: Tue, 8 Mar 2016 17:05:47 +0000	[thread overview]
Message-ID: <56DF066B.3090106@citrix.com> (raw)
In-Reply-To: <56DF00D902000078000DA7C1@prv-mh.provo.novell.com>

On 08/03/16 15:42, Jan Beulich wrote:
>>>> On 08.03.16 at 15:42, <George.Dunlap@eu.citrix.com> wrote:
>> On Tue, Mar 8, 2016 at 1:10 PM, Wu, Feng <feng.wu@intel.com> wrote:
>>>> -----Original Message-----
>>>> From: George Dunlap [mailto:george.dunlap@citrix.com]
>> [snip]
>>>> It seems like there are a couple of ways we could approach this:
>>>>
>>>> 1. Try to optimize the reverse look-up code so that it's not a linear
>>>> linked list (getting rid of the theoretical fear)
>>>
>>> Good point.
>>>
>>>>
>>>> 2. Try to test engineered situations where we expect this to be a
>>>> problem, to see how big of a problem it is (proving the theory to be
>>>> accurate or inaccurate in this case)
>>>
>>> Maybe we can run a SMP guest with all the vcpus pinned to a dedicated
>>> pCPU, we can run some benchmark in the guest with VT-d PI and without
>>> VT-d PI, then see the performance difference between these two sceanrios.
>>
>> This would give us an idea what the worst-case scenario would be.
> 
> How would a single VM ever give us an idea about the worst
> case? Something getting close to worst case is a ton of single
> vCPU guests all temporarily pinned to one and the same pCPU
> (could be multi-vCPU ones, but the more vCPU-s the more
> artificial this pinning would become) right before they go into
> blocked state (i.e. through one of the two callers of
> arch_vcpu_block()), the pinning removed while blocked, and
> then all getting woken at once.

Why would removing the pinning be important?

And I guess it's actually the case that it doesn't need all VMs to
actually be *receiving* interrupts; it just requires them to be
*capable* of receiving interrupts, for there to be a long chain all
blocked on the same physical cpu.

> 
>>  But
>> pinning all vcpus to a single pcpu isn't really a sensible use case we
>> want to support -- if you have to do something stupid to get a
>> performance regression, then I as far as I'm concerned it's not a
>> problem.
>>
>> Or to put it a different way: If we pin 10 vcpus to a single pcpu and
>> then pound them all with posted interrupts, and there is *no*
>> significant performance regression, then that will conclusively prove
>> that the theoretical performance regression is of no concern, and we
>> can enable PI by default.
> 
> The point isn't the pinning. The point is what pCPU they're on when
> going to sleep. And that could involve quite a few more than just
> 10 vCPU-s, provided they all sleep long enough.
> 
> And the "theoretical performance regression is of no concern" is
> also not a proper way of looking at it, I would say: Even if such
> a situation would happen extremely rarely, if it can happen at all,
> it would still be a security issue.

What I'm trying to get at is -- exactly what situation?  What actually
constitutes a problematic interrupt latency / interrupt processing
workload, how many vcpus must be sleeping on the same pcpu to actually
risk triggering that latency / workload, and how feasible is it that
such a situation would arise in a reasonable scenario?

If 200us is too long, and it only takes 3 sleeping vcpus to get there,
then yes, there is a genuine problem we need to try to address before we
turn it on by default.  If we say that up to 500us is tolerable, and it
takes 100 sleeping vcpus to reach that latency, then this is something I
don't really think we need to worry about.

"I think something bad may happen" is a really difficult to work with.
"I want to make sure that even a high number of blocked cpus won't cause
the interrupt latency to exceed 500us; and I want it to be basically
impossible for the interrupt latency to exceed 5ms under any
circumstances" is a concrete target someone can either demonstrate that
they meet, or aim for when trying to improve the situation.

Feng: It should be pretty easy for you to:
* Implement a modified version of Xen where
 - *All* vcpus get put on the waitqueue
 - Measure how long it took to run the loop in pi_wakeup_interrupt
* Have one VM receiving posted interrupts on a regular basis.
* Slowly increase the number of vcpus blocked on a single cpu (e.g., by
creating more guests), stopping when you either reach 500us or 500
vcpus. :-)

To report the measurements, you could either create a Xen trace record
and use xentrace_format or xenalyze to plot the results; or you could
create some software performance counters for different "buckets" --
less than 100us, 100-200us, 200-300us, 300-400us, 400-500us, and more
than 500us.

Or you could printk the min / average / max every 5000 interrupts or so. :-)

To test, it seems like using a network benchmark with short packet
lengths should be able to trigger large numbers of interrupts; and it
also can let you know if / when there's a performance impact of adding
more vcpus.

Or alternately, you could try to come up with a quicker reverse-lookup
algorithm. :-)

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel