Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long

From: George Dunlap <george.dunlap@citrix.com>
To: Chao Gao <chao.gao@intel.com>, xen-devel@lists.xen.org
Cc: Kevin Tian <kevin.tian@intel.com>, Wei Liu <wei.liu2@citrix.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	George Dunlap <George.Dunlap@eu.citrix.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Ian Jackson <ian.jackson@eu.citrix.com>,
	Jan Beulich <jbeulich@suse.com>
Subject: Re: [PATCH 0/4] mitigate the per-pCPU blocking list may be too long
Date: Wed, 26 Apr 2017 17:39:57 +0100	[thread overview]
Message-ID: <15f405cc-04aa-ac3d-8ae2-17f684b21d36@citrix.com> (raw)
In-Reply-To: <1493167967-74144-1-git-send-email-chao.gao@intel.com>

On 26/04/17 01:52, Chao Gao wrote:
> VT-d PI introduces a per-pCPU blocking list to track the blocked vCPU
> running on the pCPU. Theoretically, there are 32K domain on single
> host, 128 vCPUs per domain. If all vCPUs are blocked on the same pCPU,
> 4M vCPUs are in the same list. Travelling this issue consumes too
> much time. We have discussed this issue in [1,2,3].
> 
> To mitigate this issue, we proposed the following two method [3]:
> 1. Evenly distributing all the blocked vCPUs among all pCPUs.

So you're not actually distributing the *vcpus* among the pcpus (which
would imply some interaction with the scheduler); you're distributing
the vcpu PI wake-up interrupt between pcpus.  Is that right?

Doesn't having a PI on a different pcpu than where the vcpu is running
mean at least one IPI to wake up that vcpu?  If so, aren't we imposing a
constant overhead on basically every single interrupt, as well as
increasing the IPI traffic, in order to avoid a highly unlikely
theoretical corner case?

A general maxim in OS development is "Make the common case fast, and the
uncommon case correct."  It seems like it would be better in the common
case to have the PI vectors on the pcpu on which the vcpu is running,
and only implement the balancing when the list starts to get too long.

What do you think?

> 2. Don't put the blocked vCPUs which won't be woken by the wakeup
> interrupt into the per-pCPU list.
> 
> PATCH 1/4 tracks the event, adding entry to PI blocking list. With the
> patch, some data can be acquired to help to validate the following
> patches. 
> 
> Patch 2/4 randomly distritbutes entries (vCPUs) among all oneline
> pCPUs, which can theoretically decrease the maximum of #entry
> in the list by N times. N is #pCPU.
> 
> Patch 3/4 adds a refcount to vcpu's pi_desc. If the pi_desc is
> recorded in one IRTE, the refcount increase by 1 and If the pi_desc is
> cleared in one IRTE, the refcount decrease by 1.
> 
> In Patch 4/4, one vCPU is added to PI blocking list only if its
> pi_desc is referred by at least one IRTE.
> 
> I tested this series in the following scene:
> * One 128 vCPUs guest and assign a NIC to it
> * all 128 vCPUs are pinned to one pCPU.
> * use xentrace to collect events for 5 minutes
> 
> I compared the maximum of #entry in one list and #event (adding entry to
> PI blocking list) with and without the three latter patches. Here
> is the result:
> -------------------------------------------------------------
> |               |                      |                    |
> |    Items      |   Maximum of #entry  |      #event        |
> |               |                      |                    |
> -------------------------------------------------------------
> |               |                      |                    |
> |W/ the patches |         6            |       22740        |
> |               |                      |                    |
> -------------------------------------------------------------
> |               |                      |                    |
> |W/O the patches|        128           |       46481        |
> |               |                      |                    |
> -------------------------------------------------------------

Any chance you could trace how long the list traversal took?  It would
be good for future reference to have an idea what kinds of timescales
we're talking about.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel