Re: [Hackathon minutes] PV frontends/backends and NUMA machines

From: George Dunlap <dunlapg@umich.edu>
To: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Subject: Re: [Hackathon minutes] PV frontends/backends and NUMA machines
Date: Mon, 20 May 2013 14:48:50 +0100	[thread overview]
Message-ID: <CAFLBxZbonzEwo4mF6PTSq6WQjU2haN_Ray-Z_3Td83i=f7zsbA@mail.gmail.com> (raw)
In-Reply-To: <alpine.DEB.2.02.1305201443510.4799@kaball.uk.xensource.com>

On Mon, May 20, 2013 at 2:44 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> Hi all,
> these are my notes from the discussion that we had at the Hackathon
> regarding PV frontends and backends running on NUMA machines.
>
>
> ---
>
> The problem: how can we make sure that frontends and backends run in the
> same NUMA node?
>
> We would need to run one backend kthread per NUMA node: we have already
> one kthread per netback vif (one per guest), we could pin each of them
> on a different NUMA node, the same one the frontend is running on.
>
> But that means that dom0 would be running on several NUMA nodes at once,
> how much of a performance penalty would that be?
> We would need to export NUMA information to dom0, so that dom0 can make
> smart decisions on memory allocations and we would also need to allocate
> memory for dom0 from multiple nodes.
>
> We need a way to automatically allocate the initial dom0 memory in Xen
> in a NUMA-aware way and we need Xen to automatically create one dom0 vcpu
> per NUMA node.
>
> After dom0 boots, the toolstack is going to decide where to place any
> new guests: it allocates the memory from the NUMA node it wants to run
> the guest on and it is going to ask dom0 to allocate the kthread from
> that node too. (Maybe writing the NUMA node on xenstore.)
>
> We need to make sure that the interrupts/MSIs coming from the NIC arrive
> on the same pcpu that is running the vcpu that needs to receive it.
> We need to do irqbalacing in dom0, then Xen automatically will make the
> physical MSIs follow the vcpu automatically.
>
> If the card is multiqueue we need to make sure that we use the multiple
> queues so that we can have difference sources of interrupts/MSIs for
> each vif. This allows us to independently notify each dom0 vcpu.

So the work items I remember are as follows:
1. Implement NUMA affinity for vcpus
2. Implement Guest NUMA support for PV guests
3. Teach Xen how to make a sensible NUMA allocation layout for dom0
4. Teach the toolstack to pin the netback threads to dom0 vcpus
running on the correct node (s)

Dario will do #1.  I volunteered to take a stab at #2 and #3.  #4 we
should be able to do independently of 2 and 3 -- it should give a
slight performance improvement due to cache proximity even if dom0
memory is striped across the nodes.

Does someone want to volunteer to take a look at #4?  I suspect that
the core technical implementation will be simple, but getting a stable
design that everyone is happy with for the future will take a
significant number of iterations.  Learn from my fail w/ USB hot-plug
in 4.3, and start the design process early. :-)

 -George