All of lore.kernel.org
 help / color / mirror / Atom feed
* [Hackathon minutes] PV frontends/backends and NUMA machines
@ 2013-05-20 13:44 Stefano Stabellini
  2013-05-20 13:48 ` George Dunlap
  0 siblings, 1 reply; 28+ messages in thread
From: Stefano Stabellini @ 2013-05-20 13:44 UTC (permalink / raw)
  To: xen-devel

Hi all,
these are my notes from the discussion that we had at the Hackathon
regarding PV frontends and backends running on NUMA machines.


---

The problem: how can we make sure that frontends and backends run in the
same NUMA node?

We would need to run one backend kthread per NUMA node: we have already
one kthread per netback vif (one per guest), we could pin each of them
on a different NUMA node, the same one the frontend is running on.

But that means that dom0 would be running on several NUMA nodes at once,
how much of a performance penalty would that be?
We would need to export NUMA information to dom0, so that dom0 can make
smart decisions on memory allocations and we would also need to allocate
memory for dom0 from multiple nodes.

We need a way to automatically allocate the initial dom0 memory in Xen
in a NUMA-aware way and we need Xen to automatically create one dom0 vcpu
per NUMA node.

After dom0 boots, the toolstack is going to decide where to place any
new guests: it allocates the memory from the NUMA node it wants to run
the guest on and it is going to ask dom0 to allocate the kthread from
that node too. (Maybe writing the NUMA node on xenstore.)

We need to make sure that the interrupts/MSIs coming from the NIC arrive
on the same pcpu that is running the vcpu that needs to receive it.
We need to do irqbalacing in dom0, then Xen automatically will make the
physical MSIs follow the vcpu automatically.

If the card is multiqueue we need to make sure that we use the multiple
queues so that we can have difference sources of interrupts/MSIs for
each vif. This allows us to independently notify each dom0 vcpu.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-20 13:44 [Hackathon minutes] PV frontends/backends and NUMA machines Stefano Stabellini
@ 2013-05-20 13:48 ` George Dunlap
  2013-05-21  8:32   ` Tim Deegan
                     ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: George Dunlap @ 2013-05-20 13:48 UTC (permalink / raw)
  To: Stefano Stabellini; +Cc: xen-devel

On Mon, May 20, 2013 at 2:44 PM, Stefano Stabellini
<stefano.stabellini@eu.citrix.com> wrote:
> Hi all,
> these are my notes from the discussion that we had at the Hackathon
> regarding PV frontends and backends running on NUMA machines.
>
>
> ---
>
> The problem: how can we make sure that frontends and backends run in the
> same NUMA node?
>
> We would need to run one backend kthread per NUMA node: we have already
> one kthread per netback vif (one per guest), we could pin each of them
> on a different NUMA node, the same one the frontend is running on.
>
> But that means that dom0 would be running on several NUMA nodes at once,
> how much of a performance penalty would that be?
> We would need to export NUMA information to dom0, so that dom0 can make
> smart decisions on memory allocations and we would also need to allocate
> memory for dom0 from multiple nodes.
>
> We need a way to automatically allocate the initial dom0 memory in Xen
> in a NUMA-aware way and we need Xen to automatically create one dom0 vcpu
> per NUMA node.
>
> After dom0 boots, the toolstack is going to decide where to place any
> new guests: it allocates the memory from the NUMA node it wants to run
> the guest on and it is going to ask dom0 to allocate the kthread from
> that node too. (Maybe writing the NUMA node on xenstore.)
>
> We need to make sure that the interrupts/MSIs coming from the NIC arrive
> on the same pcpu that is running the vcpu that needs to receive it.
> We need to do irqbalacing in dom0, then Xen automatically will make the
> physical MSIs follow the vcpu automatically.
>
> If the card is multiqueue we need to make sure that we use the multiple
> queues so that we can have difference sources of interrupts/MSIs for
> each vif. This allows us to independently notify each dom0 vcpu.

So the work items I remember are as follows:
1. Implement NUMA affinity for vcpus
2. Implement Guest NUMA support for PV guests
3. Teach Xen how to make a sensible NUMA allocation layout for dom0
4. Teach the toolstack to pin the netback threads to dom0 vcpus
running on the correct node (s)

Dario will do #1.  I volunteered to take a stab at #2 and #3.  #4 we
should be able to do independently of 2 and 3 -- it should give a
slight performance improvement due to cache proximity even if dom0
memory is striped across the nodes.

Does someone want to volunteer to take a look at #4?  I suspect that
the core technical implementation will be simple, but getting a stable
design that everyone is happy with for the future will take a
significant number of iterations.  Learn from my fail w/ USB hot-plug
in 4.3, and start the design process early. :-)

 -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-20 13:48 ` George Dunlap
@ 2013-05-21  8:32   ` Tim Deegan
  2013-05-21  8:47     ` George Dunlap
  2013-05-21  8:44   ` Roger Pau Monné
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 28+ messages in thread
From: Tim Deegan @ 2013-05-21  8:32 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Stefano Stabellini

At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
> So the work items I remember are as follows:
> 1. Implement NUMA affinity for vcpus
> 2. Implement Guest NUMA support for PV guests
> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0

Does Xen need to do this?  Or could dom0 sort that out for itself after
boot?

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-20 13:48 ` George Dunlap
  2013-05-21  8:32   ` Tim Deegan
@ 2013-05-21  8:44   ` Roger Pau Monné
  2013-05-21  9:24     ` Wei Liu
  2013-05-21 11:10   ` Dario Faggioli
  2013-05-22  1:28   ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 28+ messages in thread
From: Roger Pau Monné @ 2013-05-21  8:44 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Stefano Stabellini

On 20/05/13 15:48, George Dunlap wrote:
> On Mon, May 20, 2013 at 2:44 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
>> Hi all,
>> these are my notes from the discussion that we had at the Hackathon
>> regarding PV frontends and backends running on NUMA machines.
>>
>>
>> ---
>>
>> The problem: how can we make sure that frontends and backends run in the
>> same NUMA node?
>>
>> We would need to run one backend kthread per NUMA node: we have already
>> one kthread per netback vif (one per guest), we could pin each of them
>> on a different NUMA node, the same one the frontend is running on.
>>
>> But that means that dom0 would be running on several NUMA nodes at once,
>> how much of a performance penalty would that be?
>> We would need to export NUMA information to dom0, so that dom0 can make
>> smart decisions on memory allocations and we would also need to allocate
>> memory for dom0 from multiple nodes.
>>
>> We need a way to automatically allocate the initial dom0 memory in Xen
>> in a NUMA-aware way and we need Xen to automatically create one dom0 vcpu
>> per NUMA node.
>>
>> After dom0 boots, the toolstack is going to decide where to place any
>> new guests: it allocates the memory from the NUMA node it wants to run
>> the guest on and it is going to ask dom0 to allocate the kthread from
>> that node too. (Maybe writing the NUMA node on xenstore.)
>>
>> We need to make sure that the interrupts/MSIs coming from the NIC arrive
>> on the same pcpu that is running the vcpu that needs to receive it.
>> We need to do irqbalacing in dom0, then Xen automatically will make the
>> physical MSIs follow the vcpu automatically.
>>
>> If the card is multiqueue we need to make sure that we use the multiple
>> queues so that we can have difference sources of interrupts/MSIs for
>> each vif. This allows us to independently notify each dom0 vcpu.
> 
> So the work items I remember are as follows:
> 1. Implement NUMA affinity for vcpus
> 2. Implement Guest NUMA support for PV guests
> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0
> 4. Teach the toolstack to pin the netback threads to dom0 vcpus
> running on the correct node (s)
> 
> Dario will do #1.  I volunteered to take a stab at #2 and #3.  #4 we
> should be able to do independently of 2 and 3 -- it should give a
> slight performance improvement due to cache proximity even if dom0
> memory is striped across the nodes.
> 
> Does someone want to volunteer to take a look at #4?  I suspect that
> the core technical implementation will be simple, but getting a stable
> design that everyone is happy with for the future will take a
> significant number of iterations.  Learn from my fail w/ USB hot-plug
> in 4.3, and start the design process early. :-)

#4 is easy to implement from my POV in blkback, you just need to write a
node in the xenstore backend directory that tells blkback to pin the
created kthread to a specific NUMA node, and make sure that the memory
used for that blkback instance is allocated from inside the kthread. My
indirect descriptors series already removes any shared structures
between different blkback instances, so some part of this work is
already done. And I guess that something similar could be implemented
for QEMU/Qdisk from the toolstack level (pin the qemu process to a
specific NUMA node).

I'm already quite familiar with the blkback code, so I can take care of
#4 for blkback.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  8:32   ` Tim Deegan
@ 2013-05-21  8:47     ` George Dunlap
  2013-05-21  8:49       ` George Dunlap
                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: George Dunlap @ 2013-05-21  8:47 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Stefano Stabellini

On Tue, May 21, 2013 at 9:32 AM, Tim Deegan <tim@xen.org> wrote:
> At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
>> So the work items I remember are as follows:
>> 1. Implement NUMA affinity for vcpus
>> 2. Implement Guest NUMA support for PV guests
>> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0
>
> Does Xen need to do this?  Or could dom0 sort that out for itself after
> boot?

There are two aspects of this.  First would be, if dom0.nvcpus <
host.npcpus, to place the vcpus reasonably on the various numa nodes.

The second is to make the pfn -> NUMA node layout reasonable.  At the
moment, as I understand it, pfns will be striped across nodes.  In
theory dom0 could deal with this, but it seems like in practice it's
going to be nasty trying to sort that stuff out.  It would be much
better, if you have (say) 4 nodes and 4GiB of memory assigned to dom0,
to have pfn 0-1G on node 0, 1-2G on node 2, &c.

 -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  8:47     ` George Dunlap
@ 2013-05-21  8:49       ` George Dunlap
  2013-05-21 10:03         ` Dario Faggioli
  2013-05-21  9:20       ` Tim Deegan
  2013-05-21 10:06       ` Jan Beulich
  2 siblings, 1 reply; 28+ messages in thread
From: George Dunlap @ 2013-05-21  8:49 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Stefano Stabellini

On Tue, May 21, 2013 at 9:47 AM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Tue, May 21, 2013 at 9:32 AM, Tim Deegan <tim@xen.org> wrote:
>> At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
>>> So the work items I remember are as follows:
>>> 1. Implement NUMA affinity for vcpus
>>> 2. Implement Guest NUMA support for PV guests
>>> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0
>>
>> Does Xen need to do this?  Or could dom0 sort that out for itself after
>> boot?
>
> There are two aspects of this.  First would be, if dom0.nvcpus <
> host.npcpus, to place the vcpus reasonably on the various numa nodes.

And indeed, if dom0.nvcpus < host.nnodes, to try to minimize the
distance from the vcpus to all nodes on the system.

 -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  8:47     ` George Dunlap
  2013-05-21  8:49       ` George Dunlap
@ 2013-05-21  9:20       ` Tim Deegan
  2013-05-21  9:45         ` George Dunlap
  2013-05-21  9:53         ` Dario Faggioli
  2013-05-21 10:06       ` Jan Beulich
  2 siblings, 2 replies; 28+ messages in thread
From: Tim Deegan @ 2013-05-21  9:20 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Stefano Stabellini

At 09:47 +0100 on 21 May (1369129629), George Dunlap wrote:
> On Tue, May 21, 2013 at 9:32 AM, Tim Deegan <tim@xen.org> wrote:
> > At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
> >> So the work items I remember are as follows:
> >> 1. Implement NUMA affinity for vcpus
> >> 2. Implement Guest NUMA support for PV guests
> >> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0
> >
> > Does Xen need to do this?  Or could dom0 sort that out for itself after
> > boot?
> 
> There are two aspects of this.  First would be, if dom0.nvcpus <
> host.npcpus, to place the vcpus reasonably on the various numa nodes.

Well, that part at least seems like it can be managed quite nicely from
dom0 userspace, in a Xen init script.  But...

> The second is to make the pfn -> NUMA node layout reasonable.  At the
> moment, as I understand it, pfns will be striped across nodes.  In
> theory dom0 could deal with this, but it seems like in practice it's
> going to be nasty trying to sort that stuff out.  It would be much
> better, if you have (say) 4 nodes and 4GiB of memory assigned to dom0,
> to have pfn 0-1G on node 0, 1-2G on node 2, &c.

Yeah, I can see that fixing that post-hoc would be a PITA.  I guess if
you figure out the vcpu assignments at dom0-build time, the normal NUMA
memory allocation code will just DTRT (since that's what you'd want for
a comparable domU)?

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  8:44   ` Roger Pau Monné
@ 2013-05-21  9:24     ` Wei Liu
  2013-05-21  9:53       ` George Dunlap
  0 siblings, 1 reply; 28+ messages in thread
From: Wei Liu @ 2013-05-21  9:24 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: George Dunlap, xen-devel, wei.liu2, Stefano Stabellini

On Tue, May 21, 2013 at 10:44:02AM +0200, Roger Pau Monné wrote:
[...]
> > 4. Teach the toolstack to pin the netback threads to dom0 vcpus
> > running on the correct node (s)
> > 
> > Dario will do #1.  I volunteered to take a stab at #2 and #3.  #4 we
> > should be able to do independently of 2 and 3 -- it should give a
> > slight performance improvement due to cache proximity even if dom0
> > memory is striped across the nodes.
> > 
> > Does someone want to volunteer to take a look at #4?  I suspect that
> > the core technical implementation will be simple, but getting a stable
> > design that everyone is happy with for the future will take a
> > significant number of iterations.  Learn from my fail w/ USB hot-plug
> > in 4.3, and start the design process early. :-)
> 
> #4 is easy to implement from my POV in blkback, you just need to write a
> node in the xenstore backend directory that tells blkback to pin the
> created kthread to a specific NUMA node, and make sure that the memory
> used for that blkback instance is allocated from inside the kthread. My
> indirect descriptors series already removes any shared structures
> between different blkback instances, so some part of this work is
> already done. And I guess that something similar could be implemented
> for QEMU/Qdisk from the toolstack level (pin the qemu process to a
> specific NUMA node).
> 
> I'm already quite familiar with the blkback code, so I can take care of
> #4 for blkback.
> 

So the core thing in netback is almost ready, I trust Linux scheduler
now and don't pin kthread at all but relevant code shuold be easy to
add. I just checked my code, all memory allocation is already node
awared.

As for the toolstack part, I'm not sure writing the initial node to
xenstore will be sufficient. Do we do inter-node migration? If so
frontend / backend should also update xenstore information as it
migrates?

IIRC the memory of a guest is striped through nodes, if it is this case,
how can pinning benefit? (I might be talking crap as I don't know much
about NUMA and its current status in Xen)


Wei.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  9:20       ` Tim Deegan
@ 2013-05-21  9:45         ` George Dunlap
  2013-05-21 10:24           ` Tim Deegan
  2013-05-21  9:53         ` Dario Faggioli
  1 sibling, 1 reply; 28+ messages in thread
From: George Dunlap @ 2013-05-21  9:45 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Stefano Stabellini

On 05/21/2013 10:20 AM, Tim Deegan wrote:
> At 09:47 +0100 on 21 May (1369129629), George Dunlap wrote:
>> On Tue, May 21, 2013 at 9:32 AM, Tim Deegan <tim@xen.org> wrote:
>>> At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
>>>> So the work items I remember are as follows:
>>>> 1. Implement NUMA affinity for vcpus
>>>> 2. Implement Guest NUMA support for PV guests
>>>> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0
>>>
>>> Does Xen need to do this?  Or could dom0 sort that out for itself after
>>> boot?
>>
>> There are two aspects of this.  First would be, if dom0.nvcpus <
>> host.npcpus, to place the vcpus reasonably on the various numa nodes.
>
> Well, that part at least seems like it can be managed quite nicely from
> dom0 userspace, in a Xen init script.  But...
>
>> The second is to make the pfn -> NUMA node layout reasonable.  At the
>> moment, as I understand it, pfns will be striped across nodes.  In
>> theory dom0 could deal with this, but it seems like in practice it's
>> going to be nasty trying to sort that stuff out.  It would be much
>> better, if you have (say) 4 nodes and 4GiB of memory assigned to dom0,
>> to have pfn 0-1G on node 0, 1-2G on node 2, &c.
>
> Yeah, I can see that fixing that post-hoc would be a PITA.  I guess if
> you figure out the vcpu assignments at dom0-build time, the normal NUMA
> memory allocation code will just DTRT (since that's what you'd want for
> a comparable domU)?

I'm not sure why you think so -- for one, please correct me if I'm 
wrong, but NUMA affinity is a domain construct, not a vcpu construct. 
Memory is allocated on behalf of a domain, not a vcpu, and is allocated 
a batch at a time.  So how is the memory allocator supposed to know that 
the current allocation request is in the middle of the second gigabyte 
of a 4G total, and thus to allocate from node 1?

What we would want for a comparable domU -- a domU that was NUMA-aware 
-- was to have the pfn layout in batches across the nodes to which it 
will be pinned.  E.g., if a domU has its NUMA affinity set to nodes 2-3, 
then you'd want the first half of the pfns to come from node 2, the 
second half from node 3.

In both cases, the domain builder will need to call the allocator with 
specific numa nodes for specific regions of the PFN space.

  -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  9:24     ` Wei Liu
@ 2013-05-21  9:53       ` George Dunlap
  2013-05-21 10:17         ` Dario Faggioli
  0 siblings, 1 reply; 28+ messages in thread
From: George Dunlap @ 2013-05-21  9:53 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Stefano Stabellini, Roger Pau Monné

[remembering to cc the list this time]

On Tue, May 21, 2013 at 10:24 AM, Wei Liu <wei.liu2@citrix.com> wrote:
> On Tue, May 21, 2013 at 10:44:02AM +0200, Roger Pau Monné wrote:
> [...]
>> > 4. Teach the toolstack to pin the netback threads to dom0 vcpus
>> > running on the correct node (s)
>> >
>> > Dario will do #1.  I volunteered to take a stab at #2 and #3.  #4 we
>> > should be able to do independently of 2 and 3 -- it should give a
>> > slight performance improvement due to cache proximity even if dom0
>> > memory is striped across the nodes.
>> >
>> > Does someone want to volunteer to take a look at #4?  I suspect that
>> > the core technical implementation will be simple, but getting a stable
>> > design that everyone is happy with for the future will take a
>> > significant number of iterations.  Learn from my fail w/ USB hot-plug
>> > in 4.3, and start the design process early. :-)
>>
>> #4 is easy to implement from my POV in blkback, you just need to write a
>> node in the xenstore backend directory that tells blkback to pin the
>> created kthread to a specific NUMA node, and make sure that the memory
>> used for that blkback instance is allocated from inside the kthread. My
>> indirect descriptors series already removes any shared structures
>> between different blkback instances, so some part of this work is
>> already done. And I guess that something similar could be implemented
>> for QEMU/Qdisk from the toolstack level (pin the qemu process to a
>> specific NUMA node).
>>
>> I'm already quite familiar with the blkback code, so I can take care of
>> #4 for blkback.
>>
>
> So the core thing in netback is almost ready, I trust Linux scheduler
> now and don't pin kthread at all but relevant code shuold be easy to
> add. I just checked my code, all memory allocation is already node
> awared.
>
> As for the toolstack part, I'm not sure writing the initial node to
> xenstore will be sufficient. Do we do inter-node migration? If so
> frontend / backend should also update xenstore information as it
> migrates?

We can of course migrate the vcpus, but migrating the actual memory
from one node to another is pretty tricky, particularly for PV guests.
 It won't be something that happens very often; when it does, we will
need to sort out migrating the backend threads.

> IIRC the memory of a guest is striped through nodes, if it is this case,
> how can pinning benefit? (I might be talking crap as I don't know much
> about NUMA and its current status in Xen)

It's striped across nodes *of its NUMA affinity*.  So if you have a
4-node box, and you set its NUMA affinity to node 3, then the
allocator will try to get all of the memory from node 3.  If its
affinity is set to {2,3}, then the allocator will stripe it across
nodes 2 and 3.

 -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  9:20       ` Tim Deegan
  2013-05-21  9:45         ` George Dunlap
@ 2013-05-21  9:53         ` Dario Faggioli
  1 sibling, 0 replies; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21  9:53 UTC (permalink / raw)
  To: Tim Deegan; +Cc: George Dunlap, xen-devel, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1760 bytes --]

On mar, 2013-05-21 at 10:20 +0100, Tim Deegan wrote:
> At 09:47 +0100 on 21 May (1369129629), George Dunlap wrote:
> > The second is to make the pfn -> NUMA node layout reasonable.  At the
> > moment, as I understand it, pfns will be striped across nodes.  In
> > theory dom0 could deal with this, but it seems like in practice it's
> > going to be nasty trying to sort that stuff out.  It would be much
> > better, if you have (say) 4 nodes and 4GiB of memory assigned to dom0,
> > to have pfn 0-1G on node 0, 1-2G on node 2, &c.
> 
> Yeah, I can see that fixing that post-hoc would be a PITA.  
>
Indeed! :-P

> I guess if
> you figure out the vcpu assignments at dom0-build time, the normal NUMA
> memory allocation code will just DTRT (since that's what you'd want for
> a comparable domU)?
> 
Well, we need to check what actually happens deeper, but I don't think
that is going to be enough, and that is true for DomUs as well.

In fact, what we have right now (for DomUs) is: memory is allocated from
a subset of the host nodes and vCPUs pefers to be scheduled on the pCPUs
of those nodes. However, what a true 'guest NUMA awareness' prescribes
is that you know what memory "belongs to" (i.e., is accessed quicker
from) each vCPU, and this is something we don't have.

So, yes, I think creating a node-affinity for Dom0 early enough would be
a reasonable first step, and would already help quite a bit. However,
that would just mean that we'll (going back to George's example) get 1G
of memory from node0, 1G of memory from node1, etc. What we want is to
force pfns 0-1G to be actually allocated out of node0, and so on and so
forth... And that is something that I don't think the current code can
guarantee.

Dario


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  8:49       ` George Dunlap
@ 2013-05-21 10:03         ` Dario Faggioli
  0 siblings, 0 replies; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21 10:03 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Tim Deegan, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1185 bytes --]

On mar, 2013-05-21 at 09:49 +0100, George Dunlap wrote:
> On Tue, May 21, 2013 at 9:47 AM, George Dunlap
> > There are two aspects of this.  First would be, if dom0.nvcpus <
> > host.npcpus, to place the vcpus reasonably on the various numa nodes.
> 
> And indeed, if dom0.nvcpus < host.nnodes, to try to minimize the
> distance from the vcpus to all nodes on the system.
> 
Yep, we definitely want that too. I have patches for the toolstack
counterpart of this (when speaking about DomUs), but I was using a
quadratic (in the number of nodes) algorithm, and we need to find
something better to scan and make decisions out of the distances matrix,
and that is especially true if we're in Xen rather than in libxl I
guess. :-P

Also, IONUMA machines can possibly introduce additional constraints: for
instance, you almost likely want a vCPU on all the nodes that have an IO
controller attached!

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  8:47     ` George Dunlap
  2013-05-21  8:49       ` George Dunlap
  2013-05-21  9:20       ` Tim Deegan
@ 2013-05-21 10:06       ` Jan Beulich
  2013-05-21 10:30         ` Dario Faggioli
  2 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2013-05-21 10:06 UTC (permalink / raw)
  To: George Dunlap, Tim Deegan; +Cc: xen-devel, Stefano Stabellini

>>> On 21.05.13 at 10:47, George Dunlap <George.Dunlap@eu.citrix.com> wrote:
> On Tue, May 21, 2013 at 9:32 AM, Tim Deegan <tim@xen.org> wrote:
>> At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
>>> So the work items I remember are as follows:
>>> 1. Implement NUMA affinity for vcpus
>>> 2. Implement Guest NUMA support for PV guests
>>> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0
>>
>> Does Xen need to do this?  Or could dom0 sort that out for itself after
>> boot?
> 
> There are two aspects of this.  First would be, if dom0.nvcpus <
> host.npcpus, to place the vcpus reasonably on the various numa nodes.
> 
> The second is to make the pfn -> NUMA node layout reasonable.  At the
> moment, as I understand it, pfns will be striped across nodes.  In
> theory dom0 could deal with this, but it seems like in practice it's
> going to be nasty trying to sort that stuff out.  It would be much
> better, if you have (say) 4 nodes and 4GiB of memory assigned to dom0,
> to have pfn 0-1G on node 0, 1-2G on node 2, &c.

I have been having a todo list item since around the release of 4.2
to add support for "dom0_mem=node<n>" and
"dom0_vcpus=node<n>" command line options, which I would think
would be sufficient to deal with that. Sadly, I never had enough
spare time to actually implement this.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  9:53       ` George Dunlap
@ 2013-05-21 10:17         ` Dario Faggioli
  0 siblings, 0 replies; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21 10:17 UTC (permalink / raw)
  To: George Dunlap
  Cc: xen-devel, Wei Liu, Roger Pau Monné, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 2107 bytes --]

On mar, 2013-05-21 at 10:53 +0100, George Dunlap wrote:
> On Tue, May 21, 2013 at 10:24 AM, Wei Liu <wei.liu2@citrix.com> wrote:
> > So the core thing in netback is almost ready, I trust Linux scheduler
> > now and don't pin kthread at all but relevant code shuold be easy to
> > add. I just checked my code, all memory allocation is already node
> > awared.
> >
> > As for the toolstack part, I'm not sure writing the initial node to
> > xenstore will be sufficient. Do we do inter-node migration? If so
> > frontend / backend should also update xenstore information as it
> > migrates?
> 
> We can of course migrate the vcpus, but migrating the actual memory
> from one node to another is pretty tricky, particularly for PV guests.
>  It won't be something that happens very often; when it does, we will
> need to sort out migrating the backend threads.
> 
Indeed.

> > IIRC the memory of a guest is striped through nodes, if it is this case,
> > how can pinning benefit? (I might be talking crap as I don't know much
> > about NUMA and its current status in Xen)
> 
> It's striped across nodes *of its NUMA affinity*.  So if you have a
> 4-node box, and you set its NUMA affinity to node 3, then the
> allocator will try to get all of the memory from node 3.  If its
> affinity is set to {2,3}, then the allocator will stripe it across
> nodes 2 and 3.
> 
Right. And other than that, the whole point of work items 1, 2 and 3 (in
George's list, at the beginning of this thread) is to make this striping
even "wiser". So, not only 'memory comes from nodes {2,3}' but '_this_
memory comes from node {2} and _that_ memory comes from {3}'. That's why
we think pinning would do it, but you're right (Wei), that is not true
right now, it will only be when we'll get those work items done. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21  9:45         ` George Dunlap
@ 2013-05-21 10:24           ` Tim Deegan
  2013-05-21 10:28             ` George Dunlap
  0 siblings, 1 reply; 28+ messages in thread
From: Tim Deegan @ 2013-05-21 10:24 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Stefano Stabellini

At 10:45 +0100 on 21 May (1369133137), George Dunlap wrote:
> What we would want for a comparable domU -- a domU that was NUMA-aware 
> -- was to have the pfn layout in batches across the nodes to which it 
> will be pinned.  E.g., if a domU has its NUMA affinity set to nodes 2-3, 
> then you'd want the first half of the pfns to come from node 2, the 
> second half from node 3.
> 
> In both cases, the domain builder will need to call the allocator with 
> specific numa nodes for specific regions of the PFN space.

Ah, so that logic lives in the tools for domU?  I was misremembering. 
Anyway, I think I'm convinced that this is a reasonable thing to do
the dom0 building code. :)

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 10:24           ` Tim Deegan
@ 2013-05-21 10:28             ` George Dunlap
  2013-05-21 11:12               ` Dario Faggioli
  0 siblings, 1 reply; 28+ messages in thread
From: George Dunlap @ 2013-05-21 10:28 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel, Matt Wilson, Stefano Stabellini

On 05/21/2013 11:24 AM, Tim Deegan wrote:
> At 10:45 +0100 on 21 May (1369133137), George Dunlap wrote:
>> What we would want for a comparable domU -- a domU that was NUMA-aware
>> -- was to have the pfn layout in batches across the nodes to which it
>> will be pinned.  E.g., if a domU has its NUMA affinity set to nodes 2-3,
>> then you'd want the first half of the pfns to come from node 2, the
>> second half from node 3.
>>
>> In both cases, the domain builder will need to call the allocator with
>> specific numa nodes for specific regions of the PFN space.
>
> Ah, so that logic lives in the tools for domU?  I was misremembering.
> Anyway, I think I'm convinced that this is a reasonable thing to do
> the dom0 building code. :)

I don't think it lives anywhere at the moment -- I think at the moment 
the domain builder for both dom0 and domU just call the allocator 
without any directions, and the allocator reads the NUMA affinity mask 
for the domain.  But yes, when we do get guest NUMA support, I think the 
domain builder will be the right place to set up the guest NUMA layout, 
both for domUs and dom0.

Matt Wilson I think has some patches to do the domU layout for HVM 
guests -- if he could post those at some point in the next month, it 
might give a head start to the person implementing this (probably me at 
this point).

  -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 10:06       ` Jan Beulich
@ 2013-05-21 10:30         ` Dario Faggioli
  2013-05-21 10:43           ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21 10:30 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Tim Deegan, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1474 bytes --]

On mar, 2013-05-21 at 11:06 +0100, Jan Beulich wrote:
> At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
> > The second is to make the pfn -> NUMA node layout reasonable.  At the
> > moment, as I understand it, pfns will be striped across nodes.  In
> > theory dom0 could deal with this, but it seems like in practice it's
> > going to be nasty trying to sort that stuff out.  It would be much
> > better, if you have (say) 4 nodes and 4GiB of memory assigned to dom0,
> > to have pfn 0-1G on node 0, 1-2G on node 2, &c.
> 
> I have been having a todo list item since around the release of 4.2
> to add support for "dom0_mem=node<n>" and
> "dom0_vcpus=node<n>" command line options, which I would think
> would be sufficient to deal with that. 
>
I remember such discussion (which, BTW, is here:
http://lists.xen.org/archives/html/xen-devel/2012-08/msg00332.html ).
However, wasn't that supposed to help only in case you want to confine
Dom0 on one specific node?

That would definitely be already something, but not quite the same thing
that came up in Dublin, and that George was describing above (although I
agree it covers a sensible subset of it :-) ).

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 10:30         ` Dario Faggioli
@ 2013-05-21 10:43           ` Jan Beulich
  2013-05-21 10:58             ` Dario Faggioli
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2013-05-21 10:43 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Tim Deegan, Stefano Stabellini

>>> On 21.05.13 at 12:30, Dario Faggioli <raistlin@linux.it> wrote:
> On mar, 2013-05-21 at 11:06 +0100, Jan Beulich wrote:
>> At 14:48 +0100 on 20 May (1369061330), George Dunlap wrote:
>> > The second is to make the pfn -> NUMA node layout reasonable.  At the
>> > moment, as I understand it, pfns will be striped across nodes.  In
>> > theory dom0 could deal with this, but it seems like in practice it's
>> > going to be nasty trying to sort that stuff out.  It would be much
>> > better, if you have (say) 4 nodes and 4GiB of memory assigned to dom0,
>> > to have pfn 0-1G on node 0, 1-2G on node 2, &c.
>> 
>> I have been having a todo list item since around the release of 4.2
>> to add support for "dom0_mem=node<n>" and
>> "dom0_vcpus=node<n>" command line options, which I would think
>> would be sufficient to deal with that. 
>>
> I remember such discussion (which, BTW, is here:
> http://lists.xen.org/archives/html/xen-devel/2012-08/msg00332.html ).
> However, wasn't that supposed to help only in case you want to confine
> Dom0 on one specific node?
> 
> That would definitely be already something, but not quite the same thing
> that came up in Dublin, and that George was describing above (although I
> agree it covers a sensible subset of it :-) ).

I certainly meant to implement both such that multiple nodes would
be permitted.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 10:43           ` Jan Beulich
@ 2013-05-21 10:58             ` Dario Faggioli
  2013-05-21 11:47               ` Jan Beulich
  0 siblings, 1 reply; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21 10:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Tim Deegan, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1058 bytes --]

On mar, 2013-05-21 at 11:43 +0100, Jan Beulich wrote:
> >>> On 21.05.13 at 12:30, Dario Faggioli <raistlin@linux.it> wrote:
> > I remember such discussion (which, BTW, is here:
> > http://lists.xen.org/archives/html/xen-devel/2012-08/msg00332.html ).
> > However, wasn't that supposed to help only in case you want to confine
> > Dom0 on one specific node?
> > 
> > That would definitely be already something, but not quite the same thing
> > that came up in Dublin, and that George was describing above (although I
> > agree it covers a sensible subset of it :-) ).
> 
> I certainly meant to implement both such that multiple nodes would
> be permitted.
> 
Well, sure, but then, again, how do you control which (and not only how
much) memory is taken from which node?

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-20 13:48 ` George Dunlap
  2013-05-21  8:32   ` Tim Deegan
  2013-05-21  8:44   ` Roger Pau Monné
@ 2013-05-21 11:10   ` Dario Faggioli
  2013-05-23 17:21     ` Dario Faggioli
  2013-05-22  1:28   ` Konrad Rzeszutek Wilk
  3 siblings, 1 reply; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21 11:10 UTC (permalink / raw)
  To: George Dunlap; +Cc: andre.przywara, xen-devel, Matt Wilson, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 2015 bytes --]

On lun, 2013-05-20 at 14:48 +0100, George Dunlap wrote:
> So the work items I remember are as follows:
>
Thanks for sending this out.

> 1. Implement NUMA affinity for vcpus
> 2. Implement Guest NUMA support for PV guests
> 3. Teach Xen how to make a sensible NUMA allocation layout for dom0
> 4. Teach the toolstack to pin the netback threads to dom0 vcpus
> running on the correct node (s)
> 
> Dario will do #1.  I volunteered to take a stab at #2 and #3.
>
I've got half backed patches for #1 already, I'm sure I can submit them
as soon as 4.4 window opens.

Regarding #2 and #3, here they are the old patches (and other material)
that I've been able to find so far on the subject.

For HVM, the patches are originally from Andre (and I think they are the
ones Matt is going to refresh and try to upstream):
 - July 2008: http://lists.xen.org/archives/html/xen-devel/2008-07/msg00582.html
 - February 2010: http://old-list-archives.xen.org/archives/html/xen-devel/2010-02/msg00279.html

For PV, the patches are originally from Dulloor Rao, they come with a
lot of policing and cover a lot o stuff that are upstream now, but there
perhaps still are some useful bits:
 - April 2010: http://lists.xen.org/archives/html/xen-devel/2010-04/msg00103.html
 - August 2010: http://lists.xen.org/archives/html/xen-devel/2010-08/msg00008.html

The work on PV-NUMA appears to have been presented at a XenSummit,
together by Dulloor and Jun from Intel, and there hence are slides and a
video available:
 http://www.slideshare.net/xen_com_mgr/dulloor-xensummit#btnNext
 http://www.slideshare.net/xen_com_mgr/nakajima-numafinal
 http://vimeo.com/12295753

Personally, I'm still quite buried under that memory migration thing, so
I really appreciate you stepping up for this. That being said, I should
be able to help, especially considering I used to be familiar with the
Linux side of this (although, years have passed, and my forgetting rate
is really high! :-P).

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 10:28             ` George Dunlap
@ 2013-05-21 11:12               ` Dario Faggioli
  0 siblings, 0 replies; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21 11:12 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Tim Deegan, Matt Wilson, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1959 bytes --]

On mar, 2013-05-21 at 11:28 +0100, George Dunlap wrote:
> On 05/21/2013 11:24 AM, Tim Deegan wrote:
> > At 10:45 +0100 on 21 May (1369133137), George Dunlap wrote:
> >> What we would want for a comparable domU -- a domU that was NUMA-aware
> >> -- was to have the pfn layout in batches across the nodes to which it
> >> will be pinned.  E.g., if a domU has its NUMA affinity set to nodes 2-3,
> >> then you'd want the first half of the pfns to come from node 2, the
> >> second half from node 3.
> >>
> >> In both cases, the domain builder will need to call the allocator with
> >> specific numa nodes for specific regions of the PFN space.
> >
> > Ah, so that logic lives in the tools for domU?  I was misremembering.
> > Anyway, I think I'm convinced that this is a reasonable thing to do
> > the dom0 building code. :)
> 
> I don't think it lives anywhere at the moment -- 
>
Yep, the issue is there is no such logic at all yet! :-P

> I think at the moment 
> the domain builder for both dom0 and domU just call the allocator 
> without any directions, and the allocator reads the NUMA affinity mask 
> for the domain.
>
And stripes the allocation through such mask, yes, without caring at all
where specific regions end up... Actually, without knowing anything
about 'specific regions'.

>   But yes, when we do get guest NUMA support, I think the 
> domain builder will be the right place to set up the guest NUMA layout, 
> both for domUs and dom0.
> 
+1

> Matt Wilson I think has some patches to do the domU layout for HVM 
> guests -- if he could post those at some point in the next month, it 
> might give a head start to the person implementing this (probably me at 
> this point).
> 
Yes, that would surely help. In the meantime, I replied to your first
e-mail in this thread with the links to all the old patches I've been
able to find on this subject... Hope that helps too. :-)

Regards,
Dario


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 10:58             ` Dario Faggioli
@ 2013-05-21 11:47               ` Jan Beulich
  2013-05-21 13:43                 ` Dario Faggioli
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Beulich @ 2013-05-21 11:47 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: George Dunlap, xen-devel, Tim Deegan, Stefano Stabellini

>>> On 21.05.13 at 12:58, Dario Faggioli <raistlin@linux.it> wrote:
> On mar, 2013-05-21 at 11:43 +0100, Jan Beulich wrote:
>> >>> On 21.05.13 at 12:30, Dario Faggioli <raistlin@linux.it> wrote:
>> > I remember such discussion (which, BTW, is here:
>> > http://lists.xen.org/archives/html/xen-devel/2012-08/msg00332.html ).
>> > However, wasn't that supposed to help only in case you want to confine
>> > Dom0 on one specific node?
>> > 
>> > That would definitely be already something, but not quite the same thing
>> > that came up in Dublin, and that George was describing above (although I
>> > agree it covers a sensible subset of it :-) ).
>> 
>> I certainly meant to implement both such that multiple nodes would
>> be permitted.
>> 
> Well, sure, but then, again, how do you control which (and not only how
> much) memory is taken from which node?

Hmm, I may not have followed, but why is "which" important here at
all? The only (usual) restriction should apply regarding preservation of
memory below 4G.

Jan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 11:47               ` Jan Beulich
@ 2013-05-21 13:43                 ` Dario Faggioli
  2013-05-24 16:00                   ` George Dunlap
  0 siblings, 1 reply; 28+ messages in thread
From: Dario Faggioli @ 2013-05-21 13:43 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, Tim Deegan, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1360 bytes --]

On mar, 2013-05-21 at 12:47 +0100, Jan Beulich wrote:
> >>> On 21.05.13 at 12:58, Dario Faggioli <raistlin@linux.it> wrote:
> > Well, sure, but then, again, how do you control which (and not only how
> > much) memory is taken from which node?
> 
> Hmm, I may not have followed, but why is "which" important here at
> all? The only (usual) restriction should apply regarding preservation of
> memory below 4G.
> 
It is if you want Dom0 to think it is running, say, on 2 nodes and
actually have the memory, say, in the range 0-1G accessed quicker from
d0v0 (vcpu0 of Dom0), and the vice versa with memory within 1-2G and
d0v1.

That enables NUMA optimization _inside_ Dom0, like the pinning of the
backends and all the other stuff discussed (during the Hackathon and) in
this thread.

However, to do that, we, I think, need to be able not only to specify
that we want 1G worth of memory on one specific node, but also to
request explicitly for some of Dom0's PFN to be here and for some others
to be there, as we were saying earlier in the thread with Tim.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-20 13:48 ` George Dunlap
                     ` (2 preceding siblings ...)
  2013-05-21 11:10   ` Dario Faggioli
@ 2013-05-22  1:28   ` Konrad Rzeszutek Wilk
  2013-05-22  7:44     ` Dario Faggioli
  3 siblings, 1 reply; 28+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-05-22  1:28 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Stefano Stabellini

On Mon, May 20, 2013 at 02:48:50PM +0100, George Dunlap wrote:
> On Mon, May 20, 2013 at 2:44 PM, Stefano Stabellini
> <stefano.stabellini@eu.citrix.com> wrote:
> > Hi all,
> > these are my notes from the discussion that we had at the Hackathon
> > regarding PV frontends and backends running on NUMA machines.
> >
> >
> > ---
> >
> > The problem: how can we make sure that frontends and backends run in the
> > same NUMA node?
> >
> > We would need to run one backend kthread per NUMA node: we have already
> > one kthread per netback vif (one per guest), we could pin each of them
> > on a different NUMA node, the same one the frontend is running on.
> >
> > But that means that dom0 would be running on several NUMA nodes at once,
> > how much of a performance penalty would that be?
> > We would need to export NUMA information to dom0, so that dom0 can make
> > smart decisions on memory allocations and we would also need to allocate
> > memory for dom0 from multiple nodes.
> >
> > We need a way to automatically allocate the initial dom0 memory in Xen
> > in a NUMA-aware way and we need Xen to automatically create one dom0 vcpu
> > per NUMA node.
> >
> > After dom0 boots, the toolstack is going to decide where to place any
> > new guests: it allocates the memory from the NUMA node it wants to run
> > the guest on and it is going to ask dom0 to allocate the kthread from
> > that node too. (Maybe writing the NUMA node on xenstore.)
> >
> > We need to make sure that the interrupts/MSIs coming from the NIC arrive
> > on the same pcpu that is running the vcpu that needs to receive it.
> > We need to do irqbalacing in dom0, then Xen automatically will make the
> > physical MSIs follow the vcpu automatically.
> >
> > If the card is multiqueue we need to make sure that we use the multiple
> > queues so that we can have difference sources of interrupts/MSIs for
> > each vif. This allows us to independently notify each dom0 vcpu.
> 
> So the work items I remember are as follows:
> 1. Implement NUMA affinity for vcpus
> 2. Implement Guest NUMA support for PV guests

Did anybody volunteer for this one?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-22  1:28   ` Konrad Rzeszutek Wilk
@ 2013-05-22  7:44     ` Dario Faggioli
  0 siblings, 0 replies; 28+ messages in thread
From: Dario Faggioli @ 2013-05-22  7:44 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: George Dunlap, xen-devel, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 907 bytes --]

On mar, 2013-05-21 at 21:28 -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, May 20, 2013 at 02:48:50PM +0100, George Dunlap wrote:
> > So the work items I remember are as follows:
> > 1. Implement NUMA affinity for vcpus
> > 2. Implement Guest NUMA support for PV guests
> 
> Did anybody volunteer for this one?
> 
George said (in the very mail you're replying to, in the part you're
cutting here! :-P) that he's going to give a look at it, and I said I'll
try to help... Of course in case no one else step up before we get to
it. :-)

http://web.archiveorange.com/archive/v/jwRyEmO2fQ5czV89pif8#naMMziS89rrR6vZ

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 11:10   ` Dario Faggioli
@ 2013-05-23 17:21     ` Dario Faggioli
  0 siblings, 0 replies; 28+ messages in thread
From: Dario Faggioli @ 2013-05-23 17:21 UTC (permalink / raw)
  To: George Dunlap; +Cc: andre.przywara, xen-devel, Matt Wilson, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 3333 bytes --]

On mar, 2013-05-21 at 13:10 +0200, Dario Faggioli wrote:
> Regarding #2 and #3, here they are the old patches (and other material)
> that I've been able to find so far on the subject.
> 
> For HVM, the patches are originally from Andre (and I think they are the
> ones Matt is going to refresh and try to upstream):
>  - July 2008: http://lists.xen.org/archives/html/xen-devel/2008-07/msg00582.html
>  - February 2010: http://old-list-archives.xen.org/archives/html/xen-devel/2010-02/msg00279.html
> 
> For PV, the patches are originally from Dulloor Rao, they come with a
> lot of policing and cover a lot o stuff that are upstream now, but there
> perhaps still are some useful bits:
>  - April 2010: http://lists.xen.org/archives/html/xen-devel/2010-04/msg00103.html
>  - August 2010: http://lists.xen.org/archives/html/xen-devel/2010-08/msg00008.html
> 
Ok, sounds like I was misremembering. I've looked more thoroughly at the
various threads and at the patches that come with them and it turned out
that Dulloor's Aug 2010 series is about NUMA for HVM guests as well.
That is to say:

For HVM guest NUMA, here's what we have had:
 - July 2008, from Andre: http://lists.xen.org/archives/html/xen-devel/2008-07/msg00582.html
 - February 2010, from Andre: http://old-list-archives.xen.org/archives/html/xen-devel/2010-02/msg00279.html
 - August 2010, from Dulloor: http://lists.xen.org/archives/html/xen-devel/2010-08/msg00008.html

For PV guest NUMA, here's what we have had:
 - April 2010, from Dulloor: http://lists.xen.org/archives/html/xen-devel/2010-04/msg00103.html
 - February 2010, from Dulloor: http://old-list-archives.xenproject.org/archives/html/xen-devel/2010-02/msg00630.html
 - from Dulloor (implementing some kind of NUMA aware ballooning):
   http://lists.xen.org/archives/html/xen-devel/2010-04/txtHhAq92jBSc.txt

Also, still for PV NUMA, this message (and of course the thread it comes
from) has some arguments about what are the advantages or disadvantages
of a couple of different interfaces, and I think it could be
interesting: http://lists.xen.org/archives/html/xen-devel/2010-04/msg00761.html

> The work on PV-NUMA appears to have been presented at a XenSummit,
> together by Dulloor and Jun from Intel, and there hence are slides and a
> video available:
>  http://www.slideshare.net/xen_com_mgr/dulloor-xensummit#btnNext
>  http://www.slideshare.net/xen_com_mgr/nakajima-numafinal
>  http://vimeo.com/12295753
> 
These, of course, are still valid.

So, whoever will get on to doing work on either PV or HVM NUMA, I think
it is well worthwhile to look at the threads and at the patches (the
Linux patches are there too). The code is not perfect and, as I already
said, there is a lot of overlap with stuff that are in right now, but
still, it probably is a better starting point than nothing! :-P

I'll "mirror" all this information on my NUMA roadmap Wiki page
(http://wiki.xen.org/wiki/Xen_NUMA_Roadmap), just to be even more sure
that we don't loose track of it.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-21 13:43                 ` Dario Faggioli
@ 2013-05-24 16:00                   ` George Dunlap
  2013-05-25 13:57                     ` Dario Faggioli
  0 siblings, 1 reply; 28+ messages in thread
From: George Dunlap @ 2013-05-24 16:00 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: xen-devel, Tim Deegan, Jan Beulich, Stefano Stabellini

On 21/05/13 14:43, Dario Faggioli wrote:
> On mar, 2013-05-21 at 12:47 +0100, Jan Beulich wrote:
>>>>> On 21.05.13 at 12:58, Dario Faggioli <raistlin@linux.it> wrote:
>>> Well, sure, but then, again, how do you control which (and not only how
>>> much) memory is taken from which node?
>> Hmm, I may not have followed, but why is "which" important here at
>> all? The only (usual) restriction should apply regarding preservation of
>> memory below 4G.
>>
> It is if you want Dom0 to think it is running, say, on 2 nodes and
> actually have the memory, say, in the range 0-1G accessed quicker from
> d0v0 (vcpu0 of Dom0), and the vice versa with memory within 1-2G and
> d0v1.
>
> That enables NUMA optimization _inside_ Dom0, like the pinning of the
> backends and all the other stuff discussed (during the Hackathon and) in
> this thread.
>
> However, to do that, we, I think, need to be able not only to specify
> that we want 1G worth of memory on one specific node, but also to
> request explicitly for some of Dom0's PFN to be here and for some others
> to be there, as we were saying earlier in the thread with Tim.

One thing that I wanted to add to this discussion -- unless there's some 
way for the toolstack to figure out, for each node, how much memory is 
currently free *and* how much memory could be freed by dom0 on that 
node, and a way to ask dom0 to free memory from a specific a node, then 
booting with dom0 having all the memory is basically going to make all 
of our NUMA work a noop.

We may end up having to switch from defaulting to giving dom0 and 
autoballooning to giving dom0 a fixed amount.

  -George

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Hackathon minutes] PV frontends/backends and NUMA machines
  2013-05-24 16:00                   ` George Dunlap
@ 2013-05-25 13:57                     ` Dario Faggioli
  0 siblings, 0 replies; 28+ messages in thread
From: Dario Faggioli @ 2013-05-25 13:57 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, Tim Deegan, Jan Beulich, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 2634 bytes --]

On ven, 2013-05-24 at 17:00 +0100, George Dunlap wrote:
> On 21/05/13 14:43, Dario Faggioli wrote:
> > However, to do that, we, I think, need to be able not only to specify
> > that we want 1G worth of memory on one specific node, but also to
> > request explicitly for some of Dom0's PFN to be here and for some others
> > to be there, as we were saying earlier in the thread with Tim.
> 
> One thing that I wanted to add to this discussion -- unless there's some 
> way for the toolstack to figure out, for each node, how much memory is 
> currently free *and* how much memory could be freed by dom0 on that 
> node, 
>
"A way to tell how much free memory there is on a node", yes, there is
already. "A way to tell how much memory could be freed by dom0 on a
node", no, there isn't anything like that.

This has to do with NUMA aware ballooning, which is another big bullet
on my NUMA roadmap/TODO list, although it hasn't been mentioned in the
hackathon (or at least, I don't remember it being mentioned).

I proposed it for GSoC, and got mails from several people interested in
working on it. Among them, there is one that seems to be keen on doing
the job, even outside of GSoC, but he can't start right away, as he's
otherwise engaged for the forthcoming weeks. I can double check his
availability and investigate better his commitment...

Also, I think NUMA aware ballooning requires (or, at least, would be
easier to implement with) NUMA awareness in the guest (dom0 in this
case), so we have sort of a circular dependency here. :-D

> and a way to ask dom0 to free memory from a specific a node, then 
> booting with dom0 having all the memory is basically going to make all 
> of our NUMA work a noop.
> 
Booting *without* dom0_mem=XXX already hurts quite a bit, since the
current automatic NUMA placement code needs to know how much free memory
we have on each node, and having them completely filled by dom0 is all
but the best situation for it to reach a decent solution! :-(

> We may end up having to switch from defaulting to giving dom0 and 
> autoballooning to giving dom0 a fixed amount.
> 
Well, from a NUMA only point of view, that would be really desirable....
But I'm not sure we can go that far, especially because I see it very
hard to find a value that would make everyone happy.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-05-25 13:57 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-20 13:44 [Hackathon minutes] PV frontends/backends and NUMA machines Stefano Stabellini
2013-05-20 13:48 ` George Dunlap
2013-05-21  8:32   ` Tim Deegan
2013-05-21  8:47     ` George Dunlap
2013-05-21  8:49       ` George Dunlap
2013-05-21 10:03         ` Dario Faggioli
2013-05-21  9:20       ` Tim Deegan
2013-05-21  9:45         ` George Dunlap
2013-05-21 10:24           ` Tim Deegan
2013-05-21 10:28             ` George Dunlap
2013-05-21 11:12               ` Dario Faggioli
2013-05-21  9:53         ` Dario Faggioli
2013-05-21 10:06       ` Jan Beulich
2013-05-21 10:30         ` Dario Faggioli
2013-05-21 10:43           ` Jan Beulich
2013-05-21 10:58             ` Dario Faggioli
2013-05-21 11:47               ` Jan Beulich
2013-05-21 13:43                 ` Dario Faggioli
2013-05-24 16:00                   ` George Dunlap
2013-05-25 13:57                     ` Dario Faggioli
2013-05-21  8:44   ` Roger Pau Monné
2013-05-21  9:24     ` Wei Liu
2013-05-21  9:53       ` George Dunlap
2013-05-21 10:17         ` Dario Faggioli
2013-05-21 11:10   ` Dario Faggioli
2013-05-23 17:21     ` Dario Faggioli
2013-05-22  1:28   ` Konrad Rzeszutek Wilk
2013-05-22  7:44     ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.