All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dipankar Sarma <dipankar@in.ibm.com>
To: Alexander Graf <agraf@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Anthony Liguori <anthony@codemonkey.ws>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	bharata@linux.vnet.ibm.com, kvm list <kvm@vger.kernel.org>,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Chris Wright <chrisw@sous-sol.org>,
	Vaidyanathan S <svaidy@in.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 30 Nov 2011 21:52:37 +0530	[thread overview]
Message-ID: <20111130162237.GC27308@in.ibm.com> (raw)
In-Reply-To: <4ECD3CBD.7010902@suse.de>

On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
> >Hi!
> >
> >
> >In my view the trouble of the numa hard bindings is not the fact
> >they're hard and qemu has to also decide the location (in fact it
> >doesn't need to decide the location if you use cpusets and relative
> >mbinds). The bigger problem is the fact either the admin or the app
> >developer has to explicitly scan the numa physical topology (both cpus
> >and memory) and tell the kernel how much memory to bind to each
> >thread. ms_mbind/ms_tbind only partially solve that problem. They're
> >similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> >don't need an admin or a cpuset-job-scheduler (or a perl script) to
> >redistribute the hardware resources.
> 
> Well yeah, of course the guest needs to see some topology. I don't
> see why we'd have to actually scan the host for this though. All we
> need to tell the kernel is "this memory region is close to that
> thread".
> 
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able
> to tell the kernel that this GB of RAM actually is close to that
> vCPU thread.
> 
> Of course the admin still needs to decide how to split up memory.
> That's the deal with emulating real hardware. You get the interfaces
> hardware gets :). However, if you follow a reasonable default
> strategy such as numa splitting your RAM into equal chunks between
> guest vCPUs you're probably close enough to optimal usage models. Or
> at least you could have a close enough approximation of how this
> mapping could work for the _guest_ regardless of the host and when
> you migrate it somewhere else it should also work reasonably well.

Allowing specification of the numa nodes to qemu, allowing
qemu to create cpu+mem grouping (without binding) and letting
the kernel decide how to manage them seems like a reasonable incremental 
step between no guest/host NUMA awareness and automatic NUMA 
configuration in the host kernel. It would be suffice for the current 
needs we see.

Besides migration, we also have use cases where we may want to
have large multi-node VMs that are static (like LPARs), having the guest 
aware of the topology there is helpful. 

Also, if at all topology changes due to migration or host kernel decisions,
we can make use of something like VPHN (virtual processor home node)
capability on Power systems to have guest kernel update its topology
knowledge. You can refer to that in
arch/powerpc/mm/numa.c. Otherwise, as long as the host kernel
maintains mappings requested by ms_tbind()/ms_mbind(), we can
create the guest topology correctly and optimize for NUMA. This
would work for us.

Thanks
Dipankar


WARNING: multiple messages have this Message-ID (diff)
From: Dipankar Sarma <dipankar@in.ibm.com>
To: Alexander Graf <agraf@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	kvm list <kvm@vger.kernel.org>,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Chris Wright <chrisw@sous-sol.org>,
	bharata@linux.vnet.ibm.com, Vaidyanathan S <svaidy@in.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 30 Nov 2011 21:52:37 +0530	[thread overview]
Message-ID: <20111130162237.GC27308@in.ibm.com> (raw)
In-Reply-To: <4ECD3CBD.7010902@suse.de>

On Wed, Nov 23, 2011 at 07:34:37PM +0100, Alexander Graf wrote:
> On 11/23/2011 04:03 PM, Andrea Arcangeli wrote:
> >Hi!
> >
> >
> >In my view the trouble of the numa hard bindings is not the fact
> >they're hard and qemu has to also decide the location (in fact it
> >doesn't need to decide the location if you use cpusets and relative
> >mbinds). The bigger problem is the fact either the admin or the app
> >developer has to explicitly scan the numa physical topology (both cpus
> >and memory) and tell the kernel how much memory to bind to each
> >thread. ms_mbind/ms_tbind only partially solve that problem. They're
> >similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> >don't need an admin or a cpuset-job-scheduler (or a perl script) to
> >redistribute the hardware resources.
> 
> Well yeah, of course the guest needs to see some topology. I don't
> see why we'd have to actually scan the host for this though. All we
> need to tell the kernel is "this memory region is close to that
> thread".
> 
> So if you define "-numa node,mem=1G,cpus=0" then QEMU should be able
> to tell the kernel that this GB of RAM actually is close to that
> vCPU thread.
> 
> Of course the admin still needs to decide how to split up memory.
> That's the deal with emulating real hardware. You get the interfaces
> hardware gets :). However, if you follow a reasonable default
> strategy such as numa splitting your RAM into equal chunks between
> guest vCPUs you're probably close enough to optimal usage models. Or
> at least you could have a close enough approximation of how this
> mapping could work for the _guest_ regardless of the host and when
> you migrate it somewhere else it should also work reasonably well.

Allowing specification of the numa nodes to qemu, allowing
qemu to create cpu+mem grouping (without binding) and letting
the kernel decide how to manage them seems like a reasonable incremental 
step between no guest/host NUMA awareness and automatic NUMA 
configuration in the host kernel. It would be suffice for the current 
needs we see.

Besides migration, we also have use cases where we may want to
have large multi-node VMs that are static (like LPARs), having the guest 
aware of the topology there is helpful. 

Also, if at all topology changes due to migration or host kernel decisions,
we can make use of something like VPHN (virtual processor home node)
capability on Power systems to have guest kernel update its topology
knowledge. You can refer to that in
arch/powerpc/mm/numa.c. Otherwise, as long as the host kernel
maintains mappings requested by ms_tbind()/ms_mbind(), we can
create the guest topology correctly and optimize for NUMA. This
would work for us.

Thanks
Dipankar

  parent reply	other threads:[~2011-11-30 16:23 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-29 18:45 [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding Bharata B Rao
2011-10-29 19:57 ` Alexander Graf
2011-10-29 19:57   ` [Qemu-devel] " Alexander Graf
2011-10-30  9:32   ` Vaidyanathan Srinivasan
2011-10-30  9:32     ` [Qemu-devel] " Vaidyanathan Srinivasan
2011-11-08 17:33   ` Chris Wright
2011-11-08 17:33     ` [Qemu-devel] " Chris Wright
2011-11-21 15:18     ` Bharata B Rao
2011-11-21 15:18       ` Bharata B Rao
2011-11-21 15:25       ` Peter Zijlstra
2011-11-21 15:25         ` [Qemu-devel] " Peter Zijlstra
2011-11-21 16:00         ` Bharata B Rao
2011-11-21 17:03           ` Peter Zijlstra
2011-11-21 17:03             ` [Qemu-devel] " Peter Zijlstra
2011-11-21 22:50             ` Chris Wright
2011-11-21 22:50               ` [Qemu-devel] " Chris Wright
2011-11-22  1:57               ` Anthony Liguori
2011-11-22  1:57                 ` Anthony Liguori
2011-11-22  1:51             ` Anthony Liguori
2011-11-22  1:51               ` Anthony Liguori
2011-11-23 15:03               ` Andrea Arcangeli
2011-11-23 15:03                 ` Andrea Arcangeli
2011-11-23 18:34                 ` Alexander Graf
2011-11-23 18:34                   ` Alexander Graf
2011-11-23 20:19                   ` Andrea Arcangeli
2011-11-23 20:19                     ` [Qemu-devel] " Andrea Arcangeli
2011-11-30 16:22                   ` Dipankar Sarma [this message]
2011-11-30 16:22                     ` Dipankar Sarma
2011-11-30 16:25                     ` Peter Zijlstra
2011-11-30 16:25                       ` [Qemu-devel] " Peter Zijlstra
2011-11-30 16:33                       ` Chris Wright
2011-11-30 16:33                         ` [Qemu-devel] " Chris Wright
2011-11-30 17:41                     ` Andrea Arcangeli
2011-11-30 17:41                       ` [Qemu-devel] " Andrea Arcangeli
2011-12-01 17:25                       ` Dipankar Sarma
2011-12-01 17:25                         ` Dipankar Sarma
2011-12-01 17:36                         ` Andrea Arcangeli
2011-12-01 17:36                           ` [Qemu-devel] " Andrea Arcangeli
2011-12-01 17:49                           ` Dipankar Sarma
2011-12-01 17:49                             ` Dipankar Sarma
2011-12-01 17:40                 ` Peter Zijlstra
2011-12-01 17:40                   ` Peter Zijlstra
2011-12-22 11:01                   ` Marcelo Tosatti
2011-12-22 11:01                     ` Marcelo Tosatti
2011-12-22 17:13                     ` Anthony Liguori
2011-12-22 17:13                       ` Anthony Liguori
2011-12-22 17:55                       ` Marcelo Tosatti
2011-12-22 17:55                         ` Marcelo Tosatti
2011-12-22 19:04                     ` Peter Zijlstra
2011-12-22 19:04                       ` [Qemu-devel] " Peter Zijlstra
2011-12-22 11:24                   ` Marcelo Tosatti
2011-12-22 11:24                     ` [Qemu-devel] " Marcelo Tosatti
2011-11-21 18:03         ` Avi Kivity
2011-11-21 18:03           ` [Qemu-devel] " Avi Kivity
2011-11-21 19:31           ` Peter Zijlstra
2011-11-21 19:31             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111130162237.GC27308@in.ibm.com \
    --to=dipankar@in.ibm.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=agraf@suse.de \
    --cc=anthony@codemonkey.ws \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=chrisw@sous-sol.org \
    --cc=kvm@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=svaidy@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.