All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	kvm list <kvm@vger.kernel.org>,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Alexander Graf <agraf@suse.de>,
	Chris Wright <chrisw@sous-sol.org>,
	bharata@linux.vnet.ibm.com, Vaidyanathan S <svaidy@in.ibm.com>
Subject: Re: [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 30 Nov 2011 18:41:13 +0100	[thread overview]
Message-ID: <20111130174113.GM23466@redhat.com> (raw)
In-Reply-To: <20111130162237.GC27308@in.ibm.com>

On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> create the guest topology correctly and optimize for NUMA. This
> would work for us.

Even on the case of 1 guest that fits in one node, you're not going to
max out the full bandwidth of all memory channels with this.

qemu all can do with ms_mbind/tbind is to create a vtopology that
matches the hardware topology. It has these limits:

1) requires all userland applications to be modified to scan either
   the physical topology if run on host, or the vtopology if run on
   guest to get the full benefit.

2) breaks across live migration if host physical topology changes

3) 1 small guest on a idle numa system that fits in one numa node will
   tell not enough information to the host kernel

4) if used outside of qemu and one threads allocates more memory than
   what fits in one node it won't tell enough info to the host kernel.

About 3): if you've just one guest that fits in one node, each vcpu
should be spread across all the nodes probably, and behave like
MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
reverse, the global memory bandwidth will still be used full even if
they will both access remote memory. I've just seen benchmarks where
no pinning runs more than _twice_ as fast than pinning with just 1
guest and only 10 vcpu threads, probably because of that.

About 4): even if the thread scans the numa topology it won't be able
to tell tell enough info to the kernel to know which parts of the
memory may be used more or less (ok it may be possible to call mbind
and vary it at runtime but it adds even more complexity left to the
programmer).

If the vcpu is free to go in any node, and we've a automatic
vcpu<->memory affinity, then the memory will follow the vcpu. And the
scheduler domains should already optimize for maxing out the full
memory bandwidth of all channels.

Trouble 1/2/3/4 applies to the hard bindings as well, not just to
mbind/tbin.

In short it's an incremental step that moves some logic to the kernel
but I don't see it solving all situations optimally and it shares a
lot of the limits of the hard bindings.

WARNING: multiple messages have this Message-ID (diff)
From: Andrea Arcangeli <aarcange@redhat.com>
To: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	kvm list <kvm@vger.kernel.org>,
	qemu-devel Developers <qemu-devel@nongnu.org>,
	Alexander Graf <agraf@suse.de>,
	Chris Wright <chrisw@sous-sol.org>,
	bharata@linux.vnet.ibm.com, Vaidyanathan S <svaidy@in.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding
Date: Wed, 30 Nov 2011 18:41:13 +0100	[thread overview]
Message-ID: <20111130174113.GM23466@redhat.com> (raw)
In-Reply-To: <20111130162237.GC27308@in.ibm.com>

On Wed, Nov 30, 2011 at 09:52:37PM +0530, Dipankar Sarma wrote:
> create the guest topology correctly and optimize for NUMA. This
> would work for us.

Even on the case of 1 guest that fits in one node, you're not going to
max out the full bandwidth of all memory channels with this.

qemu all can do with ms_mbind/tbind is to create a vtopology that
matches the hardware topology. It has these limits:

1) requires all userland applications to be modified to scan either
   the physical topology if run on host, or the vtopology if run on
   guest to get the full benefit.

2) breaks across live migration if host physical topology changes

3) 1 small guest on a idle numa system that fits in one numa node will
   tell not enough information to the host kernel

4) if used outside of qemu and one threads allocates more memory than
   what fits in one node it won't tell enough info to the host kernel.

About 3): if you've just one guest that fits in one node, each vcpu
should be spread across all the nodes probably, and behave like
MADV_INTERLEAVE if the guest CPU scheduler migrate guests processes in
reverse, the global memory bandwidth will still be used full even if
they will both access remote memory. I've just seen benchmarks where
no pinning runs more than _twice_ as fast than pinning with just 1
guest and only 10 vcpu threads, probably because of that.

About 4): even if the thread scans the numa topology it won't be able
to tell tell enough info to the kernel to know which parts of the
memory may be used more or less (ok it may be possible to call mbind
and vary it at runtime but it adds even more complexity left to the
programmer).

If the vcpu is free to go in any node, and we've a automatic
vcpu<->memory affinity, then the memory will follow the vcpu. And the
scheduler domains should already optimize for maxing out the full
memory bandwidth of all channels.

Trouble 1/2/3/4 applies to the hard bindings as well, not just to
mbind/tbin.

In short it's an incremental step that moves some logic to the kernel
but I don't see it solving all situations optimally and it shares a
lot of the limits of the hard bindings.

  parent reply	other threads:[~2011-11-30 17:41 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-10-29 18:45 [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding Bharata B Rao
2011-10-29 19:57 ` Alexander Graf
2011-10-29 19:57   ` [Qemu-devel] " Alexander Graf
2011-10-30  9:32   ` Vaidyanathan Srinivasan
2011-10-30  9:32     ` [Qemu-devel] " Vaidyanathan Srinivasan
2011-11-08 17:33   ` Chris Wright
2011-11-08 17:33     ` [Qemu-devel] " Chris Wright
2011-11-21 15:18     ` Bharata B Rao
2011-11-21 15:18       ` Bharata B Rao
2011-11-21 15:25       ` Peter Zijlstra
2011-11-21 15:25         ` [Qemu-devel] " Peter Zijlstra
2011-11-21 16:00         ` Bharata B Rao
2011-11-21 17:03           ` Peter Zijlstra
2011-11-21 17:03             ` [Qemu-devel] " Peter Zijlstra
2011-11-21 22:50             ` Chris Wright
2011-11-21 22:50               ` [Qemu-devel] " Chris Wright
2011-11-22  1:57               ` Anthony Liguori
2011-11-22  1:57                 ` Anthony Liguori
2011-11-22  1:51             ` Anthony Liguori
2011-11-22  1:51               ` Anthony Liguori
2011-11-23 15:03               ` Andrea Arcangeli
2011-11-23 15:03                 ` Andrea Arcangeli
2011-11-23 18:34                 ` Alexander Graf
2011-11-23 18:34                   ` Alexander Graf
2011-11-23 20:19                   ` Andrea Arcangeli
2011-11-23 20:19                     ` [Qemu-devel] " Andrea Arcangeli
2011-11-30 16:22                   ` Dipankar Sarma
2011-11-30 16:22                     ` Dipankar Sarma
2011-11-30 16:25                     ` Peter Zijlstra
2011-11-30 16:25                       ` [Qemu-devel] " Peter Zijlstra
2011-11-30 16:33                       ` Chris Wright
2011-11-30 16:33                         ` [Qemu-devel] " Chris Wright
2011-11-30 17:41                     ` Andrea Arcangeli [this message]
2011-11-30 17:41                       ` Andrea Arcangeli
2011-12-01 17:25                       ` Dipankar Sarma
2011-12-01 17:25                         ` Dipankar Sarma
2011-12-01 17:36                         ` Andrea Arcangeli
2011-12-01 17:36                           ` [Qemu-devel] " Andrea Arcangeli
2011-12-01 17:49                           ` Dipankar Sarma
2011-12-01 17:49                             ` Dipankar Sarma
2011-12-01 17:40                 ` Peter Zijlstra
2011-12-01 17:40                   ` Peter Zijlstra
2011-12-22 11:01                   ` Marcelo Tosatti
2011-12-22 11:01                     ` Marcelo Tosatti
2011-12-22 17:13                     ` Anthony Liguori
2011-12-22 17:13                       ` Anthony Liguori
2011-12-22 17:55                       ` Marcelo Tosatti
2011-12-22 17:55                         ` Marcelo Tosatti
2011-12-22 19:04                     ` Peter Zijlstra
2011-12-22 19:04                       ` [Qemu-devel] " Peter Zijlstra
2011-12-22 11:24                   ` Marcelo Tosatti
2011-12-22 11:24                     ` [Qemu-devel] " Marcelo Tosatti
2011-11-21 18:03         ` Avi Kivity
2011-11-21 18:03           ` [Qemu-devel] " Avi Kivity
2011-11-21 19:31           ` Peter Zijlstra
2011-11-21 19:31             ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111130174113.GM23466@redhat.com \
    --to=aarcange@redhat.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=agraf@suse.de \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=chrisw@sous-sol.org \
    --cc=dipankar@in.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=svaidy@in.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.