All of lore.kernel.org
 help / color / mirror / Atom feed
* How to determine the backing host physical memory for a given guest ?
@ 2012-05-09 13:05 Chegu Vinod
  2012-05-09 13:46 ` Avi Kivity
  0 siblings, 1 reply; 6+ messages in thread
From: Chegu Vinod @ 2012-05-09 13:05 UTC (permalink / raw)
  To: kvm


Hello,

On an 8 socket Westmere host I am attempting to run a single guest and 
characterize the virtualization overhead for a system intensive 
workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 
20way/128G, ... 80way/512G). 

To do some comparisons between the native vs. guest runs. I have 
been using "numactl" to control the cpu node & memory node bindings for 
the qemu instance.  For larger guest sizes I end up binding across multiple 
localities. for e.g. a 40 way guest :

numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
qemu-system-x86_64 -smp 40 -m 262144 \
<....>

I understand that actual mappings from a guest virtual address to host physical 
address could change. 

Is there a way to determine [at a given instant] which host's NUMA node is 
providing the backing physical memory for the active guest's kernel and 
also for the the apps actively running in the guest ? 

Guessing that there is a better way (some tool available?) than just
diff'ng the per node memory usage...from the before and after output of 
"numactl --hardware" on the host.

Thanks
Vinod




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine the backing host physical memory for a given guest ?
  2012-05-09 13:05 How to determine the backing host physical memory for a given guest ? Chegu Vinod
@ 2012-05-09 13:46 ` Avi Kivity
  2012-05-10  1:23   ` Chegu Vinod
  2012-05-10 15:34   ` Andrew Theurer
  0 siblings, 2 replies; 6+ messages in thread
From: Avi Kivity @ 2012-05-09 13:46 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: kvm

On 05/09/2012 04:05 PM, Chegu Vinod wrote:
> Hello,
>
> On an 8 socket Westmere host I am attempting to run a single guest and 
> characterize the virtualization overhead for a system intensive 
> workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 
> 20way/128G, ... 80way/512G). 
>
> To do some comparisons between the native vs. guest runs. I have 
> been using "numactl" to control the cpu node & memory node bindings for 
> the qemu instance.  For larger guest sizes I end up binding across multiple 
> localities. for e.g. a 40 way guest :
>
> numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
> qemu-system-x86_64 -smp 40 -m 262144 \
> <....>
>
> I understand that actual mappings from a guest virtual address to host physical 
> address could change. 
>
> Is there a way to determine [at a given instant] which host's NUMA node is 
> providing the backing physical memory for the active guest's kernel and 
> also for the the apps actively running in the guest ? 
>
> Guessing that there is a better way (some tool available?) than just
> diff'ng the per node memory usage...from the before and after output of 
> "numactl --hardware" on the host.
>

Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine the backing host physical memory for a given guest ?
  2012-05-09 13:46 ` Avi Kivity
@ 2012-05-10  1:23   ` Chegu Vinod
  2012-05-10 15:34   ` Andrew Theurer
  1 sibling, 0 replies; 6+ messages in thread
From: Chegu Vinod @ 2012-05-10  1:23 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

On 5/9/2012 6:46 AM, Avi Kivity wrote:
> On 05/09/2012 04:05 PM, Chegu Vinod wrote:
>> Hello,
>>
>> On an 8 socket Westmere host I am attempting to run a single guest and
>> characterize the virtualization overhead for a system intensive
>> workload (AIM7-high_systime) as the size of the guest scales (10way/64G,
>> 20way/128G, ... 80way/512G).
>>
>> To do some comparisons between the native vs. guest runs. I have
>> been using "numactl" to control the cpu node&  memory node bindings for
>> the qemu instance.  For larger guest sizes I end up binding across multiple
>> localities. for e.g. a 40 way guest :
>>
>> numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
>> qemu-system-x86_64 -smp 40 -m 262144 \
>> <....>
>>
>> I understand that actual mappings from a guest virtual address to host physical
>> address could change.
>>
>> Is there a way to determine [at a given instant] which host's NUMA node is
>> providing the backing physical memory for the active guest's kernel and
>> also for the the apps actively running in the guest ?
>>
>> Guessing that there is a better way (some tool available?) than just
>> diff'ng the per node memory usage...from the before and after output of
>> "numactl --hardware" on the host.
>>
> Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.
>

Thanks for the pointer Avi ! Will give it a try...

FYI... I tried using the recent version of the "crash" utility 
(http://people.redhat.com/anderson/) with the upstream kvm.git kernel 
(3.4.0-rc4+ ) and it seems to provides VA -> PA mappings for a given app 
on a live system.

Also looks like there is an extension to this crash utility...  called : 
qemu-vtop. which is supposed to give the GPA->HVA->HPA mappings. Need to 
give this a try...and see if it works.

Thx!
Vinod



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine the backing host physical memory for a given guest ?
  2012-05-09 13:46 ` Avi Kivity
  2012-05-10  1:23   ` Chegu Vinod
@ 2012-05-10 15:34   ` Andrew Theurer
  2012-05-11  1:22     ` Chegu Vinod
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Theurer @ 2012-05-10 15:34 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Chegu Vinod, kvm

On 05/09/2012 08:46 AM, Avi Kivity wrote:
> On 05/09/2012 04:05 PM, Chegu Vinod wrote:
>> Hello,
>>
>> On an 8 socket Westmere host I am attempting to run a single guest and
>> characterize the virtualization overhead for a system intensive
>> workload (AIM7-high_systime) as the size of the guest scales (10way/64G,
>> 20way/128G, ... 80way/512G).
>>
>> To do some comparisons between the native vs. guest runs. I have
>> been using "numactl" to control the cpu node&  memory node bindings for
>> the qemu instance.  For larger guest sizes I end up binding across multiple
>> localities. for e.g. a 40 way guest :
>>
>> numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
>> qemu-system-x86_64 -smp 40 -m 262144 \
>> <....>
>>
>> I understand that actual mappings from a guest virtual address to host physical
>> address could change.
>>
>> Is there a way to determine [at a given instant] which host's NUMA node is
>> providing the backing physical memory for the active guest's kernel and
>> also for the the apps actively running in the guest ?
>>
>> Guessing that there is a better way (some tool available?) than just
>> diff'ng the per node memory usage...from the before and after output of
>> "numactl --hardware" on the host.
>>
>
> Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.
>

You can look at /proc/<pid>/numa_maps and see all the mappings for the 
qemu process.  There should be one really large mapping for the guest 
memory, and in that line a number of dirty pages list potentially for 
each NUMA node.  This will tell you how much from each node, but not 
specifically "which page is mapped where".

Keep in mind with the current numactl you are using, you will likely not 
get the benefits of NUMA enhancements found in the linux kernel from 
your guest (or host).  There are a couple reasons: (1) your guest does 
not have a NUMA topology defined (based on what I see from the qemu 
command above), so it will not do anything special based on the host 
topology.  Also, things that are broken down per-NUMA-node like some 
spin-locks and sched-domains are now system-wide/flat.  This is a big 
deal for scheduler and other things like kmem allocation.  With a single 
80way VM with no NUMA, you will likely have massive spin-lock contention 
on some workloads. (2) Once the VM does have NUMA toplogy (via qemu 
-numa), one still cannot manually set mempolicy for a portion of the VM 
memory that represents each NUMA node in the VM (or have this done 
automatically with something like autoNUMA).  Therefore, it's difficult 
to forcefully map each of the VM's node's memory to the corresponding 
host node.

There are a some things you can do to mitigate some of this.  Definitely 
define the VM to match the NUMA topology found on the host.  That will 
at least allow good scaling wrt locks and scheduler in the guest.  As 
for getting memory placement close (a page in VM node x actually resides 
in host node x), you have to rely on vcpu pinning + guest NUMA topology, 
combined with default mempolicy in the guest and host.  As pages are 
faulted in the guest, the hope is that the vcpu which did the faulting 
is running in the right node (guest and host), its guest OS mempolicy 
ensures this page is to be allocated in the guest local node, and that 
allocation cause a fault in qemu, which is -also- running on the -host- 
node X.  The vcpu pinning is critical to get qemu to fault that memory 
to the correct node.  Make sure you do not use numactl for any of this. 
  I would suggest using libvirt and define the vcpu-pinning and the numa 
topology in the XML.

-Andrew Theurer


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine the backing host physical memory for a given guest ?
  2012-05-10 15:34   ` Andrew Theurer
@ 2012-05-11  1:22     ` Chegu Vinod
  2012-05-12  2:50       ` Chegu Vinod
  0 siblings, 1 reply; 6+ messages in thread
From: Chegu Vinod @ 2012-05-11  1:22 UTC (permalink / raw)
  To: kvm

Andrew Theurer <habanero <at> linux.vnet.ibm.com> writes:

> 
> On 05/09/2012 08:46 AM, Avi Kivity wrote:
> > On 05/09/2012 04:05 PM, Chegu Vinod wrote:
> >> Hello,
> >>
> >> On an 8 socket Westmere host I am attempting to run a single guest and
> >> characterize the virtualization overhead for a system intensive
> >> workload (AIM7-high_systime) as the size of the guest scales (10way/64G,
> >> 20way/128G, ... 80way/512G).
> >>
> >> To do some comparisons between the native vs. guest runs. I have
> >> been using "numactl" to control the cpu node&  memory node bindings for
> >> the qemu instance.  For larger guest sizes I end up binding across multiple
> >> localities. for e.g. a 40 way guest :
> >>
> >> numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
> >> qemu-system-x86_64 -smp 40 -m 262144 \
> >> <....>
> >>
> >> I understand that actual mappings from a guest virtual address to host 
physical
> >> address could change.
> >>
> >> Is there a way to determine [at a given instant] which host's NUMA node is
> >> providing the backing physical memory for the active guest's kernel and
> >> also for the the apps actively running in the guest ?
> >>
> >> Guessing that there is a better way (some tool available?) than just
> >> diff'ng the per node memory usage...from the before and after output of
> >> "numactl --hardware" on the host.
> >>
> >
> > Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.
> >
> 
> You can look at /proc/≤pid>/numa_maps and see all the mappings for the 
> qemu process.  There should be one really large mapping for the guest 
> memory, and in that line a number of dirty pages list potentially for 
> each NUMA node.  This will tell you how much from each node, but not 
> specifically "which page is mapped where".

Thanks . I will look at this in more detail.


> 
> Keep in mind with the current numactl you are using, you will likely not 
> get the benefits of NUMA enhancements found in the linux kernel from 
> your guest (or host).  There are a couple reasons: (1) your guest does 
> not have a NUMA topology defined (based on what I see from the qemu 
> command above), so it will not do anything special based on the host 
> topology.  Also, things that are broken down per-NUMA-node like some 
> spin-locks and sched-domains are now system-wide/flat.  This is a big 
> deal for scheduler and other things like kmem allocation.  With a single 
> 80way VM with no NUMA, you will likely have massive spin-lock contention 
> on some workloads. 


We had seen evidence of increased lock contentions (via lockstat etc.)as the 
guest size increased. 

[On a related note : Given the nature of the system intensive workload... the 
combination of the ticket based locks in the guest OS + PLE handling code in 
the host kernel was not helping. So temporarily worked around this. Hope to try 
out the PV locks changes soon....].

Regarding the -numa option :

I had earlier (about a ~month ago) tried the -numa option. The layout I 
specified didn't match the layout the guest saw. Haven't yet looked into the 
exact reason...but came to know that there was already an open issue: 

https://bugzilla.redhat.com/show_bug.cgi?id=816804 

Also remember noticing a warning message when the -numa option was used for a 
guest with more than 64VCPUs. (in my case with 80 VCPUs).  Will be looking at 
the code soon to see if there is any limitation...

I have been using [more or less the upstream version of] qemu directly to start 
the guest.  For guest sizes : 10way, 20way, 40way 60way I had been using numactl 
just to control the numa nodes where the guest ends up running. After the guest 
booted up I used to set the affinity of the the VCPUs (to the specific cores on 
the host) by via "taskset" (this was a touch painful... when compared to virsh 
vcpupin). For 80way guest (on an 80way host) I don't use numactl.  

Noticed that doing a "taskset" to pin the VCPUs didn't always give a better 
performance...perhaps this is due to the absence of a NUMA layout in the guest.


(2) Once the VM does have NUMA toplogy (via qemu 
> -numa), one still cannot manually set mempolicy for a portion of the VM 
> memory that represents each NUMA node in the VM (or have this done 
> automatically with something like autoNUMA).  Therefore, it's difficult 
> to forcefully map each of the VM's node's memory to the corresponding 
> host node.
> 
> There are a some things you can do to mitigate some of this.  Definitely 
> define the VM to match the NUMA topology found on the host.  


The native/host platform has multiple levels of NUMA...

node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  14  23  23  27  27  27  27 
  1:  14  10  23  23  27  27  27  27 
  2:  23  23  10  14  27  27  27  27 
  3:  23  23  14  10  27  27  27  27 
  4:  27  27  27  27  10  14  23  23 
  5:  27  27  27  27  14  10  23  23 
  6:  27  27  27  27  23  23  10  14 
  7:  27  27  27  27  23  23  14  10 

The qemu's -numa option seems to only allow for one level (i.e. 
specify multiple sockets etc). Am I missing something ?

> at least allow good scaling wrt locks and scheduler in the guest.  As 
> for getting memory placement close (a page in VM node x actually resides 
> in host node x), you have to rely on vcpu pinning + guest NUMA topology, 
> combined with default mempolicy in the guest and host.  


I did recompile both the kernels with the SLUB allocator enabled...


> As pages are 
> faulted in the guest, the hope is that the vcpu which did the faulting 
> is running in the right node (guest and host), its guest OS mempolicy 
> ensures this page is to be allocated in the guest local node, and that 
> allocation cause a fault in qemu, which is -also- running on the -host- 
> node X.  The vcpu pinning is critical to get qemu to fault that memory 
> to the correct node.  

In the absence of a NUMA layout in the guest it doesn't look like pinning 
helped... but I think I understand what you are saying. Thanks!

> Make sure you do not use numactl for any of this. 
>   I would suggest using libvirt and define the vcpu-pinning and the numa 
> topology in the XML.


I will try this in the coming days (waiting to get back on the system :)). 

Thanks for the detailed response ! 

Vinod


> 
> -Andrew Theurer
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: How to determine the backing host physical memory for a given guest ?
  2012-05-11  1:22     ` Chegu Vinod
@ 2012-05-12  2:50       ` Chegu Vinod
  0 siblings, 0 replies; 6+ messages in thread
From: Chegu Vinod @ 2012-05-12  2:50 UTC (permalink / raw)
  To: kvm

Chegu Vinod <chegu_vinod <at> hp.com> writes:

> 
> Andrew Theurer <habanero <at> linux.vnet.ibm.com> writes:
> 

> Regarding the -numa option :
> 
> I had earlier (about a ~month ago) tried the -numa option. The layout I 
> specified didn't match the layout the guest saw. Haven't yet looked into the 
> exact reason...but came to know that there was already an open issue: 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=816804 
> 
> Also remember noticing a warning message when the -numa option was used for a 
> guest with more than 64VCPUs. (in my case with 80 VCPUs).  Will be looking at 
> the code soon to see if there is any limitation...
> 


Just FYI...

Made some minor changes in qemu to allow for >64 cpus while using the -numa 
option and was able to boot 80way guest with 8 numa nodes in them.  

With NUMA in the guest I got much better performance for AIM7-high-systime 
workload (pretty close to native!). 

Given the nature of this system intensive  workload and a small memory footprint 
I didn't pin the VCPUs.  Do expect to see a need for pinning the VCPUs while 
running memory intensive workloads. 

Still have to try out the experiments with the PV lock changes (instead 
of the current temporary workarounds). 

Thanks
Vinod




> > 
> > -Andrew Theurer
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo <at> vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo <at> vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 





^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-05-12  2:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-09 13:05 How to determine the backing host physical memory for a given guest ? Chegu Vinod
2012-05-09 13:46 ` Avi Kivity
2012-05-10  1:23   ` Chegu Vinod
2012-05-10 15:34   ` Andrew Theurer
2012-05-11  1:22     ` Chegu Vinod
2012-05-12  2:50       ` Chegu Vinod

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.