* Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
@ 2012-04-11 17:21 Chegu Vinod
2012-04-12 18:21 ` Rik van Riel
0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-11 17:21 UTC (permalink / raw)
To: kvm
Hello,
While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
60-way KVM guest) I noticed pretty bad performance when the guest was booted
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
(RHEL6.2) kernel.
'am still trying to dig more into the details here. Wondering if some changes in
the upstream kernel (i.e. since 2.6.32-220) might be causing this to show up in
a guest environment (esp. for this system-intensive workload).
Has anyone else observed this kind of behavior ? Is it a known issue with a fix
in the pipeline ? If not are there any special knobs/tunables that one needs to
explicitly set/clear etc. when using newer kernels like 3.3.1 in a guest ?
I have included some info. below.
Also any pointers on what else I could capture that would be helpful.
Thanks!
Vinod
---
Platform used:
DL980 G7 (80 cores + 128G RAM). Hyper-threading is turned off.
Workload used:
AIM7 (workfile.high_systime) and using RAM disks. This is
primarily a cpu intensive workload...not much i/o.
Software used :
qemu-system-x86_64 : 1.0.50 (i.e. latest as of about a week or so ago).
Native/Host OS : 3.3.1 (SLUB allocator explicitly enabled)
Guest-RunA OS : 2.6.32-220 (i.e. RHEL6.2 kernel)
Guest-RunB OS : 3.3.1
Guest was pinned on :
numa node: 4,5,6,7 -> 40VCPUs + 64G (i.e. 40-way guest)
numa node: 2,3,4,5,7 -> 60VCPUs + 96G (i.e. 60-way guest)
For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
kernel was nearly 12x better !
For the Guest-RunB (3.3.1) case I ran "mpstat -P ALL 1" on the host and observed
that a very high % of time was being spent by the CPUs outside the guest mode
and mostly in the host (i.e. sys). Looking at the "perf" related traces it
seemed like there were long pauses in the guest perhaps waiting for the
zone->lru_lock as part of release_pages() and this resulted in the VT's PLE
related code to kick-in on the host.
Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest. Here is a small sampling of these traces... Notice the time stamp jump
around "_spin_lock_irqsave <-release_pages" in the case of Guest-RunB.
1) 40-way Guest-RunA (2.6.32-220 kernel):
-----------------------------------------
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
<...>-32147 [020] 145783.127452: native_flush_tlb <-flush_tlb_mm
<...>-32147 [020] 145783.127452: free_pages_and_swap_cache <-
unmap_region
<...>-32147 [020] 145783.127452: lru_add_drain <-
free_pages_and_swap_cache
<...>-32147 [020] 145783.127452: release_pages <-
free_pages_and_swap_cache
<...>-32147 [020] 145783.127452: _spin_lock_irqsave <-release_pages
<...>-32147 [020] 145783.127452: __mod_zone_page_state <-
release_pages
<...>-32147 [020] 145783.127452: mem_cgroup_del_lru_list <-
release_pages
...
<...>-32147 [022] 145783.133536: release_pages <-
free_pages_and_swap_cache
<...>-32147 [022] 145783.133536: _spin_lock_irqsave <-release_pages
<...>-32147 [022] 145783.133536: __mod_zone_page_state <-
release_pages
<...>-32147 [022] 145783.133536: mem_cgroup_del_lru_list <-
release_pages
<...>-32147 [022] 145783.133537: lookup_page_cgroup <-
mem_cgroup_del_lru_list
2) 40-way Guest-RunB (3.3.1):
-----------------------------
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
<...>-16459 [009] .... 101757.383125: free_pages_and_swap_cache <-
tlb_flush_mmu
<...>-16459 [009] .... 101757.383125: lru_add_drain <-
free_pages_and_swap_cache
<...>-16459 [009] .... 101757.383125: release_pages <-
free_pages_and_swap_cache
<...>-16459 [009] .... 101757.383125: _raw_spin_lock_irqsave <-
release_pages
<...>-16459 [009] d... 101757.384861: mem_cgroup_lru_del_list <-
release_pages
<...>-16459 [009] d... 101757.384861: lookup_page_cgroup <-
mem_cgroup_lru_del_list
....
<...>-16459 [009] .N.. 101757.390385: release_pages <-
free_pages_and_swap_cache
<...>-16459 [009] .N.. 101757.390385: _raw_spin_lock_irqsave <-
release_pages
<...>-16459 [009] dN.. 101757.392983: mem_cgroup_lru_del_list <-
release_pages
<...>-16459 [009] dN.. 101757.392983: lookup_page_cgroup <-
mem_cgroup_lru_del_list
<...>-16459 [009] dN.. 101757.392983: __mod_zone_page_state <-
release_pages
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-11 17:21 Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS Chegu Vinod
@ 2012-04-12 18:21 ` Rik van Riel
2012-04-16 3:04 ` Chegu Vinod
2012-04-16 12:18 ` Gleb Natapov
0 siblings, 2 replies; 9+ messages in thread
From: Rik van Riel @ 2012-04-12 18:21 UTC (permalink / raw)
To: Chegu Vinod; +Cc: kvm, Gleb Natapov
On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>
> Hello,
>
> While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
> 60-way KVM guest) I noticed pretty bad performance when the guest was booted
> with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> (RHEL6.2) kernel.
> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
> kernel was nearly 12x better !
> Turned on function tracing and found that there appears to be more time being
> spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
> guest.
Looks like you may be running into the ticket spinlock
code. During the early RHEL 6 days, Gleb came up with a
patch to automatically disable ticket spinlocks when
running inside a KVM guest.
IIRC that patch got rejected upstream at the time,
with upstream developers preferring to wait for a
"better solution".
If such a better solution is not on its way upstream
now (two years later), maybe we should just merge
Gleb's patch upstream for the time being?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-12 18:21 ` Rik van Riel
@ 2012-04-16 3:04 ` Chegu Vinod
2012-04-16 12:18 ` Gleb Natapov
1 sibling, 0 replies; 9+ messages in thread
From: Chegu Vinod @ 2012-04-16 3:04 UTC (permalink / raw)
To: kvm
Rik van Riel <riel <at> redhat.com> writes:
>
> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >
> > Hello,
> >
> > While running an AIM7 (workfile.high_systime) in a single 40-way (or a
single
> > 60-way KVM guest) I noticed pretty bad performance when the guest was booted
> > with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> > (RHEL6.2) kernel.
>
> > For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better
than
> > the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older
guest
> > kernel was nearly 12x better !
>
> > Turned on function tracing and found that there appears to be more time
being
> > spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-
220
> > guest.
>
> Looks like you may be running into the ticket spinlock
> code. During the early RHEL 6 days, Gleb came up with a
> patch to automatically disable ticket spinlocks when
> running inside a KVM guest.
>
Thanks for the pointer.
Perhaps that is the issue.
I did look up that old discussion thread.
> IIRC that patch got rejected upstream at the time,
> with upstream developers preferring to wait for a
> "better solution".
>
> If such a better solution is not on its way upstream
> now (two years later), maybe we should just merge
> Gleb's patch upstream for the time being?
Also noticed a recent discussion thread (that originated from the Xen context)
http://article.gmane.org/gmane.linux.kernel.virtualization/15078
Not yet sure if this recent discussion is also in some way related to
the older one initiated by Gleb.
Thanks
Vinod
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-12 18:21 ` Rik van Riel
2012-04-16 3:04 ` Chegu Vinod
@ 2012-04-16 12:18 ` Gleb Natapov
2012-04-16 14:44 ` Chegu Vinod
1 sibling, 1 reply; 9+ messages in thread
From: Gleb Natapov @ 2012-04-16 12:18 UTC (permalink / raw)
To: Rik van Riel; +Cc: Chegu Vinod, kvm
On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >
> >Hello,
> >
> >While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
> >60-way KVM guest) I noticed pretty bad performance when the guest was booted
> >with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> >(RHEL6.2) kernel.
>
> >For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
> >the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
> >kernel was nearly 12x better !
>
How many CPUs your host has?
> >Turned on function tracing and found that there appears to be more time being
> >spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
> >guest.
>
> Looks like you may be running into the ticket spinlock
> code. During the early RHEL 6 days, Gleb came up with a
> patch to automatically disable ticket spinlocks when
> running inside a KVM guest.
>
> IIRC that patch got rejected upstream at the time,
> with upstream developers preferring to wait for a
> "better solution".
>
> If such a better solution is not on its way upstream
> now (two years later), maybe we should just merge
> Gleb's patch upstream for the time being?
I think the pv spinlock that is actively discussed currently should
address the issue, but I am not sure someone tests it against non-ticket
lock in a guest to see which one performs better.
--
Gleb.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-16 12:18 ` Gleb Natapov
@ 2012-04-16 14:44 ` Chegu Vinod
2012-04-17 9:49 ` Gleb Natapov
0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-16 14:44 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Rik van Riel, kvm
On 4/16/2012 5:18 AM, Gleb Natapov wrote:
> On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
>> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>>> Hello,
>>>
>>> While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
>>> 60-way KVM guest) I noticed pretty bad performance when the guest was booted
>>> with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
>>> (RHEL6.2) kernel.
>>> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
>>> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
>>> kernel was nearly 12x better !
> How many CPUs your host has?
80 Cores on the DL980. (i.e. 8 Westmere sockets).
I was using numactl to bind the qemu of the 40-way guests to numa nodes
: 4-7 ( or for a 60-way guest
binding them to nodes 2-7)
/etc/qemu-ifup tap0
numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu
Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
-enable-kvm \
-m 65536 -smp 40 \
-name vm1 -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
\
-drive
file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
-device virtio-blk-pci,scsi=off,bus=pci
.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,macaddr=<..mac_addr..> \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4
/etc/qemu-ifdown tap0
I knew that there will be a few additional temporary qemu worker threads
created... i.e. some over
subscription will be there.
Will have to retry by doing some explicit pinning of the vcpus to native
cores (without using virsh).
>>> Turned on function tracing and found that there appears to be more time being
>>> spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
>>> guest.
>> Looks like you may be running into the ticket spinlock
>> code. During the early RHEL 6 days, Gleb came up with a
>> patch to automatically disable ticket spinlocks when
>> running inside a KVM guest.
>>
>> IIRC that patch got rejected upstream at the time,
>> with upstream developers preferring to wait for a
>> "better solution".
>>
>> If such a better solution is not on its way upstream
>> now (two years later), maybe we should just merge
>> Gleb's patch upstream for the time being?
> I think the pv spinlock that is actively discussed currently should
> address the issue, but I am not sure someone tests it against non-ticket
> lock in a guest to see which one performs better.
I did see that discussion...seems to have originated from the Xen context.
Vinod
>
> --
> Gleb.
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-16 14:44 ` Chegu Vinod
@ 2012-04-17 9:49 ` Gleb Natapov
2012-04-17 13:25 ` Chegu Vinod
0 siblings, 1 reply; 9+ messages in thread
From: Gleb Natapov @ 2012-04-17 9:49 UTC (permalink / raw)
To: Chegu Vinod; +Cc: Rik van Riel, kvm
On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
> On 4/16/2012 5:18 AM, Gleb Natapov wrote:
> >On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
> >>On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >>>Hello,
> >>>
> >>>While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
> >>>60-way KVM guest) I noticed pretty bad performance when the guest was booted
> >>>with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> >>>(RHEL6.2) kernel.
> >>>For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
> >>>the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
> >>>kernel was nearly 12x better !
> >How many CPUs your host has?
>
> 80 Cores on the DL980. (i.e. 8 Westmere sockets).
>
So you are not oversubscribing CPUs at all. Are those real cores or including HT?
Do you have other cpus hogs running on the host while testing the guest?
> I was using numactl to bind the qemu of the 40-way guests to numa
> nodes : 4-7 ( or for a 60-way guest
> binding them to nodes 2-7)
>
> /etc/qemu-ifup tap0
>
> numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
> /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
> -enable-kvm \
> -m 65536 -smp 40 \
> -name vm1 -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
> \
> -drive file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
> -device virtio-blk-pci,scsi=off,bus=pci
> .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> -monitor stdio \
> -net nic,macaddr=<..mac_addr..> \
> -net tap,ifname=tap0,script=no,downscript=no \
> -vnc :4
>
> /etc/qemu-ifdown tap0
>
>
> I knew that there will be a few additional temporary qemu worker
> threads created... i.e. some over
> subscription will be there.
>
4 nodes above have 40 real cores, yes? Can you try to run upstream
kernel without binding at all and check the performance?
>
> Will have to retry by doing some explicit pinning of the vcpus to
> native cores (without using virsh).
>
> >>>Turned on function tracing and found that there appears to be more time being
> >>>spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
> >>>guest.
> >>Looks like you may be running into the ticket spinlock
> >>code. During the early RHEL 6 days, Gleb came up with a
> >>patch to automatically disable ticket spinlocks when
> >>running inside a KVM guest.
> >>
> >>IIRC that patch got rejected upstream at the time,
> >>with upstream developers preferring to wait for a
> >>"better solution".
> >>
> >>If such a better solution is not on its way upstream
> >>now (two years later), maybe we should just merge
> >>Gleb's patch upstream for the time being?
> >I think the pv spinlock that is actively discussed currently should
> >address the issue, but I am not sure someone tests it against non-ticket
> >lock in a guest to see which one performs better.
>
> I did see that discussion...seems to have originated from the Xen context.
>
Yes, The problem is the same for both hypervisors.
--
Gleb.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-17 9:49 ` Gleb Natapov
@ 2012-04-17 13:25 ` Chegu Vinod
2012-04-19 4:44 ` Chegu Vinod
0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-17 13:25 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Rik van Riel, kvm
On 4/17/2012 2:49 AM, Gleb Natapov wrote:
> On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
>> On 4/16/2012 5:18 AM, Gleb Natapov wrote:
>>> On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
>>>> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>>>>> Hello,
>>>>>
>>>>> While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
>>>>> 60-way KVM guest) I noticed pretty bad performance when the guest was booted
>>>>> with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
>>>>> (RHEL6.2) kernel.
>>>>> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
>>>>> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
>>>>> kernel was nearly 12x better !
>>> How many CPUs your host has?
>> 80 Cores on the DL980. (i.e. 8 Westmere sockets).
>>
> So you are not oversubscribing CPUs at all. Are those real cores or including HT?
HT is off.
> Do you have other cpus hogs running on the host while testing the guest?
Nope. Sometimes I do run the utilities like "perf" or "sar" or "mpstat"
on the numa node 0 (where
the guest is not running).
>
>> I was using numactl to bind the qemu of the 40-way guests to numa
>> nodes : 4-7 ( or for a 60-way guest
>> binding them to nodes 2-7)
>>
>> /etc/qemu-ifup tap0
>>
>> numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
>> /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
>> -enable-kvm \
>> -m 65536 -smp 40 \
>> -name vm1 -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
>> \
>> -drive file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
>> -device virtio-blk-pci,scsi=off,bus=pci
>> .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
>> -monitor stdio \
>> -net nic,macaddr=<..mac_addr..> \
>> -net tap,ifname=tap0,script=no,downscript=no \
>> -vnc :4
>>
>> /etc/qemu-ifdown tap0
>>
>>
>> I knew that there will be a few additional temporary qemu worker
>> threads created... i.e. some over
>> subscription will be there.
>>
> 4 nodes above have 40 real cores, yes?
Yes .
Other than the qemu's related threads and some of the generic per-cpu
Linux kernel threads (e.g. migration etc)
there isn't anything else running on these Numa nodes.
> Can you try to run upstream
> kernel without binding at all and check the performance?
I shall re-run and get back to you with this info.
Typically for the native runs... binding the workload results in better
numbers. Hence I choose to do the
binding for the guest too...i.e. on the same numa nodes as the native
case for virt. vs. native comparison
purposes. Having said that ...In the past I had seen a couple of cases
where the non-binded guest
performed better than the native case. Need to re-run and dig into this
further...
>
>> Will have to retry by doing some explicit pinning of the vcpus to
>> native cores (without using virsh).
>>
>>>>> Turned on function tracing and found that there appears to be more time being
>>>>> spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
>>>>> guest.
>>>> Looks like you may be running into the ticket spinlock
>>>> code. During the early RHEL 6 days, Gleb came up with a
>>>> patch to automatically disable ticket spinlocks when
>>>> running inside a KVM guest.
>>>>
>>>> IIRC that patch got rejected upstream at the time,
>>>> with upstream developers preferring to wait for a
>>>> "better solution".
>>>>
>>>> If such a better solution is not on its way upstream
>>>> now (two years later), maybe we should just merge
>>>> Gleb's patch upstream for the time being?
>>> I think the pv spinlock that is actively discussed currently should
>>> address the issue, but I am not sure someone tests it against non-ticket
>>> lock in a guest to see which one performs better.
>> I did see that discussion...seems to have originated from the Xen context.
>>
> Yes, The problem is the same for both hypervisors.
>
> --
> Gleb.
Thanks
Vinod
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-17 13:25 ` Chegu Vinod
@ 2012-04-19 4:44 ` Chegu Vinod
2012-04-19 6:01 ` Gleb Natapov
0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-19 4:44 UTC (permalink / raw)
To: chegu_vinod; +Cc: Gleb Natapov, Rik van Riel, kvm
On 4/17/2012 6:25 AM, Chegu Vinod wrote:
> On 4/17/2012 2:49 AM, Gleb Natapov wrote:
>> On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
>>> On 4/16/2012 5:18 AM, Gleb Natapov wrote:
>>>> On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
>>>>> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>>>>>> Hello,
>>>>>>
>>>>>> While running an AIM7 (workfile.high_systime) in a single 40-way
>>>>>> (or a single
>>>>>> 60-way KVM guest) I noticed pretty bad performance when the guest
>>>>>> was booted
>>>>>> with 3.3.1 kernel when compared to the same guest booted with
>>>>>> 2.6.32-220
>>>>>> (RHEL6.2) kernel.
>>>>>> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x
>>>>>> better than
>>>>>> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run
>>>>>> the older guest
>>>>>> kernel was nearly 12x better !
>>>> How many CPUs your host has?
>>> 80 Cores on the DL980. (i.e. 8 Westmere sockets).
>>>
>> So you are not oversubscribing CPUs at all. Are those real cores or
>> including HT?
>
> HT is off.
>
>> Do you have other cpus hogs running on the host while testing the guest?
>
> Nope. Sometimes I do run the utilities like "perf" or "sar" or
> "mpstat" on the numa node 0 (where
> the guest is not running).
>
>>
>>> I was using numactl to bind the qemu of the 40-way guests to numa
>>> nodes : 4-7 ( or for a 60-way guest
>>> binding them to nodes 2-7)
>>>
>>> /etc/qemu-ifup tap0
>>>
>>> numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
>>> /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu
>>> Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
>>> -enable-kvm \
>>> -m 65536 -smp 40 \
>>> -name vm1 -chardev
>>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
>>> \
>>> -drive
>>> file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
>>> -device virtio-blk-pci,scsi=off,bus=pci
>>> .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
>>> -monitor stdio \
>>> -net nic,macaddr=<..mac_addr..> \
>>> -net tap,ifname=tap0,script=no,downscript=no \
>>> -vnc :4
>>>
>>> /etc/qemu-ifdown tap0
>>>
>>>
>>> I knew that there will be a few additional temporary qemu worker
>>> threads created... i.e. some over
>>> subscription will be there.
>>>
>> 4 nodes above have 40 real cores, yes?
>
> Yes .
> Other than the qemu's related threads and some of the generic per-cpu
> Linux kernel threads (e.g. migration etc)
> there isn't anything else running on these Numa nodes.
>
>> Can you try to run upstream
>> kernel without binding at all and check the performance?
>
Re-ran the same workload *without* binding the qemu...but using the
3.3.1 kernel
20-way guest: Performance got much worse when compared to the case where
bind the qemu.
40-way guest: about the same as in the case where we bind the qemu
60-way guest: about the same as in the case where we bind the qemu
Trying out a couple of other experiments...
FYI
Vinod
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS
2012-04-19 4:44 ` Chegu Vinod
@ 2012-04-19 6:01 ` Gleb Natapov
0 siblings, 0 replies; 9+ messages in thread
From: Gleb Natapov @ 2012-04-19 6:01 UTC (permalink / raw)
To: Chegu Vinod; +Cc: Rik van Riel, kvm
On Wed, Apr 18, 2012 at 09:44:47PM -0700, Chegu Vinod wrote:
> On 4/17/2012 6:25 AM, Chegu Vinod wrote:
> >On 4/17/2012 2:49 AM, Gleb Natapov wrote:
> >>On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
> >>>On 4/16/2012 5:18 AM, Gleb Natapov wrote:
> >>>>On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
> >>>>>On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >>>>>>Hello,
> >>>>>>
> >>>>>>While running an AIM7 (workfile.high_systime) in a
> >>>>>>single 40-way (or a single
> >>>>>>60-way KVM guest) I noticed pretty bad performance when
> >>>>>>the guest was booted
> >>>>>>with 3.3.1 kernel when compared to the same guest booted
> >>>>>>with 2.6.32-220
> >>>>>>(RHEL6.2) kernel.
> >>>>>>For the 40-way Guest-RunA (2.6.32-220 kernel) performed
> >>>>>>nearly 9x better than
> >>>>>>the Guest-RunB (3.3.1 kernel). In the case of 60-way
> >>>>>>guest run the older guest
> >>>>>>kernel was nearly 12x better !
> >>>>How many CPUs your host has?
> >>>80 Cores on the DL980. (i.e. 8 Westmere sockets).
> >>>
> >>So you are not oversubscribing CPUs at all. Are those real cores
> >>or including HT?
> >
> >HT is off.
> >
> >>Do you have other cpus hogs running on the host while testing the guest?
> >
> >Nope. Sometimes I do run the utilities like "perf" or "sar" or
> >"mpstat" on the numa node 0 (where
> >the guest is not running).
> >
> >>
> >>>I was using numactl to bind the qemu of the 40-way guests to numa
> >>>nodes : 4-7 ( or for a 60-way guest
> >>>binding them to nodes 2-7)
> >>>
> >>>/etc/qemu-ifup tap0
> >>>
> >>>numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
> >>>/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
> >>>-enable-kvm \
> >>>-m 65536 -smp 40 \
> >>>-name vm1 -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
> >>>\
> >>>-drive file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
> >>>-device virtio-blk-pci,scsi=off,bus=pci
> >>>.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> >>>-monitor stdio \
> >>>-net nic,macaddr=<..mac_addr..> \
> >>>-net tap,ifname=tap0,script=no,downscript=no \
> >>>-vnc :4
> >>>
> >>>/etc/qemu-ifdown tap0
> >>>
> >>>
> >>>I knew that there will be a few additional temporary qemu worker
> >>>threads created... i.e. some over
> >>>subscription will be there.
> >>>
> >>4 nodes above have 40 real cores, yes?
> >
> >Yes .
> >Other than the qemu's related threads and some of the generic
> >per-cpu Linux kernel threads (e.g. migration etc)
> >there isn't anything else running on these Numa nodes.
> >
> >>Can you try to run upstream
> >>kernel without binding at all and check the performance?
> >
>
> Re-ran the same workload *without* binding the qemu...but using the
> 3.3.1 kernel
>
> 20-way guest: Performance got much worse when compared to the case
> where bind the qemu.
> 40-way guest: about the same as in the case where we bind the qemu
> 60-way guest: about the same as in the case where we bind the qemu
>
> Trying out a couple of other experiments...
>
With 8 sockets the numa effects are probably very strong. Couple of things to
try:
1. Run vm that fits into one numa node and bind it to a numa node. Compare
performance of rhel kernel and upstream.
2. Run vm bigger than numa node, bind vcpus to numa nodes separately and
pass resulted topology to a guest using -numa flag.
--
Gleb.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-04-19 6:01 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-11 17:21 Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS Chegu Vinod
2012-04-12 18:21 ` Rik van Riel
2012-04-16 3:04 ` Chegu Vinod
2012-04-16 12:18 ` Gleb Natapov
2012-04-16 14:44 ` Chegu Vinod
2012-04-17 9:49 ` Gleb Natapov
2012-04-17 13:25 ` Chegu Vinod
2012-04-19 4:44 ` Chegu Vinod
2012-04-19 6:01 ` Gleb Natapov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.