All of lore.kernel.org
 help / color / mirror / Atom feed
* Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs.  3.3.1 OS
@ 2012-04-11 17:21 Chegu Vinod
  2012-04-12 18:21 ` Rik van Riel
  0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-11 17:21 UTC (permalink / raw)
  To: kvm


Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single 
60-way KVM guest) I noticed pretty bad performance when the guest was booted 
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 
(RHEL6.2) kernel.

'am still trying to dig more into the details here. Wondering if some changes in 
the upstream kernel (i.e. since 2.6.32-220) might be causing this to show up in 
a guest environment (esp. for this system-intensive workload).  

Has anyone else observed this kind of behavior ? Is it a known issue with a fix 
in the pipeline ? If not are there any special knobs/tunables that one needs to 
explicitly set/clear etc. when using newer kernels like 3.3.1 in a guest ? 

I have included some info. below. 

Also any pointers on what else I could capture that would be helpful.

Thanks!
Vinod

---

Platform used:
DL980 G7 (80 cores + 128G RAM).  Hyper-threading is turned off.

Workload used:
AIM7  (workfile.high_systime) and using RAM disks. This is 
primarily a cpu intensive workload...not much i/o. 

Software used :
qemu-system-x86_64   :  1.0.50    (i.e. latest as of about a week or so ago).
Native/Host  OS      :  3.3.1     (SLUB allocator explicitly enabled)
Guest-RunA   OS      :  2.6.32-220 (i.e. RHEL6.2 kernel)
Guest-RunB   OS      :  3.3.1

Guest was pinned on :
numa node: 4,5,6,7   ->   40VCPUs + 64G   (i.e. 40-way guest)
numa node: 2,3,4,5,7  ->  60VCPUs + 96G   (i.e. 60-way guest)

For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than 
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest 
kernel was nearly 12x better !

For the Guest-RunB (3.3.1) case I ran "mpstat -P ALL 1" on the host and observed 
that a very high % of time was being spent by the CPUs outside the guest mode 
and mostly in the host (i.e.  sys). Looking at the "perf" related traces it 
seemed like there were long pauses in the guest perhaps waiting for the 
zone->lru_lock as part of release_pages() and this resulted in the VT's PLE 
related code to kick-in on the host.

Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.  Here is a small sampling of these traces... Notice the time stamp jump 
around "_spin_lock_irqsave <-release_pages" in the case of Guest-RunB. 


1) 40-way Guest-RunA (2.6.32-220 kernel):
-----------------------------------------


#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION

           <...>-32147 [020] 145783.127452: native_flush_tlb <-flush_tlb_mm
           <...>-32147 [020] 145783.127452: free_pages_and_swap_cache <-
unmap_region
           <...>-32147 [020] 145783.127452: lru_add_drain <-
free_pages_and_swap_cache
           <...>-32147 [020] 145783.127452: release_pages <-
free_pages_and_swap_cache
           <...>-32147 [020] 145783.127452: _spin_lock_irqsave <-release_pages
           <...>-32147 [020] 145783.127452: __mod_zone_page_state <-
release_pages
           <...>-32147 [020] 145783.127452: mem_cgroup_del_lru_list <-
release_pages

...

           <...>-32147 [022] 145783.133536: release_pages <-
free_pages_and_swap_cache
           <...>-32147 [022] 145783.133536: _spin_lock_irqsave <-release_pages
           <...>-32147 [022] 145783.133536: __mod_zone_page_state <-
release_pages
           <...>-32147 [022] 145783.133536: mem_cgroup_del_lru_list <-
release_pages
           <...>-32147 [022] 145783.133537: lookup_page_cgroup <-
mem_cgroup_del_lru_list




2) 40-way Guest-RunB (3.3.1):
-----------------------------


#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
           <...>-16459 [009] .... 101757.383125: free_pages_and_swap_cache <-
tlb_flush_mmu
           <...>-16459 [009] .... 101757.383125: lru_add_drain <-
free_pages_and_swap_cache
           <...>-16459 [009] .... 101757.383125: release_pages <-
free_pages_and_swap_cache
           <...>-16459 [009] .... 101757.383125: _raw_spin_lock_irqsave <-
release_pages
           <...>-16459 [009] d... 101757.384861: mem_cgroup_lru_del_list <-
release_pages
           <...>-16459 [009] d... 101757.384861: lookup_page_cgroup <-
mem_cgroup_lru_del_list


....

           <...>-16459 [009] .N.. 101757.390385: release_pages <-
free_pages_and_swap_cache
           <...>-16459 [009] .N.. 101757.390385: _raw_spin_lock_irqsave <-
release_pages
           <...>-16459 [009] dN.. 101757.392983: mem_cgroup_lru_del_list <-
release_pages
           <...>-16459 [009] dN.. 101757.392983: lookup_page_cgroup <-
mem_cgroup_lru_del_list
           <...>-16459 [009] dN.. 101757.392983: __mod_zone_page_state <-
release_pages





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs. 3.3.1 OS
  2012-04-11 17:21 Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS Chegu Vinod
@ 2012-04-12 18:21 ` Rik van Riel
  2012-04-16  3:04   ` Chegu Vinod
  2012-04-16 12:18   ` Gleb Natapov
  0 siblings, 2 replies; 9+ messages in thread
From: Rik van Riel @ 2012-04-12 18:21 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: kvm, Gleb Natapov

On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>
> Hello,
>
> While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
> 60-way KVM guest) I noticed pretty bad performance when the guest was booted
> with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> (RHEL6.2) kernel.

> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
> kernel was nearly 12x better !

> Turned on function tracing and found that there appears to be more time being
> spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
> guest.

Looks like you may be running into the ticket spinlock
code. During the early RHEL 6 days, Gleb came up with a
patch to automatically disable ticket spinlocks when
running inside a KVM guest.

IIRC that patch got rejected upstream at the time,
with upstream developers preferring to wait for a
"better solution".

If such a better solution is not on its way upstream
now (two years later), maybe we should just merge
Gleb's patch upstream for the time being?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs.  3.3.1 OS
  2012-04-12 18:21 ` Rik van Riel
@ 2012-04-16  3:04   ` Chegu Vinod
  2012-04-16 12:18   ` Gleb Natapov
  1 sibling, 0 replies; 9+ messages in thread
From: Chegu Vinod @ 2012-04-16  3:04 UTC (permalink / raw)
  To: kvm

Rik van Riel <riel <at> redhat.com> writes:

> 
> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >
> > Hello,
> >
> > While running an AIM7 (workfile.high_systime) in a single 40-way (or a 
single
> > 60-way KVM guest) I noticed pretty bad performance when the guest was booted
> > with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> > (RHEL6.2) kernel.
> 
> > For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better 
than
> > the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older 
guest
> > kernel was nearly 12x better !
> 
> > Turned on function tracing and found that there appears to be more time 
being
> > spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-
220
> > guest.
> 
> Looks like you may be running into the ticket spinlock
> code. During the early RHEL 6 days, Gleb came up with a
> patch to automatically disable ticket spinlocks when
> running inside a KVM guest.
> 

Thanks for the pointer. 
Perhaps that is the issue.  
I did look up that old discussion thread.


> IIRC that patch got rejected upstream at the time,
> with upstream developers preferring to wait for a
> "better solution".
> 
> If such a better solution is not on its way upstream
> now (two years later), maybe we should just merge
> Gleb's patch upstream for the time being?



Also noticed a recent discussion thread (that originated from the Xen context)

http://article.gmane.org/gmane.linux.kernel.virtualization/15078

Not yet sure if this recent discussion is also in some way related to
the older one initiated by Gleb.

Thanks
Vinod




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs. 3.3.1 OS
  2012-04-12 18:21 ` Rik van Riel
  2012-04-16  3:04   ` Chegu Vinod
@ 2012-04-16 12:18   ` Gleb Natapov
  2012-04-16 14:44     ` Chegu Vinod
  1 sibling, 1 reply; 9+ messages in thread
From: Gleb Natapov @ 2012-04-16 12:18 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Chegu Vinod, kvm

On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >
> >Hello,
> >
> >While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
> >60-way KVM guest) I noticed pretty bad performance when the guest was booted
> >with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> >(RHEL6.2) kernel.
> 
> >For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
> >the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
> >kernel was nearly 12x better !
> 
How many CPUs your host has?

> >Turned on function tracing and found that there appears to be more time being
> >spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
> >guest.
> 
> Looks like you may be running into the ticket spinlock
> code. During the early RHEL 6 days, Gleb came up with a
> patch to automatically disable ticket spinlocks when
> running inside a KVM guest.
> 
> IIRC that patch got rejected upstream at the time,
> with upstream developers preferring to wait for a
> "better solution".
> 
> If such a better solution is not on its way upstream
> now (two years later), maybe we should just merge
> Gleb's patch upstream for the time being?
I think the pv spinlock that is actively discussed currently should
address the issue, but I am not sure someone tests it against non-ticket
lock in a guest to see which one performs better.

--
			Gleb.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs. 3.3.1 OS
  2012-04-16 12:18   ` Gleb Natapov
@ 2012-04-16 14:44     ` Chegu Vinod
  2012-04-17  9:49       ` Gleb Natapov
  0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-16 14:44 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Rik van Riel, kvm

On 4/16/2012 5:18 AM, Gleb Natapov wrote:
> On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
>> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>>> Hello,
>>>
>>> While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
>>> 60-way KVM guest) I noticed pretty bad performance when the guest was booted
>>> with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
>>> (RHEL6.2) kernel.
>>> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
>>> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
>>> kernel was nearly 12x better !
> How many CPUs your host has?

80 Cores on the DL980.  (i.e. 8 Westmere sockets).

I was using numactl to bind the qemu of the 40-way guests to numa nodes 
: 4-7  ( or for a 60-way guest
binding them to nodes 2-7)

/etc/qemu-ifup tap0

numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7 
/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme 
-enable-kvm \
-m 65536 -smp 40 \
-name vm1 -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait 
\
-drive 
file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none 
-device virtio-blk-pci,scsi=off,bus=pci
.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-monitor stdio \
-net nic,macaddr=<..mac_addr..> \
-net tap,ifname=tap0,script=no,downscript=no \
-vnc :4

/etc/qemu-ifdown tap0


I knew that there will be a few additional temporary qemu worker threads 
created...  i.e. some over
subscription  will be there.


Will have to retry by doing some explicit pinning of the vcpus to native 
cores (without using virsh).

>>> Turned on function tracing and found that there appears to be more time being
>>> spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
>>> guest.
>> Looks like you may be running into the ticket spinlock
>> code. During the early RHEL 6 days, Gleb came up with a
>> patch to automatically disable ticket spinlocks when
>> running inside a KVM guest.
>>
>> IIRC that patch got rejected upstream at the time,
>> with upstream developers preferring to wait for a
>> "better solution".
>>
>> If such a better solution is not on its way upstream
>> now (two years later), maybe we should just merge
>> Gleb's patch upstream for the time being?
> I think the pv spinlock that is actively discussed currently should
> address the issue, but I am not sure someone tests it against non-ticket
> lock in a guest to see which one performs better.

I did see that discussion...seems to have originated from the Xen context.

Vinod

>
> --
> 			Gleb.
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs. 3.3.1 OS
  2012-04-16 14:44     ` Chegu Vinod
@ 2012-04-17  9:49       ` Gleb Natapov
  2012-04-17 13:25         ` Chegu Vinod
  0 siblings, 1 reply; 9+ messages in thread
From: Gleb Natapov @ 2012-04-17  9:49 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: Rik van Riel, kvm

On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
> On 4/16/2012 5:18 AM, Gleb Natapov wrote:
> >On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
> >>On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >>>Hello,
> >>>
> >>>While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
> >>>60-way KVM guest) I noticed pretty bad performance when the guest was booted
> >>>with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
> >>>(RHEL6.2) kernel.
> >>>For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
> >>>the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
> >>>kernel was nearly 12x better !
> >How many CPUs your host has?
> 
> 80 Cores on the DL980.  (i.e. 8 Westmere sockets).
> 
So you are not oversubscribing CPUs at all. Are those real cores or including HT?
Do you have other cpus hogs running on the host while testing the guest?

> I was using numactl to bind the qemu of the 40-way guests to numa
> nodes : 4-7  ( or for a 60-way guest
> binding them to nodes 2-7)
> 
> /etc/qemu-ifup tap0
> 
> numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
> /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
> -enable-kvm \
> -m 65536 -smp 40 \
> -name vm1 -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
> \
> -drive file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
> -device virtio-blk-pci,scsi=off,bus=pci
> .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> -monitor stdio \
> -net nic,macaddr=<..mac_addr..> \
> -net tap,ifname=tap0,script=no,downscript=no \
> -vnc :4
> 
> /etc/qemu-ifdown tap0
> 
> 
> I knew that there will be a few additional temporary qemu worker
> threads created...  i.e. some over
> subscription  will be there.
> 
4 nodes above have 40 real cores, yes? Can you try to run upstream
kernel without binding at all and check the performance?

> 
> Will have to retry by doing some explicit pinning of the vcpus to
> native cores (without using virsh).
> 
> >>>Turned on function tracing and found that there appears to be more time being
> >>>spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
> >>>guest.
> >>Looks like you may be running into the ticket spinlock
> >>code. During the early RHEL 6 days, Gleb came up with a
> >>patch to automatically disable ticket spinlocks when
> >>running inside a KVM guest.
> >>
> >>IIRC that patch got rejected upstream at the time,
> >>with upstream developers preferring to wait for a
> >>"better solution".
> >>
> >>If such a better solution is not on its way upstream
> >>now (two years later), maybe we should just merge
> >>Gleb's patch upstream for the time being?
> >I think the pv spinlock that is actively discussed currently should
> >address the issue, but I am not sure someone tests it against non-ticket
> >lock in a guest to see which one performs better.
> 
> I did see that discussion...seems to have originated from the Xen context.
> 
Yes, The problem is the same for both hypervisors.

--
			Gleb.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs. 3.3.1 OS
  2012-04-17  9:49       ` Gleb Natapov
@ 2012-04-17 13:25         ` Chegu Vinod
  2012-04-19  4:44           ` Chegu Vinod
  0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-17 13:25 UTC (permalink / raw)
  To: Gleb Natapov; +Cc: Rik van Riel, kvm

On 4/17/2012 2:49 AM, Gleb Natapov wrote:
> On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
>> On 4/16/2012 5:18 AM, Gleb Natapov wrote:
>>> On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
>>>> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>>>>> Hello,
>>>>>
>>>>> While running an AIM7 (workfile.high_systime) in a single 40-way (or a single
>>>>> 60-way KVM guest) I noticed pretty bad performance when the guest was booted
>>>>> with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220
>>>>> (RHEL6.2) kernel.
>>>>> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than
>>>>> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest
>>>>> kernel was nearly 12x better !
>>> How many CPUs your host has?
>> 80 Cores on the DL980.  (i.e. 8 Westmere sockets).
>>
> So you are not oversubscribing CPUs at all. Are those real cores or including HT?

HT is off.

> Do you have other cpus hogs running on the host while testing the guest?

Nope.  Sometimes I do run the utilities like "perf" or "sar" or "mpstat" 
on the numa node 0 (where
the guest is not running).

>
>> I was using numactl to bind the qemu of the 40-way guests to numa
>> nodes : 4-7  ( or for a 60-way guest
>> binding them to nodes 2-7)
>>
>> /etc/qemu-ifup tap0
>>
>> numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
>> /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
>> -enable-kvm \
>> -m 65536 -smp 40 \
>> -name vm1 -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
>> \
>> -drive file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
>> -device virtio-blk-pci,scsi=off,bus=pci
>> .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
>> -monitor stdio \
>> -net nic,macaddr=<..mac_addr..>  \
>> -net tap,ifname=tap0,script=no,downscript=no \
>> -vnc :4
>>
>> /etc/qemu-ifdown tap0
>>
>>
>> I knew that there will be a few additional temporary qemu worker
>> threads created...  i.e. some over
>> subscription  will be there.
>>
> 4 nodes above have 40 real cores, yes?

Yes .
Other than the qemu's related threads and some of the generic per-cpu 
Linux kernel threads (e.g. migration  etc)
there isn't anything else running on these Numa nodes.

> Can you try to run upstream
> kernel without binding at all and check the performance?


I shall re-run and get back to you with this info.

Typically for the native runs... binding the workload results in better 
numbers.  Hence I choose to do the
binding for the guest too...i.e. on the same numa nodes as the native 
case for virt. vs. native comparison
purposes. Having said that ...In the past I had seen a couple of cases 
where the non-binded guest
performed better than the native case. Need to re-run and dig into this 
further...

>
>> Will have to retry by doing some explicit pinning of the vcpus to
>> native cores (without using virsh).
>>
>>>>> Turned on function tracing and found that there appears to be more time being
>>>>> spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
>>>>> guest.
>>>> Looks like you may be running into the ticket spinlock
>>>> code. During the early RHEL 6 days, Gleb came up with a
>>>> patch to automatically disable ticket spinlocks when
>>>> running inside a KVM guest.
>>>>
>>>> IIRC that patch got rejected upstream at the time,
>>>> with upstream developers preferring to wait for a
>>>> "better solution".
>>>>
>>>> If such a better solution is not on its way upstream
>>>> now (two years later), maybe we should just merge
>>>> Gleb's patch upstream for the time being?
>>> I think the pv spinlock that is actively discussed currently should
>>> address the issue, but I am not sure someone tests it against non-ticket
>>> lock in a guest to see which one performs better.
>> I did see that discussion...seems to have originated from the Xen context.
>>
> Yes, The problem is the same for both hypervisors.
>
> --
> 			Gleb.

Thanks
Vinod


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs. 3.3.1 OS
  2012-04-17 13:25         ` Chegu Vinod
@ 2012-04-19  4:44           ` Chegu Vinod
  2012-04-19  6:01             ` Gleb Natapov
  0 siblings, 1 reply; 9+ messages in thread
From: Chegu Vinod @ 2012-04-19  4:44 UTC (permalink / raw)
  To: chegu_vinod; +Cc: Gleb Natapov, Rik van Riel, kvm

On 4/17/2012 6:25 AM, Chegu Vinod wrote:
> On 4/17/2012 2:49 AM, Gleb Natapov wrote:
>> On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
>>> On 4/16/2012 5:18 AM, Gleb Natapov wrote:
>>>> On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
>>>>> On 04/11/2012 01:21 PM, Chegu Vinod wrote:
>>>>>> Hello,
>>>>>>
>>>>>> While running an AIM7 (workfile.high_systime) in a single 40-way 
>>>>>> (or a single
>>>>>> 60-way KVM guest) I noticed pretty bad performance when the guest 
>>>>>> was booted
>>>>>> with 3.3.1 kernel when compared to the same guest booted with 
>>>>>> 2.6.32-220
>>>>>> (RHEL6.2) kernel.
>>>>>> For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x 
>>>>>> better than
>>>>>> the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run 
>>>>>> the older guest
>>>>>> kernel was nearly 12x better !
>>>> How many CPUs your host has?
>>> 80 Cores on the DL980.  (i.e. 8 Westmere sockets).
>>>
>> So you are not oversubscribing CPUs at all. Are those real cores or 
>> including HT?
>
> HT is off.
>
>> Do you have other cpus hogs running on the host while testing the guest?
>
> Nope.  Sometimes I do run the utilities like "perf" or "sar" or 
> "mpstat" on the numa node 0 (where
> the guest is not running).
>
>>
>>> I was using numactl to bind the qemu of the 40-way guests to numa
>>> nodes : 4-7  ( or for a 60-way guest
>>> binding them to nodes 2-7)
>>>
>>> /etc/qemu-ifup tap0
>>>
>>> numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
>>> /usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu 
>>> Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
>>> -enable-kvm \
>>> -m 65536 -smp 40 \
>>> -name vm1 -chardev 
>>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
>>> \
>>> -drive 
>>> file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
>>> -device virtio-blk-pci,scsi=off,bus=pci
>>> .0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
>>> -monitor stdio \
>>> -net nic,macaddr=<..mac_addr..>  \
>>> -net tap,ifname=tap0,script=no,downscript=no \
>>> -vnc :4
>>>
>>> /etc/qemu-ifdown tap0
>>>
>>>
>>> I knew that there will be a few additional temporary qemu worker
>>> threads created...  i.e. some over
>>> subscription  will be there.
>>>
>> 4 nodes above have 40 real cores, yes?
>
> Yes .
> Other than the qemu's related threads and some of the generic per-cpu 
> Linux kernel threads (e.g. migration  etc)
> there isn't anything else running on these Numa nodes.
>
>> Can you try to run upstream
>> kernel without binding at all and check the performance?
>

Re-ran the same workload *without* binding the qemu...but using the 
3.3.1 kernel

20-way guest: Performance got much worse when compared to the case where 
bind the qemu.
40-way guest: about the same as in the case  where we bind the qemu
60-way guest: about the same as in the case  where we bind the qemu

Trying out a couple of other experiments...

FYI
Vinod




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs. 3.3.1 OS
  2012-04-19  4:44           ` Chegu Vinod
@ 2012-04-19  6:01             ` Gleb Natapov
  0 siblings, 0 replies; 9+ messages in thread
From: Gleb Natapov @ 2012-04-19  6:01 UTC (permalink / raw)
  To: Chegu Vinod; +Cc: Rik van Riel, kvm

On Wed, Apr 18, 2012 at 09:44:47PM -0700, Chegu Vinod wrote:
> On 4/17/2012 6:25 AM, Chegu Vinod wrote:
> >On 4/17/2012 2:49 AM, Gleb Natapov wrote:
> >>On Mon, Apr 16, 2012 at 07:44:39AM -0700, Chegu Vinod wrote:
> >>>On 4/16/2012 5:18 AM, Gleb Natapov wrote:
> >>>>On Thu, Apr 12, 2012 at 02:21:06PM -0400, Rik van Riel wrote:
> >>>>>On 04/11/2012 01:21 PM, Chegu Vinod wrote:
> >>>>>>Hello,
> >>>>>>
> >>>>>>While running an AIM7 (workfile.high_systime) in a
> >>>>>>single 40-way (or a single
> >>>>>>60-way KVM guest) I noticed pretty bad performance when
> >>>>>>the guest was booted
> >>>>>>with 3.3.1 kernel when compared to the same guest booted
> >>>>>>with 2.6.32-220
> >>>>>>(RHEL6.2) kernel.
> >>>>>>For the 40-way Guest-RunA (2.6.32-220 kernel) performed
> >>>>>>nearly 9x better than
> >>>>>>the Guest-RunB (3.3.1 kernel). In the case of 60-way
> >>>>>>guest run the older guest
> >>>>>>kernel was nearly 12x better !
> >>>>How many CPUs your host has?
> >>>80 Cores on the DL980.  (i.e. 8 Westmere sockets).
> >>>
> >>So you are not oversubscribing CPUs at all. Are those real cores
> >>or including HT?
> >
> >HT is off.
> >
> >>Do you have other cpus hogs running on the host while testing the guest?
> >
> >Nope.  Sometimes I do run the utilities like "perf" or "sar" or
> >"mpstat" on the numa node 0 (where
> >the guest is not running).
> >
> >>
> >>>I was using numactl to bind the qemu of the 40-way guests to numa
> >>>nodes : 4-7  ( or for a 60-way guest
> >>>binding them to nodes 2-7)
> >>>
> >>>/etc/qemu-ifup tap0
> >>>
> >>>numactl --cpunodebind=4,5,6,7 --membind=4,5,6,7
> >>>/usr/local/bin/qemu-system-x86_64 -enable-kvm -cpu Westmere,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
> >>>-enable-kvm \
> >>>-m 65536 -smp 40 \
> >>>-name vm1 -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vm1.monitor,server,nowait
> >>>\
> >>>-drive file=/var/lib/libvirt/images/vmVinod1/vm1.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none
> >>>-device virtio-blk-pci,scsi=off,bus=pci
> >>>.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
> >>>-monitor stdio \
> >>>-net nic,macaddr=<..mac_addr..>  \
> >>>-net tap,ifname=tap0,script=no,downscript=no \
> >>>-vnc :4
> >>>
> >>>/etc/qemu-ifdown tap0
> >>>
> >>>
> >>>I knew that there will be a few additional temporary qemu worker
> >>>threads created...  i.e. some over
> >>>subscription  will be there.
> >>>
> >>4 nodes above have 40 real cores, yes?
> >
> >Yes .
> >Other than the qemu's related threads and some of the generic
> >per-cpu Linux kernel threads (e.g. migration  etc)
> >there isn't anything else running on these Numa nodes.
> >
> >>Can you try to run upstream
> >>kernel without binding at all and check the performance?
> >
> 
> Re-ran the same workload *without* binding the qemu...but using the
> 3.3.1 kernel
> 
> 20-way guest: Performance got much worse when compared to the case
> where bind the qemu.
> 40-way guest: about the same as in the case  where we bind the qemu
> 60-way guest: about the same as in the case  where we bind the qemu
> 
> Trying out a couple of other experiments...
> 
With 8 sockets the numa effects are probably very strong. Couple of things to
try:
1. Run vm that fits into one numa node and bind it to a numa node. Compare
   performance of rhel kernel and upstream.
2. Run vm bigger than numa node, bind vcpus to numa nodes separately and
   pass resulted topology to a guest using -numa flag.

--
			Gleb.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-04-19  6:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-11 17:21 Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS Chegu Vinod
2012-04-12 18:21 ` Rik van Riel
2012-04-16  3:04   ` Chegu Vinod
2012-04-16 12:18   ` Gleb Natapov
2012-04-16 14:44     ` Chegu Vinod
2012-04-17  9:49       ` Gleb Natapov
2012-04-17 13:25         ` Chegu Vinod
2012-04-19  4:44           ` Chegu Vinod
2012-04-19  6:01             ` Gleb Natapov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.