Performance of 40-way guest running 2.6.32-220 (RHEL6.2) vs. 3.3.1 OS

From: Chegu Vinod <chegu_vinod@hp.com>
To: kvm@vger.kernel.org
Subject: Performance of  40-way guest running  2.6.32-220 (RHEL6.2)  vs.  3.3.1 OS
Date: Wed, 11 Apr 2012 17:21:59 +0000 (UTC)	[thread overview]
Message-ID: <loom.20120411T183827-108@post.gmane.org> (raw)

Hello,

While running an AIM7 (workfile.high_systime) in a single 40-way (or a single 
60-way KVM guest) I noticed pretty bad performance when the guest was booted 
with 3.3.1 kernel when compared to the same guest booted with 2.6.32-220 
(RHEL6.2) kernel.

'am still trying to dig more into the details here. Wondering if some changes in 
the upstream kernel (i.e. since 2.6.32-220) might be causing this to show up in 
a guest environment (esp. for this system-intensive workload).  

Has anyone else observed this kind of behavior ? Is it a known issue with a fix 
in the pipeline ? If not are there any special knobs/tunables that one needs to 
explicitly set/clear etc. when using newer kernels like 3.3.1 in a guest ? 

I have included some info. below. 

Also any pointers on what else I could capture that would be helpful.

Thanks!
Vinod

---

Platform used:
DL980 G7 (80 cores + 128G RAM).  Hyper-threading is turned off.

Workload used:
AIM7  (workfile.high_systime) and using RAM disks. This is 
primarily a cpu intensive workload...not much i/o. 

Software used :
qemu-system-x86_64   :  1.0.50    (i.e. latest as of about a week or so ago).
Native/Host  OS      :  3.3.1     (SLUB allocator explicitly enabled)
Guest-RunA   OS      :  2.6.32-220 (i.e. RHEL6.2 kernel)
Guest-RunB   OS      :  3.3.1

Guest was pinned on :
numa node: 4,5,6,7   ->   40VCPUs + 64G   (i.e. 40-way guest)
numa node: 2,3,4,5,7  ->  60VCPUs + 96G   (i.e. 60-way guest)

For the 40-way Guest-RunA (2.6.32-220 kernel) performed nearly 9x better than 
the Guest-RunB (3.3.1 kernel). In the case of 60-way guest run the older guest 
kernel was nearly 12x better !

For the Guest-RunB (3.3.1) case I ran "mpstat -P ALL 1" on the host and observed 
that a very high % of time was being spent by the CPUs outside the guest mode 
and mostly in the host (i.e.  sys). Looking at the "perf" related traces it 
seemed like there were long pauses in the guest perhaps waiting for the 
zone->lru_lock as part of release_pages() and this resulted in the VT's PLE 
related code to kick-in on the host.

Turned on function tracing and found that there appears to be more time being
spent around the lock code in the 3.3.1 guest when compared to the 2.6.32-220
guest.  Here is a small sampling of these traces... Notice the time stamp jump 
around "_spin_lock_irqsave <-release_pages" in the case of Guest-RunB. 

1) 40-way Guest-RunA (2.6.32-220 kernel):
-----------------------------------------

#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION

           <...>-32147 [020] 145783.127452: native_flush_tlb <-flush_tlb_mm
           <...>-32147 [020] 145783.127452: free_pages_and_swap_cache <-
unmap_region
           <...>-32147 [020] 145783.127452: lru_add_drain <-
free_pages_and_swap_cache
           <...>-32147 [020] 145783.127452: release_pages <-
free_pages_and_swap_cache
           <...>-32147 [020] 145783.127452: _spin_lock_irqsave <-release_pages
           <...>-32147 [020] 145783.127452: __mod_zone_page_state <-
release_pages
           <...>-32147 [020] 145783.127452: mem_cgroup_del_lru_list <-
release_pages

...

           <...>-32147 [022] 145783.133536: release_pages <-
free_pages_and_swap_cache
           <...>-32147 [022] 145783.133536: _spin_lock_irqsave <-release_pages
           <...>-32147 [022] 145783.133536: __mod_zone_page_state <-
release_pages
           <...>-32147 [022] 145783.133536: mem_cgroup_del_lru_list <-
release_pages
           <...>-32147 [022] 145783.133537: lookup_page_cgroup <-
mem_cgroup_del_lru_list

2) 40-way Guest-RunB (3.3.1):
-----------------------------

#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
           <...>-16459 [009] .... 101757.383125: free_pages_and_swap_cache <-
tlb_flush_mmu
           <...>-16459 [009] .... 101757.383125: lru_add_drain <-
free_pages_and_swap_cache
           <...>-16459 [009] .... 101757.383125: release_pages <-
free_pages_and_swap_cache
           <...>-16459 [009] .... 101757.383125: _raw_spin_lock_irqsave <-
release_pages
           <...>-16459 [009] d... 101757.384861: mem_cgroup_lru_del_list <-
release_pages
           <...>-16459 [009] d... 101757.384861: lookup_page_cgroup <-
mem_cgroup_lru_del_list

....

           <...>-16459 [009] .N.. 101757.390385: release_pages <-
free_pages_and_swap_cache
           <...>-16459 [009] .N.. 101757.390385: _raw_spin_lock_irqsave <-
release_pages
           <...>-16459 [009] dN.. 101757.392983: mem_cgroup_lru_del_list <-
release_pages
           <...>-16459 [009] dN.. 101757.392983: lookup_page_cgroup <-
mem_cgroup_lru_del_list
           <...>-16459 [009] dN.. 101757.392983: __mod_zone_page_state <-
release_pages