All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Theurer <habanero@linux.vnet.ibm.com>
To: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Ingo Molnar <mingo@redhat.com>, Avi Kivity <avi@redhat.com>,
	Rik van Riel <riel@redhat.com>, S390 <linux-s390@vger.kernel.org>,
	Carsten Otte <cotte@de.ibm.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>,
	LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
	Gleb Natapov <gleb@redhat.com>,
	linux390@de.ibm.com,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>,
	Joerg Roedel <joerg.roedel@amd.com>
Subject: Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
Date: Mon, 09 Jul 2012 16:47:37 -0500	[thread overview]
Message-ID: <1341870457.2909.27.camel@oc2024037011.ibm.com> (raw)
In-Reply-To: <20120709062012.24030.37154.sendpatchset@codeblue>

On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.

Hi, Raghu.

> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
> 
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:
> 
>  (a) Not done PLE exit at all (probably he is preempted lock-holder)
> 
>  (b) VCPU skipped in last iteration because it did PL exit, and probably
>  has become eligible now (next eligible lock holder)
> 
> Future enhancemnets:
>   (1) Currently we have a boolean to decide on eligibility of vcpu. It
>     would be nice if I get feedback on guest (>32 vcpu) whether we can
>     improve better with integer counter. (with counter = say f(log n )).
>   
>   (2) We have not considered system load during iteration of vcpu. With
>    that information we can limit the scan and also decide whether schedule()
>    is better. [ I am able to use #kicked vcpus to decide on this But may
>    be there are better ideas like information from global loadavg.]
> 
>   (3) We can exploit this further with PV patches since it also knows about
>    next eligible lock-holder.
> 
> Summary: There is a huge improvement for moderate / no overcommit scenario
>  for kvm based guest on PLE machine (which is difficult ;) ).
> 
> Result:
> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
> 
> Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
>           32 core machine

Is this with HT enabled, therefore 64 CPU threads?

> Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
>   with test kernels 
> 
> Guest: fedora 16 with 32 vcpus 8GB memory. 

Can you briefly explain the 1x and 2x configs?  This of course is highly
dependent whether or not HT is enabled...

FWIW, I started testing what I would call "0.5x", where I have one 40
vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
enabled, no extra load on the system).  For ebizzy, the results are
quite erratic from run to run, so I am inclined to discard it as a
workload, but maybe I should try "1x" and "2x" cpu over-commit as well.

>From initial observations, at least for the ebizzy workload, the
percentage of exits that result in a yield_to() are very low, around 1%,
before these patches.  So, I am concerned that at least for this test,
reducing that number even more has diminishing returns.  I am however
still concerned about the scalability problem with yield_to(), which
shows like this for me (perf):

> 63.56%     282095         qemu-kvm  [kernel.kallsyms]        [k] _raw_spin_lock                  
> 5.42%      24420         qemu-kvm  [kvm]                    [k] kvm_vcpu_yield_to               
> 5.33%      26481         qemu-kvm  [kernel.kallsyms]        [k] get_pid_task                    
> 4.35%      20049         qemu-kvm  [kernel.kallsyms]        [k] yield_to                        
> 2.74%      15652         qemu-kvm  [kvm]                    [k] kvm_apic_present                
> 1.70%       8657         qemu-kvm  [kvm]                    [k] kvm_vcpu_on_spin                
> 1.45%       7889         qemu-kvm  [kvm]                    [k] vcpu_enter_guest                
         
For the cpu threads in the host that are actually active (in this case
1/2 of them), ~50% of their time is in kernel and ~43% in guest.  This
is for a no-IO workload, so that's just incredible to see so much cpu
wasted.  I feel that 2 important areas to tackle are a more scalable
yield_to() and reducing the number of pause exits itself (hopefully by
just tuning ple_window for the latter).

Honestly, I not confident addressing this problem will improve the
ebizzy score. That workload is so erratic for me, that I do not trust
the results at all.  I have however seen consistent improvements in
disabling PLE for a http guest workload and a very high IOPS guest
workload, both with much time spent in host in the double runqueue lock
for yield_to(), so that's why I still gravitate toward that issue.


-Andrew Theurer


WARNING: multiple messages have this Message-ID (diff)
From: Andrew Theurer <habanero@linux.vnet.ibm.com>
To: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Ingo Molnar <mingo@redhat.com>, Avi Kivity <avi@redhat.com>,
	Rik van Riel <riel@redhat.com>, S390 <linux-s390@vger.kernel.org>,
	Carsten Otte <cotte@de.ibm.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>,
	LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
	Gleb Natapov <gleb@redhat.com>,
	linux390@de.ibm.com,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>,
	Joerg Roedel <joerg.roedel@amd.com>
Subject: Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
Date: Mon, 09 Jul 2012 16:47:37 -0500	[thread overview]
Message-ID: <1341870457.2909.27.camel@oc2024037011.ibm.com> (raw)
In-Reply-To: <20120709062012.24030.37154.sendpatchset@codeblue>

On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.

Hi, Raghu.

> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
> 
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:
> 
>  (a) Not done PLE exit at all (probably he is preempted lock-holder)
> 
>  (b) VCPU skipped in last iteration because it did PL exit, and probably
>  has become eligible now (next eligible lock holder)
> 
> Future enhancemnets:
>   (1) Currently we have a boolean to decide on eligibility of vcpu. It
>     would be nice if I get feedback on guest (>32 vcpu) whether we can
>     improve better with integer counter. (with counter = say f(log n )).
>   
>   (2) We have not considered system load during iteration of vcpu. With
>    that information we can limit the scan and also decide whether schedule()
>    is better. [ I am able to use #kicked vcpus to decide on this But may
>    be there are better ideas like information from global loadavg.]
> 
>   (3) We can exploit this further with PV patches since it also knows about
>    next eligible lock-holder.
> 
> Summary: There is a huge improvement for moderate / no overcommit scenario
>  for kvm based guest on PLE machine (which is difficult ;) ).
> 
> Result:
> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
> 
> Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
>           32 core machine

Is this with HT enabled, therefore 64 CPU threads?

> Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
>   with test kernels 
> 
> Guest: fedora 16 with 32 vcpus 8GB memory. 

Can you briefly explain the 1x and 2x configs?  This of course is highly
dependent whether or not HT is enabled...

FWIW, I started testing what I would call "0.5x", where I have one 40
vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
enabled, no extra load on the system).  For ebizzy, the results are
quite erratic from run to run, so I am inclined to discard it as a
workload, but maybe I should try "1x" and "2x" cpu over-commit as well.

From initial observations, at least for the ebizzy workload, the
percentage of exits that result in a yield_to() are very low, around 1%,
before these patches.  So, I am concerned that at least for this test,
reducing that number even more has diminishing returns.  I am however
still concerned about the scalability problem with yield_to(), which
shows like this for me (perf):

> 63.56%     282095         qemu-kvm  [kernel.kallsyms]        [k] _raw_spin_lock                  
> 5.42%      24420         qemu-kvm  [kvm]                    [k] kvm_vcpu_yield_to               
> 5.33%      26481         qemu-kvm  [kernel.kallsyms]        [k] get_pid_task                    
> 4.35%      20049         qemu-kvm  [kernel.kallsyms]        [k] yield_to                        
> 2.74%      15652         qemu-kvm  [kvm]                    [k] kvm_apic_present                
> 1.70%       8657         qemu-kvm  [kvm]                    [k] kvm_vcpu_on_spin                
> 1.45%       7889         qemu-kvm  [kvm]                    [k] vcpu_enter_guest                
         
For the cpu threads in the host that are actually active (in this case
1/2 of them), ~50% of their time is in kernel and ~43% in guest.  This
is for a no-IO workload, so that's just incredible to see so much cpu
wasted.  I feel that 2 important areas to tackle are a more scalable
yield_to() and reducing the number of pause exits itself (hopefully by
just tuning ple_window for the latter).

Honestly, I not confident addressing this problem will improve the
ebizzy score. That workload is so erratic for me, that I do not trust
the results at all.  I have however seen consistent improvements in
disabling PLE for a http guest workload and a very high IOPS guest
workload, both with much time spent in host in the double runqueue lock
for yield_to(), so that's why I still gravitate toward that issue.


-Andrew Theurer

  parent reply	other threads:[~2012-07-09 21:47 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
2012-07-09  6:33   ` Raghavendra K T
2012-07-09  6:33     ` Raghavendra K T
2012-07-09 22:39   ` Rik van Riel
2012-07-10 11:22     ` Raghavendra K T
2012-07-11  8:53   ` Avi Kivity
2012-07-11 10:52     ` Raghavendra K T
2012-07-11 11:18       ` Avi Kivity
2012-07-11 11:56         ` Raghavendra K T
2012-07-11 12:41           ` Andrew Jones
2012-07-12 10:58       ` Nikunj A Dadhania
2012-07-12 11:02         ` Raghavendra K T
2012-07-09  6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
2012-07-09 22:30   ` Rik van Riel
2012-07-10 11:46     ` Raghavendra K T
2012-07-09  7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger
2012-07-10  8:27   ` Raghavendra K T
2012-07-11  9:06   ` Avi Kivity
2012-07-11 10:17     ` Christian Borntraeger
2012-07-11 11:04       ` Avi Kivity
2012-07-11 11:16         ` Alexander Graf
2012-07-11 11:23           ` Avi Kivity
2012-07-11 11:52             ` Alexander Graf
2012-07-11 12:48               ` Avi Kivity
2012-07-12  2:19             ` Benjamin Herrenschmidt
2012-07-11 11:18         ` Christian Borntraeger
2012-07-11 11:39           ` Avi Kivity
2012-07-12  5:11             ` Raghavendra K T
2012-07-12  8:11               ` Avi Kivity
2012-07-12  8:32                 ` Raghavendra K T
2012-07-12  2:17         ` Benjamin Herrenschmidt
2012-07-12  8:12           ` Avi Kivity
2012-07-12 11:24             ` Benjamin Herrenschmidt
2012-07-12 10:38         ` Nikunj A Dadhania
2012-07-11 11:51       ` Raghavendra K T
2012-07-11 11:55         ` Christian Borntraeger
2012-07-11 12:04           ` Raghavendra K T
2012-07-11 13:04         ` Raghavendra K T
2012-07-09 21:47 ` Andrew Theurer [this message]
2012-07-09 21:47   ` Andrew Theurer
2012-07-10  9:26   ` Raghavendra K T
2012-07-10 10:07   ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result Raghavendra K T
2012-07-10 11:54   ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-10 13:27     ` Andrew Theurer
2012-07-11  9:00   ` Avi Kivity
2012-07-11 13:59     ` Raghavendra K T
2012-07-11 14:01       ` Raghavendra K T
2012-07-12  8:15         ` Avi Kivity
2012-07-12  8:25           ` Raghavendra K T
2012-07-12 12:31             ` Avi Kivity
2012-07-09 22:28 ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1341870457.2909.27.camel@oc2024037011.ibm.com \
    --to=habanero@linux.vnet.ibm.com \
    --cc=avi@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=chegu_vinod@hp.com \
    --cc=cotte@de.ibm.com \
    --cc=gleb@redhat.com \
    --cc=hpa@zytor.com \
    --cc=joerg.roedel@amd.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux390@de.ibm.com \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=srivatsa.vaddagiri@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.