All of lore.kernel.org
 help / color / mirror / Atom feed
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
To: "Andrew M. Theurer" <habanero@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Ingo Molnar <mingo@redhat.com>, Avi Kivity <avi@redhat.com>,
	Rik van Riel <riel@redhat.com>, S390 <linux-s390@vger.kernel.org>,
	Carsten Otte <cotte@de.ibm.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>,
	LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
	Gleb Natapov <gleb@redhat.com>,
	linux390@de.ibm.com,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>,
	Joerg Roedel <joerg.roedel@amd.com>
Subject: Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
Date: Tue, 10 Jul 2012 14:56:12 +0530	[thread overview]
Message-ID: <4FFBF534.5040107@linux.vnet.ibm.com> (raw)
In-Reply-To: <1341870457.2909.27.camel@oc2024037011.ibm.com>

On 07/10/2012 03:17 AM, Andrew Theurer wrote:
 > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
 >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
 >> random VCPU on PL exit. Though we already have filtering while choosing
 >> the candidate to yield_to, we can do better.
 >
 > Hi, Raghu.
Hi Andrew,
Thank you for your analysis and inputs

 >
 >> Problem is, for large vcpu guests, we have more probability of yielding
 >> to a bad vcpu. We are not able to prevent directed yield to same guy who
 >> has done PL exit recently, who perhaps spins again and wastes CPU.
 >>
 >> Fix that by keeping track of who has done PL exit. So The Algorithm 
in series
 >> give chance to a VCPU which has:
 >>
 >>   (a) Not done PLE exit at all (probably he is preempted lock-holder)
 >>
 >>   (b) VCPU skipped in last iteration because it did PL exit, and 
probably
 >>   has become eligible now (next eligible lock holder)
 >>
 >> Future enhancemnets:
 >>    (1) Currently we have a boolean to decide on eligibility of vcpu. It
 >>      would be nice if I get feedback on guest (>32 vcpu) whether we can
 >>      improve better with integer counter. (with counter = say f(log 
n )).
 >>
 >>    (2) We have not considered system load during iteration of vcpu. With
 >>     that information we can limit the scan and also decide whether 
schedule()
 >>     is better. [ I am able to use #kicked vcpus to decide on this 
But may
 >>     be there are better ideas like information from global loadavg.]
 >>
 >>    (3) We can exploit this further with PV patches since it also 
knows about
 >>     next eligible lock-holder.
 >>
 >> Summary: There is a huge improvement for moderate / no overcommit 
scenario
 >>   for kvm based guest on PLE machine (which is difficult ;) ).
 >>
 >> Result:
 >> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
 >>
 >> Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
 >>            32 core machine
 >
 > Is this with HT enabled, therefore 64 CPU threads?

No. HT disabled with 32 online CPUs

 >
 >> Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) 
(GCC)
 >>    with test kernels
 >>
 >> Guest: fedora 16 with 32 vcpus 8GB memory.
 >
 > Can you briefly explain the 1x and 2x configs?  This of course is highly
 > dependent whether or not HT is enabled...

1x config:  kernbench/ebizzy/sysbench running on 1 guest (32 vcpu)
  all the benchmarks have 2*#vcpu = 64 threads

2x config:  kernbench/ebizzy/sysbench running on 2 guests each with  32
vcpu)
  all the benchmarks have 2*#vcpu = 64 threads

 >
 > FWIW, I started testing what I would call "0.5x", where I have one 40
 > vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
 > enabled, no extra load on the system).  For ebizzy, the results are
 > quite erratic from run to run, so I am inclined to discard it as a

I will be posting full run detail (individual run) in reply to this
mail since it is big. I have posted stdev also with the result.. it has
not shown too much deviation.

 > workload, but maybe I should try "1x" and "2x" cpu over-commit as well.
 >
 >> From initial observations, at least for the ebizzy workload, the
 > percentage of exits that result in a yield_to() are very low, around 1%,
 > before these patches.

Hmm Ok..
IMO for a under-committed workload, probably low percentage of yield_to
was expected, but not sure whether 1% is too less though.
But importantly,  number of successful yield_to can never measure
benefit.

With this patch what I am trying to address is to ensure successful
yield_to result in benefit.

So, I am concerned that at least for this test,
 > reducing that number even more has diminishing returns.  I am however
 > still concerned about the scalability problem with yield_to(),

So did you mean you are expected to see more yield_to overheads with 
large guests?
As already mentioned in future enhancements, one thing I will be trying 
in future would be,

a. have counter instead of boolean for skipping yield_to
b. just scan probably f(log(n)) vcpu to yield and then schedule()/ 
return depending on system load.

so we will be reducing overall vcpu iteration in PLE handler from
O(n * n) to O(n log n)

which
 > shows like this for me (perf):
 >
 >> 63.56%     282095         qemu-kvm  [kernel.kallsyms]        [k] 
_raw_spin_lock
 >> 5.42%      24420         qemu-kvm  [kvm]                    [k] 
kvm_vcpu_yield_to
 >> 5.33%      26481         qemu-kvm  [kernel.kallsyms]        [k] 
get_pid_task
 >> 4.35%      20049         qemu-kvm  [kernel.kallsyms]        [k] yield_to
 >> 2.74%      15652         qemu-kvm  [kvm]                    [k] 
kvm_apic_present
 >> 1.70%       8657         qemu-kvm  [kvm]                    [k] 
kvm_vcpu_on_spin
 >> 1.45%       7889         qemu-kvm  [kvm]                    [k] 
vcpu_enter_guest
 >
 > For the cpu threads in the host that are actually active (in this case
 > 1/2 of them), ~50% of their time is in kernel and ~43% in guest.This
 > is for a no-IO workload, so that's just incredible to see so much cpu
 > wasted.  I feel that 2 important areas to tackle are a more scalable
 > yield_to() and reducing the number of pause exits itself (hopefully by
 > just tuning ple_window for the latter).

I think this is a concern and as you stated I agree that tuning
ple_window helps here.

 >
 > Honestly, I not confident addressing this problem will improve the
 > ebizzy score. That workload is so erratic for me, that I do not trust
 > the results at all.  I have however seen consistent improvements in
 > disabling PLE for a http guest workload and a very high IOPS guest
 > workload, both with much time spent in host in the double runqueue lock
 > for yield_to(), so that's why I still gravitate toward that issue.

The problem starts (in PLE disabled) when we have workload just > 1x.We 
start burning so much of cpu.

IIRC, in 2x overcommit, kernel compilation that takes 10hr on non-PLE,
used to take just 1hr after pv patches (and should be same with PLE enabled)

If we leave PLE disabled case, I do not expect any degradation even in 
0.5 x scenario, though you say results are erratic.

Could you please let me know, When PLE was enabled,
before and after the patch did you see any degradation for 0.5x?

 > -Andrew Theurer
 >
 >


  reply	other threads:[~2012-07-10  9:28 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-09  6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-09  6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
2012-07-09  6:33   ` Raghavendra K T
2012-07-09  6:33     ` Raghavendra K T
2012-07-09 22:39   ` Rik van Riel
2012-07-10 11:22     ` Raghavendra K T
2012-07-11  8:53   ` Avi Kivity
2012-07-11 10:52     ` Raghavendra K T
2012-07-11 11:18       ` Avi Kivity
2012-07-11 11:56         ` Raghavendra K T
2012-07-11 12:41           ` Andrew Jones
2012-07-12 10:58       ` Nikunj A Dadhania
2012-07-12 11:02         ` Raghavendra K T
2012-07-09  6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
2012-07-09 22:30   ` Rik van Riel
2012-07-10 11:46     ` Raghavendra K T
2012-07-09  7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger
2012-07-10  8:27   ` Raghavendra K T
2012-07-11  9:06   ` Avi Kivity
2012-07-11 10:17     ` Christian Borntraeger
2012-07-11 11:04       ` Avi Kivity
2012-07-11 11:16         ` Alexander Graf
2012-07-11 11:23           ` Avi Kivity
2012-07-11 11:52             ` Alexander Graf
2012-07-11 12:48               ` Avi Kivity
2012-07-12  2:19             ` Benjamin Herrenschmidt
2012-07-11 11:18         ` Christian Borntraeger
2012-07-11 11:39           ` Avi Kivity
2012-07-12  5:11             ` Raghavendra K T
2012-07-12  8:11               ` Avi Kivity
2012-07-12  8:32                 ` Raghavendra K T
2012-07-12  2:17         ` Benjamin Herrenschmidt
2012-07-12  8:12           ` Avi Kivity
2012-07-12 11:24             ` Benjamin Herrenschmidt
2012-07-12 10:38         ` Nikunj A Dadhania
2012-07-11 11:51       ` Raghavendra K T
2012-07-11 11:55         ` Christian Borntraeger
2012-07-11 12:04           ` Raghavendra K T
2012-07-11 13:04         ` Raghavendra K T
2012-07-09 21:47 ` Andrew Theurer
2012-07-09 21:47   ` Andrew Theurer
2012-07-10  9:26   ` Raghavendra K T [this message]
2012-07-10 10:07   ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result Raghavendra K T
2012-07-10 11:54   ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-10 13:27     ` Andrew Theurer
2012-07-11  9:00   ` Avi Kivity
2012-07-11 13:59     ` Raghavendra K T
2012-07-11 14:01       ` Raghavendra K T
2012-07-12  8:15         ` Avi Kivity
2012-07-12  8:25           ` Raghavendra K T
2012-07-12 12:31             ` Avi Kivity
2012-07-09 22:28 ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FFBF534.5040107@linux.vnet.ibm.com \
    --to=raghavendra.kt@linux.vnet.ibm.com \
    --cc=avi@redhat.com \
    --cc=borntraeger@de.ibm.com \
    --cc=chegu_vinod@hp.com \
    --cc=cotte@de.ibm.com \
    --cc=gleb@redhat.com \
    --cc=habanero@linux.vnet.ibm.com \
    --cc=hpa@zytor.com \
    --cc=joerg.roedel@amd.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux390@de.ibm.com \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=riel@redhat.com \
    --cc=srivatsa.vaddagiri@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.