From: Andrew Theurer <habanero@linux.vnet.ibm.com> To: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Marcelo Tosatti <mtosatti@redhat.com>, Ingo Molnar <mingo@redhat.com>, Avi Kivity <avi@redhat.com>, Rik van Riel <riel@redhat.com>, S390 <linux-s390@vger.kernel.org>, Carsten Otte <cotte@de.ibm.com>, Christian Borntraeger <borntraeger@de.ibm.com>, KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>, LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>, Gleb Natapov <gleb@redhat.com>, linux390@de.ibm.com, Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>, Joerg Roedel <joerg.roedel@amd.com> Subject: Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Date: Mon, 09 Jul 2012 16:47:37 -0500 [thread overview] Message-ID: <1341870457.2909.27.camel@oc2024037011.ibm.com> (raw) In-Reply-To: <20120709062012.24030.37154.sendpatchset@codeblue> On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: > Currently Pause Looop Exit (PLE) handler is doing directed yield to a > random VCPU on PL exit. Though we already have filtering while choosing > the candidate to yield_to, we can do better. Hi, Raghu. > Problem is, for large vcpu guests, we have more probability of yielding > to a bad vcpu. We are not able to prevent directed yield to same guy who > has done PL exit recently, who perhaps spins again and wastes CPU. > > Fix that by keeping track of who has done PL exit. So The Algorithm in series > give chance to a VCPU which has: > > (a) Not done PLE exit at all (probably he is preempted lock-holder) > > (b) VCPU skipped in last iteration because it did PL exit, and probably > has become eligible now (next eligible lock holder) > > Future enhancemnets: > (1) Currently we have a boolean to decide on eligibility of vcpu. It > would be nice if I get feedback on guest (>32 vcpu) whether we can > improve better with integer counter. (with counter = say f(log n )). > > (2) We have not considered system load during iteration of vcpu. With > that information we can limit the scan and also decide whether schedule() > is better. [ I am able to use #kicked vcpus to decide on this But may > be there are better ideas like information from global loadavg.] > > (3) We can exploit this further with PV patches since it also knows about > next eligible lock-holder. > > Summary: There is a huge improvement for moderate / no overcommit scenario > for kvm based guest on PLE machine (which is difficult ;) ). > > Result: > Base : kernel 3.5.0-rc5 with Rik's Ple handler fix > > Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, > 32 core machine Is this with HT enabled, therefore 64 CPU threads? > Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) > with test kernels > > Guest: fedora 16 with 32 vcpus 8GB memory. Can you briefly explain the 1x and 2x configs? This of course is highly dependent whether or not HT is enabled... FWIW, I started testing what I would call "0.5x", where I have one 40 vcpu guest running on a host with 40 cores and 80 CPU threads total (HT enabled, no extra load on the system). For ebizzy, the results are quite erratic from run to run, so I am inclined to discard it as a workload, but maybe I should try "1x" and "2x" cpu over-commit as well. >From initial observations, at least for the ebizzy workload, the percentage of exits that result in a yield_to() are very low, around 1%, before these patches. So, I am concerned that at least for this test, reducing that number even more has diminishing returns. I am however still concerned about the scalability problem with yield_to(), which shows like this for me (perf): > 63.56% 282095 qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock > 5.42% 24420 qemu-kvm [kvm] [k] kvm_vcpu_yield_to > 5.33% 26481 qemu-kvm [kernel.kallsyms] [k] get_pid_task > 4.35% 20049 qemu-kvm [kernel.kallsyms] [k] yield_to > 2.74% 15652 qemu-kvm [kvm] [k] kvm_apic_present > 1.70% 8657 qemu-kvm [kvm] [k] kvm_vcpu_on_spin > 1.45% 7889 qemu-kvm [kvm] [k] vcpu_enter_guest For the cpu threads in the host that are actually active (in this case 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This is for a no-IO workload, so that's just incredible to see so much cpu wasted. I feel that 2 important areas to tackle are a more scalable yield_to() and reducing the number of pause exits itself (hopefully by just tuning ple_window for the latter). Honestly, I not confident addressing this problem will improve the ebizzy score. That workload is so erratic for me, that I do not trust the results at all. I have however seen consistent improvements in disabling PLE for a http guest workload and a very high IOPS guest workload, both with much time spent in host in the double runqueue lock for yield_to(), so that's why I still gravitate toward that issue. -Andrew Theurer
WARNING: multiple messages have this Message-ID (diff)
From: Andrew Theurer <habanero@linux.vnet.ibm.com> To: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>, Marcelo Tosatti <mtosatti@redhat.com>, Ingo Molnar <mingo@redhat.com>, Avi Kivity <avi@redhat.com>, Rik van Riel <riel@redhat.com>, S390 <linux-s390@vger.kernel.org>, Carsten Otte <cotte@de.ibm.com>, Christian Borntraeger <borntraeger@de.ibm.com>, KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>, LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>, Gleb Natapov <gleb@redhat.com>, linux390@de.ibm.com, Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>, Joerg Roedel <joerg.roedel@amd.com> Subject: Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Date: Mon, 09 Jul 2012 16:47:37 -0500 [thread overview] Message-ID: <1341870457.2909.27.camel@oc2024037011.ibm.com> (raw) In-Reply-To: <20120709062012.24030.37154.sendpatchset@codeblue> On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: > Currently Pause Looop Exit (PLE) handler is doing directed yield to a > random VCPU on PL exit. Though we already have filtering while choosing > the candidate to yield_to, we can do better. Hi, Raghu. > Problem is, for large vcpu guests, we have more probability of yielding > to a bad vcpu. We are not able to prevent directed yield to same guy who > has done PL exit recently, who perhaps spins again and wastes CPU. > > Fix that by keeping track of who has done PL exit. So The Algorithm in series > give chance to a VCPU which has: > > (a) Not done PLE exit at all (probably he is preempted lock-holder) > > (b) VCPU skipped in last iteration because it did PL exit, and probably > has become eligible now (next eligible lock holder) > > Future enhancemnets: > (1) Currently we have a boolean to decide on eligibility of vcpu. It > would be nice if I get feedback on guest (>32 vcpu) whether we can > improve better with integer counter. (with counter = say f(log n )). > > (2) We have not considered system load during iteration of vcpu. With > that information we can limit the scan and also decide whether schedule() > is better. [ I am able to use #kicked vcpus to decide on this But may > be there are better ideas like information from global loadavg.] > > (3) We can exploit this further with PV patches since it also knows about > next eligible lock-holder. > > Summary: There is a huge improvement for moderate / no overcommit scenario > for kvm based guest on PLE machine (which is difficult ;) ). > > Result: > Base : kernel 3.5.0-rc5 with Rik's Ple handler fix > > Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, > 32 core machine Is this with HT enabled, therefore 64 CPU threads? > Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) > with test kernels > > Guest: fedora 16 with 32 vcpus 8GB memory. Can you briefly explain the 1x and 2x configs? This of course is highly dependent whether or not HT is enabled... FWIW, I started testing what I would call "0.5x", where I have one 40 vcpu guest running on a host with 40 cores and 80 CPU threads total (HT enabled, no extra load on the system). For ebizzy, the results are quite erratic from run to run, so I am inclined to discard it as a workload, but maybe I should try "1x" and "2x" cpu over-commit as well. From initial observations, at least for the ebizzy workload, the percentage of exits that result in a yield_to() are very low, around 1%, before these patches. So, I am concerned that at least for this test, reducing that number even more has diminishing returns. I am however still concerned about the scalability problem with yield_to(), which shows like this for me (perf): > 63.56% 282095 qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock > 5.42% 24420 qemu-kvm [kvm] [k] kvm_vcpu_yield_to > 5.33% 26481 qemu-kvm [kernel.kallsyms] [k] get_pid_task > 4.35% 20049 qemu-kvm [kernel.kallsyms] [k] yield_to > 2.74% 15652 qemu-kvm [kvm] [k] kvm_apic_present > 1.70% 8657 qemu-kvm [kvm] [k] kvm_vcpu_on_spin > 1.45% 7889 qemu-kvm [kvm] [k] vcpu_enter_guest For the cpu threads in the host that are actually active (in this case 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This is for a no-IO workload, so that's just incredible to see so much cpu wasted. I feel that 2 important areas to tackle are a more scalable yield_to() and reducing the number of pause exits itself (hopefully by just tuning ple_window for the latter). Honestly, I not confident addressing this problem will improve the ebizzy score. That workload is so erratic for me, that I do not trust the results at all. I have however seen consistent improvements in disabling PLE for a http guest workload and a very high IOPS guest workload, both with much time spent in host in the double runqueue lock for yield_to(), so that's why I still gravitate toward that issue. -Andrew Theurer
next prev parent reply other threads:[~2012-07-09 21:47 UTC|newest] Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top 2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T 2012-07-09 6:33 ` Raghavendra K T 2012-07-09 6:33 ` Raghavendra K T 2012-07-09 22:39 ` Rik van Riel 2012-07-10 11:22 ` Raghavendra K T 2012-07-11 8:53 ` Avi Kivity 2012-07-11 10:52 ` Raghavendra K T 2012-07-11 11:18 ` Avi Kivity 2012-07-11 11:56 ` Raghavendra K T 2012-07-11 12:41 ` Andrew Jones 2012-07-12 10:58 ` Nikunj A Dadhania 2012-07-12 11:02 ` Raghavendra K T 2012-07-09 6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T 2012-07-09 22:30 ` Rik van Riel 2012-07-10 11:46 ` Raghavendra K T 2012-07-09 7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger 2012-07-10 8:27 ` Raghavendra K T 2012-07-11 9:06 ` Avi Kivity 2012-07-11 10:17 ` Christian Borntraeger 2012-07-11 11:04 ` Avi Kivity 2012-07-11 11:16 ` Alexander Graf 2012-07-11 11:23 ` Avi Kivity 2012-07-11 11:52 ` Alexander Graf 2012-07-11 12:48 ` Avi Kivity 2012-07-12 2:19 ` Benjamin Herrenschmidt 2012-07-11 11:18 ` Christian Borntraeger 2012-07-11 11:39 ` Avi Kivity 2012-07-12 5:11 ` Raghavendra K T 2012-07-12 8:11 ` Avi Kivity 2012-07-12 8:32 ` Raghavendra K T 2012-07-12 2:17 ` Benjamin Herrenschmidt 2012-07-12 8:12 ` Avi Kivity 2012-07-12 11:24 ` Benjamin Herrenschmidt 2012-07-12 10:38 ` Nikunj A Dadhania 2012-07-11 11:51 ` Raghavendra K T 2012-07-11 11:55 ` Christian Borntraeger 2012-07-11 12:04 ` Raghavendra K T 2012-07-11 13:04 ` Raghavendra K T 2012-07-09 21:47 ` Andrew Theurer [this message] 2012-07-09 21:47 ` Andrew Theurer 2012-07-10 9:26 ` Raghavendra K T 2012-07-10 10:07 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result Raghavendra K T 2012-07-10 11:54 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T 2012-07-10 13:27 ` Andrew Theurer 2012-07-11 9:00 ` Avi Kivity 2012-07-11 13:59 ` Raghavendra K T 2012-07-11 14:01 ` Raghavendra K T 2012-07-12 8:15 ` Avi Kivity 2012-07-12 8:25 ` Raghavendra K T 2012-07-12 12:31 ` Avi Kivity 2012-07-09 22:28 ` Rik van Riel
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=1341870457.2909.27.camel@oc2024037011.ibm.com \ --to=habanero@linux.vnet.ibm.com \ --cc=avi@redhat.com \ --cc=borntraeger@de.ibm.com \ --cc=chegu_vinod@hp.com \ --cc=cotte@de.ibm.com \ --cc=gleb@redhat.com \ --cc=hpa@zytor.com \ --cc=joerg.roedel@amd.com \ --cc=kvm@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-s390@vger.kernel.org \ --cc=linux390@de.ibm.com \ --cc=mingo@redhat.com \ --cc=mtosatti@redhat.com \ --cc=raghavendra.kt@linux.vnet.ibm.com \ --cc=riel@redhat.com \ --cc=srivatsa.vaddagiri@gmail.com \ --cc=tglx@linutronix.de \ --cc=x86@kernel.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.