From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753113AbdFVLWq (ORCPT ); Thu, 22 Jun 2017 07:22:46 -0400 Received: from mail-pf0-f195.google.com ([209.85.192.195]:34991 "EHLO mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752719AbdFVLWm (ORCPT ); Thu, 22 Jun 2017 07:22:42 -0400 From: root X-Google-Original-From: root To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, pbonzini@redhat.com Cc: x86@kernel.org, corbet@lwn.net, tony.luck@intel.com, bp@alien8.de, peterz@infradead.org, mchehab@kernel.org, akpm@linux-foundation.org, krzk@kernel.org, jpoimboe@redhat.com, luto@kernel.org, borntraeger@de.ibm.com, thgarnie@google.com, rgerst@gmail.com, minipli@googlemail.com, douly.fnst@cn.fujitsu.com, nicstange@gmail.com, fweisbec@gmail.com, dvlasenk@redhat.com, bristot@redhat.com, yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, yu.c.chen@intel.com, aaron.lu@intel.com, rostedt@goodmis.org, me@kylehuey.com, len.brown@intel.com, prarit@redhat.com, hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, adobriyan@gmail.com, fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com, subashab@codeaurora.org, arnd@arndb.de, matt@codeblueprint.co.uk, mgorman@techsingularity.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm@vger.kernel.org, Yang Zhang Subject: [PATCH 1/2] x86/idle: add halt poll for halt idle Date: Thu, 22 Jun 2017 11:22:13 +0000 Message-Id: <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Yang Zhang This patch introduce a new mechanism to poll for a while before entering idle state. David has a topic in KVM forum to describe the problem on current KVM VM when running some message passing workload in KVM forum. Also, there are some work to improve the performance in KVM, like halt polling in KVM. But we still has 4 MSR wirtes and HLT vmexit when going into halt idle which introduce lot of latency. Halt polling in KVM provide the capbility to not schedule out VCPU when it is the only task in this pCPU. Unlike it, this patch will let VCPU polls for a while if there is no work inside VCPU to elimiate heavy vmexit during in/out idle. The potential impact is it will cost more CPU cycle since we are doing polling and may impact other task which waiting on the same physical CPU in host. Here is the data i get when running benchmark contextswitch (https://github.com/tsuna/contextswitch) before patch: 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw) after patch: 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw) Signed-off-by: Yang Zhang --- Documentation/sysctl/kernel.txt | 10 ++++++++++ arch/x86/kernel/process.c | 21 +++++++++++++++++++++ include/linux/kernel.h | 3 +++ kernel/sched/idle.c | 3 +++ kernel/sysctl.c | 9 +++++++++ 5 files changed, 46 insertions(+) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index bac23c1..4e71bfe 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -63,6 +63,7 @@ show up in /proc/sys/kernel: - perf_event_max_stack - perf_event_max_contexts_per_stack - pid_max +- poll_threshold_ns [ X86 only ] - powersave-nap [ PPC only ] - printk - printk_delay @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one. ============================================================== +poll_threshold_ns: (X86 only) + +This parameter used to control the max wait time to poll before going +into real idle state. By default, the values is 0 means don't poll. +It is recommended to change the value to non-zero if running latency-bound +workloads in VM. + +============================================================== + powersave-nap: (PPC only) If set, Linux-PPC will use the 'nap' mode of powersaving, diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0bb8842..6361783 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -39,6 +39,10 @@ #include #include +#ifdef CONFIG_HYPERVISOR_GUEST +unsigned long poll_threshold_ns; +#endif + /* * per-CPU TSS segments. Threads are completely 'soft' on Linux, * no more per-task TSS's. The TSS size is kept cacheline-aligned @@ -313,6 +317,23 @@ static inline void play_dead(void) } #endif +#ifdef CONFIG_HYPERVISOR_GUEST +void arch_cpu_idle_poll(void) +{ + ktime_t start, cur, stop; + + if (poll_threshold_ns) { + start = cur = ktime_get(); + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); + do { + if (need_resched()) + break; + cur = ktime_get(); + } while (ktime_before(cur, stop)); + } +} +#endif + void arch_cpu_idle_enter(void) { tsc_verify_tsc_adjust(false); diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 13bc08a..04cf774 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -460,6 +460,9 @@ extern __scanf(2, 0) extern int sysctl_panic_on_stackoverflow; extern bool crash_kexec_post_notifiers; +#ifdef CONFIG_HYPERVISOR_GUEST +extern unsigned long poll_threshold_ns; +#endif /* * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 2a25a9e..e789f99 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } void __weak arch_cpu_idle_exit(void) { } @@ -219,6 +220,8 @@ static void do_idle(void) */ __current_set_polling(); + arch_cpu_idle_poll(); + tick_nohz_idle_enter(); while (!need_resched()) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 4dfba1a..9174d57 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .extra2 = &one, }, #endif +#ifdef CONFIG_HYPERVISOR_GUEST + { + .procname = "halt_poll_threshold", + .data = &poll_threshold_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif { } }; -- 1.8.3.1 From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [1/2] x86/idle: add halt poll for halt idle From: Yang Zhang Message-Id: <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> Date: Thu, 22 Jun 2017 11:22:13 +0000 To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, pbonzini@redhat.com Cc: x86@kernel.org, corbet@lwn.net, tony.luck@intel.com, bp@alien8.de, peterz@infradead.org, mchehab@kernel.org, akpm@linux-foundation.org, krzk@kernel.org, jpoimboe@redhat.com, luto@kernel.org, borntraeger@de.ibm.com, thgarnie@google.com, rgerst@gmail.com, minipli@googlemail.com, douly.fnst@cn.fujitsu.com, nicstange@gmail.com, fweisbec@gmail.com, dvlasenk@redhat.com, bristot@redhat.com, yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, yu.c.chen@intel.com, aaron.lu@intel.com, rostedt@goodmis.org, me@kylehuey.com, len.brown@intel.com, prarit@redhat.com, hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, adobriyan@gmail.com, fgao@ikuai8.com, ebiederm@xmission.com, subashab@codeaurora.org, arnd@arndb.de, matt@codeblueprint.co.uk, mgorman@techsingularity.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm@vger.kernel.org, Yang Zhang List-ID: RnJvbTogWWFuZyBaaGFuZyA8eWFuZy56aGFuZy53ekBnbWFpbC5jb20+CgpUaGlzIHBhdGNoIGlu dHJvZHVjZSBhIG5ldyBtZWNoYW5pc20gdG8gcG9sbCBmb3IgYSB3aGlsZSBiZWZvcmUKZW50ZXJp bmcgaWRsZSBzdGF0ZS4KCkRhdmlkIGhhcyBhIHRvcGljIGluIEtWTSBmb3J1bSB0byBkZXNjcmli ZSB0aGUgcHJvYmxlbSBvbiBjdXJyZW50IEtWTSBWTQp3aGVuIHJ1bm5pbmcgc29tZSBtZXNzYWdl IHBhc3Npbmcgd29ya2xvYWQgaW4gS1ZNIGZvcnVtLiBBbHNvLCB0aGVyZQphcmUgc29tZSB3b3Jr IHRvIGltcHJvdmUgdGhlIHBlcmZvcm1hbmNlIGluIEtWTSwgbGlrZSBoYWx0IHBvbGxpbmcgaW4g S1ZNLgpCdXQgd2Ugc3RpbGwgaGFzIDQgTVNSIHdpcnRlcyBhbmQgSExUIHZtZXhpdCB3aGVuIGdv aW5nIGludG8gaGFsdCBpZGxlCndoaWNoIGludHJvZHVjZSBsb3Qgb2YgbGF0ZW5jeS4KCkhhbHQg cG9sbGluZyBpbiBLVk0gcHJvdmlkZSB0aGUgY2FwYmlsaXR5IHRvIG5vdCBzY2hlZHVsZSBvdXQg VkNQVSB3aGVuCml0IGlzIHRoZSBvbmx5IHRhc2sgaW4gdGhpcyBwQ1BVLiBVbmxpa2UgaXQsIHRo aXMgcGF0Y2ggd2lsbCBsZXQgVkNQVSBwb2xscwpmb3IgYSB3aGlsZSBpZiB0aGVyZSBpcyBubyB3 b3JrIGluc2lkZSBWQ1BVIHRvIGVsaW1pYXRlIGhlYXZ5IHZtZXhpdCBkdXJpbmcKaW4vb3V0IGlk bGUuIFRoZSBwb3RlbnRpYWwgaW1wYWN0IGlzIGl0IHdpbGwgY29zdCBtb3JlIENQVSBjeWNsZSBz aW5jZSB3ZQphcmUgZG9pbmcgcG9sbGluZyBhbmQgbWF5IGltcGFjdCBvdGhlciB0YXNrIHdoaWNo IHdhaXRpbmcgb24gdGhlIHNhbWUKcGh5c2ljYWwgQ1BVIGluIGhvc3QuCgpIZXJlIGlzIHRoZSBk YXRhIGkgZ2V0IHdoZW4gcnVubmluZyBiZW5jaG1hcmsgY29udGV4dHN3aXRjaAooaHR0cHM6Ly9n aXRodWIuY29tL3RzdW5hL2NvbnRleHRzd2l0Y2gpCgpiZWZvcmUgcGF0Y2g6CjIwMDAwMDAgcHJv Y2VzcyBjb250ZXh0IHN3aXRjaGVzIGluIDQ4MjI2MTM4MDFucyAoMjQxMS4zbnMvY3R4c3cpCgph ZnRlciBwYXRjaDoKMjAwMDAwMCBwcm9jZXNzIGNvbnRleHQgc3dpdGNoZXMgaW4gMzU4NDA5ODI0 MW5zICgxNzkyLjBucy9jdHhzdykKClNpZ25lZC1vZmYtYnk6IFlhbmcgWmhhbmcgPHlhbmcuemhh bmcud3pAZ21haWwuY29tPgotLS0KIERvY3VtZW50YXRpb24vc3lzY3RsL2tlcm5lbC50eHQgfCAx MCArKysrKysrKysrCiBhcmNoL3g4Ni9rZXJuZWwvcHJvY2Vzcy5jICAgICAgIHwgMjEgKysrKysr KysrKysrKysrKysrKysrCiBpbmNsdWRlL2xpbnV4L2tlcm5lbC5oICAgICAgICAgIHwgIDMgKysr CiBrZXJuZWwvc2NoZWQvaWRsZS5jICAgICAgICAgICAgIHwgIDMgKysrCiBrZXJuZWwvc3lzY3Rs LmMgICAgICAgICAgICAgICAgIHwgIDkgKysrKysrKysrCiA1IGZpbGVzIGNoYW5nZWQsIDQ2IGlu c2VydGlvbnMoKykKCmRpZmYgLS1naXQgYS9Eb2N1bWVudGF0aW9uL3N5c2N0bC9rZXJuZWwudHh0 IGIvRG9jdW1lbnRhdGlvbi9zeXNjdGwva2VybmVsLnR4dAppbmRleCBiYWMyM2MxLi40ZTcxYmZl IDEwMDY0NAotLS0gYS9Eb2N1bWVudGF0aW9uL3N5c2N0bC9rZXJuZWwudHh0CisrKyBiL0RvY3Vt ZW50YXRpb24vc3lzY3RsL2tlcm5lbC50eHQKQEAgLTYzLDYgKzYzLDcgQEAgc2hvdyB1cCBpbiAv cHJvYy9zeXMva2VybmVsOgogLSBwZXJmX2V2ZW50X21heF9zdGFjawogLSBwZXJmX2V2ZW50X21h eF9jb250ZXh0c19wZXJfc3RhY2sKIC0gcGlkX21heAorLSBwb2xsX3RocmVzaG9sZF9ucyAgICAg ICAgWyBYODYgb25seSBdCiAtIHBvd2Vyc2F2ZS1uYXAgICAgICAgICAgICAgICBbIFBQQyBvbmx5 IF0KIC0gcHJpbnRrCiAtIHByaW50a19kZWxheQpAQCAtNzAyLDYgKzcwMywxNSBAQCBrZXJuZWwg dHJpZXMgdG8gYWxsb2NhdGUgYSBudW1iZXIgc3RhcnRpbmcgZnJvbSB0aGlzIG9uZS4KIAogPT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09 PT0KIAorcG9sbF90aHJlc2hvbGRfbnM6IChYODYgb25seSkKKworVGhpcyBwYXJhbWV0ZXIgdXNl ZCB0byBjb250cm9sIHRoZSBtYXggd2FpdCB0aW1lIHRvIHBvbGwgYmVmb3JlIGdvaW5nCitpbnRv IHJlYWwgaWRsZSBzdGF0ZS4gQnkgZGVmYXVsdCwgdGhlIHZhbHVlcyBpcyAwIG1lYW5zIGRvbid0 IHBvbGwuCitJdCBpcyByZWNvbW1lbmRlZCB0byBjaGFuZ2UgdGhlIHZhbHVlIHRvIG5vbi16ZXJv IGlmIHJ1bm5pbmcgbGF0ZW5jeS1ib3VuZAord29ya2xvYWRzIGluIFZNLgorCis9PT09PT09PT09 PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQorCiBw b3dlcnNhdmUtbmFwOiAoUFBDIG9ubHkpCiAKIElmIHNldCwgTGludXgtUFBDIHdpbGwgdXNlIHRo ZSAnbmFwJyBtb2RlIG9mIHBvd2Vyc2F2aW5nLApkaWZmIC0tZ2l0IGEvYXJjaC94ODYva2VybmVs L3Byb2Nlc3MuYyBiL2FyY2gveDg2L2tlcm5lbC9wcm9jZXNzLmMKaW5kZXggMGJiODg0Mi4uNjM2 MTc4MyAxMDA2NDQKLS0tIGEvYXJjaC94ODYva2VybmVsL3Byb2Nlc3MuYworKysgYi9hcmNoL3g4 Ni9rZXJuZWwvcHJvY2Vzcy5jCkBAIC0zOSw2ICszOSwxMCBAQAogI2luY2x1ZGUgPGFzbS9kZXNj Lmg+CiAjaW5jbHVkZSA8YXNtL3ByY3RsLmg+CiAKKyNpZmRlZiBDT05GSUdfSFlQRVJWSVNPUl9H VUVTVAordW5zaWduZWQgbG9uZyBwb2xsX3RocmVzaG9sZF9uczsKKyNlbmRpZgorCiAvKgogICog cGVyLUNQVSBUU1Mgc2VnbWVudHMuIFRocmVhZHMgYXJlIGNvbXBsZXRlbHkgJ3NvZnQnIG9uIExp bnV4LAogICogbm8gbW9yZSBwZXItdGFzayBUU1Mncy4gVGhlIFRTUyBzaXplIGlzIGtlcHQgY2Fj aGVsaW5lLWFsaWduZWQKQEAgLTMxMyw2ICszMTcsMjMgQEAgc3RhdGljIGlubGluZSB2b2lkIHBs YXlfZGVhZCh2b2lkKQogfQogI2VuZGlmCiAKKyNpZmRlZiBDT05GSUdfSFlQRVJWSVNPUl9HVUVT VAordm9pZCBhcmNoX2NwdV9pZGxlX3BvbGwodm9pZCkKK3sKKwlrdGltZV90IHN0YXJ0LCBjdXIs IHN0b3A7CisKKwlpZiAocG9sbF90aHJlc2hvbGRfbnMpIHsKKwkJc3RhcnQgPSBjdXIgPSBrdGlt ZV9nZXQoKTsKKwkJc3RvcCA9IGt0aW1lX2FkZF9ucyhrdGltZV9nZXQoKSwgcG9sbF90aHJlc2hv bGRfbnMpOworCQlkbyB7CisJCQlpZiAobmVlZF9yZXNjaGVkKCkpCisJCQkJYnJlYWs7CisJCQlj dXIgPSBrdGltZV9nZXQoKTsKKwkJfSB3aGlsZSAoa3RpbWVfYmVmb3JlKGN1ciwgc3RvcCkpOwor CX0KK30KKyNlbmRpZgorCiB2b2lkIGFyY2hfY3B1X2lkbGVfZW50ZXIodm9pZCkKIHsKIAl0c2Nf dmVyaWZ5X3RzY19hZGp1c3QoZmFsc2UpOwpkaWZmIC0tZ2l0IGEvaW5jbHVkZS9saW51eC9rZXJu ZWwuaCBiL2luY2x1ZGUvbGludXgva2VybmVsLmgKaW5kZXggMTNiYzA4YS4uMDRjZjc3NCAxMDA2 NDQKLS0tIGEvaW5jbHVkZS9saW51eC9rZXJuZWwuaAorKysgYi9pbmNsdWRlL2xpbnV4L2tlcm5l bC5oCkBAIC00NjAsNiArNDYwLDkgQEAgZXh0ZXJuIF9fc2NhbmYoMiwgMCkKIGV4dGVybiBpbnQg c3lzY3RsX3BhbmljX29uX3N0YWNrb3ZlcmZsb3c7CiAKIGV4dGVybiBib29sIGNyYXNoX2tleGVj X3Bvc3Rfbm90aWZpZXJzOworI2lmZGVmIENPTkZJR19IWVBFUlZJU09SX0dVRVNUCitleHRlcm4g dW5zaWduZWQgbG9uZyBwb2xsX3RocmVzaG9sZF9uczsKKyNlbmRpZgogCiAvKgogICogcGFuaWNf Y3B1IGlzIHVzZWQgZm9yIHN5bmNocm9uaXppbmcgcGFuaWMoKSBhbmQgY3Jhc2hfa2V4ZWMoKSBl eGVjdXRpb24uIEl0CmRpZmYgLS1naXQgYS9rZXJuZWwvc2NoZWQvaWRsZS5jIGIva2VybmVsL3Nj aGVkL2lkbGUuYwppbmRleCAyYTI1YTllLi5lNzg5Zjk5IDEwMDY0NAotLS0gYS9rZXJuZWwvc2No ZWQvaWRsZS5jCisrKyBiL2tlcm5lbC9zY2hlZC9pZGxlLmMKQEAgLTc0LDYgKzc0LDcgQEAgc3Rh dGljIG5vaW5saW5lIGludCBfX2NwdWlkbGUgY3B1X2lkbGVfcG9sbCh2b2lkKQogfQogCiAvKiBX ZWFrIGltcGxlbWVudGF0aW9ucyBmb3Igb3B0aW9uYWwgYXJjaCBzcGVjaWZpYyBmdW5jdGlvbnMg Ki8KK3ZvaWQgX193ZWFrIGFyY2hfY3B1X2lkbGVfcG9sbCh2b2lkKSB7IH0KIHZvaWQgX193ZWFr IGFyY2hfY3B1X2lkbGVfcHJlcGFyZSh2b2lkKSB7IH0KIHZvaWQgX193ZWFrIGFyY2hfY3B1X2lk bGVfZW50ZXIodm9pZCkgeyB9CiB2b2lkIF9fd2VhayBhcmNoX2NwdV9pZGxlX2V4aXQodm9pZCkg eyB9CkBAIC0yMTksNiArMjIwLDggQEAgc3RhdGljIHZvaWQgZG9faWRsZSh2b2lkKQogCSAqLwog CiAJX19jdXJyZW50X3NldF9wb2xsaW5nKCk7CisJYXJjaF9jcHVfaWRsZV9wb2xsKCk7CisKIAl0 aWNrX25vaHpfaWRsZV9lbnRlcigpOwogCiAJd2hpbGUgKCFuZWVkX3Jlc2NoZWQoKSkgewpkaWZm IC0tZ2l0IGEva2VybmVsL3N5c2N0bC5jIGIva2VybmVsL3N5c2N0bC5jCmluZGV4IDRkZmJhMWEu LjkxNzRkNTcgMTAwNjQ0Ci0tLSBhL2tlcm5lbC9zeXNjdGwuYworKysgYi9rZXJuZWwvc3lzY3Rs LmMKQEAgLTEyMDMsNiArMTIwMywxNSBAQCBzdGF0aWMgaW50IHN5c3JxX3N5c2N0bF9oYW5kbGVy KHN0cnVjdCBjdGxfdGFibGUgKnRhYmxlLCBpbnQgd3JpdGUsCiAJCS5leHRyYTIJCT0gJm9uZSwK IAl9LAogI2VuZGlmCisjaWZkZWYgQ09ORklHX0hZUEVSVklTT1JfR1VFU1QKKwl7CisJCS5wcm9j bmFtZQk9ICJoYWx0X3BvbGxfdGhyZXNob2xkIiwKKwkJLmRhdGEJCT0gJnBvbGxfdGhyZXNob2xk X25zLAorCQkubWF4bGVuCQk9IHNpemVvZih1bnNpZ25lZCBsb25nKSwKKwkJLm1vZGUJCT0gMDY0 NCwKKwkJLnByb2NfaGFuZGxlcgk9IHByb2NfZG9pbnR2ZWMsCisJfSwKKyNlbmRpZgogCXsgfQog fTsKIAo= From mboxrd@z Thu Jan 1 00:00:00 1970 From: root Subject: [PATCH 1/2] x86/idle: add halt poll for halt idle Date: Thu, 22 Jun 2017 11:22:13 +0000 Message-ID: <1498130534-26568-2-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> Cc: x86@kernel.org, corbet@lwn.net, tony.luck@intel.com, bp@alien8.de, peterz@infradead.org, mchehab@kernel.org, akpm@linux-foundation.org, krzk@kernel.org, jpoimboe@redhat.com, luto@kernel.org, borntraeger@de.ibm.com, thgarnie@google.com, rgerst@gmail.com, minipli@googlemail.com, douly.fnst@cn.fujitsu.com, nicstange@gmail.com, fweisbec@gmail.com, dvlasenk@redhat.com, bristot@redhat.com, yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, yu.c.chen@intel.com, aaron.lu@intel.com, rostedt@goodmis.org, me@kylehuey.com, len.brown@intel.com, prarit@redhat.com, hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se To: tglx@linutronix.de, mingo@redhat.com, hpa@zytor.com, pbonzini@redhat.com Return-path: In-Reply-To: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> Sender: linux-kernel-owner@vger.kernel.org List-Id: kvm.vger.kernel.org From: Yang Zhang This patch introduce a new mechanism to poll for a while before entering idle state. David has a topic in KVM forum to describe the problem on current KVM VM when running some message passing workload in KVM forum. Also, there are some work to improve the performance in KVM, like halt polling in KVM. But we still has 4 MSR wirtes and HLT vmexit when going into halt idle which introduce lot of latency. Halt polling in KVM provide the capbility to not schedule out VCPU when it is the only task in this pCPU. Unlike it, this patch will let VCPU polls for a while if there is no work inside VCPU to elimiate heavy vmexit during in/out idle. The potential impact is it will cost more CPU cycle since we are doing polling and may impact other task which waiting on the same physical CPU in host. Here is the data i get when running benchmark contextswitch (https://github.com/tsuna/contextswitch) before patch: 2000000 process context switches in 4822613801ns (2411.3ns/ctxsw) after patch: 2000000 process context switches in 3584098241ns (1792.0ns/ctxsw) Signed-off-by: Yang Zhang --- Documentation/sysctl/kernel.txt | 10 ++++++++++ arch/x86/kernel/process.c | 21 +++++++++++++++++++++ include/linux/kernel.h | 3 +++ kernel/sched/idle.c | 3 +++ kernel/sysctl.c | 9 +++++++++ 5 files changed, 46 insertions(+) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index bac23c1..4e71bfe 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -63,6 +63,7 @@ show up in /proc/sys/kernel: - perf_event_max_stack - perf_event_max_contexts_per_stack - pid_max +- poll_threshold_ns [ X86 only ] - powersave-nap [ PPC only ] - printk - printk_delay @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one. ============================================================== +poll_threshold_ns: (X86 only) + +This parameter used to control the max wait time to poll before going +into real idle state. By default, the values is 0 means don't poll. +It is recommended to change the value to non-zero if running latency-bound +workloads in VM. + +============================================================== + powersave-nap: (PPC only) If set, Linux-PPC will use the 'nap' mode of powersaving, diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 0bb8842..6361783 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -39,6 +39,10 @@ #include #include +#ifdef CONFIG_HYPERVISOR_GUEST +unsigned long poll_threshold_ns; +#endif + /* * per-CPU TSS segments. Threads are completely 'soft' on Linux, * no more per-task TSS's. The TSS size is kept cacheline-aligned @@ -313,6 +317,23 @@ static inline void play_dead(void) } #endif +#ifdef CONFIG_HYPERVISOR_GUEST +void arch_cpu_idle_poll(void) +{ + ktime_t start, cur, stop; + + if (poll_threshold_ns) { + start = cur = ktime_get(); + stop = ktime_add_ns(ktime_get(), poll_threshold_ns); + do { + if (need_resched()) + break; + cur = ktime_get(); + } while (ktime_before(cur, stop)); + } +} +#endif + void arch_cpu_idle_enter(void) { tsc_verify_tsc_adjust(false); diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 13bc08a..04cf774 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -460,6 +460,9 @@ extern __scanf(2, 0) extern int sysctl_panic_on_stackoverflow; extern bool crash_kexec_post_notifiers; +#ifdef CONFIG_HYPERVISOR_GUEST +extern unsigned long poll_threshold_ns; +#endif /* * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 2a25a9e..e789f99 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } void __weak arch_cpu_idle_exit(void) { } @@ -219,6 +220,8 @@ static void do_idle(void) */ __current_set_polling(); + arch_cpu_idle_poll(); + tick_nohz_idle_enter(); while (!need_resched()) { diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 4dfba1a..9174d57 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1203,6 +1203,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .extra2 = &one, }, #endif +#ifdef CONFIG_HYPERVISOR_GUEST + { + .procname = "halt_poll_threshold", + .data = &poll_threshold_ns, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif { } }; -- 1.8.3.1