From: "Luis R. Rodriguez" <mcgrof@kernel.org> To: Yang Zhang <yang.zhang.wz@gmail.com> Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, wanpeng.li@hotmail.com, mst@redhat.com, pbonzini@redhat.com, tglx@linutronix.de, rkrcmar@redhat.com, dmatlack@google.com, agraf@suse.de, peterz@infradead.org, linux-doc@vger.kernel.org, Quan Xu <quan.xu0@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Juergen Gross <jgross@suse.com>, Jeremy Fitzhardinge <jeremy@goop.org>, Chris Wright <chrisw@sous-sol.org>, Alok Kataria <akataria@vmware.com>, Rusty Russell <rusty@rustcorp.com.au>, Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, x86@kernel.org, "Luis R. Rodriguez" <mcgrof@kernel.org>, Kees Cook <keescook@chromium.org>, Mauro Carvalho Chehab <mchehab@kernel.org>, Krzysztof Kozlowski <krzk@kernel.org>, Josh Poimboeuf <jpoimboe@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Petr Mladek <pmladek@suse.com>, Jessica Yu <jeyu@redhat.com>, Larry Finger <Larry.Finger@lwfinger.net>, zijun_hu <zijun_hu@htc.com>, Baoquan He <bhe@redhat.com>, Johannes Berg <johannes.berg@intel.com>, Ian Abbott <abbotti@mev.co.uk>, virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org Subject: Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll Date: Tue, 29 Aug 2017 19:20:41 +0200 [thread overview] Message-ID: <20170829172041.GH27873@wotan.suse.de> (raw) In-Reply-To: <1504007201-12904-6-git-send-email-yang.zhang.wz@gmail.com> On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote: > To reduce the cost of poll, we introduce three sysctl to control the > poll time. This commit does not describe in any way the fact that these knobs are all for and only for PARAVIRT. > diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt > index bac23c1..67447b8 100644 > --- a/Documentation/sysctl/kernel.txt > +++ b/Documentation/sysctl/kernel.txt > @@ -63,6 +63,9 @@ show up in /proc/sys/kernel: > - perf_event_max_stack > - perf_event_max_contexts_per_stack > - pid_max > +- poll_grow [ X86 only ] > +- poll_shrink [ X86 only ] > +- poll_threshold_ns [ X86 only ] How about paravirt_ prefix? > - powersave-nap [ PPC only ] > - printk > - printk_delay > @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one. > > ============================================================== > > +poll_grow: (X86 only) > + > +This parameter is multiplied in the grow_poll_ns() to increase the poll time. > +By default, the values is 2. > + > +============================================================== > +poll_shrink: (X86 only) > + > +This parameter is divided in the shrink_poll_ns() to reduce the poll time. > +By default, the values is 2. These don't say anything about this being related to paravirt. > + > +============================================================== > +poll_threshold_ns: (X86 only) > + > +This parameter controls the max poll time before entering real idle path. > +This parameter is expected to take effect only when running inside a VM. > +It would make no sense to turn on it in bare mental. turn it on? Or modify, or use it? > +By default, the values is 0 means don't poll. It is recommended to change > +the value to non-zero if running latency-bound workloads inside VM. > + > +============================================================== > + > powersave-nap: (PPC only) > > If set, Linux-PPC will use the 'nap' mode of powersaving, > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index a11b2c2..0b92f8f 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = { > .update = paravirt_nop, > }; > > +unsigned long poll_threshold_ns; > +unsigned int poll_shrink = 2; > +unsigned int poll_grow = 2; > + > __visible struct pv_irq_ops pv_irq_ops = { > .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), > .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > index bd6d96c..6cb2820 100644 > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -462,6 +462,12 @@ extern __scanf(2, 0) > > extern bool crash_kexec_post_notifiers; > > +#ifdef CONFIG_PARAVIRT > +extern unsigned long poll_threshold_ns; > +extern unsigned int poll_shrink; > +extern unsigned int poll_grow; > +#endif What are these if we are on bare metal but support paravirt? The docs are not clear in any way about it. > + > /* > * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It > * holds a CPU number which is executing panic() currently. A value of > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 6648fbb..9b86a8f 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, > .extra2 = &one, > }, > #endif > +#ifdef CONFIG_PARAVIRT > + { > + .procname = "halt_poll_threshold", > + .data = &poll_threshold_ns, > + .maxlen = sizeof(unsigned long), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, proc_doulongvec_minmax() may be more appropriate here? What is the range? Also what user did you see used proc_dointvec() but had unsigned long? If the test below reveal this is invalid we would need to go fix them as well. > + { > + .procname = "halt_poll_grow", > + .data = &poll_grow, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > + { > + .procname = "halt_poll_shrink", > + .data = &poll_shrink, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec, We have now a much more appropriate proc_douintvec() proc_douintvec_minmax(). It seems you want to support only unsigned int for two of these so that would be more appropriate. To test things out you can replicate your values using or extending the test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of values and then fuzz testing them with arbitrary values to ensure you get only as input what you really think these values should get. Once done please extend the script: tools/testing/selftests/sysctl/sysctl.sh to cover the different tests you have run, we want tests to be generic. Luis
WARNING: multiple messages have this Message-ID (diff)
From: "Luis R. Rodriguez" <mcgrof@kernel.org> To: Yang Zhang <yang.zhang.wz@gmail.com> Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, wanpeng.li@hotmail.com, mst@redhat.com, pbonzini@redhat.com, tglx@linutronix.de, rkrcmar@redhat.com, dmatlack@google.com, agraf@suse.de, peterz@infradead.org, linux-doc@vger.kernel.org, Quan Xu <quan.xu0@gmail.com>, Jonathan Corbet <corbet@lwn.net>, Juergen Gross <jgross@suse.com>, Jeremy Fitzhardinge <jeremy@goop.org>, Chris Wright <chrisw@sous-sol.org>, Alok Kataria <akataria@vmware.com>, Rusty Russell <rusty@rustcorp.com.au>, Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>, x86@kernel.org, "Luis R. Rodriguez" <mcgrof@kernel.org>, Kees Cook <keescook@chromium.org>, Mauro Carvalho Chehab <mchehab@kernel.org>, Krzysztof Kozlowski <krzk@kernel.org>, Josh Poimboeuf <jpoimboe@redhat.com>, Subject: Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll Date: Tue, 29 Aug 2017 19:20:41 +0200 [thread overview] Message-ID: <20170829172041.GH27873@wotan.suse.de> (raw) In-Reply-To: <1504007201-12904-6-git-send-email-yang.zhang.wz@gmail.com> On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote: > To reduce the cost of poll, we introduce three sysctl to control the > poll time. This commit does not describe in any way the fact that these knobs are all for and only for PARAVIRT. > diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt > index bac23c1..67447b8 100644 > --- a/Documentation/sysctl/kernel.txt > +++ b/Documentation/sysctl/kernel.txt > @@ -63,6 +63,9 @@ show up in /proc/sys/kernel: > - perf_event_max_stack > - perf_event_max_contexts_per_stack > - pid_max > +- poll_grow [ X86 only ] > +- poll_shrink [ X86 only ] > +- poll_threshold_ns [ X86 only ] How about paravirt_ prefix? > - powersave-nap [ PPC only ] > - printk > - printk_delay > @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one. > > ============================================================== > > +poll_grow: (X86 only) > + > +This parameter is multiplied in the grow_poll_ns() to increase the poll time. > +By default, the values is 2. > + > +============================================================== > +poll_shrink: (X86 only) > + > +This parameter is divided in the shrink_poll_ns() to reduce the poll time. > +By default, the values is 2. These don't say anything about this being related to paravirt. > + > +============================================================== > +poll_threshold_ns: (X86 only) > + > +This parameter controls the max poll time before entering real idle path. > +This parameter is expected to take effect only when running inside a VM. > +It would make no sense to turn on it in bare mental. turn it on? Or modify, or use it? > +By default, the values is 0 means don't poll. It is recommended to change > +the value to non-zero if running latency-bound workloads inside VM. > + > +============================================================== > + > powersave-nap: (PPC only) > > If set, Linux-PPC will use the 'nap' mode of powersaving, > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index a11b2c2..0b92f8f 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = { > .update = paravirt_nop, > }; > > +unsigned long poll_threshold_ns; > +unsigned int poll_shrink = 2; > +unsigned int poll_grow = 2; > + > __visible struct pv_irq_ops pv_irq_ops = { > .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl), > .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl), > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > index bd6d96c..6cb2820 100644 > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -462,6 +462,12 @@ extern __scanf(2, 0) > > extern bool crash_kexec_post_notifiers; > > +#ifdef CONFIG_PARAVIRT > +extern unsigned long poll_threshold_ns; > +extern unsigned int poll_shrink; > +extern unsigned int poll_grow; > +#endif What are these if we are on bare metal but support paravirt? The docs are not clear in any way about it. > + > /* > * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It > * holds a CPU number which is executing panic() currently. A value of > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 6648fbb..9b86a8f 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, > .extra2 = &one, > }, > #endif > +#ifdef CONFIG_PARAVIRT > + { > + .procname = "halt_poll_threshold", > + .data = &poll_threshold_ns, > + .maxlen = sizeof(unsigned long), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, proc_doulongvec_minmax() may be more appropriate here? What is the range? Also what user did you see used proc_dointvec() but had unsigned long? If the test below reveal this is invalid we would need to go fix them as well. > + { > + .procname = "halt_poll_grow", > + .data = &poll_grow, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > + { > + .procname = "halt_poll_shrink", > + .data = &poll_shrink, > + .maxlen = sizeof(unsigned int), > + .mode = 0644, > + .proc_handler = proc_dointvec, We have now a much more appropriate proc_douintvec() proc_douintvec_minmax(). It seems you want to support only unsigned int for two of these so that would be more appropriate. To test things out you can replicate your values using or extending the test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of values and then fuzz testing them with arbitrary values to ensure you get only as input what you really think these values should get. Once done please extend the script: tools/testing/selftests/sysctl/sysctl.sh to cover the different tests you have run, we want tests to be generic. Luis
next prev parent reply other threads:[~2017-08-29 17:20 UTC|newest] Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang 2017-08-29 11:46 ` [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops Yang Zhang 2017-08-29 11:46 ` Yang Zhang 2017-08-29 13:55 ` Konrad Rzeszutek Wilk 2017-08-29 13:55 ` Konrad Rzeszutek Wilk 2017-08-30 7:33 ` Juergen Gross 2017-08-30 7:33 ` Juergen Gross 2017-09-01 6:50 ` Yang Zhang 2017-09-01 6:50 ` Yang Zhang 2017-08-29 11:46 ` [RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops Yang Zhang 2017-08-29 11:46 ` [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path Yang Zhang 2017-08-29 12:45 ` Peter Zijlstra 2017-09-01 5:57 ` Quan Xu 2017-09-14 8:41 ` Quan Xu 2017-09-14 9:18 ` Borislav Petkov 2017-08-29 14:39 ` Borislav Petkov 2017-09-01 6:49 ` Quan Xu 2017-09-29 10:39 ` Quan Xu 2017-08-29 11:46 ` [RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops Yang Zhang 2017-08-29 11:46 ` Yang Zhang 2017-08-29 11:46 ` [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll Yang Zhang 2017-08-29 11:46 ` Yang Zhang 2017-08-29 11:46 ` Yang Zhang 2017-08-29 17:20 ` Luis R. Rodriguez 2017-08-29 17:20 ` Luis R. Rodriguez [this message] 2017-08-29 17:20 ` Luis R. Rodriguez 2017-08-29 11:46 ` [RFC PATCH v2 6/7] KVM guest: introduce smart idle poll algorithm Yang Zhang 2017-08-29 11:46 ` [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle Yang Zhang 2017-08-29 12:46 ` Peter Zijlstra 2017-09-01 7:30 ` Yang Zhang 2017-09-29 10:29 ` Quan Xu 2017-08-29 11:58 ` [RFC PATCH v2 0/7] x86/idle: add halt poll support Alexander Graf 2017-09-01 6:21 ` Yang Zhang 2017-08-29 13:03 ` Andi Kleen 2017-08-29 14:02 ` Wanpeng Li 2017-08-29 14:27 ` Konrad Rzeszutek Wilk 2017-08-29 14:36 ` Michael S. Tsirkin 2017-09-01 6:32 ` Yang Zhang 2017-09-01 6:52 ` Wanpeng Li 2017-09-01 6:44 ` Yang Zhang 2017-09-01 6:58 ` Wanpeng Li 2017-09-01 7:53 ` Yang Zhang 2017-08-29 14:56 ` Michael S. Tsirkin 2017-09-13 11:56 ` Yang Zhang 2017-09-14 8:36 ` Quan Xu 2017-09-14 9:19 ` Wanpeng Li 2017-09-14 9:40 ` Quan Xu
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20170829172041.GH27873@wotan.suse.de \ --to=mcgrof@kernel.org \ --cc=Larry.Finger@lwfinger.net \ --cc=abbotti@mev.co.uk \ --cc=agraf@suse.de \ --cc=akataria@vmware.com \ --cc=akpm@linux-foundation.org \ --cc=bhe@redhat.com \ --cc=chrisw@sous-sol.org \ --cc=corbet@lwn.net \ --cc=dmatlack@google.com \ --cc=hpa@zytor.com \ --cc=jeremy@goop.org \ --cc=jeyu@redhat.com \ --cc=jgross@suse.com \ --cc=johannes.berg@intel.com \ --cc=jpoimboe@redhat.com \ --cc=keescook@chromium.org \ --cc=krzk@kernel.org \ --cc=kvm@vger.kernel.org \ --cc=linux-doc@vger.kernel.org \ --cc=linux-fsdevel@vger.kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=mchehab@kernel.org \ --cc=mingo@redhat.com \ --cc=mst@redhat.com \ --cc=pbonzini@redhat.com \ --cc=peterz@infradead.org \ --cc=pmladek@suse.com \ --cc=quan.xu0@gmail.com \ --cc=rkrcmar@redhat.com \ --cc=rusty@rustcorp.com.au \ --cc=tglx@linutronix.de \ --cc=virtualization@lists.linux-foundation.org \ --cc=wanpeng.li@hotmail.com \ --cc=x86@kernel.org \ --cc=yang.zhang.wz@gmail.com \ --cc=zijun_hu@htc.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.