All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Luis R. Rodriguez" <mcgrof@kernel.org>
To: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	wanpeng.li@hotmail.com, mst@redhat.com, pbonzini@redhat.com,
	tglx@linutronix.de, rkrcmar@redhat.com, dmatlack@google.com,
	agraf@suse.de, peterz@infradead.org, linux-doc@vger.kernel.org,
	Quan Xu <quan.xu0@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
	Juergen Gross <jgross@suse.com>,
	Jeremy Fitzhardinge <jeremy@goop.org>,
	Chris Wright <chrisw@sous-sol.org>,
	Alok Kataria <akataria@vmware.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, "Luis R. Rodriguez" <mcgrof@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Krzysztof Kozlowski <krzk@kernel.org>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Petr Mladek <pmladek@suse.com>, Jessica Yu <jeyu@redhat.com>,
	Larry Finger <Larry.Finger@lwfinger.net>,
	zijun_hu <zijun_hu@htc.com>, Baoquan He <bhe@redhat.com>,
	Johannes Berg <johannes.berg@intel.com>,
	Ian Abbott <abbotti@mev.co.uk>,
	virtualization@lists.linux-foundation.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
Date: Tue, 29 Aug 2017 19:20:41 +0200	[thread overview]
Message-ID: <20170829172041.GH27873@wotan.suse.de> (raw)
In-Reply-To: <1504007201-12904-6-git-send-email-yang.zhang.wz@gmail.com>

On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.

This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_grow                   [ X86 only ]
> +- poll_shrink                 [ X86 only ]
> +- poll_threshold_ns           [ X86 only ]

How about paravirt_ prefix?

>  - powersave-nap               [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>  
>  ==============================================================
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==============================================================
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

These don't say anything about this being related to paravirt.

> +
> +==============================================================
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.

turn it on? Or modify, or use it?

> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
>  	.update = paravirt_nop,
>  };
>  
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>  
>  extern bool crash_kexec_post_notifiers;
>  
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif

What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.

> +
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>   * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_PARAVIRT
> +	{
> +		.procname       = "halt_poll_threshold",
> +		.data           = &poll_threshold_ns,
> +		.maxlen         = sizeof(unsigned long),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},

proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.

> +	{
> +		.procname       = "halt_poll_grow",
> +		.data           = &poll_grow,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},
> +	{
> +		.procname       = "halt_poll_shrink",
> +		.data           = &poll_shrink,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,

We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.

To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input what you really think these values should get.

Once done please extend the script:

tools/testing/selftests/sysctl/sysctl.sh

to cover the different tests you have run, we want tests to be generic.

  Luis

WARNING: multiple messages have this Message-ID (diff)
From: "Luis R. Rodriguez" <mcgrof@kernel.org>
To: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	wanpeng.li@hotmail.com, mst@redhat.com, pbonzini@redhat.com,
	tglx@linutronix.de, rkrcmar@redhat.com, dmatlack@google.com,
	agraf@suse.de, peterz@infradead.org, linux-doc@vger.kernel.org,
	Quan Xu <quan.xu0@gmail.com>, Jonathan Corbet <corbet@lwn.net>,
	Juergen Gross <jgross@suse.com>,
	Jeremy Fitzhardinge <jeremy@goop.org>,
	Chris Wright <chrisw@sous-sol.org>,
	Alok Kataria <akataria@vmware.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, "Luis R. Rodriguez" <mcgrof@kernel.org>,
	Kees Cook <keescook@chromium.org>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Krzysztof Kozlowski <krzk@kernel.org>,
	Josh Poimboeuf <jpoimboe@redhat.com>,
Subject: Re: [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll
Date: Tue, 29 Aug 2017 19:20:41 +0200	[thread overview]
Message-ID: <20170829172041.GH27873@wotan.suse.de> (raw)
In-Reply-To: <1504007201-12904-6-git-send-email-yang.zhang.wz@gmail.com>

On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.

This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.

> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_grow                   [ X86 only ]
> +- poll_shrink                 [ X86 only ]
> +- poll_threshold_ns           [ X86 only ]

How about paravirt_ prefix?

>  - powersave-nap               [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>  
>  ==============================================================
>  
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==============================================================
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.

These don't say anything about this being related to paravirt.

> +
> +==============================================================
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.

turn it on? Or modify, or use it?

> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==============================================================
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
>  	.update = paravirt_nop,
>  };
>  
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
>  __visible struct pv_irq_ops pv_irq_ops = {
>  	.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>  	.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>  
>  extern bool crash_kexec_post_notifiers;
>  
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif

What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.

> +
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
>   * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
>  		.extra2		= &one,
>  	},
>  #endif
> +#ifdef CONFIG_PARAVIRT
> +	{
> +		.procname       = "halt_poll_threshold",
> +		.data           = &poll_threshold_ns,
> +		.maxlen         = sizeof(unsigned long),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},

proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.

> +	{
> +		.procname       = "halt_poll_grow",
> +		.data           = &poll_grow,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,
> +	},
> +	{
> +		.procname       = "halt_poll_shrink",
> +		.data           = &poll_shrink,
> +		.maxlen         = sizeof(unsigned int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec,

We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.

To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input what you really think these values should get.

Once done please extend the script:

tools/testing/selftests/sysctl/sysctl.sh

to cover the different tests you have run, we want tests to be generic.

  Luis

  parent reply	other threads:[~2017-08-29 17:20 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-29 11:46 [RFC PATCH v2 0/7] x86/idle: add halt poll support Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 1/7] x86/paravirt: Add pv_idle_ops to paravirt ops Yang Zhang
2017-08-29 11:46   ` Yang Zhang
2017-08-29 13:55   ` Konrad Rzeszutek Wilk
2017-08-29 13:55     ` Konrad Rzeszutek Wilk
2017-08-30  7:33     ` Juergen Gross
2017-08-30  7:33     ` Juergen Gross
2017-09-01  6:50     ` Yang Zhang
2017-09-01  6:50     ` Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 2/7] KVM guest: register kvm_idle_poll for pv_idle_ops Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path Yang Zhang
2017-08-29 12:45   ` Peter Zijlstra
2017-09-01  5:57     ` Quan Xu
2017-09-14  8:41       ` Quan Xu
2017-09-14  9:18         ` Borislav Petkov
2017-08-29 14:39   ` Borislav Petkov
2017-09-01  6:49     ` Quan Xu
2017-09-29 10:39       ` Quan Xu
2017-08-29 11:46 ` [RFC PATCH v2 4/7] x86/paravirt: Add update in x86/paravirt pv_idle_ops Yang Zhang
2017-08-29 11:46 ` Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 5/7] Documentation: Add three sysctls for smart idle poll Yang Zhang
2017-08-29 11:46 ` Yang Zhang
2017-08-29 11:46   ` Yang Zhang
2017-08-29 17:20   ` Luis R. Rodriguez
2017-08-29 17:20   ` Luis R. Rodriguez [this message]
2017-08-29 17:20     ` Luis R. Rodriguez
2017-08-29 11:46 ` [RFC PATCH v2 6/7] KVM guest: introduce smart idle poll algorithm Yang Zhang
2017-08-29 11:46 ` [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle Yang Zhang
2017-08-29 12:46   ` Peter Zijlstra
2017-09-01  7:30     ` Yang Zhang
2017-09-29 10:29     ` Quan Xu
2017-08-29 11:58 ` [RFC PATCH v2 0/7] x86/idle: add halt poll support Alexander Graf
2017-09-01  6:21   ` Yang Zhang
2017-08-29 13:03 ` Andi Kleen
2017-08-29 14:02 ` Wanpeng Li
2017-08-29 14:27   ` Konrad Rzeszutek Wilk
2017-08-29 14:36   ` Michael S. Tsirkin
2017-09-01  6:32   ` Yang Zhang
2017-09-01  6:52     ` Wanpeng Li
2017-09-01  6:44   ` Yang Zhang
2017-09-01  6:58     ` Wanpeng Li
2017-09-01  7:53       ` Yang Zhang
2017-08-29 14:56 ` Michael S. Tsirkin
2017-09-13 11:56   ` Yang Zhang
2017-09-14  8:36     ` Quan Xu
2017-09-14  9:19       ` Wanpeng Li
2017-09-14  9:40         ` Quan Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170829172041.GH27873@wotan.suse.de \
    --to=mcgrof@kernel.org \
    --cc=Larry.Finger@lwfinger.net \
    --cc=abbotti@mev.co.uk \
    --cc=agraf@suse.de \
    --cc=akataria@vmware.com \
    --cc=akpm@linux-foundation.org \
    --cc=bhe@redhat.com \
    --cc=chrisw@sous-sol.org \
    --cc=corbet@lwn.net \
    --cc=dmatlack@google.com \
    --cc=hpa@zytor.com \
    --cc=jeremy@goop.org \
    --cc=jeyu@redhat.com \
    --cc=jgross@suse.com \
    --cc=johannes.berg@intel.com \
    --cc=jpoimboe@redhat.com \
    --cc=keescook@chromium.org \
    --cc=krzk@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=quan.xu0@gmail.com \
    --cc=rkrcmar@redhat.com \
    --cc=rusty@rustcorp.com.au \
    --cc=tglx@linutronix.de \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=wanpeng.li@hotmail.com \
    --cc=x86@kernel.org \
    --cc=yang.zhang.wz@gmail.com \
    --cc=zijun_hu@htc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.