[RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
@ 2014-01-31 19:53 Zoran Markovic
  2014-01-31 20:10 ` Zoran Markovic
  2014-02-10 10:08 ` Lai Jiangshan
  0 siblings, 2 replies; 25+ messages in thread
From: Zoran Markovic @ 2014-01-31 19:53 UTC (permalink / raw)
  To: linux-kernel
  Cc: Shaibal Dutta, Lai Jiangshan, Paul E. McKenney, Dipankar Sarma

From: Shaibal Dutta <shaibal.dutta@broadcom.com>

For better use of CPU idle time, allow the scheduler to select the CPU
on which the SRCU grace period work would be scheduled. This improves
idle residency time and conserves power.

This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.

Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
[zoran.markovic@linaro.org: Rebased to latest kernel version. Added commit
message. Fixed code alignment.]
---
 kernel/rcu/srcu.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index 3318d82..a1ebe6d 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
 	rcu_batch_queue(&sp->batch_queue, head);
 	if (!sp->running) {
 		sp->running = true;
-		schedule_delayed_work(&sp->work, 0);
+		queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
 	}
 	spin_unlock_irqrestore(&sp->queue_lock, flags);
 }
@@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
 	}
 
 	if (pending)
-		schedule_delayed_work(&sp->work, SRCU_INTERVAL);
+		queue_delayed_work(system_power_efficient_wq,
+				   &sp->work, SRCU_INTERVAL);
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-01-31 19:53 [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue Zoran Markovic
@ 2014-01-31 20:10 ` Zoran Markovic
  2014-02-10 10:08 ` Lai Jiangshan
  1 sibling, 0 replies; 25+ messages in thread
From: Zoran Markovic @ 2014-01-31 20:10 UTC (permalink / raw)
  To: lkml; +Cc: Shaibal Dutta, Lai Jiangshan, Paul E. McKenney, Dipankar Sarma

Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org>

On 31 January 2014 11:53, Zoran Markovic <zoran.markovic@linaro.org> wrote:
> From: Shaibal Dutta <shaibal.dutta@broadcom.com>
>
> For better use of CPU idle time, allow the scheduler to select the CPU
> on which the SRCU grace period work would be scheduled. This improves
> idle residency time and conserves power.
>
> This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
>
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Dipankar Sarma <dipankar@in.ibm.com>
> Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
> [zoran.markovic@linaro.org: Rebased to latest kernel version. Added commit
> message. Fixed code alignment.]
> ---
>  kernel/rcu/srcu.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 3318d82..a1ebe6d 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
>         rcu_batch_queue(&sp->batch_queue, head);
>         if (!sp->running) {
>                 sp->running = true;
> -               schedule_delayed_work(&sp->work, 0);
> +               queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
>         }
>         spin_unlock_irqrestore(&sp->queue_lock, flags);
>  }
> @@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
>         }
>
>         if (pending)
> -               schedule_delayed_work(&sp->work, SRCU_INTERVAL);
> +               queue_delayed_work(system_power_efficient_wq,
> +                                  &sp->work, SRCU_INTERVAL);
>  }
>
>  /*
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-01-31 19:53 [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue Zoran Markovic
  2014-01-31 20:10 ` Zoran Markovic
@ 2014-02-10 10:08 ` Lai Jiangshan
  2014-02-10 18:47   ` Paul E. McKenney
  1 sibling, 1 reply; 25+ messages in thread
From: Lai Jiangshan @ 2014-02-10 10:08 UTC (permalink / raw)
  To: Zoran Markovic
  Cc: linux-kernel, Shaibal Dutta, Paul E. McKenney, Dipankar Sarma

Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>

On 02/01/2014 03:53 AM, Zoran Markovic wrote:
> From: Shaibal Dutta <shaibal.dutta@broadcom.com>
> 
> For better use of CPU idle time, allow the scheduler to select the CPU
> on which the SRCU grace period work would be scheduled. This improves
> idle residency time and conserves power.
> 
> This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
> 
> Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Dipankar Sarma <dipankar@in.ibm.com>
> Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
> [zoran.markovic@linaro.org: Rebased to latest kernel version. Added commit
> message. Fixed code alignment.]
> ---
>  kernel/rcu/srcu.c |    5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> index 3318d82..a1ebe6d 100644
> --- a/kernel/rcu/srcu.c
> +++ b/kernel/rcu/srcu.c
> @@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
>  	rcu_batch_queue(&sp->batch_queue, head);
>  	if (!sp->running) {
>  		sp->running = true;
> -		schedule_delayed_work(&sp->work, 0);
> +		queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
>  	}
>  	spin_unlock_irqrestore(&sp->queue_lock, flags);
>  }
> @@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
>  	}
>  
>  	if (pending)
> -		schedule_delayed_work(&sp->work, SRCU_INTERVAL);
> +		queue_delayed_work(system_power_efficient_wq,
> +				   &sp->work, SRCU_INTERVAL);
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-10 10:08 ` Lai Jiangshan
@ 2014-02-10 18:47   ` Paul E. McKenney
  2014-02-12 18:23     ` Frederic Weisbecker
  0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2014-02-10 18:47 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma,
	fweisbec, tj

On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>

Thank you all, queued for 3.15.

We should also have some facility for moving the SRCU workqueues to
housekeeping/timekeeping kthreads in the NO_HZ_FULL case.  Or does
this patch already have that effect?

							Thanx, Paul

> On 02/01/2014 03:53 AM, Zoran Markovic wrote:
> > From: Shaibal Dutta <shaibal.dutta@broadcom.com>
> > 
> > For better use of CPU idle time, allow the scheduler to select the CPU
> > on which the SRCU grace period work would be scheduled. This improves
> > idle residency time and conserves power.
> > 
> > This functionality is enabled when CONFIG_WQ_POWER_EFFICIENT is selected.
> > 
> > Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Dipankar Sarma <dipankar@in.ibm.com>
> > Signed-off-by: Shaibal Dutta <shaibal.dutta@broadcom.com>
> > [zoran.markovic@linaro.org: Rebased to latest kernel version. Added commit
> > message. Fixed code alignment.]
> > ---
> >  kernel/rcu/srcu.c |    5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
> > index 3318d82..a1ebe6d 100644
> > --- a/kernel/rcu/srcu.c
> > +++ b/kernel/rcu/srcu.c
> > @@ -398,7 +398,7 @@ void call_srcu(struct srcu_struct *sp, struct rcu_head *head,
> >  	rcu_batch_queue(&sp->batch_queue, head);
> >  	if (!sp->running) {
> >  		sp->running = true;
> > -		schedule_delayed_work(&sp->work, 0);
> > +		queue_delayed_work(system_power_efficient_wq, &sp->work, 0);
> >  	}
> >  	spin_unlock_irqrestore(&sp->queue_lock, flags);
> >  }
> > @@ -674,7 +674,8 @@ static void srcu_reschedule(struct srcu_struct *sp)
> >  	}
> >  
> >  	if (pending)
> > -		schedule_delayed_work(&sp->work, SRCU_INTERVAL);
> > +		queue_delayed_work(system_power_efficient_wq,
> > +				   &sp->work, SRCU_INTERVAL);
> >  }
> >  
> >  /*
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-10 18:47   ` Paul E. McKenney
@ 2014-02-12 18:23     ` Frederic Weisbecker
  2014-02-12 19:02       ` Paul E. McKenney
  2014-02-17  5:26       ` Mike Galbraith
  0 siblings, 2 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2014-02-12 18:23 UTC (permalink / raw)
  To: Paul E. McKenney, Kevin Hilman, Tejun Heo
  Cc: Lai Jiangshan, Zoran Markovic, linux-kernel, Shaibal Dutta,
	Dipankar Sarma, tj

On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> 
> Thank you all, queued for 3.15.
> 
> We should also have some facility for moving the SRCU workqueues to
> housekeeping/timekeeping kthreads in the NO_HZ_FULL case.  Or does
> this patch already have that effect?

Kevin Hilman and me plan to try to bring a new Kconfig option that could let
us control the unbound workqueues affinity through sysfs.

The feature actually exist currently but is only enabled for workqueues that
have WQ_SYSFS. Writeback and raid5 are the only current users.

See for example: /sys/devices/virtual/workqueue/writeback/cpumask

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 18:23     ` Frederic Weisbecker
@ 2014-02-12 19:02       ` Paul E. McKenney
  2014-02-12 19:23         ` Tejun Heo
  2014-02-17  5:26       ` Mike Galbraith
  1 sibling, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2014-02-12 19:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Kevin Hilman, Tejun Heo, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On Wed, Feb 12, 2014 at 07:23:38PM +0100, Frederic Weisbecker wrote:
> On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > 
> > Thank you all, queued for 3.15.
> > 
> > We should also have some facility for moving the SRCU workqueues to
> > housekeeping/timekeeping kthreads in the NO_HZ_FULL case.  Or does
> > this patch already have that effect?
> 
> Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> us control the unbound workqueues affinity through sysfs.

Please CC me or feel free to update Documentation/kernel-per-CPU-kthreads.txt
as part of this upcoming series.

> The feature actually exist currently but is only enabled for workqueues that
> have WQ_SYSFS. Writeback and raid5 are the only current users.
> 
> See for example: /sys/devices/virtual/workqueue/writeback/cpumask

Ah, news to me!  I have queued the following patch, seem reasonable?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 827104fb9364..09f28841ee3e 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -162,7 +162,11 @@ Purpose: Execute workqueue requests
 To reduce its OS jitter, do any of the following:
 1.	Run your workload at a real-time priority, which will allow
 	preempting the kworker daemons.
-2.	Do any of the following needed to avoid jitter that your
+2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
+	to force the WQ_SYSFS workqueues to run on the specified set
+	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
+	"ls sys/devices/virtual/workqueue".
+3.	Do any of the following needed to avoid jitter that your
 	application cannot tolerate:
 	a.	Build your kernel with CONFIG_SLUB=y rather than
 		CONFIG_SLAB=y, thus avoiding the slab allocator's periodic


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 19:02       ` Paul E. McKenney
@ 2014-02-12 19:23         ` Tejun Heo
  2014-02-12 19:59           ` Paul E. McKenney
  2014-02-14 23:24           ` Kevin Hilman
  0 siblings, 2 replies; 25+ messages in thread
From: Tejun Heo @ 2014-02-12 19:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Kevin Hilman, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

Hello,

On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> +	to force the WQ_SYSFS workqueues to run on the specified set
> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> +	"ls sys/devices/virtual/workqueue".

One thing to be careful about is that once published, it becomes part
of userland visible interface.  Maybe adding some words warning
against sprinkling WQ_SYSFS willy-nilly is a good idea?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 19:23         ` Tejun Heo
@ 2014-02-12 19:59           ` Paul E. McKenney
  2014-02-12 20:13             ` Tejun Heo
  2014-02-12 23:04             ` Frederic Weisbecker
  2014-02-14 23:24           ` Kevin Hilman
  1 sibling, 2 replies; 25+ messages in thread
From: Paul E. McKenney @ 2014-02-12 19:59 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Kevin Hilman, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > +	to force the WQ_SYSFS workqueues to run on the specified set
> > +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> > +	"ls sys/devices/virtual/workqueue".
> 
> One thing to be careful about is that once published, it becomes part
> of userland visible interface.  Maybe adding some words warning
> against sprinkling WQ_SYSFS willy-nilly is a good idea?

Good point!  How about the following?

							Thanx, Paul

------------------------------------------------------------------------

Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity

This commit documents the ability to apply CPU affinity to WQ_SYSFS
workqueues, thus offloading them from the desired worker CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tejun Heo <tj@kernel.org>

diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 827104fb9364..214da3a47a68 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
 To reduce its OS jitter, do any of the following:
 1.	Run your workload at a real-time priority, which will allow
 	preempting the kworker daemons.
-2.	Do any of the following needed to avoid jitter that your
+2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
+	to force the WQ_SYSFS workqueues to run on the specified set
+	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
+	"ls sys/devices/virtual/workqueue".  That said, the workqueues
+	maintainer would like to caution people against indiscriminately
+	sprinkling WQ_SYSFS across all the workqueues.  The reason for
+	caution is that it is easy to add WQ_SYSFS, but because sysfs
+	is part of the formal user/kernel API, it can be nearly impossible
+	to remove it, even if its addition was a mistake.
+3.	Do any of the following needed to avoid jitter that your
 	application cannot tolerate:
 	a.	Build your kernel with CONFIG_SLUB=y rather than
 		CONFIG_SLAB=y, thus avoiding the slab allocator's periodic


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 19:59           ` Paul E. McKenney
@ 2014-02-12 20:13             ` Tejun Heo
  2014-02-12 23:04             ` Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Tejun Heo @ 2014-02-12 20:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Kevin Hilman, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
> 
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>

Reviewed-by: Tejun Heo <tj@kernel.org>

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 19:59           ` Paul E. McKenney
  2014-02-12 20:13             ` Tejun Heo
@ 2014-02-12 23:04             ` Frederic Weisbecker
  2014-02-13  0:33               ` Paul E. McKenney
  1 sibling, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2014-02-12 23:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, Kevin Hilman, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
> On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
> > Hello,
> > 
> > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > > +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > > +	to force the WQ_SYSFS workqueues to run on the specified set
> > > +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> > > +	"ls sys/devices/virtual/workqueue".
> > 
> > One thing to be careful about is that once published, it becomes part
> > of userland visible interface.  Maybe adding some words warning
> > against sprinkling WQ_SYSFS willy-nilly is a good idea?
> 
> Good point!  How about the following?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
> 
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Tejun Heo <tj@kernel.org>
> 
> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> index 827104fb9364..214da3a47a68 100644
> --- a/Documentation/kernel-per-CPU-kthreads.txt
> +++ b/Documentation/kernel-per-CPU-kthreads.txt
> @@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
>  To reduce its OS jitter, do any of the following:
>  1.	Run your workload at a real-time priority, which will allow
>  	preempting the kworker daemons.
> -2.	Do any of the following needed to avoid jitter that your
> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> +	to force the WQ_SYSFS workqueues to run on the specified set
> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> +	"ls sys/devices/virtual/workqueue".  That said, the workqueues
> +	maintainer would like to caution people against indiscriminately
> +	sprinkling WQ_SYSFS across all the workqueues.  The reason for
> +	caution is that it is easy to add WQ_SYSFS, but because sysfs
> +	is part of the formal user/kernel API, it can be nearly impossible
> +	to remove it, even if its addition was a mistake.
> +3.	Do any of the following needed to avoid jitter that your

Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

I just suggest we append a small explanation about what WQ_SYSFS is about.
Like:

+2.
+       The workqueues that want to be visible on the sysfs hierarchy
+	set the WQ_SYSFS flag.
+  	For those who have this flag set, you can use the
+	/sys/devices/virtual/workqueue/*/cpumask sysfs files
+	to force the workqueues to run on the specified set
+	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
+	"ls sys/devices/virtual/workqueue".  That said, the workqueues
+	maintainer would like to caution people against indiscriminately
+	sprinkling WQ_SYSFS across all the workqueues.  The reason for
+	caution is that it is easy to add WQ_SYSFS, but because sysfs
+	is part of the formal user/kernel API, it can be nearly impossible
+	to remove it, even if its addition was a mistake.
+3.	Do any of the following needed to avoid jitter that your

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 23:04             ` Frederic Weisbecker
@ 2014-02-13  0:33               ` Paul E. McKenney
  2014-02-13  1:30                 ` Lai Jiangshan
  2014-02-13 14:05                 ` Frederic Weisbecker
  0 siblings, 2 replies; 25+ messages in thread
From: Paul E. McKenney @ 2014-02-13  0:33 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Tejun Heo, Kevin Hilman, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On Thu, Feb 13, 2014 at 12:04:57AM +0100, Frederic Weisbecker wrote:
> On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > > > +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > > > +	to force the WQ_SYSFS workqueues to run on the specified set
> > > > +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> > > > +	"ls sys/devices/virtual/workqueue".
> > > 
> > > One thing to be careful about is that once published, it becomes part
> > > of userland visible interface.  Maybe adding some words warning
> > > against sprinkling WQ_SYSFS willy-nilly is a good idea?
> > 
> > Good point!  How about the following?
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
> > 
> > This commit documents the ability to apply CPU affinity to WQ_SYSFS
> > workqueues, thus offloading them from the desired worker CPUs.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Tejun Heo <tj@kernel.org>
> > 
> > diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> > index 827104fb9364..214da3a47a68 100644
> > --- a/Documentation/kernel-per-CPU-kthreads.txt
> > +++ b/Documentation/kernel-per-CPU-kthreads.txt
> > @@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
> >  To reduce its OS jitter, do any of the following:
> >  1.	Run your workload at a real-time priority, which will allow
> >  	preempting the kworker daemons.
> > -2.	Do any of the following needed to avoid jitter that your
> > +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > +	to force the WQ_SYSFS workqueues to run on the specified set
> > +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> > +	"ls sys/devices/virtual/workqueue".  That said, the workqueues
> > +	maintainer would like to caution people against indiscriminately
> > +	sprinkling WQ_SYSFS across all the workqueues.  The reason for
> > +	caution is that it is easy to add WQ_SYSFS, but because sysfs
> > +	is part of the formal user/kernel API, it can be nearly impossible
> > +	to remove it, even if its addition was a mistake.
> > +3.	Do any of the following needed to avoid jitter that your
> 
> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
> 
> I just suggest we append a small explanation about what WQ_SYSFS is about.
> Like:

Fair point!  I wordsmithed it into the following.  Seem reasonable?

							Thanx, Paul

------------------------------------------------------------------------

Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity

This commit documents the ability to apply CPU affinity to WQ_SYSFS
workqueues, thus offloading them from the desired worker CPUs.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>

diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
index 827104fb9364..f3cd299fcc41 100644
--- a/Documentation/kernel-per-CPU-kthreads.txt
+++ b/Documentation/kernel-per-CPU-kthreads.txt
@@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
 To reduce its OS jitter, do any of the following:
 1.	Run your workload at a real-time priority, which will allow
 	preempting the kworker daemons.
-2.	Do any of the following needed to avoid jitter that your
+2.	A given workqueue can be made visible in the sysfs filesystem
+	by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
+	Such a workqueue can be confined to a given subset of the
+	CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
+	files.	The set of WQ_SYSFS workqueues can be displayed using
+	"ls sys/devices/virtual/workqueue".  That said, the workqueues
+	maintainer would like to caution people against indiscriminately
+	sprinkling WQ_SYSFS across all the workqueues.	The reason for
+	caution is that it is easy to add WQ_SYSFS, but because sysfs is
+	part of the formal user/kernel API, it can be nearly impossible
+	to remove it, even if its addition was a mistake.
+3.	Do any of the following needed to avoid jitter that your
 	application cannot tolerate:
 	a.	Build your kernel with CONFIG_SLUB=y rather than
 		CONFIG_SLAB=y, thus avoiding the slab allocator's periodic


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-13  0:33               ` Paul E. McKenney
@ 2014-02-13  1:30                 ` Lai Jiangshan
  2014-02-13 14:05                 ` Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Lai Jiangshan @ 2014-02-13  1:30 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, Tejun Heo, Kevin Hilman, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On 02/13/2014 08:33 AM, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 12:04:57AM +0100, Frederic Weisbecker wrote:
>> On Wed, Feb 12, 2014 at 11:59:22AM -0800, Paul E. McKenney wrote:
>>> On Wed, Feb 12, 2014 at 02:23:54PM -0500, Tejun Heo wrote:
>>>> Hello,
>>>>
>>>> On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
>>>>> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
>>>>> +	to force the WQ_SYSFS workqueues to run on the specified set
>>>>> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
>>>>> +	"ls sys/devices/virtual/workqueue".
>>>>
>>>> One thing to be careful about is that once published, it becomes part
>>>> of userland visible interface.  Maybe adding some words warning
>>>> against sprinkling WQ_SYSFS willy-nilly is a good idea?
>>>
>>> Good point!  How about the following?
>>>
>>> 							Thanx, Paul
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
>>>
>>> This commit documents the ability to apply CPU affinity to WQ_SYSFS
>>> workqueues, thus offloading them from the desired worker CPUs.
>>>
>>> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>>> Cc: Frederic Weisbecker <fweisbec@gmail.com>
>>> Cc: Tejun Heo <tj@kernel.org>
>>>
>>> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
>>> index 827104fb9364..214da3a47a68 100644
>>> --- a/Documentation/kernel-per-CPU-kthreads.txt
>>> +++ b/Documentation/kernel-per-CPU-kthreads.txt
>>> @@ -162,7 +162,16 @@ Purpose: Execute workqueue requests
>>>  To reduce its OS jitter, do any of the following:
>>>  1.	Run your workload at a real-time priority, which will allow
>>>  	preempting the kworker daemons.
>>> -2.	Do any of the following needed to avoid jitter that your
>>> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
>>> +	to force the WQ_SYSFS workqueues to run on the specified set
>>> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
>>> +	"ls sys/devices/virtual/workqueue".  That said, the workqueues
>>> +	maintainer would like to caution people against indiscriminately
>>> +	sprinkling WQ_SYSFS across all the workqueues.  The reason for
>>> +	caution is that it is easy to add WQ_SYSFS, but because sysfs
>>> +	is part of the formal user/kernel API, it can be nearly impossible
>>> +	to remove it, even if its addition was a mistake.
>>> +3.	Do any of the following needed to avoid jitter that your
>>
>> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
>>
>> I just suggest we append a small explanation about what WQ_SYSFS is about.
>> Like:
> 
> Fair point!  I wordsmithed it into the following.  Seem reasonable?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
> 
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Reviewed-by: Tejun Heo <tj@kernel.org>
> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
> 
> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> index 827104fb9364..f3cd299fcc41 100644
> --- a/Documentation/kernel-per-CPU-kthreads.txt
> +++ b/Documentation/kernel-per-CPU-kthreads.txt
> @@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
>  To reduce its OS jitter, do any of the following:
>  1.	Run your workload at a real-time priority, which will allow
>  	preempting the kworker daemons.
> -2.	Do any of the following needed to avoid jitter that your
> +2.	A given workqueue can be made visible in the sysfs filesystem
> +	by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
> +	Such a workqueue can be confined to a given subset of the
> +	CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
> +	files.	The set of WQ_SYSFS workqueues can be displayed using
> +	"ls sys/devices/virtual/workqueue".  That said, the workqueues
> +	maintainer would like to caution people against indiscriminately
> +	sprinkling WQ_SYSFS across all the workqueues.	The reason for
> +	caution is that it is easy to add WQ_SYSFS, but because sysfs is
> +	part of the formal user/kernel API, it can be nearly impossible
> +	to remove it, even if its addition was a mistake.
> +3.	Do any of the following needed to avoid jitter that your
>  	application cannot tolerate:
>  	a.	Build your kernel with CONFIG_SLUB=y rather than
>  		CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
> 
> 


Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-13  0:33               ` Paul E. McKenney
  2014-02-13  1:30                 ` Lai Jiangshan
@ 2014-02-13 14:05                 ` Frederic Weisbecker
  1 sibling, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2014-02-13 14:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, Kevin Hilman, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On Wed, Feb 12, 2014 at 04:33:11PM -0800, Paul E. McKenney wrote:
> 
> Fair point!  I wordsmithed it into the following.  Seem reasonable?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> Documentation/kernel-per-CPU-kthreads.txt: Workqueue affinity
> 
> This commit documents the ability to apply CPU affinity to WQ_SYSFS
> workqueues, thus offloading them from the desired worker CPUs.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Reviewed-by: Tejun Heo <tj@kernel.org>
> Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
> 
> diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt
> index 827104fb9364..f3cd299fcc41 100644
> --- a/Documentation/kernel-per-CPU-kthreads.txt
> +++ b/Documentation/kernel-per-CPU-kthreads.txt
> @@ -162,7 +162,18 @@ Purpose: Execute workqueue requests
>  To reduce its OS jitter, do any of the following:
>  1.	Run your workload at a real-time priority, which will allow
>  	preempting the kworker daemons.
> -2.	Do any of the following needed to avoid jitter that your
> +2.	A given workqueue can be made visible in the sysfs filesystem
> +	by passing the WQ_SYSFS to that workqueue's alloc_workqueue().
> +	Such a workqueue can be confined to a given subset of the
> +	CPUs using the /sys/devices/virtual/workqueue/*/cpumask sysfs
> +	files.	The set of WQ_SYSFS workqueues can be displayed using
> +	"ls sys/devices/virtual/workqueue".  That said, the workqueues
> +	maintainer would like to caution people against indiscriminately
> +	sprinkling WQ_SYSFS across all the workqueues.	The reason for
> +	caution is that it is easy to add WQ_SYSFS, but because sysfs is
> +	part of the formal user/kernel API, it can be nearly impossible
> +	to remove it, even if its addition was a mistake.
> +3.	Do any of the following needed to avoid jitter that your
>  	application cannot tolerate:
>  	a.	Build your kernel with CONFIG_SLUB=y rather than
>  		CONFIG_SLAB=y, thus avoiding the slab allocator's periodic
> 

Perfect!!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 19:23         ` Tejun Heo
  2014-02-12 19:59           ` Paul E. McKenney
@ 2014-02-14 23:24           ` Kevin Hilman
  2014-02-15  7:36             ` Mike Galbraith
                               ` (2 more replies)
  1 sibling, 3 replies; 25+ messages in thread
From: Kevin Hilman @ 2014-02-14 23:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E. McKenney, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

Tejun Heo <tj@kernel.org> writes:

> Hello,
>
> On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
>> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
>> +	to force the WQ_SYSFS workqueues to run on the specified set
>> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
>> +	"ls sys/devices/virtual/workqueue".
>
> One thing to be careful about is that once published, it becomes part
> of userland visible interface.  Maybe adding some words warning
> against sprinkling WQ_SYSFS willy-nilly is a good idea?

In the NO_HZ_FULL case, it seems to me we'd always want all unbound
workqueues to have their affinity set to the housekeeping CPUs.

Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
the affinity can be controlled?  I guess the main reason would be that 
all of these workqueue names would become permanent ABI.

At least for NO_HZ_FULL, maybe this should be automatic.  The cpumask of
unbound workqueues should default to !tick_nohz_full_mask?  Any WQ_SYSFS
workqueues could still be overridden from userspace, but at least the
default would be sane, and help keep full dyntics CPUs isolated.

Example patch below, only boot tested on 4-CPU ARM system with
CONFIG_NO_HZ_FULL_ALL=y and verified that 'cat
/sys/devices/virtual/workqueue/writeback/cpumask' looked sane.  If this
looks OK, I can maybe clean it up a bit and make it runtime check
instead of a compile time check.

Kevin

>From 902a2b58d61a51415457ea6768d687cdb7532eff Mon Sep 17 00:00:00 2001
From: Kevin Hilman <khilman@linaro.org>
Date: Fri, 14 Feb 2014 15:10:58 -0800
Subject: [PATCH] workqueue: for NO_HZ_FULL, set default cpumask to
 !tick_nohz_full_mask

To help in keeping NO_HZ_FULL CPUs isolated, keep unbound workqueues
from running on full dynticks CPUs.  To do this, set the default
workqueue cpumask to be the set of "housekeeping" CPUs instead of all
possible CPUs.

This is just just the starting/default cpumask, and can be overridden
with all the normal ways (NUMA settings, apply_workqueue_attrs and via
sysfs for workqueus with the WQ_SYSFS attribute.)

Cc: Tejun Heo <tj@kernel.org>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Kevin Hilman <khilman@linaro.org>
---
 kernel/workqueue.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 987293d03ebc..9a9d9b0eaf6d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -48,6 +48,7 @@
 #include <linux/nodemask.h>
 #include <linux/moduleparam.h>
 #include <linux/uaccess.h>
+#include <linux/tick.h>

 #include "workqueue_internal.h"

@@ -3436,7 +3437,11 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
 	if (!alloc_cpumask_var(&attrs->cpumask, gfp_mask))
 		goto fail;

+#ifdef CONFIG_NO_HZ_FULL
+	cpumask_complement(attrs->cpumask, tick_nohz_full_mask);
+#else
 	cpumask_copy(attrs->cpumask, cpu_possible_mask);
+#endif
 	return attrs;
 fail:
 	free_workqueue_attrs(attrs);
-- 
1.8.3

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-14 23:24           ` Kevin Hilman
@ 2014-02-15  7:36             ` Mike Galbraith
  2014-02-16 16:41               ` Paul E. McKenney
  2014-02-27 15:08             ` Frederic Weisbecker
  2014-03-10  9:52             ` Viresh Kumar
  2 siblings, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2014-02-15  7:36 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Tejun Heo, Paul E. McKenney, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Fri, 2014-02-14 at 15:24 -0800, Kevin Hilman wrote: 
> Tejun Heo <tj@kernel.org> writes:
> 
> > Hello,
> >
> > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> >> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> >> +	to force the WQ_SYSFS workqueues to run on the specified set
> >> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> >> +	"ls sys/devices/virtual/workqueue".
> >
> > One thing to be careful about is that once published, it becomes part
> > of userland visible interface.  Maybe adding some words warning
> > against sprinkling WQ_SYSFS willy-nilly is a good idea?
> 
> In the NO_HZ_FULL case, it seems to me we'd always want all unbound
> workqueues to have their affinity set to the housekeeping CPUs.
> 
> Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
> the affinity can be controlled?  I guess the main reason would be that 
> all of these workqueue names would become permanent ABI.
> 
> At least for NO_HZ_FULL, maybe this should be automatic.  The cpumask of
> unbound workqueues should default to !tick_nohz_full_mask?  Any WQ_SYSFS
> workqueues could still be overridden from userspace, but at least the
> default would be sane, and help keep full dyntics CPUs isolated.

What I'm thinking is that it should be automatic, but not necessarily
based upon the nohz full mask, rather maybe based upon whether sched
domains exist, or perhaps a generic exclusive cpuset property, though
some really don't want anything to do with cpusets.

Why? Because there are jitter intolerant loads where nohz full isn't all
that useful, because you'll be constantly restarting and stopping the
tick, and eating the increased accounting overhead to no gain because
there are frequently multiple realtime tasks running.  For these loads
(I have a user with such a fairly hefty 80 core rt load), dynamically
turning the tick _on_ is currently a better choice than nohz_full.
Point being, control of where unbound workqueues are allowed to run
isn't only desirable for single task HPC loads, other loads exist.

For my particular fairly critical 80 core load, workqueues aren't a real
big hairy deal, because its jitter tolerance isn't _all_ that tight (30
us max is easy enough to meet with room to spare).  The load can slice
through workers well enough to meet requirements, but it would certainly
be a win to be able to keep them at bay.  (gonna measure it, less jitter
is better even if it's only a little bit better.. eventually somebody
will demand what's currently impossible to deliver)

-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-15  7:36             ` Mike Galbraith
@ 2014-02-16 16:41               ` Paul E. McKenney
  2014-02-17  4:50                 ` Mike Galbraith
  0 siblings, 1 reply; 25+ messages in thread
From: Paul E. McKenney @ 2014-02-16 16:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Kevin Hilman, Tejun Heo, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Sat, Feb 15, 2014 at 08:36:44AM +0100, Mike Galbraith wrote:
> On Fri, 2014-02-14 at 15:24 -0800, Kevin Hilman wrote: 
> > Tejun Heo <tj@kernel.org> writes:
> > 
> > > Hello,
> > >
> > > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> > >> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> > >> +	to force the WQ_SYSFS workqueues to run on the specified set
> > >> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> > >> +	"ls sys/devices/virtual/workqueue".
> > >
> > > One thing to be careful about is that once published, it becomes part
> > > of userland visible interface.  Maybe adding some words warning
> > > against sprinkling WQ_SYSFS willy-nilly is a good idea?
> > 
> > In the NO_HZ_FULL case, it seems to me we'd always want all unbound
> > workqueues to have their affinity set to the housekeeping CPUs.
> > 
> > Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
> > the affinity can be controlled?  I guess the main reason would be that 
> > all of these workqueue names would become permanent ABI.
> > 
> > At least for NO_HZ_FULL, maybe this should be automatic.  The cpumask of
> > unbound workqueues should default to !tick_nohz_full_mask?  Any WQ_SYSFS
> > workqueues could still be overridden from userspace, but at least the
> > default would be sane, and help keep full dyntics CPUs isolated.
> 
> What I'm thinking is that it should be automatic, but not necessarily
> based upon the nohz full mask, rather maybe based upon whether sched
> domains exist, or perhaps a generic exclusive cpuset property, though
> some really don't want anything to do with cpusets.
> 
> Why? Because there are jitter intolerant loads where nohz full isn't all
> that useful, because you'll be constantly restarting and stopping the
> tick, and eating the increased accounting overhead to no gain because
> there are frequently multiple realtime tasks running.  For these loads
> (I have a user with such a fairly hefty 80 core rt load), dynamically
> turning the tick _on_ is currently a better choice than nohz_full.
> Point being, control of where unbound workqueues are allowed to run
> isn't only desirable for single task HPC loads, other loads exist.
> 
> For my particular fairly critical 80 core load, workqueues aren't a real
> big hairy deal, because its jitter tolerance isn't _all_ that tight (30
> us max is easy enough to meet with room to spare).  The load can slice
> through workers well enough to meet requirements, but it would certainly
> be a win to be able to keep them at bay.  (gonna measure it, less jitter
> is better even if it's only a little bit better.. eventually somebody
> will demand what's currently impossible to deliver)

So if there is NO_HZ_FULL, you have no objection to binding workqueues to
the timekeeping CPUs, but that you would also like some form of automatic
binding in the !NO_HZ_FULL case.  Of course, if a common mechanism could
serve both cases, that would be good.  And yes, cpusets are frowned upon
for some workloads.

So maybe start with Kevin's patch, but augment with something else for
the !NO_HZ_FULL case?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-16 16:41               ` Paul E. McKenney
@ 2014-02-17  4:50                 ` Mike Galbraith
  2014-02-19  7:00                   ` Mike Galbraith
  2014-02-24 17:55                   ` Kevin Hilman
  0 siblings, 2 replies; 25+ messages in thread
From: Mike Galbraith @ 2014-02-17  4:50 UTC (permalink / raw)
  To: paulmck
  Cc: Kevin Hilman, Tejun Heo, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Sun, 2014-02-16 at 08:41 -0800, Paul E. McKenney wrote:

> So if there is NO_HZ_FULL, you have no objection to binding workqueues to
> the timekeeping CPUs, but that you would also like some form of automatic
> binding in the !NO_HZ_FULL case.  Of course, if a common mechanism could
> serve both cases, that would be good.  And yes, cpusets are frowned upon
> for some workloads.

I'm not _objecting_, I'm not driving, Frederic's doing that ;-)

That said, isolation seems to be turning into a property of nohz mode,
but as I see it, nohz_full is an extension to generic isolation.

> So maybe start with Kevin's patch, but augment with something else for
> the !NO_HZ_FULL case?

Sure (hm, does it work without workqueue.disable_numa ?).

It just seems to me that tying it to sched domain construction would be
a better fit.  That way, it doesn't matter what your isolation requiring
load is, whether you run a gaggle of realtime tasks or one HPC task your
business, the generic requirement is isolation, not tick mode.  For one
HPC task per core, you want no tick, if you're running all SCHED_FIFO,
maybe you want that too, depends on the impact of nohz_full mode.  All
sensitive loads want the isolation, but they may not like the price.

I personally like the cpuset way.  Being able to partition boxen on the
fly makes them very flexible.  In a perfect world, you'd be able to
quiesce and configure offloading and nohz_full on the fly too, and not
end up with some hodgepodge like this needs boot option foo, that
happens invisibly because of config option bar, the other thing you have
to do manually.. and you get to eat 937 kthreads and tons of overhead on
all CPUs if you want the ability to _maybe_ run a critical task or two.

-Mike

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-12 18:23     ` Frederic Weisbecker
  2014-02-12 19:02       ` Paul E. McKenney
@ 2014-02-17  5:26       ` Mike Galbraith
  2014-02-27 14:43         ` Frederic Weisbecker
  1 sibling, 1 reply; 25+ messages in thread
From: Mike Galbraith @ 2014-02-17  5:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Kevin Hilman, Tejun Heo, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Wed, 2014-02-12 at 19:23 +0100, Frederic Weisbecker wrote: 
> On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > 
> > Thank you all, queued for 3.15.
> > 
> > We should also have some facility for moving the SRCU workqueues to
> > housekeeping/timekeeping kthreads in the NO_HZ_FULL case.  Or does
> > this patch already have that effect?
> 
> Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> us control the unbound workqueues affinity through sysfs.

Handing control to the user seemed like a fine thing, so I started
making a boot option to enable it.  Forcing WQ_SYSFS on at sysfs
decision spot doesn't go well, init order matters :)  Post init frobbing
required if you want to see/frob all unbound.

-Mike


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-17  4:50                 ` Mike Galbraith
@ 2014-02-19  7:00                   ` Mike Galbraith
  2014-02-24 17:55                   ` Kevin Hilman
  1 sibling, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2014-02-19  7:00 UTC (permalink / raw)
  To: paulmck
  Cc: Kevin Hilman, Tejun Heo, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Mon, 2014-02-17 at 05:50 +0100, Mike Galbraith wrote: 
> On Sun, 2014-02-16 at 08:41 -0800, Paul E. McKenney wrote:

> > So maybe start with Kevin's patch, but augment with something else for
> > the !NO_HZ_FULL case?
> 
> Sure (hm, does it work without workqueue.disable_numa ?).

I took patch out for a spin on a 40 core box +SMT, with CPUs 4-79
isolated via exclusive cpuset with load balancing off.  Worker bees
ignored patch either way.

-Mike

Perturbation measurement hog pinned to CPU4.

With patch:

#           TASK-PID   CPU#  ||||||  TIMESTAMP  FUNCTION
#              | |       |   ||||||     |         |
            pert-9949  [004] ....113   405.120164: workqueue_queue_work: work struct=ffff880a5c4ecc08 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=4
            pert-9949  [004] ....113   405.120166: workqueue_activate_work: work struct ffff880a5c4ecc08
            pert-9949  [004] d.L.313   405.120169: sched_wakeup: comm=kworker/4:2 pid=2119 prio=120 success=1 target_cpu=004
            pert-9949  [004] d.Lh213   405.120170: tick_stop: success=no msg=more than 1 task in runqueue
            pert-9949  [004] d.L.213   405.120172: tick_stop: success=no msg=more than 1 task in runqueue
            pert-9949  [004] d...3..   405.120173: sched_switch: prev_comm=pert prev_pid=9949 prev_prio=120 prev_state=R+ ==> next_comm=kworker/4:2 next_pid=2119 next_prio=120
     kworker/4:2-2119  [004] ....1..   405.120174: workqueue_execute_start: work struct ffff880a5c4ecc08: function flush_to_ldisc
     kworker/4:2-2119  [004] d...311   405.120176: sched_wakeup: comm=sshd pid=6620 prio=120 success=1 target_cpu=000
     kworker/4:2-2119  [004] ....1..   405.120176: workqueue_execute_end: work struct ffff880a5c4ecc08
     kworker/4:2-2119  [004] d...3..   405.120177: sched_switch: prev_comm=kworker/4:2 prev_pid=2119 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=9949 next_prio=120
            pert-9949  [004] ....113   405.120178: workqueue_queue_work: work struct=ffff880a5c4ecc08 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=4
            pert-9949  [004] ....113   405.120179: workqueue_activate_work: work struct ffff880a5c4ecc08
            pert-9949  [004] d.L.313   405.120179: sched_wakeup: comm=kworker/4:2 pid=2119 prio=120 success=1 target_cpu=004
            pert-9949  [004] d.L.213   405.120181: tick_stop: success=no msg=more than 1 task in runqueue
            pert-9949  [004] d...3..   405.120181: sched_switch: prev_comm=pert prev_pid=9949 prev_prio=120 prev_state=R+ ==> next_comm=kworker/4:2 next_pid=2119 next_prio=120
     kworker/4:2-2119  [004] ....1..   405.120182: workqueue_execute_start: work struct ffff880a5c4ecc08: function flush_to_ldisc
     kworker/4:2-2119  [004] ....1..   405.120183: workqueue_execute_end: work struct ffff880a5c4ecc08
     kworker/4:2-2119  [004] d...3..   405.120183: sched_switch: prev_comm=kworker/4:2 prev_pid=2119 prev_prio=120 prev_state=S ==> next_comm=pert next_pid=9949 next_prio=120
            pert-9949  [004] d...1..   405.120736: tick_stop: success=yes msg= 
            pert-9949  [004] ....113   410.121082: workqueue_queue_work: work struct=ffff880a5c4ecc08 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=4
            pert-9949  [004] ....113   410.121082: workqueue_activate_work: work struct ffff880a5c4ecc08
            pert-9949  [004] d.L.313   410.121084: sched_wakeup: comm=kworker/4:2 pid=2119 prio=120 success=1 target_cpu=004
            pert-9949  [004] d.Lh213   410.121085: tick_stop: success=no msg=more than 1 task in runqueue
            pert-9949  [004] d.L.213   410.121087: tick_stop: success=no msg=more than 1 task in runqueue
            ...and so on until tick time


(extra cheezy) hack kinda sorta works iff workqueue.disable_numa:

---
 kernel/workqueue.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1328,8 +1328,11 @@ static void __queue_work(int cpu, struct
 
 	rcu_read_lock();
 retry:
-	if (req_cpu == WORK_CPU_UNBOUND)
+	if (req_cpu == WORK_CPU_UNBOUND) {
 		cpu = raw_smp_processor_id();
+		if (runqueue_is_isolated(cpu))
+			cpu = 0;
+	}
 
 	/* pwq which will be used unless @work is executing elsewhere */
 	if (!(wq->flags & WQ_UNBOUND))


#           TASK-PID   CPU#  ||||||  TIMESTAMP  FUNCTION
#              | |       |   ||||||     |         |
           <...>-33824 [004] ....113  5555.889694: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
           <...>-33824 [004] ....113  5555.889695: workqueue_activate_work: work struct ffff880a596eb008
           <...>-33824 [004] d...313  5555.889697: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
           <...>-33824 [004] ....113  5560.890594: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
           <...>-33824 [004] ....113  5560.890595: workqueue_activate_work: work struct ffff880a596eb008
           <...>-33824 [004] d...313  5560.890596: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
           <...>-33824 [004] ....113  5565.891493: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
           <...>-33824 [004] ....113  5565.891493: workqueue_activate_work: work struct ffff880a596eb008
           <...>-33824 [004] d...313  5565.891494: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
           <...>-33824 [004] ....113  5570.892401: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
           <...>-33824 [004] ....113  5570.892401: workqueue_activate_work: work struct ffff880a596eb008
           <...>-33824 [004] d...313  5570.892403: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
           <...>-33824 [004] ....113  5575.893300: workqueue_queue_work: work struct=ffff880a596eb008 function=flush_to_ldisc workqueue=ffff88046f40ba00 req_cpu=256 cpu=0
           <...>-33824 [004] ....113  5575.893301: workqueue_activate_work: work struct ffff880a596eb008
           <...>-33824 [004] d...313  5575.893302: sched_wakeup: comm=kworker/0:2 pid=2105 prio=120 success=1 target_cpu=000
           <...>-33824 [004] d..h1..  5578.854979: softirq_raise: vec=1 [action=TIMER]
           <...>-33824 [004] dN..3..  5578.854981: sched_wakeup: comm=sirq-timer/4 pid=319 prio=69 success=1 target_cpu=004
           <...>-33824 [004] dN..1..  5578.854982: tick_stop: success=no msg=more than 1 task in runqueue
           <...>-33824 [004] dN.h1..  5578.854983: tick_stop: success=no msg=more than 1 task in runqueue
           <...>-33824 [004] dN..1..  5578.854985: tick_stop: success=no msg=more than 1 task in runqueue
           <...>-33824 [004] d...3..  5578.854986: sched_switch: prev_comm=pert prev_pid=33824 prev_prio=120 prev_state=R+ ==> next_comm=sirq-timer/4 next_pid=319 next_prio=69
    sirq-timer/4-319   [004] d..h3..  5578.854987: softirq_raise: vec=1 [action=TIMER]
    sirq-timer/4-319   [004] d...3..  5578.854989: tick_stop: success=no msg=more than 1 task in runqueue
    sirq-timer/4-319   [004] ....111  5578.854990: softirq_entry: vec=1 [action=TIMER]
    sirq-timer/4-319   [004] ....111  5578.855194: softirq_exit: vec=1 [action=TIMER]     <== 204us tick, not good... to_stare_at++
    sirq-timer/4-319   [004] d...3..  5578.855196: sched_switch: prev_comm=sirq-timer/4 prev_pid=319 prev_prio=69 prev_state=S ==> next_comm=pert next_pid=33824 next_prio=120
           <...>-33824 [004] d...1..  5578.855987: tick_stop: success=yes msg= 
           ...etc



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-17  4:50                 ` Mike Galbraith
  2014-02-19  7:00                   ` Mike Galbraith
@ 2014-02-24 17:55                   ` Kevin Hilman
  2014-02-24 18:25                     ` Mike Galbraith
  1 sibling, 1 reply; 25+ messages in thread
From: Kevin Hilman @ 2014-02-24 17:55 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: paulmck, Tejun Heo, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

Mike Galbraith <bitbucket@online.de> writes:

> On Sun, 2014-02-16 at 08:41 -0800, Paul E. McKenney wrote:
>
>> So if there is NO_HZ_FULL, you have no objection to binding workqueues to
>> the timekeeping CPUs, but that you would also like some form of automatic
>> binding in the !NO_HZ_FULL case.  Of course, if a common mechanism could
>> serve both cases, that would be good.  And yes, cpusets are frowned upon
>> for some workloads.
>
> I'm not _objecting_, I'm not driving, Frederic's doing that ;-)
>
> That said, isolation seems to be turning into a property of nohz mode,
> but as I see it, nohz_full is an extension to generic isolation.
>
>> So maybe start with Kevin's patch, but augment with something else for
>> the !NO_HZ_FULL case?
>
> Sure (hm, does it work without workqueue.disable_numa ?).

[ /me returns from vacation ]

Yes, since it happens for every alloc_workqueue_attrs()

> It just seems to me that tying it to sched domain construction would be
> a better fit.  That way, it doesn't matter what your isolation requiring
> load is, whether you run a gaggle of realtime tasks or one HPC task your
> business, the generic requirement is isolation, not tick mode.  For one
> HPC task per core, you want no tick, if you're running all SCHED_FIFO,
> maybe you want that too, depends on the impact of nohz_full mode.  All
> sensitive loads want the isolation, but they may not like the price.
>
> I personally like the cpuset way.  Being able to partition boxen on the
> fly makes them very flexible.  In a perfect world, you'd be able to
> quiesce and configure offloading and nohz_full on the fly too, and not
> end up with some hodgepodge like this needs boot option foo, that
> happens invisibly because of config option bar, the other thing you have
> to do manually.. and you get to eat 937 kthreads and tons of overhead on
> all CPUs if you want the ability to _maybe_ run a critical task or two.

Yeah, my patch only addresses the nohz_full case, but since there
doesn't seem to be any general agreemenet about the generic case, it
seems that exposing all unbound workqueues via WQ_SYSFS is the way to
go.  

Mike, looks like you may have started on that.  Did it get any further?

Kevin

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-24 17:55                   ` Kevin Hilman
@ 2014-02-24 18:25                     ` Mike Galbraith
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2014-02-24 18:25 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: paulmck, Tejun Heo, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Mon, 2014-02-24 at 09:55 -0800, Kevin Hilman wrote:

> Yeah, my patch only addresses the nohz_full case, but since there
> doesn't seem to be any general agreemenet about the generic case, it
> seems that exposing all unbound workqueues via WQ_SYSFS is the way to
> go.  
> 
> Mike, looks like you may have started on that.  Did it get any further?

No, work gets in the way of tinker time.  Not sure which way I want to
explore anyway.  On the one hand, auto-quiesce thingy tied to domain
construction/destruction sounds good, but on the other, just handing
control to the user is highly attractive.. likely lots simpler too.

-Mike


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-17  5:26       ` Mike Galbraith
@ 2014-02-27 14:43         ` Frederic Weisbecker
  2014-02-27 15:22           ` Mike Galbraith
  0 siblings, 1 reply; 25+ messages in thread
From: Frederic Weisbecker @ 2014-02-27 14:43 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Paul E. McKenney, Kevin Hilman, Tejun Heo, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Mon, Feb 17, 2014 at 06:26:41AM +0100, Mike Galbraith wrote:
> On Wed, 2014-02-12 at 19:23 +0100, Frederic Weisbecker wrote: 
> > On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > > Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > > 
> > > Thank you all, queued for 3.15.
> > > 
> > > We should also have some facility for moving the SRCU workqueues to
> > > housekeeping/timekeeping kthreads in the NO_HZ_FULL case.  Or does
> > > this patch already have that effect?
> > 
> > Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> > us control the unbound workqueues affinity through sysfs.
> 
> Handing control to the user seemed like a fine thing, so I started
> making a boot option to enable it.  Forcing WQ_SYSFS on at sysfs
> decision spot doesn't go well, init order matters :)  Post init frobbing
> required if you want to see/frob all unbound.

I'm curious about the details. Is that because some workqueues are registered
before sysfs is even initialized?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-14 23:24           ` Kevin Hilman
  2014-02-15  7:36             ` Mike Galbraith
@ 2014-02-27 15:08             ` Frederic Weisbecker
  2014-03-10  9:52             ` Viresh Kumar
  2 siblings, 0 replies; 25+ messages in thread
From: Frederic Weisbecker @ 2014-02-27 15:08 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Tejun Heo, Paul E. McKenney, Lai Jiangshan, Zoran Markovic,
	linux-kernel, Shaibal Dutta, Dipankar Sarma

On Fri, Feb 14, 2014 at 03:24:35PM -0800, Kevin Hilman wrote:
> Tejun Heo <tj@kernel.org> writes:
> 
> > Hello,
> >
> > On Wed, Feb 12, 2014 at 11:02:41AM -0800, Paul E. McKenney wrote:
> >> +2.	Use the /sys/devices/virtual/workqueue/*/cpumask sysfs files
> >> +	to force the WQ_SYSFS workqueues to run on the specified set
> >> +	of CPUs.  The set of WQ_SYSFS workqueues can be displayed using
> >> +	"ls sys/devices/virtual/workqueue".
> >
> > One thing to be careful about is that once published, it becomes part
> > of userland visible interface.  Maybe adding some words warning
> > against sprinkling WQ_SYSFS willy-nilly is a good idea?
> 
> In the NO_HZ_FULL case, it seems to me we'd always want all unbound
> workqueues to have their affinity set to the housekeeping CPUs.
> 
> Is there any reason not to enable WQ_SYSFS whenever WQ_UNBOUND is set so
> the affinity can be controlled?  I guess the main reason would be that 
> all of these workqueue names would become permanent ABI.

Right. It's a legitimate worry but couldn't we consider workqueue names
just like kernel threads names? Ie: something that can be renamed or
disappear anytime from a kernel version to another?

Or sysfs has real strict rules about that and I'm just daydreaming?

I've been thinking we could also have a pseudo-workqueue directory in
/sys/devices/virtual/workqueue/unbounds with only cpumask as a file.

Writing to it would set all unbound workqueue affinity, at least those
that don't have WQ_SYSFS.

This would solve the ABI issue and keep a single consistent interface
for workqueue affinity.

> 
> At least for NO_HZ_FULL, maybe this should be automatic.  The cpumask of
> unbound workqueues should default to !tick_nohz_full_mask?  Any WQ_SYSFS
> workqueues could still be overridden from userspace, but at least the
> default would be sane, and help keep full dyntics CPUs isolated.
> 
> Example patch below, only boot tested on 4-CPU ARM system with
> CONFIG_NO_HZ_FULL_ALL=y and verified that 'cat
> /sys/devices/virtual/workqueue/writeback/cpumask' looked sane.  If this
> looks OK, I can maybe clean it up a bit and make it runtime check
> instead of a compile time check.

It can work too yeah. Maybe I prefer the idea of keeping a sysfs interface
for all workqueues (whether we use a pseudo "unbounds" dir or not) because then
the workqueue core stays unaware of dynticks details and it doesn't end up
fiddling with timers core internals like the full dynticks cpumask. The result
is more interdependency and possible headaches between timers and workqueue init
ordering.

And moreover people may forget to change WQ_SYSFS workqueues if all other UNBOUND
workqueues are known to be automatically handled. Or we handle WQ_SYSFS as well
along the way? But still WQ_SYSFS cpumask may be modified by other user programms
so it's still a round of set that must be done before doing any isolation work.

So I have mixed feelings between code complexity, simplicity for users, etc...

What do you guys think?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-27 14:43         ` Frederic Weisbecker
@ 2014-02-27 15:22           ` Mike Galbraith
  0 siblings, 0 replies; 25+ messages in thread
From: Mike Galbraith @ 2014-02-27 15:22 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Kevin Hilman, Tejun Heo, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Thu, 2014-02-27 at 15:43 +0100, Frederic Weisbecker wrote: 
> On Mon, Feb 17, 2014 at 06:26:41AM +0100, Mike Galbraith wrote:
> > On Wed, 2014-02-12 at 19:23 +0100, Frederic Weisbecker wrote: 
> > > On Mon, Feb 10, 2014 at 10:47:29AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 10, 2014 at 06:08:31PM +0800, Lai Jiangshan wrote:
> > > > > Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
> > > > 
> > > > Thank you all, queued for 3.15.
> > > > 
> > > > We should also have some facility for moving the SRCU workqueues to
> > > > housekeeping/timekeeping kthreads in the NO_HZ_FULL case.  Or does
> > > > this patch already have that effect?
> > > 
> > > Kevin Hilman and me plan to try to bring a new Kconfig option that could let
> > > us control the unbound workqueues affinity through sysfs.
> > 
> > Handing control to the user seemed like a fine thing, so I started
> > making a boot option to enable it.  Forcing WQ_SYSFS on at sysfs
> > decision spot doesn't go well, init order matters :)  Post init frobbing
> > required if you want to see/frob all unbound.
> 
> I'm curious about the details. Is that because some workqueues are registered
> before sysfs is even initialized?

Yeah.  I put in a test, and told it if not ready, go get ready and flag
yourself as having BTDT (one registration being plenty), but that only
made a different explosion.  Post-init rescan is definitely a better
plan, it can't be worse :)

-Mike


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue
  2014-02-14 23:24           ` Kevin Hilman
  2014-02-15  7:36             ` Mike Galbraith
  2014-02-27 15:08             ` Frederic Weisbecker
@ 2014-03-10  9:52             ` Viresh Kumar
  2 siblings, 0 replies; 25+ messages in thread
From: Viresh Kumar @ 2014-03-10  9:52 UTC (permalink / raw)
  To: Kevin Hilman
  Cc: Tejun Heo, Paul E. McKenney, Frederic Weisbecker, Lai Jiangshan,
	Zoran Markovic, linux-kernel, Shaibal Dutta, Dipankar Sarma

On Sat, Feb 15, 2014 at 7:24 AM, Kevin Hilman <khilman@linaro.org> wrote:
> From 902a2b58d61a51415457ea6768d687cdb7532eff Mon Sep 17 00:00:00 2001
> From: Kevin Hilman <khilman@linaro.org>
> Date: Fri, 14 Feb 2014 15:10:58 -0800
> Subject: [PATCH] workqueue: for NO_HZ_FULL, set default cpumask to
>  !tick_nohz_full_mask
>
> To help in keeping NO_HZ_FULL CPUs isolated, keep unbound workqueues
> from running on full dynticks CPUs.  To do this, set the default
> workqueue cpumask to be the set of "housekeeping" CPUs instead of all
> possible CPUs.
>
> This is just just the starting/default cpumask, and can be overridden
> with all the normal ways (NUMA settings, apply_workqueue_attrs and via
> sysfs for workqueus with the WQ_SYSFS attribute.)
>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Signed-off-by: Kevin Hilman <khilman@linaro.org>
> ---
>  kernel/workqueue.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 987293d03ebc..9a9d9b0eaf6d 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -48,6 +48,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/moduleparam.h>
>  #include <linux/uaccess.h>
> +#include <linux/tick.h>
>
>  #include "workqueue_internal.h"
>
> @@ -3436,7 +3437,11 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask)
>         if (!alloc_cpumask_var(&attrs->cpumask, gfp_mask))
>                 goto fail;
>
> +#ifdef CONFIG_NO_HZ_FULL
> +       cpumask_complement(attrs->cpumask, tick_nohz_full_mask);
> +#else
>         cpumask_copy(attrs->cpumask, cpu_possible_mask);
> +#endif
>         return attrs;
>  fail:
>         free_workqueue_attrs(attrs);

Can we play with this mask at runtime? I thought a better idea would be
to keep this mask as mask of all CPUs initially and once any CPU enters
NO_HZ_FULL mode we can remove that from mask? And ones it leaves
that mode we can get that added again..

I am looking to use similar concept in case of un-pinned timers with my
activity around cpuset.quiesce option..

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2014-03-10  9:52 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-31 19:53 [RFC PATCH] rcu: move SRCU grace period work to power efficient workqueue Zoran Markovic
2014-01-31 20:10 ` Zoran Markovic
2014-02-10 10:08 ` Lai Jiangshan
2014-02-10 18:47   ` Paul E. McKenney
2014-02-12 18:23     ` Frederic Weisbecker
2014-02-12 19:02       ` Paul E. McKenney
2014-02-12 19:23         ` Tejun Heo
2014-02-12 19:59           ` Paul E. McKenney
2014-02-12 20:13             ` Tejun Heo
2014-02-12 23:04             ` Frederic Weisbecker
2014-02-13  0:33               ` Paul E. McKenney
2014-02-13  1:30                 ` Lai Jiangshan
2014-02-13 14:05                 ` Frederic Weisbecker
2014-02-14 23:24           ` Kevin Hilman
2014-02-15  7:36             ` Mike Galbraith
2014-02-16 16:41               ` Paul E. McKenney
2014-02-17  4:50                 ` Mike Galbraith
2014-02-19  7:00                   ` Mike Galbraith
2014-02-24 17:55                   ` Kevin Hilman
2014-02-24 18:25                     ` Mike Galbraith
2014-02-27 15:08             ` Frederic Weisbecker
2014-03-10  9:52             ` Viresh Kumar
2014-02-17  5:26       ` Mike Galbraith
2014-02-27 14:43         ` Frederic Weisbecker
2014-02-27 15:22           ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).