[PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
@ 2009-05-07  9:59 KOSAKI Motohiro
  2009-05-07 10:50 ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-07  9:59 UTC (permalink / raw)
  To: LKML, Bharata B Rao, Balaji Rao, Dhaval Giani, KAMEZAWA Hiroyuki,
	Peter Zijlstra, Balbir Singh, Ingo Molnar, Martin Schwidefsky
  Cc: kosaki.motohiro


Changelog:
  since v2
  - revert using percpu_counter_sum()

  since v1
  - use percpu_counter_sum() instead percpu_counter_read()


-------------------------------------
Subject: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count

cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
for avoiding performance degression.

For archs which define VIRT_CPU_ACCOUNTING, every tick would result
in >1000 units of cputime updates and since this is much much greater
than percpu_batch_counter, we end up taking spinlock on every tick.

This patch change batch rule. now, any cpu can store "percpu_counter_bach * jiffies"
cputime in per-cpu cache.
it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n.

Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 kernel/sched.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c	2009-04-30 11:37:47.000000000 +0900
+++ b/kernel/sched.c	2009-05-07 16:34:00.000000000 +0900
@@ -10221,6 +10221,7 @@ struct cpuacct {
 };
 
 struct cgroup_subsys cpuacct_subsys;
+static s32 cpuacct_batch;
 
 /* return cpu accounting group corresponding to this container */
 static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
@@ -10250,6 +10251,9 @@ static struct cgroup_subsys_state *cpuac
 	if (!ca->cpuusage)
 		goto out_free_ca;
 
+	if (!cpuacct_batch)
+		cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
 		if (percpu_counter_init(&ca->cpustat[i], 0))
 			goto out_free_counters;
@@ -10446,7 +10450,7 @@ static void cpuacct_update_stats(struct 
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
  2009-05-07  9:59 [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count KOSAKI Motohiro
@ 2009-05-07 10:50 ` Ingo Molnar
  2009-05-09  9:26   ` KOSAKI Motohiro
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-07 10:50 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: LKML, Bharata B Rao, Balaji Rao, Dhaval Giani, KAMEZAWA Hiroyuki,
	Peter Zijlstra, Balbir Singh, Martin Schwidefsky


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> 
> Changelog:
>   since v2
>   - revert using percpu_counter_sum()
> 
>   since v1
>   - use percpu_counter_sum() instead percpu_counter_read()
> 
> 
> -------------------------------------
> Subject: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
> 
> cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
> for avoiding performance degression.
> 
> For archs which define VIRT_CPU_ACCOUNTING, every tick would result
> in >1000 units of cputime updates and since this is much much greater
> than percpu_batch_counter, we end up taking spinlock on every tick.
> 
> This patch change batch rule. now, any cpu can store "percpu_counter_bach * jiffies"
> cputime in per-cpu cache.
> it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n.
> 
> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Cc: Balaji Rao <balajirrao@gmail.com>
> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  kernel/sched.c |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c	2009-04-30 11:37:47.000000000 +0900
> +++ b/kernel/sched.c	2009-05-07 16:34:00.000000000 +0900
> @@ -10221,6 +10221,7 @@ struct cpuacct {
>  };
>  
>  struct cgroup_subsys cpuacct_subsys;
> +static s32 cpuacct_batch;

should be __read_mostly i guess.

>  
>  /* return cpu accounting group corresponding to this container */
>  static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> @@ -10250,6 +10251,9 @@ static struct cgroup_subsys_state *cpuac
>  	if (!ca->cpuusage)
>  		goto out_free_ca;
>  
> +	if (!cpuacct_batch)
> +		cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);

So this is a once per boot condition? Why not initialize it in 
sched_init() or so? (instead of incuring this ugly check all the 
time)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
  2009-05-07 10:50 ` Ingo Molnar
@ 2009-05-09  9:26   ` KOSAKI Motohiro
  2009-05-09  9:38     ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-09  9:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, LKML, Bharata B Rao, Balaji Rao, Dhaval Giani,
	KAMEZAWA Hiroyuki, Peter Zijlstra, Balbir Singh,
	Martin Schwidefsky

> 
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > 
> > Changelog:
> >   since v2
> >   - revert using percpu_counter_sum()
> > 
> >   since v1
> >   - use percpu_counter_sum() instead percpu_counter_read()
> > 
> > 
> > -------------------------------------
> > Subject: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
> > 
> > cpuacct_update_stats() is called at every tick updating. and it use percpu_counter
> > for avoiding performance degression.
> > 
> > For archs which define VIRT_CPU_ACCOUNTING, every tick would result
> > in >1000 units of cputime updates and since this is much much greater
> > than percpu_batch_counter, we end up taking spinlock on every tick.
> > 
> > This patch change batch rule. now, any cpu can store "percpu_counter_bach * jiffies"
> > cputime in per-cpu cache.
> > it mean this patch don't have behavior change if VIRT_CPU_ACCOUNTING=n.
> > 
> > Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > Cc: Balaji Rao <balajirrao@gmail.com>
> > Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Cc: Ingo Molnar <mingo@elte.hu>
> > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  kernel/sched.c |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > Index: b/kernel/sched.c
> > ===================================================================
> > --- a/kernel/sched.c	2009-04-30 11:37:47.000000000 +0900
> > +++ b/kernel/sched.c	2009-05-07 16:34:00.000000000 +0900
> > @@ -10221,6 +10221,7 @@ struct cpuacct {
> >  };
> >  
> >  struct cgroup_subsys cpuacct_subsys;
> > +static s32 cpuacct_batch;
> 
> should be __read_mostly i guess.
> 
> >  
> >  /* return cpu accounting group corresponding to this container */
> >  static inline struct cpuacct *cgroup_ca(struct cgroup *cgrp)
> > @@ -10250,6 +10251,9 @@ static struct cgroup_subsys_state *cpuac
> >  	if (!ca->cpuusage)
> >  		goto out_free_ca;
> >  
> > +	if (!cpuacct_batch)
> > +		cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> 
> So this is a once per boot condition? Why not initialize it in 
> sched_init() or so? (instead of incuring this ugly check all the 
> time)

Thanks! fixed.



Changelog:
  since V3
  - rewirte patch description (thanks Bharata!)
  - append read_mostly to cpuacct_batch
  - cpuacct_batch is initialized by sched_init()

  since v2
  - revert using percpu_counter_sum()

  since v1
  - use percpu_counter_sum() instead percpu_counter_read()


---------------------------------------------------------
Subject: [PATCH v4] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 kernel/sched.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c	2009-05-09 16:48:08.000000000 +0900
+++ b/kernel/sched.c	2009-05-09 18:16:00.000000000 +0900
@@ -824,8 +824,12 @@ static struct file_operations sched_feat
 	.release	= single_release,
 };
 
+static __read_mostly s32 cpuacct_batch;
+
 static __init int sched_init_debug(void)
 {
+	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+
 	debugfs_create_file("sched_features", 0644, NULL, NULL,
 			&sched_feat_fops);
 
@@ -10457,7 +10461,8 @@ static void cpuacct_update_stats(struct 
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
  2009-05-09  9:26   ` KOSAKI Motohiro
@ 2009-05-09  9:38     ` Peter Zijlstra
  2009-05-09  9:52       ` KOSAKI Motohiro
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2009-05-09  9:38 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, LKML, Bharata B Rao, Balaji Rao, Dhaval Giani,
	KAMEZAWA Hiroyuki, Balbir Singh, Martin Schwidefsky

On Sat, 2009-05-09 at 18:26 +0900, KOSAKI Motohiro wrote:
> 
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c    2009-05-09 16:48:08.000000000 +0900
> +++ b/kernel/sched.c    2009-05-09 18:16:00.000000000 +0900
> @@ -824,8 +824,12 @@ static struct file_operations sched_feat
>         .release        = single_release,
>  };
>  
> +static __read_mostly s32 cpuacct_batch;
> +
>  static __init int sched_init_debug(void)
>  {
> +       cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> +
>         debugfs_create_file("sched_features", 0644, NULL, NULL,
>                         &sched_feat_fops);

I'm pretty user you can select cpu accounting without SCHED_DEBUG...

> @@ -10457,7 +10461,8 @@ static void cpuacct_update_stats(struct 
>         ca = task_ca(tsk);
>  
>         do {
> -               percpu_counter_add(&ca->cpustat[idx], val);
> +               __percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
> +
>                 ca = ca->parent;
>         } while (ca);
>         rcu_read_unlock();
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
  2009-05-09  9:38     ` Peter Zijlstra
@ 2009-05-09  9:52       ` KOSAKI Motohiro
  2009-05-09 10:14         ` KOSAKI Motohiro
  0 siblings, 1 reply; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-09  9:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Ingo Molnar, LKML, Bharata B Rao, Balaji Rao,
	Dhaval Giani, KAMEZAWA Hiroyuki, Balbir Singh,
	Martin Schwidefsky

> On Sat, 2009-05-09 at 18:26 +0900, KOSAKI Motohiro wrote:
> > 
> > Index: b/kernel/sched.c
> > ===================================================================
> > --- a/kernel/sched.c    2009-05-09 16:48:08.000000000 +0900
> > +++ b/kernel/sched.c    2009-05-09 18:16:00.000000000 +0900
> > @@ -824,8 +824,12 @@ static struct file_operations sched_feat
> >         .release        = single_release,
> >  };
> >  
> > +static __read_mostly s32 cpuacct_batch;
> > +
> >  static __init int sched_init_debug(void)
> >  {
> > +       cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > +
> >         debugfs_create_file("sched_features", 0644, NULL, NULL,
> >                         &sched_feat_fops);
> 
> I'm pretty user you can select cpu accounting without SCHED_DEBUG...

Grr, yes, it's idiot ;)
I'll fix it soon.





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count
  2009-05-09  9:52       ` KOSAKI Motohiro
@ 2009-05-09 10:14         ` KOSAKI Motohiro
  2009-05-11 12:33           ` [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters tip-bot for KOSAKI Motohiro
  2009-05-11 12:36           ` tip-bot for KOSAKI Motohiro
  0 siblings, 2 replies; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-09 10:14 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Peter Zijlstra, Ingo Molnar, LKML,
	Bharata B Rao, Balaji Rao, Dhaval Giani, KAMEZAWA Hiroyuki,
	Balbir Singh, Martin Schwidefsky

> > On Sat, 2009-05-09 at 18:26 +0900, KOSAKI Motohiro wrote:
> > > 
> > > Index: b/kernel/sched.c
> > > ===================================================================
> > > --- a/kernel/sched.c    2009-05-09 16:48:08.000000000 +0900
> > > +++ b/kernel/sched.c    2009-05-09 18:16:00.000000000 +0900
> > > @@ -824,8 +824,12 @@ static struct file_operations sched_feat
> > >         .release        = single_release,
> > >  };
> > >  
> > > +static __read_mostly s32 cpuacct_batch;
> > > +
> > >  static __init int sched_init_debug(void)
> > >  {
> > > +       cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > +
> > >         debugfs_create_file("sched_features", 0644, NULL, NULL,
> > >                         &sched_feat_fops);
> > 
> > I'm pretty user you can select cpu accounting without SCHED_DEBUG...
> 
> Grr, yes, it's idiot ;)
> I'll fix it soon.


Done.


Changelog:
  since V4
  - move cpuacct_batch initialization into sched_init()

  since V3
  - rewirte patch description (thanks Bharata!)
  - append read_mostly to cpuacct_batch
  - cpuacct_batch is initialized by sched_init_debug()

  since v2
  - revert using percpu_counter_sum()

  since v1
  - use percpu_counter_sum() instead percpu_counter_read()


---------------------------------------------------------
Subject: [PATCH v5] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 kernel/sched.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c	2009-05-09 16:48:08.000000000 +0900
+++ b/kernel/sched.c	2009-05-09 18:57:35.000000000 +0900
@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
  */
 int sysctl_sched_rt_runtime = 950000;
 
+static __read_mostly s32 cpuacct_batch;
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9284,6 +9286,8 @@ void __init sched_init(void)
 
 	perf_counter_init();
 
+	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+
 	scheduler_running = 1;
 }
 
@@ -10457,7 +10461,8 @@ static void cpuacct_update_stats(struct 
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-09 10:14         ` KOSAKI Motohiro
@ 2009-05-11 12:33           ` tip-bot for KOSAKI Motohiro
  2009-05-11 12:36           ` tip-bot for KOSAKI Motohiro
  1 sibling, 0 replies; 62+ messages in thread
From: tip-bot for KOSAKI Motohiro @ 2009-05-11 12:33 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, schwidefsky, dhaval,
	balajirrao, balbir, bharata, tglx, kamezawa.hiroyu,
	kosaki.motohiro, mingo

Commit-ID:  0719318fea31d54d13ed8ead7f4a277038bd75a2
Gitweb:     http://git.kernel.org/tip/0719318fea31d54d13ed8ead7f4a277038bd75a2
Author:     KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
AuthorDate: Sat, 9 May 2009 19:14:58 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 11 May 2009 14:21:32 +0200

sched: cpuacct: Use bigger percpu counter batch values for stats counters

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

 v5:
 - move cpuacct_batch initialization into sched_init()

 v4:
  - rewrite patch description (thanks Bharata!)
  - append read_mostly to cpuacct_batch
  - cpuacct_batch is initialized by sched_init_debug()

 v3:
  - revert using percpu_counter_sum()

 v2:
  - use percpu_counter_sum() instead percpu_counter_read()

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20090509191430.3AD5.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 kernel/sched.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8908d19..beadb82 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -872,6 +872,8 @@ static __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+static __read_mostly s32 cpuacct_batch;
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9181,6 +9183,8 @@ void __init sched_init(void)
 	alloc_bootmem_cpumask_var(&cpu_isolated_map);
 #endif /* SMP */
 
+	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+
 	scheduler_running = 1;
 }
 
@@ -10354,7 +10358,8 @@ static void cpuacct_update_stats(struct task_struct *tsk,
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-09 10:14         ` KOSAKI Motohiro
  2009-05-11 12:33           ` [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters tip-bot for KOSAKI Motohiro
@ 2009-05-11 12:36           ` tip-bot for KOSAKI Motohiro
  2009-05-11 15:07             ` Ingo Molnar
  1 sibling, 1 reply; 62+ messages in thread
From: tip-bot for KOSAKI Motohiro @ 2009-05-11 12:36 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, schwidefsky, dhaval,
	balajirrao, balbir, bharata, tglx, kamezawa.hiroyu,
	kosaki.motohiro, mingo

Commit-ID:  50fbed3bb72cbe38a665512771a91a96d028de46
Gitweb:     http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46
Author:     KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
AuthorDate: Sat, 9 May 2009 19:14:58 +0900
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Mon, 11 May 2009 14:32:57 +0200

sched: cpuacct: Use bigger percpu counter batch values for stats counters

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

 v5:
 - move cpuacct_batch initialization into sched_init()

 v4:
  - rewrite patch description (thanks Bharata!)
  - append read_mostly to cpuacct_batch
  - cpuacct_batch is initialized by sched_init_debug()

 v3:
  - revert using percpu_counter_sum()

 v2:
  - use percpu_counter_sum() instead percpu_counter_read()

Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
LKML-Reference: <20090509191430.3AD5.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 kernel/sched.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8908d19..873bc3b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -872,6 +872,8 @@ static __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 950000;
 
+static __read_mostly s32 cpuacct_batch;
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9171,6 +9173,8 @@ void __init sched_init(void)
 	 */
 	current->sched_class = &fair_sched_class;
 
+	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+
 	/* Allocate the nohz_cpu_mask if CONFIG_CPUMASK_OFFSTACK */
 	alloc_bootmem_cpumask_var(&nohz_cpu_mask);
 #ifdef CONFIG_SMP
@@ -10354,7 +10358,8 @@ static void cpuacct_update_stats(struct task_struct *tsk,
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-11 12:36           ` tip-bot for KOSAKI Motohiro
@ 2009-05-11 15:07             ` Ingo Molnar
  2009-05-12 10:01               ` KOSAKI Motohiro
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-11 15:07 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, a.p.zijlstra, schwidefsky, balajirrao,
	dhaval, balbir, bharata, tglx, kosaki.motohiro, kamezawa.hiroyu
  Cc: linux-tip-commits


* tip-bot for KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> Commit-ID:  50fbed3bb72cbe38a665512771a91a96d028de46
> Gitweb:     http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46
> Author:     KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> AuthorDate: Sat, 9 May 2009 19:14:58 +0900
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Mon, 11 May 2009 14:32:57 +0200
> 
> sched: cpuacct: Use bigger percpu counter batch values for stats counters

-tip testing found a build failure with this patch:

kernel/sched.c:9278: error: ‘percpu_counter_batch’ undeclared (first use in this function)
kernel/sched.c:9278: error: (Each undeclared identifier is reported only once
kernel/sched.c:9278: error: for each function it appears in.)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-11 15:07             ` Ingo Molnar
@ 2009-05-12 10:01               ` KOSAKI Motohiro
  2009-05-12 10:02                 ` Peter Zijlstra
  2009-05-12 10:10                 ` Balbir Singh
  0 siblings, 2 replies; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 10:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, mingo, hpa, linux-kernel, a.p.zijlstra,
	schwidefsky, balajirrao, dhaval, balbir, bharata, tglx,
	kamezawa.hiroyu, linux-tip-commits

> 
> * tip-bot for KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> 
> > Commit-ID:  50fbed3bb72cbe38a665512771a91a96d028de46
> > Gitweb:     http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46
> > Author:     KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > AuthorDate: Sat, 9 May 2009 19:14:58 +0900
> > Committer:  Ingo Molnar <mingo@elte.hu>
> > CommitDate: Mon, 11 May 2009 14:32:57 +0200
> > 
> > sched: cpuacct: Use bigger percpu counter batch values for stats counters
> 
> -tip testing found a build failure with this patch:
> 
> kernel/sched.c:9278: error: ‘percpu_counter_batch’ undeclared (first use in this function)
> kernel/sched.c:9278: error: (Each undeclared identifier is reported only once
> kernel/sched.c:9278: error: for each function it appears in.)

Oh, my fault again. Good catch!



Changelog:
  since V5
  - fix build error when UP

  since V4
  - move cpuacct_batch initialization into sched_init()

  since V3
  - rewirte patch description (thanks Bharata!)
  - append read_mostly to cpuacct_batch
  - cpuacct_batch is initialized by sched_init()

  since v2
  - revert using percpu_counter_sum()

  since v1
  - use percpu_counter_sum() instead percpu_counter_read()


---------------------------------------------------------
Subject: [PATCH v6] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 kernel/sched.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c	2009-05-12 13:12:59.000000000 +0900
+++ b/kernel/sched.c	2009-05-12 13:28:53.000000000 +0900
@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
  */
 int sysctl_sched_rt_runtime = 950000;
 
+static __read_mostly s32 cpuacct_batch;
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9284,6 +9286,10 @@ void __init sched_init(void)
 
 	perf_counter_init();
 
+#ifdef CONFIGCONFIG_SMP
+	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+#endif
+
 	scheduler_running = 1;
 }
 
@@ -10457,7 +10463,8 @@ static void cpuacct_update_stats(struct 
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:01               ` KOSAKI Motohiro
@ 2009-05-12 10:02                 ` Peter Zijlstra
  2009-05-12 10:06                   ` KOSAKI Motohiro
  2009-05-12 10:10                 ` Balbir Singh
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2009-05-12 10:02 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, mingo, hpa, linux-kernel, schwidefsky, balajirrao,
	dhaval, balbir, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits

On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> +#ifdef CONFIGCONFIG_SMP

typo ?


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:02                 ` Peter Zijlstra
@ 2009-05-12 10:06                   ` KOSAKI Motohiro
  2009-05-12 10:22                     ` KOSAKI Motohiro
  0 siblings, 1 reply; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kosaki.motohiro, Ingo Molnar, mingo, hpa, linux-kernel,
	schwidefsky, balajirrao, dhaval, balbir, bharata, tglx,
	kamezawa.hiroyu, linux-tip-commits

> On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> > +#ifdef CONFIGCONFIG_SMP
> 
> typo ?

yes, this week is my memorial stupid one ;)




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:01               ` KOSAKI Motohiro
  2009-05-12 10:02                 ` Peter Zijlstra
@ 2009-05-12 10:10                 ` Balbir Singh
  2009-05-12 10:13                   ` KOSAKI Motohiro
  1 sibling, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2009-05-12 10:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, mingo, hpa, linux-kernel, a.p.zijlstra, schwidefsky,
	balajirrao, dhaval, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-05-12 19:01:25]:

> > 
> > * tip-bot for KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> > 
> > > Commit-ID:  50fbed3bb72cbe38a665512771a91a96d028de46
> > > Gitweb:     http://git.kernel.org/tip/50fbed3bb72cbe38a665512771a91a96d028de46
> > > Author:     KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > > AuthorDate: Sat, 9 May 2009 19:14:58 +0900
> > > Committer:  Ingo Molnar <mingo@elte.hu>
> > > CommitDate: Mon, 11 May 2009 14:32:57 +0200
> > > 
> > > sched: cpuacct: Use bigger percpu counter batch values for stats counters
> > 
> > -tip testing found a build failure with this patch:
> > 
> > kernel/sched.c:9278: error: ?$B!Fpercpu_counter_batch?$B!G undeclared (first use in this function)
> > kernel/sched.c:9278: error: (Each undeclared identifier is reported only once
> > kernel/sched.c:9278: error: for each function it appears in.)
> 
> Oh, my fault again. Good catch!
> 
> 
> 
> Changelog:
>   since V5
>   - fix build error when UP
> 
>   since V4
>   - move cpuacct_batch initialization into sched_init()
> 
>   since V3
>   - rewirte patch description (thanks Bharata!)
>   - append read_mostly to cpuacct_batch
>   - cpuacct_batch is initialized by sched_init()
> 
>   since v2
>   - revert using percpu_counter_sum()
> 
>   since v1
>   - use percpu_counter_sum() instead percpu_counter_read()
> 
> 
> ---------------------------------------------------------
> Subject: [PATCH v6] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y
> 
> percpu counters used to accumulate statistics in cpuacct controller use
> the default batch value [max(2*nr_cpus, 32)] which can be too small for
> archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
> cputime updates in the range of thousands. As a result, cpuacct_update_stats()
> would end up acquiring the percpu counter spinlock on every tick which
> is not good for performance.
> 
> Let those architectures to have a bigger batch threshold so that percpu counter
> spinlock isn't taken on every tick. This change doesn't affect the archs which
> don't define VIRT_CPU_ACCOUNTING and they continue to have the default
> percpu counter batch value.
> 
> Cc: Balaji Rao <balajirrao@gmail.com>
> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  kernel/sched.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c	2009-05-12 13:12:59.000000000 +0900
> +++ b/kernel/sched.c	2009-05-12 13:28:53.000000000 +0900
> @@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
>   */
>  int sysctl_sched_rt_runtime = 950000;
> 
> +static __read_mostly s32 cpuacct_batch;
> +
>  static inline u64 global_rt_period(void)
>  {
>  	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> @@ -9284,6 +9286,10 @@ void __init sched_init(void)
> 
>  	perf_counter_init();
> 
> +#ifdef CONFIGCONFIG_SMP
> +	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> +#endif

Slow down and compile patches before sending them out.. please. That
is a basic expectation if you expect it to be merged.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:10                 ` Balbir Singh
@ 2009-05-12 10:13                   ` KOSAKI Motohiro
  2009-05-12 10:24                     ` Balbir Singh
  0 siblings, 1 reply; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 10:13 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, Ingo Molnar, mingo, hpa, linux-kernel,
	a.p.zijlstra, schwidefsky, balajirrao, dhaval, bharata, tglx,
	kamezawa.hiroyu, linux-tip-commits

> > +#ifdef CONFIGCONFIG_SMP
> > +	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > +#endif
> 
> Slow down and compile patches before sending them out.. please. That
> is a basic expectation if you expect it to be merged.

Unfortunately, this mistake pass test successfully ;)
it because cpuacct_batch=0 works even SMP.




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:06                   ` KOSAKI Motohiro
@ 2009-05-12 10:22                     ` KOSAKI Motohiro
  2009-05-12 10:27                       ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 10:22 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: kosaki.motohiro, Peter Zijlstra, Ingo Molnar, mingo, hpa,
	linux-kernel, schwidefsky, balajirrao, dhaval, balbir, bharata,
	tglx, kamezawa.hiroyu, linux-tip-commits

> > On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> > > +#ifdef CONFIGCONFIG_SMP
> > 
> > typo ?
> 
> yes, this week is my memorial stupid one ;)

ok, assemble list indicate current patch have no typo ;)

ffffffff814ae920:       48 c7 40 30 10 eb 2a    movq   $0xffffffff812aeb10,0x30(%rax)
ffffffff814ae927:       81
ffffffff814ae928:       c7 05 3e 37 fe ff 01    movl   $0x1,-0x1c8c2(%rip)        # ffffffff81492070 <scheduler_running>
ffffffff814ae92f:       00 00 00
ffffffff814ae932:       8b 05 28 52 fe ff       mov    -0x1add8(%rip),%eax        # ffffffff81493b60 <percpu_counter_batch>
ffffffff814ae938:       89 05 36 37 fe ff       mov    %eax,-0x1c8ca(%rip)        # ffffffff81492074 <cpuacct_batch>
ffffffff814ae93e:       41 5c                   pop    %r12



---------------------------------------------------------
Subject: [PATCH] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 kernel/sched.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c	2009-05-12 13:12:59.000000000 +0900
+++ b/kernel/sched.c	2009-05-12 19:04:49.000000000 +0900
@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
  */
 int sysctl_sched_rt_runtime = 950000;
 
+static __read_mostly s32 cpuacct_batch;
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9284,6 +9286,10 @@ void __init sched_init(void)
 
 	perf_counter_init();
 
+#ifdef CONFIG_SMP
+	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+#endif
+
 	scheduler_running = 1;
 }
 
@@ -10457,7 +10463,8 @@ static void cpuacct_update_stats(struct 
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:13                   ` KOSAKI Motohiro
@ 2009-05-12 10:24                     ` Balbir Singh
  2009-05-12 10:29                       ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2009-05-12 10:24 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, mingo, hpa, linux-kernel, a.p.zijlstra, schwidefsky,
	balajirrao, dhaval, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits

* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-05-12 19:13:42]:

> > > +#ifdef CONFIGCONFIG_SMP
> > > +	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > +#endif
> > 
> > Slow down and compile patches before sending them out.. please. That
> > is a basic expectation if you expect it to be merged.
> 
> Unfortunately, this mistake pass test successfully ;)
> it because cpuacct_batch=0 works even SMP.
>

OK, BTW, using an #ifdef right in the middle of a function makes the
code harder to read, can't we use an inline function to abstract out
SMP? 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:22                     ` KOSAKI Motohiro
@ 2009-05-12 10:27                       ` Ingo Molnar
  2009-05-12 10:42                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-12 10:27 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Peter Zijlstra, mingo, hpa, linux-kernel, schwidefsky,
	balajirrao, dhaval, balbir, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > > On Tue, 2009-05-12 at 19:01 +0900, KOSAKI Motohiro wrote:
> > > > +#ifdef CONFIGCONFIG_SMP
> > > 
> > > typo ?
> > 
> > yes, this week is my memorial stupid one ;)
> 
> ok, assemble list indicate current patch have no typo ;)
> 
> ffffffff814ae920:       48 c7 40 30 10 eb 2a    movq   $0xffffffff812aeb10,0x30(%rax)
> ffffffff814ae927:       81
> ffffffff814ae928:       c7 05 3e 37 fe ff 01    movl   $0x1,-0x1c8c2(%rip)        # ffffffff81492070 <scheduler_running>
> ffffffff814ae92f:       00 00 00
> ffffffff814ae932:       8b 05 28 52 fe ff       mov    -0x1add8(%rip),%eax        # ffffffff81493b60 <percpu_counter_batch>
> ffffffff814ae938:       89 05 36 37 fe ff       mov    %eax,-0x1c8ca(%rip)        # ffffffff81492074 <cpuacct_batch>
> ffffffff814ae93e:       41 5c                   pop    %r12
> 
> 
> 
> ---------------------------------------------------------
> Subject: [PATCH] cpuacct: Use bigger percpu counter batch values for stats counters on archs that have VIRT_CPU_ACCOUNTING=y
> 
> percpu counters used to accumulate statistics in cpuacct controller use
> the default batch value [max(2*nr_cpus, 32)] which can be too small for
> archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
> cputime updates in the range of thousands. As a result, cpuacct_update_stats()
> would end up acquiring the percpu counter spinlock on every tick which
> is not good for performance.
> 
> Let those architectures to have a bigger batch threshold so that percpu counter
> spinlock isn't taken on every tick. This change doesn't affect the archs which
> don't define VIRT_CPU_ACCOUNTING and they continue to have the default
> percpu counter batch value.
> 
> Cc: Balaji Rao <balajirrao@gmail.com>
> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  kernel/sched.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)

Please reuse the commit log formatting fixes i did when i applied 
your patch. (you should still have that commit notification email)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:24                     ` Balbir Singh
@ 2009-05-12 10:29                       ` Ingo Molnar
  2009-05-12 10:44                         ` KOSAKI Motohiro
  0 siblings, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-12 10:29 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KOSAKI Motohiro, mingo, hpa, linux-kernel, a.p.zijlstra,
	schwidefsky, balajirrao, dhaval, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits


* Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-05-12 19:13:42]:
> 
> > > > +#ifdef CONFIGCONFIG_SMP
> > > > +	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > > +#endif
> > > 
> > > Slow down and compile patches before sending them out.. please. That
> > > is a basic expectation if you expect it to be merged.
> > 
> > Unfortunately, this mistake pass test successfully ;)
> > it because cpuacct_batch=0 works even SMP.
> >
> 
> OK, BTW, using an #ifdef right in the middle of a function makes 
> the code harder to read, can't we use an inline function to 
> abstract out SMP?

or rather, to make cpuacct_batch have a sane value on UP too. (1? 
0?)

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:27                       ` Ingo Molnar
@ 2009-05-12 10:42                         ` KOSAKI Motohiro
  0 siblings, 0 replies; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 10:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Peter Zijlstra, mingo, hpa, linux-kernel,
	schwidefsky, balajirrao, dhaval, balbir, bharata, tglx,
	kamezawa.hiroyu, linux-tip-commits

> > Cc: Balaji Rao <balajirrao@gmail.com>
> > Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Cc: Ingo Molnar <mingo@elte.hu>
> > Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> > Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> > Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> > ---
> >  kernel/sched.c |    9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Please reuse the commit log formatting fixes i did when i applied 
> your patch. (you should still have that commit notification email)

Oh, I didn't notice this change. I'll resend it. thanks



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:29                       ` Ingo Molnar
@ 2009-05-12 10:44                         ` KOSAKI Motohiro
  2009-05-12 13:28                           ` Ingo Molnar
  2009-07-16  8:10                           ` Bharata B Rao
  0 siblings, 2 replies; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 10:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: kosaki.motohiro, Balbir Singh, mingo, hpa, linux-kernel,
	a.p.zijlstra, schwidefsky, balajirrao, dhaval, bharata, tglx,
	kamezawa.hiroyu, linux-tip-commits

> 
> * Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> [2009-05-12 19:13:42]:
> > 
> > > > > +#ifdef CONFIGCONFIG_SMP
> > > > > +	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> > > > > +#endif
> > > > 
> > > > Slow down and compile patches before sending them out.. please. That
> > > > is a basic expectation if you expect it to be merged.
> > > 
> > > Unfortunately, this mistake pass test successfully ;)
> > > it because cpuacct_batch=0 works even SMP.
> > >
> > 
> > OK, BTW, using an #ifdef right in the middle of a function makes 
> > the code harder to read, can't we use an inline function to 
> > abstract out SMP?
> 
> or rather, to make cpuacct_batch have a sane value on UP too. (1? 
> 0?)

umm..

I've reviewed my patch again.
but sched_init() already has multiple #ifdef SMP. Thus I don't think
cosmetic changing improve readability largely.



------------------------------------
Subject: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters

percpu counters used to accumulate statistics in cpuacct controller use
the default batch value [max(2*nr_cpus, 32)] which can be too small for
archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
cputime updates in the range of thousands. As a result, cpuacct_update_stats()
would end up acquiring the percpu counter spinlock on every tick which
is not good for performance.

Let those architectures to have a bigger batch threshold so that percpu counter
spinlock isn't taken on every tick. This change doesn't affect the archs which
don't define VIRT_CPU_ACCOUNTING and they continue to have the default
percpu counter batch value.

 v7:
 - fix typo and changelog

 v6:
 - fix build error when UP

 v5:
 - move cpuacct_batch initialization into sched_init()

 v4:
  - rewrite patch description (thanks Bharata!)
  - append read_mostly to cpuacct_batch
  - cpuacct_batch is initialized by sched_init_debug()

 v3:
  - revert using percpu_counter_sum()

 v2:
  - use percpu_counter_sum() instead percpu_counter_read()

Cc: Balaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 kernel/sched.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c	2009-05-12 13:12:59.000000000 +0900
+++ b/kernel/sched.c	2009-05-12 19:04:49.000000000 +0900
@@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
  */
 int sysctl_sched_rt_runtime = 950000;
 
+static __read_mostly s32 cpuacct_batch;
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
@@ -9284,6 +9286,10 @@ void __init sched_init(void)
 
 	perf_counter_init();
 
+#ifdef CONFIG_SMP
+	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
+#endif
+
 	scheduler_running = 1;
 }
 
@@ -10457,7 +10463,8 @@ static void cpuacct_update_stats(struct 
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, cpuacct_batch);
+
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:44                         ` KOSAKI Motohiro
@ 2009-05-12 13:28                           ` Ingo Molnar
  2009-05-12 13:40                             ` KOSAKI Motohiro
  2009-07-16  8:10                           ` Bharata B Rao
  1 sibling, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-05-12 13:28 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Balbir Singh, mingo, hpa, linux-kernel, a.p.zijlstra,
	schwidefsky, balajirrao, dhaval, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> > or rather, to make cpuacct_batch have a sane value on UP too. 
> > (1? 0?)
> 
> umm..
> 
> I've reviewed my patch again.
> but sched_init() already has multiple #ifdef SMP. [...]

Patches are welcome to remove more of them.

> [...] Thus I don't think cosmetic changing improve readability 
> largely.

an avoidable #ifdef should aways be avoided.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch  values for stats counters
  2009-05-12 13:28                           ` Ingo Molnar
@ 2009-05-12 13:40                             ` KOSAKI Motohiro
  2009-05-12 13:42                               ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: KOSAKI Motohiro @ 2009-05-12 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Balbir Singh, mingo, hpa, linux-kernel, a.p.zijlstra,
	schwidefsky, balajirrao, dhaval, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits

2009/5/12 Ingo Molnar <mingo@elte.hu>:
>
> * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
>
>> > or rather, to make cpuacct_batch have a sane value on UP too.
>> > (1? 0?)
>>
>> umm..
>>
>> I've reviewed my patch again.
>> but sched_init() already has multiple #ifdef SMP. [...]
>
> Patches are welcome to remove more of them.
>
>> [...] Thus I don't think cosmetic changing improve readability
>> largely.
>
> an avoidable #ifdef should aways be avoided.

ok. I'll fix it tommorow.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 13:40                             ` KOSAKI Motohiro
@ 2009-05-12 13:42                               ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-05-12 13:42 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Balbir Singh, mingo, hpa, linux-kernel, a.p.zijlstra,
	schwidefsky, balajirrao, dhaval, bharata, tglx, kamezawa.hiroyu,
	linux-tip-commits


* KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:

> 2009/5/12 Ingo Molnar <mingo@elte.hu>:
> >
> > * KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> wrote:
> >
> >> > or rather, to make cpuacct_batch have a sane value on UP too.
> >> > (1? 0?)
> >>
> >> umm..
> >>
> >> I've reviewed my patch again.
> >> but sched_init() already has multiple #ifdef SMP. [...]
> >
> > Patches are welcome to remove more of them.
> >
> >> [...] Thus I don't think cosmetic changing improve readability
> >> largely.
> >
> > an avoidable #ifdef should aways be avoided.
> 
> ok. I'll fix it tommorow.

With the caveat that you should do it only if i'm right and if it 
results in better code in general.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-05-12 10:44                         ` KOSAKI Motohiro
  2009-05-12 13:28                           ` Ingo Molnar
@ 2009-07-16  8:10                           ` Bharata B Rao
  2009-07-16  8:39                             ` Anton Blanchard
  1 sibling, 1 reply; 62+ messages in thread
From: Bharata B Rao @ 2009-07-16  8:10 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Ingo Molnar, Balbir Singh, mingo, hpa, linux-kernel,
	a.p.zijlstra, schwidefsky, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, linux-tip-commits, Anton Blanchard

On Tue, May 12, 2009 at 07:44:37PM +0900, KOSAKI Motohiro wrote:
> ------------------------------------
> Subject: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
> 
> percpu counters used to accumulate statistics in cpuacct controller use
> the default batch value [max(2*nr_cpus, 32)] which can be too small for
> archs that define VIRT_CPU_ACCOUNTING. In such archs, a tick could result in
> cputime updates in the range of thousands. As a result, cpuacct_update_stats()
> would end up acquiring the percpu counter spinlock on every tick which
> is not good for performance.
> 
> Let those architectures to have a bigger batch threshold so that percpu counter
> spinlock isn't taken on every tick. This change doesn't affect the archs which
> don't define VIRT_CPU_ACCOUNTING and they continue to have the default
> percpu counter batch value.
> 
>  v7:
>  - fix typo and changelog
> 
>  v6:
>  - fix build error when UP
> 
>  v5:
>  - move cpuacct_batch initialization into sched_init()
> 
>  v4:
>   - rewrite patch description (thanks Bharata!)
>   - append read_mostly to cpuacct_batch
>   - cpuacct_batch is initialized by sched_init_debug()
> 
>  v3:
>   - revert using percpu_counter_sum()
> 
>  v2:
>   - use percpu_counter_sum() instead percpu_counter_read()
> 
> Cc: Balaji Rao <balajirrao@gmail.com>
> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> ---
>  kernel/sched.c |    9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c	2009-05-12 13:12:59.000000000 +0900
> +++ b/kernel/sched.c	2009-05-12 19:04:49.000000000 +0900
> @@ -870,6 +870,8 @@ static __read_mostly int scheduler_runni
>   */
>  int sysctl_sched_rt_runtime = 950000;
> 
> +static __read_mostly s32 cpuacct_batch;
> +
>  static inline u64 global_rt_period(void)
>  {
>  	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> @@ -9284,6 +9286,10 @@ void __init sched_init(void)
> 
>  	perf_counter_init();
> 
> +#ifdef CONFIG_SMP
> +	cpuacct_batch = jiffies_to_cputime(percpu_counter_batch);
> +#endif

On ppc64, calling jiffies_to_cputime() from sched_init() is too early because
jiffies_to_cputime() needs tb_ticks_per_sec which gets initialized only
later in time_init(). Because of this I see that cpuacct_batch will always
be zero effectively negating what this patch is trying to do.

As explained by you earlier, we too are finding the default batch value to
be too low for ppc64 with VIRT_CPU_ACCOUNTING turned on. Hence I guess
if this patch is taken in (ofcourse with the above issue fixed), it will
benefit ppc64 also.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-07-16  8:10                           ` Bharata B Rao
@ 2009-07-16  8:39                             ` Anton Blanchard
  2009-08-20  5:10                               ` Anton Blanchard
  0 siblings, 1 reply; 62+ messages in thread
From: Anton Blanchard @ 2009-07-16  8:39 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo, hpa,
	linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, linux-tip-commits

 
Hi,

> On ppc64, calling jiffies_to_cputime() from sched_init() is too early because
> jiffies_to_cputime() needs tb_ticks_per_sec which gets initialized only
> later in time_init(). Because of this I see that cpuacct_batch will always
> be zero effectively negating what this patch is trying to do.
> 
> As explained by you earlier, we too are finding the default batch value to
> be too low for ppc64 with VIRT_CPU_ACCOUNTING turned on. Hence I guess
> if this patch is taken in (ofcourse with the above issue fixed), it will
> benefit ppc64 also.

I created this patch earlier today when I hit the problem. Thoughts?

Anton
--

When CONFIG_VIRT_CPU_ACCOUNTING is enabled we can call cpuacct_update_stats
with values much larger than percpu_counter_batch. This means the
call to percpu_counter_add will always add to the global count which is
protected by a spinlock.

Since reading of the CPU accounting cgroup counters is not performance
critical, we can use a maximum size batch of INT_MAX and use
percpu_counter_sum on the read side which will add all the percpu
counters.

With this patch an 8 core POWER6 with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT shows an improvement in aggregate context switch rate of
397k/sec to 3.9M/sec, a 10x improvement.

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c	2009-07-16 10:11:02.000000000 +1000
+++ linux.trees.git/kernel/sched.c	2009-07-16 10:16:41.000000000 +1000
@@ -10551,7 +10551,7 @@
 	int i;
 
 	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
-		s64 val = percpu_counter_read(&ca->cpustat[i]);
+		s64 val = percpu_counter_sum(&ca->cpustat[i]);
 		val = cputime64_to_clock_t(val);
 		cb->fill(cb, cpuacct_stat_desc[i], val);
 	}
@@ -10621,7 +10621,7 @@
 	ca = task_ca(tsk);
 
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-07-16  8:39                             ` Anton Blanchard
@ 2009-08-20  5:10                               ` Anton Blanchard
  2009-08-20  5:16                                 ` KAMEZAWA Hiroyuki
                                                   ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Anton Blanchard @ 2009-08-20  5:10 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo, hpa,
	linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, akpm


Hi,

Looks like this issue is still present. I tested on a 32 core box and 
the patch improved maximum context switch rate from from 76k/sec to 9.5M/sec.
Thats over 100x faster, or 50x per line of code. That's got to be some sort of
record :)

Any chance we can get a fix in for 2.6.31? Don't make me find an even bigger
box so I can break the 200x mark :)

Anton

> --
> 
> When CONFIG_VIRT_CPU_ACCOUNTING is enabled we can call cpuacct_update_stats
> with values much larger than percpu_counter_batch. This means the
> call to percpu_counter_add will always add to the global count which is
> protected by a spinlock.
> 
> Since reading of the CPU accounting cgroup counters is not performance
> critical, we can use a maximum size batch of INT_MAX and use
> percpu_counter_sum on the read side which will add all the percpu
> counters.
> 
> With this patch an 8 core POWER6 with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT shows an improvement in aggregate context switch rate of
> 397k/sec to 3.9M/sec, a 10x improvement.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c	2009-07-16 10:11:02.000000000 +1000
> +++ linux.trees.git/kernel/sched.c	2009-07-16 10:16:41.000000000 +1000
> @@ -10551,7 +10551,7 @@
>  	int i;
>  
>  	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> -		s64 val = percpu_counter_read(&ca->cpustat[i]);
> +		s64 val = percpu_counter_sum(&ca->cpustat[i]);
>  		val = cputime64_to_clock_t(val);
>  		cb->fill(cb, cpuacct_stat_desc[i], val);
>  	}
> @@ -10621,7 +10621,7 @@
>  	ca = task_ca(tsk);
>  
>  	do {
> -		percpu_counter_add(&ca->cpustat[idx], val);
> +		__percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
>  		ca = ca->parent;
>  	} while (ca);
>  	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  5:10                               ` Anton Blanchard
@ 2009-08-20  5:16                                 ` KAMEZAWA Hiroyuki
  2009-08-20  5:26                                 ` Balbir Singh
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
  2 siblings, 0 replies; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-20  5:16 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, akpm

On Thu, 20 Aug 2009 15:10:38 +1000
Anton Blanchard <anton@samba.org> wrote:

> 
> Hi,
> 
> Looks like this issue is still present. I tested on a 32 core box and 
> the patch improved maximum context switch rate from from 76k/sec to 9.5M/sec.
> Thats over 100x faster, or 50x per line of code. That's got to be some sort of
> record :)
> 
> Any chance we can get a fix in for 2.6.31? Don't make me find an even bigger
> box so I can break the 200x mark :)
> 
I like the patch itself, I think it does a correct fix.

Hmm, but, IIRC, there was a discssion "what happens at reading
this counter once per 1ms ?" or some.

What overhead at read on big system ? Can you show it ?

Thanks,
-Kame


> Anton
> 
> > --
> > 
> > When CONFIG_VIRT_CPU_ACCOUNTING is enabled we can call cpuacct_update_stats
> > with values much larger than percpu_counter_batch. This means the
> > call to percpu_counter_add will always add to the global count which is
> > protected by a spinlock.
> > 
> > Since reading of the CPU accounting cgroup counters is not performance
> > critical, we can use a maximum size batch of INT_MAX and use
> > percpu_counter_sum on the read side which will add all the percpu
> > counters.
> > 
> > With this patch an 8 core POWER6 with CONFIG_VIRT_CPU_ACCOUNTING and
> > CONFIG_CGROUP_CPUACCT shows an improvement in aggregate context switch rate of
> > 397k/sec to 3.9M/sec, a 10x improvement.
> > 
> > Signed-off-by: Anton Blanchard <anton@samba.org>
> > ---
> > 
> > Index: linux.trees.git/kernel/sched.c
> > ===================================================================
> > --- linux.trees.git.orig/kernel/sched.c	2009-07-16 10:11:02.000000000 +1000
> > +++ linux.trees.git/kernel/sched.c	2009-07-16 10:16:41.000000000 +1000
> > @@ -10551,7 +10551,7 @@
> >  	int i;
> >  
> >  	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > -		s64 val = percpu_counter_read(&ca->cpustat[i]);
> > +		s64 val = percpu_counter_sum(&ca->cpustat[i]);
> >  		val = cputime64_to_clock_t(val);
> >  		cb->fill(cb, cpuacct_stat_desc[i], val);
> >  	}
> > @@ -10621,7 +10621,7 @@
> >  	ca = task_ca(tsk);
> >  
> >  	do {
> > -		percpu_counter_add(&ca->cpustat[idx], val);
> > +		__percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
> >  		ca = ca->parent;
> >  	} while (ca);
> >  	rcu_read_unlock();
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  5:10                               ` Anton Blanchard
  2009-08-20  5:16                                 ` KAMEZAWA Hiroyuki
@ 2009-08-20  5:26                                 ` Balbir Singh
  2009-08-20  5:52                                   ` Peter Zijlstra
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
  2 siblings, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2009-08-20  5:26 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, mingo, hpa,
	linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, akpm

* Anton Blanchard <anton@samba.org> [2009-08-20 15:10:38]:

> 
> Hi,
> 
> Looks like this issue is still present. I tested on a 32 core box and 
> the patch improved maximum context switch rate from from 76k/sec to 9.5M/sec.
> Thats over 100x faster, or 50x per line of code. That's got to be some sort of
> record :)
> 
> Any chance we can get a fix in for 2.6.31? Don't make me find an even bigger
> box so I can break the 200x mark :)
> 
> Anton
> 
> > --
> > 
> > When CONFIG_VIRT_CPU_ACCOUNTING is enabled we can call cpuacct_update_stats
> > with values much larger than percpu_counter_batch. This means the
> > call to percpu_counter_add will always add to the global count which is
> > protected by a spinlock.
> > 
> > Since reading of the CPU accounting cgroup counters is not performance
> > critical, we can use a maximum size batch of INT_MAX and use
> > percpu_counter_sum on the read side which will add all the percpu
> > counters.
> > 
> > With this patch an 8 core POWER6 with CONFIG_VIRT_CPU_ACCOUNTING and
> > CONFIG_CGROUP_CPUACCT shows an improvement in aggregate context switch rate of
> > 397k/sec to 3.9M/sec, a 10x improvement.
> > 

Looks good overall, why not keep the batch size conditional on
CONFIG_VIRT_CPU_ACCOUNTING? I'd still like to stick with
percpu_counter_read() on the read side because My concern is that a
bad user space application can read cpuacct.stat and bring the kernel
to its knees.

> > Signed-off-by: Anton Blanchard <anton@samba.org>
> > ---
> > 
> > Index: linux.trees.git/kernel/sched.c
> > ===================================================================
> > --- linux.trees.git.orig/kernel/sched.c	2009-07-16 10:11:02.000000000 +1000
> > +++ linux.trees.git/kernel/sched.c	2009-07-16 10:16:41.000000000 +1000
> > @@ -10551,7 +10551,7 @@
> >  	int i;
> >  
> >  	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > -		s64 val = percpu_counter_read(&ca->cpustat[i]);
> > +		s64 val = percpu_counter_sum(&ca->cpustat[i]);
> >  		val = cputime64_to_clock_t(val);
> >  		cb->fill(cb, cpuacct_stat_desc[i], val);
> >  	}
> > @@ -10621,7 +10621,7 @@
> >  	ca = task_ca(tsk);
> >  
> >  	do {
> > -		percpu_counter_add(&ca->cpustat[idx], val);
> > +		__percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
> >  		ca = ca->parent;
> >  	} while (ca);
> >  	rcu_read_unlock();

-- 
	Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  5:26                                 ` Balbir Singh
@ 2009-08-20  5:52                                   ` Peter Zijlstra
  2009-08-20  6:05                                     ` Anton Blanchard
  0 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2009-08-20  5:52 UTC (permalink / raw)
  To: balbir
  Cc: Anton Blanchard, Bharata B Rao, KOSAKI Motohiro, Ingo Molnar,
	mingo, hpa, linux-kernel, schwidefsky, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm

On Thu, 2009-08-20 at 10:56 +0530, Balbir Singh wrote:

> Looks good overall, why not keep the batch size conditional on
> CONFIG_VIRT_CPU_ACCOUNTING? I'd still like to stick with
> percpu_counter_read() on the read side because My concern is that a
> bad user space application can read cpuacct.stat and bring the kernel
> to its knees.

Agreed, the first hunk is dangerous, on the second hunk, I'm not sure
about the INT_MAX thing, that's a 4s gap per cpu, that might be a tad
much.

> > > Signed-off-by: Anton Blanchard <anton@samba.org>
> > > ---
> > > 
> > > Index: linux.trees.git/kernel/sched.c
> > > ===================================================================
> > > --- linux.trees.git.orig/kernel/sched.c	2009-07-16 10:11:02.000000000 +1000
> > > +++ linux.trees.git/kernel/sched.c	2009-07-16 10:16:41.000000000 +1000
> > > @@ -10551,7 +10551,7 @@
> > >  	int i;
> > >  
> > >  	for (i = 0; i < CPUACCT_STAT_NSTATS; i++) {
> > > -		s64 val = percpu_counter_read(&ca->cpustat[i]);
> > > +		s64 val = percpu_counter_sum(&ca->cpustat[i]);
> > >  		val = cputime64_to_clock_t(val);
> > >  		cb->fill(cb, cpuacct_stat_desc[i], val);
> > >  	}
> > > @@ -10621,7 +10621,7 @@
> > >  	ca = task_ca(tsk);
> > >  
> > >  	do {
> > > -		percpu_counter_add(&ca->cpustat[idx], val);
> > > +		__percpu_counter_add(&ca->cpustat[idx], val, INT_MAX);
> > >  		ca = ca->parent;
> > >  	} while (ca);
> > >  	rcu_read_unlock();
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  5:52                                   ` Peter Zijlstra
@ 2009-08-20  6:05                                     ` Anton Blanchard
  2009-08-20  6:10                                       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 62+ messages in thread
From: Anton Blanchard @ 2009-08-20  6:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: balbir, Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, mingo, hpa,
	linux-kernel, schwidefsky, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm


Hi,

> Agreed, the first hunk is dangerous, on the second hunk, I'm not sure
> about the INT_MAX thing, that's a 4s gap per cpu, that might be a tad
> much.

Yeah I somehow missed the fact we read it every timer tick. I was hoping
for an easy fix but I guess we just need to set a reasonable batch size
at runtime.

Anton

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  6:05                                     ` Anton Blanchard
@ 2009-08-20  6:10                                       ` KAMEZAWA Hiroyuki
  2009-08-20  6:24                                         ` Anton Blanchard
  0 siblings, 1 reply; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-20  6:10 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Peter Zijlstra, balbir, Bharata B Rao, KOSAKI Motohiro,
	Ingo Molnar, mingo, hpa, linux-kernel, schwidefsky, balajirrao,
	dhaval, tglx, akpm

On Thu, 20 Aug 2009 16:05:28 +1000
Anton Blanchard <anton@samba.org> wrote:

> 
> Hi,
> 
> > Agreed, the first hunk is dangerous, on the second hunk, I'm not sure
> > about the INT_MAX thing, that's a 4s gap per cpu, that might be a tad
> > much.
> 
> Yeah I somehow missed the fact we read it every timer tick. I was hoping
> for an easy fix but I guess we just need to set a reasonable batch size
> at runtime.
> 

Could you share contex-switch-test program ?
I'd like to play with it to find out what I can do against percpu counter.

Thanks,
-Kame

> Anton
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  6:10                                       ` KAMEZAWA Hiroyuki
@ 2009-08-20  6:24                                         ` Anton Blanchard
  2009-08-20  8:41                                           ` [PATCH] better align percpu counter (Was " KAMEZAWA Hiroyuki
                                                             ` (2 more replies)
  0 siblings, 3 replies; 62+ messages in thread
From: Anton Blanchard @ 2009-08-20  6:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, balbir, Bharata B Rao, KOSAKI Motohiro,
	Ingo Molnar, mingo, hpa, linux-kernel, schwidefsky, balajirrao,
	dhaval, tglx, akpm

 
Hi,

> Could you share contex-switch-test program ?
> I'd like to play with it to find out what I can do against percpu counter.

Sure:

http://ozlabs.org/~anton/junkcode/context_switch.c

Very simple, just run it once per core:

for i in `seq 0 31`
do
	taskset -c $i ./context_switch &
done

Then look at the context switch rates in vmstat.

Anton

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH] better align percpu counter (Was Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  6:24                                         ` Anton Blanchard
@ 2009-08-20  8:41                                           ` KAMEZAWA Hiroyuki
  2009-08-20 10:04                                             ` Ingo Molnar
  2009-08-21  3:56                                             ` KAMEZAWA Hiroyuki
  2009-08-20 11:46                                           ` Balbir Singh
  2009-08-21  6:21                                           ` KAMEZAWA Hiroyuki
  2 siblings, 2 replies; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-20  8:41 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Peter Zijlstra, balbir, Bharata B Rao, KOSAKI Motohiro,
	Ingo Molnar, mingo, hpa, linux-kernel, schwidefsky, balajirrao,
	dhaval, tglx, akpm

On Thu, 20 Aug 2009 16:24:51 +1000
Anton Blanchard <anton@samba.org> wrote:

>  
> Hi,
> 
> > Could you share contex-switch-test program ?
> > I'd like to play with it to find out what I can do against percpu counter.
> 
> Sure:
> 
> http://ozlabs.org/~anton/junkcode/context_switch.c
> 
> Very simple, just run it once per core:
> 
> for i in `seq 0 31`
> do
> 	taskset -c $i ./context_switch &
> done
> 
> Then look at the context switch rates in vmstat.
> 
Thank you for test program.

Before adjusting batch counter (I think you should modify it),
Could you try this ?

I only have 8cpu(2socket) host but works well.
(But...my host is x86-64 and has not virt-cpu-accouting.)

with your program
before patch.
cpuacct off : 414000-416000 ctsw per sec.
cpuacct on  : 401000-404000 ctsw per sec.

after patch
cpuacct on  : 412000-413000 ctsw per sec.

Maybe I should check cache-miss late ;)
==
It's bad to place pointer for array of per-cpu-data on the
same cache line of spinlock. This patch moves percpu_counter's
cacheline to reduce false sharing.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/percpu_counter.h |   12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6.31-rc6/include/linux/percpu_counter.h
===================================================================
--- linux-2.6.31-rc6.orig/include/linux/percpu_counter.h	2009-08-20 12:09:27.000000000 +0900
+++ linux-2.6.31-rc6/include/linux/percpu_counter.h	2009-08-20 17:31:13.000000000 +0900
@@ -14,14 +14,24 @@
 #include <linux/types.h>
 
 #ifdef CONFIG_SMP
+struct __percpu_counter_padding {
+	char x[0];
+} ____cacheline_internodealigned_in_smp;
+#define CACHELINE_PADDING(name)  struct __percpu_counter_padding name
 
 struct percpu_counter {
+	/*
+	 * This pointer is persistent and accessed firstly.
+	 * Then, should not be purged by locking in other cpus.
+	 */
+	s32 *counters;
+	CACHELINE_PADDING(pad);
 	spinlock_t lock;
 	s64 count;
 #ifdef CONFIG_HOTPLUG_CPU
+	/* rarely accessed field */
 	struct list_head list;	/* All percpu_counters are on a list */
 #endif
-	s32 *counters;
 };
 
 extern int percpu_counter_batch;



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] better align percpu counter (Was Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  8:41                                           ` [PATCH] better align percpu counter (Was " KAMEZAWA Hiroyuki
@ 2009-08-20 10:04                                             ` Ingo Molnar
  2009-08-21  2:20                                               ` KAMEZAWA Hiroyuki
  2009-08-21  3:56                                             ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 62+ messages in thread
From: Ingo Molnar @ 2009-08-20 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Anton Blanchard, Peter Zijlstra, balbir, Bharata B Rao,
	KOSAKI Motohiro, mingo, hpa, linux-kernel, schwidefsky,
	balajirrao, dhaval, tglx, akpm


* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 20 Aug 2009 16:24:51 +1000
> Anton Blanchard <anton@samba.org> wrote:
> 
> >  
> > Hi,
> > 
> > > Could you share contex-switch-test program ?
> > > I'd like to play with it to find out what I can do against percpu counter.
> > 
> > Sure:
> > 
> > http://ozlabs.org/~anton/junkcode/context_switch.c
> > 
> > Very simple, just run it once per core:
> > 
> > for i in `seq 0 31`
> > do
> > 	taskset -c $i ./context_switch &
> > done
> > 
> > Then look at the context switch rates in vmstat.
> > 
> Thank you for test program.
> 
> Before adjusting batch counter (I think you should modify it),
> Could you try this ?
> 
> I only have 8cpu(2socket) host but works well.
> (But...my host is x86-64 and has not virt-cpu-accouting.)
> 
> with your program
> before patch.
> cpuacct off : 414000-416000 ctsw per sec.
> cpuacct on  : 401000-404000 ctsw per sec.
> 
> after patch
> cpuacct on  : 412000-413000 ctsw per sec.
> 
> Maybe I should check cache-miss late ;)

Btw., in latest upstream you can do that via:

  cd tools/perf/
  make -j install

  perf stat --repeat 5 -- taskset -c 1 ./context_switch

there will be cache-miss and other stats, error bars so 
you can compare the before/after better, etc.

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch  values for stats counters
  2009-08-20  6:24                                         ` Anton Blanchard
  2009-08-20  8:41                                           ` [PATCH] better align percpu counter (Was " KAMEZAWA Hiroyuki
@ 2009-08-20 11:46                                           ` Balbir Singh
  2009-08-21  6:21                                           ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 62+ messages in thread
From: Balbir Singh @ 2009-08-20 11:46 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: KAMEZAWA Hiroyuki, Peter Zijlstra, Bharata B Rao,
	KOSAKI Motohiro, Ingo Molnar, mingo, hpa, linux-kernel,
	schwidefsky, balajirrao, dhaval, tglx, akpm

Hi, Anton,

Do you have any data on what level of batching is good for
VIRT_CPU_ACCOUNTING? If it is going to be as high as INT_MAX, then
your patch seems to be the best way to go forward. We'll need to herd
away bad users who read cpuacct.stat frequently enough to create a
problem.

Balbir Singh.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] better align percpu counter (Was Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20 10:04                                             ` Ingo Molnar
@ 2009-08-21  2:20                                               ` KAMEZAWA Hiroyuki
  2009-08-21 11:29                                                 ` Ingo Molnar
  0 siblings, 1 reply; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-21  2:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Anton Blanchard, Peter Zijlstra, balbir, Bharata B Rao,
	KOSAKI Motohiro, mingo, hpa, linux-kernel, schwidefsky,
	balajirrao, dhaval, tglx, akpm

On Thu, 20 Aug 2009 12:04:03 +0200
Ingo Molnar <mingo@elte.hu> wrote:
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > with your program
> > before patch.
> > cpuacct off : 414000-416000 ctsw per sec.
> > cpuacct on  : 401000-404000 ctsw per sec.
> > 
> > after patch
> > cpuacct on  : 412000-413000 ctsw per sec.
> > 
> > Maybe I should check cache-miss late ;)
> 
> Btw., in latest upstream you can do that via:
> 
>   cd tools/perf/
>   make -j install
> 
>   perf stat --repeat 5 -- taskset -c 1 ./context_switch
> 

tried. (on 8cpu/2socket host). It seems cache-miss decreases.
But IPC ..?
==
/root/bin/perf stat --repeat 5 -a -e cycles,instructions,cache-misses,L1-dcache-load-misses,L1-dcache-store-misses -- ./ctxt_sw.sh
[Before] patch
 Performance counter stats for './ctxt_sw.sh' (5 runs):

  1511260148530  cycles                     ( +-   0.025% )  (scaled from 63.49%             )
   470690642181  instructions             #      0.311 IPC     ( +-   0.098% )(scaled from 79.49%)
     1210051728  cache-misses               ( +-   0.629% )  (scaled from 79.00%             )
     3202978828  L1-dcache-load-misses      ( +-   1.118% )  (scaled from 78.00%             )
     1803963907  L1-dcache-store-misses     ( +-   0.728% )  (scaled from 42.99%             )

   60.161941918  seconds time elapsed   ( +-   0.029% )

[After] patch
 Performance counter stats for './ctxt_sw.sh' (5 runs):

  1511961867506  cycles                     ( +-   0.018% )  (scaled from 71.50%)
   448724406149  instructions             #      0.297 IPC     ( +-   0.133% )  (scaled from 75.49%)
     1184548041  cache-misses               ( +-   0.136% )  (scaled from 75.50%)
     3086357048  L1-dcache-load-misses      ( +-   0.822% )  (scaled from 77.50%)
     1708375493  L1-dcache-store-misses     ( +-   0.328% )  (scaled from 47.00%)

   60.179814774  seconds time elapsed   ( +-   0.052% )

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] better align percpu counter (Was Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  8:41                                           ` [PATCH] better align percpu counter (Was " KAMEZAWA Hiroyuki
  2009-08-20 10:04                                             ` Ingo Molnar
@ 2009-08-21  3:56                                             ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-21  3:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Anton Blanchard, Peter Zijlstra, balbir, Bharata B Rao,
	KOSAKI Motohiro, Ingo Molnar, mingo, hpa, linux-kernel,
	schwidefsky, balajirrao, dhaval, tglx, akpm

On Thu, 20 Aug 2009 17:41:23 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/percpu_counter.h |   12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.31-rc6/include/linux/percpu_counter.h
> ===================================================================
> --- linux-2.6.31-rc6.orig/include/linux/percpu_counter.h	2009-08-20 12:09:27.000000000 +0900
> +++ linux-2.6.31-rc6/include/linux/percpu_counter.h	2009-08-20 17:31:13.000000000 +0900
> @@ -14,14 +14,24 @@
>  #include <linux/types.h>
>  
>  #ifdef CONFIG_SMP
> +struct __percpu_counter_padding {
> +	char x[0];
> +} ____cacheline_internodealigned_in_smp;

above alignement is too big...plz ignore this patch :(

-Kame


> +#define CACHELINE_PADDING(name)  struct __percpu_counter_padding name
>  
>  struct percpu_counter {
> +	/*
> +	 * This pointer is persistent and accessed firstly.
> +	 * Then, should not be purged by locking in other cpus.
> +	 */
> +	s32 *counters;
> +	CACHELINE_PADDING(pad);
>  	spinlock_t lock;
>  	s64 count;
>  #ifdef CONFIG_HOTPLUG_CPU
> +	/* rarely accessed field */
>  	struct list_head list;	/* All percpu_counters are on a list */
>  #endif
> -	s32 *counters;
>  };
>  
>  extern int percpu_counter_batch;
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  6:24                                         ` Anton Blanchard
  2009-08-20  8:41                                           ` [PATCH] better align percpu counter (Was " KAMEZAWA Hiroyuki
  2009-08-20 11:46                                           ` Balbir Singh
@ 2009-08-21  6:21                                           ` KAMEZAWA Hiroyuki
  2 siblings, 0 replies; 62+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-08-21  6:21 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Peter Zijlstra, balbir, Bharata B Rao, KOSAKI Motohiro,
	Ingo Molnar, mingo, hpa, linux-kernel, schwidefsky, balajirrao,
	dhaval, tglx, akpm

On Thu, 20 Aug 2009 16:24:51 +1000
Anton Blanchard <anton@samba.org> wrote:

>  
> Hi,
> 
> > Could you share contex-switch-test program ?
> > I'd like to play with it to find out what I can do against percpu counter.
> 
> Sure:
> 
> http://ozlabs.org/~anton/junkcode/context_switch.c
> 
> Very simple, just run it once per core:
> 
> for i in `seq 0 31`
> do
> 	taskset -c $i ./context_switch &
> done
> 
> Then look at the context switch rates in vmstat.
> 

Ok, I reproduced it on ia64 box (ia64 supports VIRT_CPU_ACCOUNTING)

On 8cpu
-- config=on--
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
11  0      0 7185152  66048 284416    0    0     0     0 5918 12354  0 100  0  0  0
--config=off--
11  0      0 7194496  66304 283392    0    0     0    20 8023 1424022 32 68  0  0  0

Wow, yes. 100x regression.

I'll dig this more. 

IIUC, If CONFIG_VIRT_CPU_ACCOUNTING=on, process's stime/utime records "nanosecond".
Then, batch=32 is too small. How about once per msec ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] better align percpu counter (Was Re: [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-21  2:20                                               ` KAMEZAWA Hiroyuki
@ 2009-08-21 11:29                                                 ` Ingo Molnar
  0 siblings, 0 replies; 62+ messages in thread
From: Ingo Molnar @ 2009-08-21 11:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Anton Blanchard, Peter Zijlstra, balbir, Bharata B Rao,
	KOSAKI Motohiro, mingo, hpa, linux-kernel, schwidefsky,
	balajirrao, dhaval, tglx, akpm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 20 Aug 2009 12:04:03 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > with your program
> > > before patch.
> > > cpuacct off : 414000-416000 ctsw per sec.
> > > cpuacct on  : 401000-404000 ctsw per sec.
> > > 
> > > after patch
> > > cpuacct on  : 412000-413000 ctsw per sec.
> > > 
> > > Maybe I should check cache-miss late ;)
> > 
> > Btw., in latest upstream you can do that via:
> > 
> >   cd tools/perf/
> >   make -j install
> > 
> >   perf stat --repeat 5 -- taskset -c 1 ./context_switch
> > 
> 
> tried. (on 8cpu/2socket host). It seems cache-miss decreases. But 
> IPC ..?

All the numbers have gone down - about the same amount of cycles but 
fewer instructions executed, and fewer cache-misses. That's good.

The Instructions Per Cycle metric got worse because cycles stayed 
constant. One thing is that you have triggered counter-over-commit 
(the 'scaled from' messages) - this means that more counters are 
used than the hardware has space for - so we round-robin schedule 
them.

If you want to get to the bottom of that, to get the most precise 
result try something like:

    perf stat --repeat 5 -a -e \
        cycles,instructions,L1-dcache-load-misses,L1-dcache-store-misses \
        -- ./ctxt_sw.sh

( this is almost the same as the command line you used, but without 
  the 'cache-misses' counter. Your CPU should be able to 
  simultaneously activate all these counters and they should count 
  100% of the events. )

	Ingo

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2009-08-20  5:10                               ` Anton Blanchard
  2009-08-20  5:16                                 ` KAMEZAWA Hiroyuki
  2009-08-20  5:26                                 ` Balbir Singh
@ 2010-01-18  4:41                                 ` Anton Blanchard
  2010-01-18  4:55                                   ` KOSAKI Motohiro
                                                     ` (5 more replies)
  2 siblings, 6 replies; 62+ messages in thread
From: Anton Blanchard @ 2010-01-18  4:41 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo, hpa,
	linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu,
	Heiko Carstens, linux390

Hi,

Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
and CONFIG_CGROUP_CPUACCT enabled. Thoughts?

--

When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count
which is protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and 
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled:		16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled:		   61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch:		16663217 ctx switches/sec

Tested with:

wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Note: ccing ia64 and s390 who have not yet added code to statically
initialise cputime_one_jiffy at boot. 
See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
jiffies_to_cputime(1) for details). Adding this would help optimise not only
this patch but many other areas of the scheduler when
CONFIG_VIRT_CPU_ACCOUNTING is enabled.

Index: linux.trees.git/kernel/sched.c
===================================================================
--- linux.trees.git.orig/kernel/sched.c	2010-01-18 14:27:12.000000000 +1100
+++ linux.trees.git/kernel/sched.c	2010-01-18 15:21:59.000000000 +1100
@@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct 
 		enum cpuacct_stat_index idx, cputime_t val)
 {
 	struct cpuacct *ca;
+	int batch;

 	if (unlikely(!cpuacct_subsys.active))
 		return;
@@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct 
 	rcu_read_lock();
 	ca = task_ca(tsk);

+	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, batch);
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
@ 2010-01-18  4:55                                   ` KOSAKI Motohiro
  2010-01-18  7:51                                   ` Peter Zijlstra
                                                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 62+ messages in thread
From: KOSAKI Motohiro @ 2010-01-18  4:55 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: kosaki.motohiro, Bharata B Rao, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu,
	Heiko Carstens, linux390

> 
> Hi,
> 
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
> 
> --
> 
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.
> 
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would
> if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu add/remove.
> 
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
> 
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to
> optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
> at least on x86 and PowerPC. So there is no need to add an #ifdef.
> 
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and 
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
> and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
> 
> CONFIG_CGROUP_CPUACCT disabled:		16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled:		   61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch:		16663217 ctx switches/sec
> 
> Tested with:
> 
> wget http://ozlabs.org/~anton/junkcode/context_switch.c
> make context_switch
> for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> vmstat 1
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>

Looks fine to me. Thanks Anton!
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


> ---
> 
> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot. 
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.
> 
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c	2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c	2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct 
>  		enum cpuacct_stat_index idx, cputime_t val)
>  {
>  	struct cpuacct *ca;
> +	int batch;
>  
>  	if (unlikely(!cpuacct_subsys.active))
>  		return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct 
>  	rcu_read_lock();
>  	ca = task_ca(tsk);
>  
> +	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
>  	do {
> -		percpu_counter_add(&ca->cpustat[idx], val);
> +		__percpu_counter_add(&ca->cpustat[idx], val, batch);
>  		ca = ca->parent;
>  	} while (ca);
>  	rcu_read_unlock();




^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
  2010-01-18  4:55                                   ` KOSAKI Motohiro
@ 2010-01-18  7:51                                   ` Peter Zijlstra
  2010-01-18  8:21                                     ` Anton Blanchard
  2010-01-18  8:35                                   ` Balbir Singh
                                                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2010-01-18  7:51 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, schwidefsky, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390

On Mon, 2010-01-18 at 15:41 +1100, Anton Blanchard wrote:
> Hi,
> 
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?

Seems like a good idea, but isn't that batch number rather static? If
so, computing it in some init path would save that multiply on the
actual accounting path.

> 
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c	2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c	2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct 
>  		enum cpuacct_stat_index idx, cputime_t val)
>  {
>  	struct cpuacct *ca;
> +	int batch;
>  
>  	if (unlikely(!cpuacct_subsys.active))
>  		return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct 
>  	rcu_read_lock();
>  	ca = task_ca(tsk);
>  
> +	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
>  	do {
> -		percpu_counter_add(&ca->cpustat[idx], val);
> +		__percpu_counter_add(&ca->cpustat[idx], val, batch);
>  		ca = ca->parent;
>  	} while (ca);
>  	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  7:51                                   ` Peter Zijlstra
@ 2010-01-18  8:21                                     ` Anton Blanchard
  2010-01-18  8:34                                       ` Peter Zijlstra
  0 siblings, 1 reply; 62+ messages in thread
From: Anton Blanchard @ 2010-01-18  8:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, schwidefsky, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390


Hi Peter,

> Seems like a good idea, but isn't that batch number rather static? If
> so, computing it in some init path would save that multiply on the
> actual accounting path.

It is mostly static, but percpu_counter_batch does change with hotplug
operations. Adding a hotplug notifier felt like a lot of work but I was
worried we have issues if we didn't scale with hotplug add operations.

Anton

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  8:21                                     ` Anton Blanchard
@ 2010-01-18  8:34                                       ` Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2010-01-18  8:34 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, schwidefsky, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390

On Mon, 2010-01-18 at 19:21 +1100, Anton Blanchard wrote:
> Hi Peter,
> 
> > Seems like a good idea, but isn't that batch number rather static? If
> > so, computing it in some init path would save that multiply on the
> > actual accounting path.
> 
> It is mostly static, but percpu_counter_batch does change with hotplug
> operations. Adding a hotplug notifier felt like a lot of work but I was
> worried we have issues if we didn't scale with hotplug add operations.

OK, trading a notifier for a mult might be worth it, we'll see if
someone complains ;-)

Lets got with this for now.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
  2010-01-18  4:55                                   ` KOSAKI Motohiro
  2010-01-18  7:51                                   ` Peter Zijlstra
@ 2010-01-18  8:35                                   ` Balbir Singh
  2010-01-18  9:42                                   ` Martin Schwidefsky
                                                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 62+ messages in thread
From: Balbir Singh @ 2010-01-18  8:35 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, mingo, hpa,
	linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu,
	Heiko Carstens, linux390

On Monday 18 January 2010 10:11 AM, Anton Blanchard wrote:
> 
> Hi,
> 
> Another try at this percpu_counter batch issue with CONFIG_VIRT_CPU_ACCOUNTING
> and CONFIG_CGROUP_CPUACCT enabled. Thoughts?
> 
> --
> 
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.
> 
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would
> if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu add/remove.
> 
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
> 
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to
> optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
> at least on x86 and PowerPC. So there is no need to add an #ifdef.
> 
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and 
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
> and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
> 
> CONFIG_CGROUP_CPUACCT disabled:		16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled:		   61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch:		16663217 ctx switches/sec
> 
> Tested with:
> 
> wget http://ozlabs.org/~anton/junkcode/context_switch.c
> make context_switch
> for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> vmstat 1
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot. 
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.
> 
> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c	2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c	2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct 
>  		enum cpuacct_stat_index idx, cputime_t val)
>  {
>  	struct cpuacct *ca;
> +	int batch;
> 
>  	if (unlikely(!cpuacct_subsys.active))
>  		return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct 
>  	rcu_read_lock();
>  	ca = task_ca(tsk);
> 
> +	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
>  	do {
> -		percpu_counter_add(&ca->cpustat[idx], val);
> +		__percpu_counter_add(&ca->cpustat[idx], val, batch);
>  		ca = ca->parent;
>  	} while (ca);
>  	rcu_read_unlock();

Looks good to me, but I'll test it as well and revert back. I think we
might need to look at the call side where we do the percpu_counter_read().

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
                                                     ` (2 preceding siblings ...)
  2010-01-18  8:35                                   ` Balbir Singh
@ 2010-01-18  9:42                                   ` Martin Schwidefsky
  2010-01-18  9:55                                     ` Anton Blanchard
  2010-01-25 23:14                                   ` Andrew Morton
  2010-01-27 13:15                                   ` [tip:sched/urgent] " tip-bot for Anton Blanchard
  5 siblings, 1 reply; 62+ messages in thread
From: Martin Schwidefsky @ 2010-01-18  9:42 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, a.p.zijlstra, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390

Hi Anton,

On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <anton@samba.org> wrote:

> Note: ccing ia64 and s390 who have not yet added code to statically
> initialise cputime_one_jiffy at boot. 
> See a42548a18866e87092db93b771e6c5b060d78401 (cputime: Optimize
> jiffies_to_cputime(1) for details). Adding this would help optimise not only
> this patch but many other areas of the scheduler when
> CONFIG_VIRT_CPU_ACCOUNTING is enabled.

For s390 the jiffies_to_cputime is a compile time constant. No need to
initialize it at runtime, no? 

> Index: linux.trees.git/kernel/sched.c
> ===================================================================
> --- linux.trees.git.orig/kernel/sched.c	2010-01-18 14:27:12.000000000 +1100
> +++ linux.trees.git/kernel/sched.c	2010-01-18 15:21:59.000000000 +1100
> @@ -10894,6 +10894,7 @@ static void cpuacct_update_stats(struct 
>  		enum cpuacct_stat_index idx, cputime_t val)
>  {
>  	struct cpuacct *ca;
> +	int batch;
> 
>  	if (unlikely(!cpuacct_subsys.active))
>  		return;
> @@ -10901,8 +10902,9 @@ static void cpuacct_update_stats(struct 
>  	rcu_read_lock();
>  	ca = task_ca(tsk);
> 
> +	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
>  	do {
> -		percpu_counter_add(&ca->cpustat[idx], val);
> +		__percpu_counter_add(&ca->cpustat[idx], val, batch);
>  		ca = ca->parent;
>  	} while (ca);
>  	rcu_read_unlock();

The patch itself trades some accuracy (larger cpu accounting value that
are stored per-cpu) against runtime overhead (spinlock to transfer the
value to the global variable in __percpu_counter_add). Did you
calculate how big the loss in accuracy is?

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  9:42                                   ` Martin Schwidefsky
@ 2010-01-18  9:55                                     ` Anton Blanchard
  2010-01-18 11:00                                       ` Balbir Singh
  2010-01-21 15:25                                       ` Balbir Singh
  0 siblings, 2 replies; 62+ messages in thread
From: Anton Blanchard @ 2010-01-18  9:55 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, a.p.zijlstra, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390


Hi Martin,

> For s390 the jiffies_to_cputime is a compile time constant. No need to
> initialize it at runtime, no? 

Indeed it is, I didn't look closely enough. Same with ia64 so no work to
do on either arch :)

> The patch itself trades some accuracy (larger cpu accounting value that
> are stored per-cpu) against runtime overhead (spinlock to transfer the
> value to the global variable in __percpu_counter_add). Did you
> calculate how big the loss in accuracy is?

The idea is we are already batching percpu_counter_batch jiffies, so
with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
cputime.

Anton

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  9:55                                     ` Anton Blanchard
@ 2010-01-18 11:00                                       ` Balbir Singh
  2010-01-21 15:25                                       ` Balbir Singh
  1 sibling, 0 replies; 62+ messages in thread
From: Balbir Singh @ 2010-01-18 11:00 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Martin Schwidefsky, Bharata B Rao, KOSAKI Motohiro, Ingo Molnar,
	mingo, hpa, linux-kernel, a.p.zijlstra, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390

On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
> 
> Hi Martin,
> 
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no? 
> 
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
> 
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
> 
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
> 

This patch worked well for me.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Balbir Singh.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  9:55                                     ` Anton Blanchard
  2010-01-18 11:00                                       ` Balbir Singh
@ 2010-01-21 15:25                                       ` Balbir Singh
  2010-01-21 15:36                                         ` Peter Zijlstra
  1 sibling, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2010-01-21 15:25 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Martin Schwidefsky, Bharata B Rao, KOSAKI Motohiro, Ingo Molnar,
	mingo, hpa, linux-kernel, a.p.zijlstra, balajirrao, dhaval, tglx,
	kamezawa.hiroyu, akpm, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390

On Monday 18 January 2010 03:25 PM, Anton Blanchard wrote:
> 
> Hi Martin,
> 
>> For s390 the jiffies_to_cputime is a compile time constant. No need to
>> initialize it at runtime, no? 
> 
> Indeed it is, I didn't look closely enough. Same with ia64 so no work to
> do on either arch :)
> 
>> The patch itself trades some accuracy (larger cpu accounting value that
>> are stored per-cpu) against runtime overhead (spinlock to transfer the
>> value to the global variable in __percpu_counter_add). Did you
>> calculate how big the loss in accuracy is?
> 
> The idea is we are already batching percpu_counter_batch jiffies, so
> with CONFIG_VIRT_CPU_ACCOUNTING we batch the equivalent amount in
> cputime.
> 

Hi, Peter, Ingo

Could we please pick up the patch for -tip?

-- 
Three Cheers,
Balbir Singh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-21 15:25                                       ` Balbir Singh
@ 2010-01-21 15:36                                         ` Peter Zijlstra
  0 siblings, 0 replies; 62+ messages in thread
From: Peter Zijlstra @ 2010-01-21 15:36 UTC (permalink / raw)
  To: balbir
  Cc: Anton Blanchard, Martin Schwidefsky, Bharata B Rao,
	KOSAKI Motohiro, Ingo Molnar, mingo, hpa, linux-kernel,
	balajirrao, dhaval, tglx, kamezawa.hiroyu, akpm, Tony Luck,
	Fenghua Yu, Heiko Carstens, linux390

On Thu, 2010-01-21 at 20:55 +0530, Balbir Singh wrote:

> Could we please pick up the patch for -tip?

OK, will queue it up if Ingo doesn't beat me to it.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
                                                     ` (3 preceding siblings ...)
  2010-01-18  9:42                                   ` Martin Schwidefsky
@ 2010-01-25 23:14                                   ` Andrew Morton
  2010-01-25 23:44                                     ` KOSAKI Motohiro
  2010-01-26  6:17                                     ` Balbir Singh
  2010-01-27 13:15                                   ` [tip:sched/urgent] " tip-bot for Anton Blanchard
  5 siblings, 2 replies; 62+ messages in thread
From: Andrew Morton @ 2010-01-25 23:14 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, Balbir Singh, mingo,
	hpa, linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390

On Mon, 18 Jan 2010 15:41:42 +1100
Anton Blanchard <anton@samba.org> wrote:

> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> This means the call to percpu_counter_add will always add to the global count
> which is protected by a spinlock and we end up with a global spinlock in
> the scheduler.

When one looks at the end result:

: static void cpuacct_update_stats(struct task_struct *tsk,
: 		enum cpuacct_stat_index idx, cputime_t val)
: {
: 	struct cpuacct *ca;
: 	int batch;
: 
: 	if (unlikely(!cpuacct_subsys.active))
: 		return;
: 
: 	rcu_read_lock();
: 	ca = task_ca(tsk);
: 
: 	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
: 	do {
: 		__percpu_counter_add(&ca->cpustat[idx], val, batch);
: 		ca = ca->parent;
: 	} while (ca);
: 	rcu_read_unlock();
: }

the code (which used to be quite obvious) becomes pretty unobvious.  In
fact it looks quite wrong.

Shouldn't there be a comment there explaining wtf is going on?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-25 23:14                                   ` Andrew Morton
@ 2010-01-25 23:44                                     ` KOSAKI Motohiro
  2010-01-26  6:17                                     ` Balbir Singh
  1 sibling, 0 replies; 62+ messages in thread
From: KOSAKI Motohiro @ 2010-01-25 23:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: kosaki.motohiro, Anton Blanchard, Bharata B Rao, Ingo Molnar,
	Balbir Singh, mingo, hpa, linux-kernel, a.p.zijlstra,
	schwidefsky, balajirrao, dhaval, tglx, kamezawa.hiroyu,
	Tony Luck, Fenghua Yu, Heiko Carstens, linux390

> On Mon, 18 Jan 2010 15:41:42 +1100
> Anton Blanchard <anton@samba.org> wrote:
> 
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> > call cpuacct_update_stats with values much larger than percpu_counter_batch.
> > This means the call to percpu_counter_add will always add to the global count
> > which is protected by a spinlock and we end up with a global spinlock in
> > the scheduler.
> 
> When one looks at the end result:

We have about 32 jiffies batch both with or without CONFIG_VIRT_CPU_ACCOUNTING.
Then, The enduser can looks some jiffies after.





^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-25 23:14                                   ` Andrew Morton
  2010-01-25 23:44                                     ` KOSAKI Motohiro
@ 2010-01-26  6:17                                     ` Balbir Singh
  2010-01-26  6:35                                       ` Andrew Morton
  1 sibling, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2010-01-26  6:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Anton Blanchard, Bharata B Rao, KOSAKI Motohiro, Ingo Molnar,
	mingo, hpa, linux-kernel, a.p.zijlstra, schwidefsky, balajirrao,
	dhaval, tglx, kamezawa.hiroyu, Tony Luck, Fenghua Yu,
	Heiko Carstens, linux390

On Tuesday 26 January 2010 04:44 AM, Andrew Morton wrote:
> On Mon, 18 Jan 2010 15:41:42 +1100
> Anton Blanchard <anton@samba.org> wrote:
> 
>> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
>> call cpuacct_update_stats with values much larger than percpu_counter_batch.
>> This means the call to percpu_counter_add will always add to the global count
>> which is protected by a spinlock and we end up with a global spinlock in
>> the scheduler.
> 
> When one looks at the end result:
> 
> : static void cpuacct_update_stats(struct task_struct *tsk,
> : 		enum cpuacct_stat_index idx, cputime_t val)
> : {
> : 	struct cpuacct *ca;
> : 	int batch;
> : 
> : 	if (unlikely(!cpuacct_subsys.active))
> : 		return;
> : 
> : 	rcu_read_lock();
> : 	ca = task_ca(tsk);
> : 
> : 	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> : 	do {
> : 		__percpu_counter_add(&ca->cpustat[idx], val, batch);
> : 		ca = ca->parent;
> : 	} while (ca);
> : 	rcu_read_unlock();
> : }
> 
> the code (which used to be quite obvious) becomes pretty unobvious.  In
> fact it looks quite wrong.
> 
> Shouldn't there be a comment there explaining wtf is going on?

Andrew,

I guess a lot of the changelog and comments are in the email history,
but your point on the comment is valid. Why does it look quite wrong to you?

cputime_one_jiffy tells us how many cputime_t's we've gotten in one
jiffy. If virtual accounting is enabled, this number is quite large, and
1 if virtual accounting is not enabled. Overall the value is set to 32
for non-virtual accounting enabled systems. On systems that support
virtual accounting, the value is set to 32*cputime_per_jifffy, so the
per cpu counter syncs up roughly once in 32 jiffies assuming
cpuacct_update_stats is called once per jiffy for non-virtual machines.

If the above comment, pleases you I'll polish it and send it across.
Anton, could you please confirm what I've said above is indeed correct.



-- 
Three Cheers,
Balbir Singh

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-26  6:17                                     ` Balbir Singh
@ 2010-01-26  6:35                                       ` Andrew Morton
  2010-01-26  8:07                                         ` Anton Blanchard
  0 siblings, 1 reply; 62+ messages in thread
From: Andrew Morton @ 2010-01-26  6:35 UTC (permalink / raw)
  To: balbir
  Cc: Anton Blanchard, Bharata B Rao, KOSAKI Motohiro, Ingo Molnar,
	mingo, hpa, linux-kernel, a.p.zijlstra, schwidefsky, balajirrao,
	dhaval, tglx, kamezawa.hiroyu, Tony Luck, Fenghua Yu,
	Heiko Carstens, linux390

On Tue, 26 Jan 2010 11:47:15 +0530 Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Tuesday 26 January 2010 04:44 AM, Andrew Morton wrote:
> > On Mon, 18 Jan 2010 15:41:42 +1100
> > Anton Blanchard <anton@samba.org> wrote:
> > 
> >> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
> >> call cpuacct_update_stats with values much larger than percpu_counter_batch.
> >> This means the call to percpu_counter_add will always add to the global count
> >> which is protected by a spinlock and we end up with a global spinlock in
> >> the scheduler.
> > 
> > When one looks at the end result:
> > 
> > : static void cpuacct_update_stats(struct task_struct *tsk,
> > : 		enum cpuacct_stat_index idx, cputime_t val)
> > : {
> > : 	struct cpuacct *ca;
> > : 	int batch;
> > : 
> > : 	if (unlikely(!cpuacct_subsys.active))
> > : 		return;
> > : 
> > : 	rcu_read_lock();
> > : 	ca = task_ca(tsk);
> > : 
> > : 	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> > : 	do {
> > : 		__percpu_counter_add(&ca->cpustat[idx], val, batch);
> > : 		ca = ca->parent;
> > : 	} while (ca);
> > : 	rcu_read_unlock();
> > : }
> > 
> > the code (which used to be quite obvious) becomes pretty unobvious.  In
> > fact it looks quite wrong.
> > 
> > Shouldn't there be a comment there explaining wtf is going on?
> 
> Andrew,
> 
> I guess a lot of the changelog and comments are in the email history,

Not a very useful location for it!

> Why does it look quite wrong to you?

Because it computes the correct value and then if it's larger than
INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!


Does that code actually work, btw?  percpu_counter_batch has type `int'
and cputime_one_jiffy has type `int' so their product has type `int'. 
So by the time min_t performs its comparison, the upper 32 bits of the
product are already lost.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-26  6:35                                       ` Andrew Morton
@ 2010-01-26  8:07                                         ` Anton Blanchard
  0 siblings, 0 replies; 62+ messages in thread
From: Anton Blanchard @ 2010-01-26  8:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: balbir, Bharata B Rao, KOSAKI Motohiro, Ingo Molnar, mingo, hpa,
	linux-kernel, a.p.zijlstra, schwidefsky, balajirrao, dhaval,
	tglx, kamezawa.hiroyu, Tony Luck, Fenghua Yu, Heiko Carstens,
	linux390


Hi Andrew,

> > I guess a lot of the changelog and comments are in the email history,
> 
> Not a very useful location for it!

Good point, I'll work on a useful comment.

> > Why does it look quite wrong to you?
> 
> Because it computes the correct value and then if it's larger than
> INT_MAX, it inexplicably assigns INT_MAX to it, giving a wrong result!
>
> Does that code actually work, btw?  percpu_counter_batch has type `int'
> and cputime_one_jiffy has type `int' so their product has type `int'. 
> So by the time min_t performs its comparison, the upper 32 bits of the
> product are already lost.

On ppc64, s390 and ia64 cputime_one_jiffy is 64bit and I want to prevent us
creating too large a batch value:

void __percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)

If we pass in something bigger than INT_MAX we could end up with 0 after
truncation which will turn the percpu counter into a single spinlock global
counter.

Anton

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
                                                     ` (4 preceding siblings ...)
  2010-01-25 23:14                                   ` Andrew Morton
@ 2010-01-27 13:15                                   ` tip-bot for Anton Blanchard
  2010-01-27 13:34                                     ` Balbir Singh
  5 siblings, 1 reply; 62+ messages in thread
From: tip-bot for Anton Blanchard @ 2010-01-27 13:15 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, anton, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  43f85eab1411905afe5db510fbf9841b516e7e6a
Gitweb:     http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
Author:     Anton Blanchard <anton@samba.org>
AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 27 Jan 2010 08:34:38 +0100

sched: cpuacct: Use bigger percpu counter batch values for stats counters

When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
can call cpuacct_update_stats with values much larger than
percpu_counter_batch. This means the call to percpu_counter_add will
always add to the global count which is protected by a spinlock and we
end up with a global spinlock in the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would if
CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu
add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled:		16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled:		   61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch:		16663217 ctx switches/sec

Tested with:

 wget http://ozlabs.org/~anton/junkcode/context_switch.c
 make context_switch
 for i in `seq 0 63`; do taskset -c $i ./context_switch & done
 vmstat 1

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100118044142.GS12666@kryten>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3a8fb30..8f94138 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val)
 {
 	struct cpuacct *ca;
+	int batch;

 	if (unlikely(!cpuacct_subsys.active))
 		return;
@@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
 	rcu_read_lock();
 	ca = task_ca(tsk);

+	batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, batch);
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter  batch values for stats counters
  2010-01-27 13:15                                   ` [tip:sched/urgent] " tip-bot for Anton Blanchard
@ 2010-01-27 13:34                                     ` Balbir Singh
  2010-01-27 21:22                                       ` Andrew Morton
  0 siblings, 1 reply; 62+ messages in thread
From: Balbir Singh @ 2010-01-27 13:34 UTC (permalink / raw)
  To: mingo, hpa, anton, linux-kernel, a.p.zijlstra, tglx, mingo, akpm
  Cc: linux-tip-commits

On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
<anton@samba.org> wrote:
> Commit-ID:  43f85eb1411905afe5db510fbf9841b516e7e6a
> Gitweb:     http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
> Author:     Anton Blanchard <anton@samba.org>
> AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
> Committer:  Ingo Molnar <mingo@elte.hu>
> CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
>
> sched: cpuacct: Use bigger percpu counter batch values for stats counters
>
> When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
> can call cpuacct_update_stats with values much larger than
> percpu_counter_batch. This means the call to percpu_counter_add will
> always add to the global count which is protected by a spinlock and we
> end up with a global spinlock in the scheduler.
>
> Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> cputime_one_jiffy such that we have the same batch limit as we would if
> CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> but that initialisation happened too early on PowerPC (before time_init)
> and it was never updated at runtime as a result of a hotplug cpu
> add/remove.
>
> This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> runtime, which keeps the batch correct even after cpu hotplug operations.
> We cap it at INT_MAX in case of overflow.
>
> For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
> min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
> least on x86 and PowerPC. So there is no need to add an #ifdef.
>
> On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
> faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
>
> CONFIG_CGROUP_CPUACCT disabled:         16906698 ctx switches/sec
> CONFIG_CGROUP_CPUACCT enabled:             61720 ctx switches/sec
> CONFIG_CGROUP_CPUACCT + patch:          16663217 ctx switches/sec
>
> Tested with:
>
>  wget http://ozlabs.org/~anton/junkcode/context_switch.c
>  make context_switch
>  for i in `seq 0 63`; do taskset -c $i ./context_switch & done
>  vmstat 1
>
> Signed-off-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> LKML-Reference: <20100118044142.GS12666@kryten>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  kernel/sched.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 3a8fb30..8f94138 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
>                enum cpuacct_stat_index idx, cputime_t val)
>  {
>        struct cpuacct *ca;
> +       int batch;
>
>        if (unlikely(!cpuacct_subsys.active))
>                return;
> @@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
>        rcu_read_lock();
>        ca = task_ca(tsk);
>
> +       batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
>        do {
> -               percpu_counter_add(&ca->cpustat[idx], val);
> +               __percpu_counter_add(&ca->cpustat[idx], val, batch);
>                ca = ca->parent;
>        } while (ca);
>        rcu_read_unlock();

IIRC, Andrew picked up this patch as well and applied some checkpatch
fixes too..

Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter  batch values for stats counters
  2010-01-27 13:34                                     ` Balbir Singh
@ 2010-01-27 21:22                                       ` Andrew Morton
  2010-01-27 21:43                                         ` Peter Zijlstra
  2010-01-28  5:21                                         ` [tip:sched/urgent] " Balbir Singh
  0 siblings, 2 replies; 62+ messages in thread
From: Andrew Morton @ 2010-01-27 21:22 UTC (permalink / raw)
  To: Balbir Singh
  Cc: mingo, hpa, anton, linux-kernel, a.p.zijlstra, tglx, mingo,
	linux-tip-commits

On Wed, 27 Jan 2010 19:04:55 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Wed, Jan 27, 2010 at 6:45 PM, tip-bot for Anton Blanchard
> <anton@samba.org> wrote:
> > Commit-ID: __43f85eb1411905afe5db510fbf9841b516e7e6a
> > Gitweb: __ __ http://git.kernel.org/tip/43f85eab1411905afe5db510fbf9841b516e7e6a
> > Author: __ __ Anton Blanchard <anton@samba.org>
> > AuthorDate: Mon, 18 Jan 2010 15:41:42 +1100
> > Committer: __Ingo Molnar <mingo@elte.hu>
> > CommitDate: Wed, 27 Jan 2010 08:34:38 +0100
> >
> > sched: cpuacct: Use bigger percpu counter batch values for stats counters
> >
> > When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we
> > can call cpuacct_update_stats with values much larger than
> > percpu_counter_batch. This means the call to percpu_counter_add will
> > always add to the global count which is protected by a spinlock and we
> > end up with a global spinlock in the scheduler.
> >
> > Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
> > cputime_one_jiffy such that we have the same batch limit as we would if
> > CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
> > but that initialisation happened too early on PowerPC (before time_init)
> > and it was never updated at runtime as a result of a hotplug cpu
> > add/remove.
> >
> > This patch instead scales percpu_counter_batch by cputime_one_jiffy at
> > runtime, which keeps the batch correct even after cpu hotplug operations.
> > We cap it at INT_MAX in case of overflow.
> >
> > For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
> > cputime_one_jiffy is the constant 1 and gcc is smart enough to optimise
> > min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
> > least on x86 and PowerPC. So there is no need to add an #ifdef.
> >
> > On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
> > CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x
> > faster and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:
> >
> > CONFIG_CGROUP_CPUACCT disabled: __ __ __ __ 16906698 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT enabled: __ __ __ __ __ __ 61720 ctx switches/sec
> > CONFIG_CGROUP_CPUACCT + patch: __ __ __ __ __16663217 ctx switches/sec
> >
> > Tested with:
> >
> > __wget http://ozlabs.org/~anton/junkcode/context_switch.c
> > __make context_switch
> > __for i in `seq 0 63`; do taskset -c $i ./context_switch & done
> > __vmstat 1
> >
> > Signed-off-by: Anton Blanchard <anton@samba.org>
> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > LKML-Reference: <20100118044142.GS12666@kryten>
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > ---
> > __kernel/sched.c | __ __4 +++-
> > __1 files changed, 3 insertions(+), 1 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 3a8fb30..8f94138 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -10906,6 +10906,7 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> > __ __ __ __ __ __ __ __enum cpuacct_stat_index idx, cputime_t val)
> > __{
> > __ __ __ __struct cpuacct *ca;
> > + __ __ __ int batch;
> >
> > __ __ __ __if (unlikely(!cpuacct_subsys.active))
> > __ __ __ __ __ __ __ __return;
> > @@ -10913,8 +10914,9 @@ static void cpuacct_update_stats(struct task_struct *tsk,
> > __ __ __ __rcu_read_lock();
> > __ __ __ __ca = task_ca(tsk);
> >
> > + __ __ __ batch = min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX);
> > __ __ __ __do {
> > - __ __ __ __ __ __ __ percpu_counter_add(&ca->cpustat[idx], val);
> > + __ __ __ __ __ __ __ __percpu_counter_add(&ca->cpustat[idx], val, batch);
> > __ __ __ __ __ __ __ __ca = ca->parent;
> > __ __ __ __} while (ca);
> > __ __ __ __rcu_read_unlock();

^^ your email client inexplicably fills emails with 0xa0

> IIRC, Andrew picked up this patch as well and applied some checkpatch
> fixes too..

No, I have no changes.

Last I heard, Anton was "working on a useful comment" and will be
redoing the patch.


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter  batch values for stats counters
  2010-01-27 21:22                                       ` Andrew Morton
@ 2010-01-27 21:43                                         ` Peter Zijlstra
  2010-01-28 12:48                                           ` [PATCH 1/2] percpu_counter: Make __percpu_counter_add an inline function on UP Anton Blanchard
  2010-01-28  5:21                                         ` [tip:sched/urgent] " Balbir Singh
  1 sibling, 1 reply; 62+ messages in thread
From: Peter Zijlstra @ 2010-01-27 21:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Balbir Singh, mingo, hpa, anton, linux-kernel, tglx, mingo,
	linux-tip-commits

On Wed, 2010-01-27 at 13:22 -0800, Andrew Morton wrote:

> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.

If Anton is working on a new version, would Anton then also care to post
one that compiles on UP? :-)


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [tip:sched/urgent] sched: cpuacct: Use bigger percpu counter  batch values for stats counters
  2010-01-27 21:22                                       ` Andrew Morton
  2010-01-27 21:43                                         ` Peter Zijlstra
@ 2010-01-28  5:21                                         ` Balbir Singh
  1 sibling, 0 replies; 62+ messages in thread
From: Balbir Singh @ 2010-01-28  5:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: mingo, hpa, anton, linux-kernel, a.p.zijlstra, tglx, mingo,
	linux-tip-commits

>> > __ __ __ __ __ __ __ __ca = ca->parent;
>> > __ __ __ __} while (ca);
>> > __ __ __ __rcu_read_unlock();
>
> ^^ your email client inexplicably fills emails with 0xa0
>

I am going to be changing my email client soon, back to the good old
text client.

>> IIRC, Andrew picked up this patch as well and applied some checkpatch
>> fixes too..
>
> No, I have no changes.
>

Sorry, I was confused, the changes you had were for getdelays.c

> Last I heard, Anton was "working on a useful comment" and will be
> redoing the patch.

Yep, lets wait on him.

Three Cheers,
Balbir

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 1/2] percpu_counter: Make __percpu_counter_add an inline function on UP
  2010-01-27 21:43                                         ` Peter Zijlstra
@ 2010-01-28 12:48                                           ` Anton Blanchard
  2010-01-28 12:52                                             ` [PATCH 2/2] sched: cpuacct: Use bigger percpu counter batch values for stats counters Anton Blanchard
  0 siblings, 1 reply; 62+ messages in thread
From: Anton Blanchard @ 2010-01-28 12:48 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton
  Cc: Balbir Singh, mingo, hpa, linux-kernel, tglx, mingo, linux-tip-commits


Even though batch isn't used on UP, we may want to pass one in to keep the
SMP and UP code paths similar. Convert __percpu_counter_add to an inline
function so we wont get variable unused warnings if we do.
 
Signed-off-by: Anton Blanchard <anton@samba.org>
---

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index a7684a5..794662b 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -98,9 +98,6 @@ static inline void percpu_counter_set(struct percpu_counter *fbc, s64 amount)
 	fbc->count = amount;
 }
 
-#define __percpu_counter_add(fbc, amount, batch) \
-	percpu_counter_add(fbc, amount)
-
 static inline void
 percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 {
@@ -109,6 +106,12 @@ percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 	preempt_enable();
 }
 
+static inline void
+__percpu_counter_add(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+	percpu_counter_add(fbc, amount);
+}
+
 static inline s64 percpu_counter_read(struct percpu_counter *fbc)
 {
 	return fbc->count;

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 2/2] sched: cpuacct: Use bigger percpu counter batch values for stats counters
  2010-01-28 12:48                                           ` [PATCH 1/2] percpu_counter: Make __percpu_counter_add an inline function on UP Anton Blanchard
@ 2010-01-28 12:52                                             ` Anton Blanchard
  0 siblings, 0 replies; 62+ messages in thread
From: Anton Blanchard @ 2010-01-28 12:52 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton
  Cc: Balbir Singh, mingo, hpa, linux-kernel, tglx, mingo, linux-tip-commits

When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are enabled we can
call cpuacct_update_stats with values much larger than percpu_counter_batch.
This means the call to percpu_counter_add will always add to the global count
which is protected by a spinlock and we end up with a global spinlock in
the scheduler.

Based on an idea by KOSAKI Motohiro, this patch scales the batch value by
cputime_one_jiffy such that we have the same batch limit as we would
if CONFIG_VIRT_CPU_ACCOUNTING was disabled. His patch did this once at boot
but that initialisation happened too early on PowerPC (before time_init)
and it was never updated at runtime as a result of a hotplug cpu add/remove.

This patch instead scales percpu_counter_batch by cputime_one_jiffy at
runtime, which keeps the batch correct even after cpu hotplug operations.
We cap it at INT_MAX in case of overflow.

For architectures that do not support CONFIG_VIRT_CPU_ACCOUNTING,
cputime_one_jiffy is the constant 1 and gcc is smart enough to
optimise min(s32 percpu_counter_batch, INT_MAX) to just percpu_counter_batch
at least on x86 and PowerPC. So there is no need to add an #ifdef.

On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and 
CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark is 234x faster
and almost matches a CONFIG_CGROUP_CPUACCT disabled kernel:

CONFIG_CGROUP_CPUACCT disabled:         16906698 ctx switches/sec
CONFIG_CGROUP_CPUACCT enabled:             61720 ctx switches/sec
CONFIG_CGROUP_CPUACCT + patch:          16663217 ctx switches/sec

Tested with:

wget http://ozlabs.org/~anton/junkcode/context_switch.c
make context_switch
for i in `seq 0 63`; do taskset -c $i ./context_switch & done
vmstat 1

Signed-off-by: Anton Blanchard <anton@samba.org>
---

v2: Added a comment and fixed the UP build.

Index: linux-cpumask/kernel/sched.c
===================================================================
--- linux-cpumask.orig/kernel/sched.c	2010-01-22 18:23:43.377212514 +1100
+++ linux-cpumask/kernel/sched.c	2010-01-28 23:24:02.677233753 +1100
@@ -10885,12 +10885,30 @@ static void cpuacct_charge(struct task_s
 }

 /*
+ * When CONFIG_VIRT_CPU_ACCOUNTING is enabled one jiffy can be very large
+ * in cputime_t units. As a result, cpuacct_update_stats calls
+ * percpu_counter_add with values large enough to always overflow the
+ * per cpu batch limit causing bad SMP scalability.
+ *
+ * To fix this we scale percpu_counter_batch by cputime_one_jiffy so we
+ * batch the same amount of time with CONFIG_VIRT_CPU_ACCOUNTING disabled
+ * and enabled. We cap it at INT_MAX which is the largest allowed batch value.
+ */
+#ifdef CONFIG_SMP
+#define CPUACCT_BATCH	\
+	min_t(long, percpu_counter_batch * cputime_one_jiffy, INT_MAX)
+#else
+#define CPUACCT_BATCH	0
+#endif
+
+/*
  * Charge the system/user time to the task's accounting group.
  */
 static void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val)
 {
 	struct cpuacct *ca;
+	int batch = CPUACCT_BATCH;

 	if (unlikely(!cpuacct_subsys.active))
 		return;
@@ -10899,7 +10917,7 @@ static void cpuacct_update_stats(struct 
 	ca = task_ca(tsk);

 	do {
-		percpu_counter_add(&ca->cpustat[idx], val);
+		__percpu_counter_add(&ca->cpustat[idx], val, batch);
 		ca = ca->parent;
 	} while (ca);
 	rcu_read_unlock();

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2010-01-28 12:56 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-07  9:59 [PATCH v3] cpuacct: VIRT_CPU_ACCOUNTING don't prevent percpu cputime count KOSAKI Motohiro
2009-05-07 10:50 ` Ingo Molnar
2009-05-09  9:26   ` KOSAKI Motohiro
2009-05-09  9:38     ` Peter Zijlstra
2009-05-09  9:52       ` KOSAKI Motohiro
2009-05-09 10:14         ` KOSAKI Motohiro
2009-05-11 12:33           ` [tip:sched/core] sched: cpuacct: Use bigger percpu counter batch values for stats counters tip-bot for KOSAKI Motohiro
2009-05-11 12:36           ` tip-bot for KOSAKI Motohiro
2009-05-11 15:07             ` Ingo Molnar
2009-05-12 10:01               ` KOSAKI Motohiro
2009-05-12 10:02                 ` Peter Zijlstra
2009-05-12 10:06                   ` KOSAKI Motohiro
2009-05-12 10:22                     ` KOSAKI Motohiro
2009-05-12 10:27                       ` Ingo Molnar
2009-05-12 10:42                         ` KOSAKI Motohiro
2009-05-12 10:10                 ` Balbir Singh
2009-05-12 10:13                   ` KOSAKI Motohiro
2009-05-12 10:24                     ` Balbir Singh
2009-05-12 10:29                       ` Ingo Molnar
2009-05-12 10:44                         ` KOSAKI Motohiro
2009-05-12 13:28                           ` Ingo Molnar
2009-05-12 13:40                             ` KOSAKI Motohiro
2009-05-12 13:42                               ` Ingo Molnar
2009-07-16  8:10                           ` Bharata B Rao
2009-07-16  8:39                             ` Anton Blanchard
2009-08-20  5:10                               ` Anton Blanchard
2009-08-20  5:16                                 ` KAMEZAWA Hiroyuki
2009-08-20  5:26                                 ` Balbir Singh
2009-08-20  5:52                                   ` Peter Zijlstra
2009-08-20  6:05                                     ` Anton Blanchard
2009-08-20  6:10                                       ` KAMEZAWA Hiroyuki
2009-08-20  6:24                                         ` Anton Blanchard
2009-08-20  8:41                                           ` [PATCH] better align percpu counter (Was " KAMEZAWA Hiroyuki
2009-08-20 10:04                                             ` Ingo Molnar
2009-08-21  2:20                                               ` KAMEZAWA Hiroyuki
2009-08-21 11:29                                                 ` Ingo Molnar
2009-08-21  3:56                                             ` KAMEZAWA Hiroyuki
2009-08-20 11:46                                           ` Balbir Singh
2009-08-21  6:21                                           ` KAMEZAWA Hiroyuki
2010-01-18  4:41                                 ` [PATCH] " Anton Blanchard
2010-01-18  4:55                                   ` KOSAKI Motohiro
2010-01-18  7:51                                   ` Peter Zijlstra
2010-01-18  8:21                                     ` Anton Blanchard
2010-01-18  8:34                                       ` Peter Zijlstra
2010-01-18  8:35                                   ` Balbir Singh
2010-01-18  9:42                                   ` Martin Schwidefsky
2010-01-18  9:55                                     ` Anton Blanchard
2010-01-18 11:00                                       ` Balbir Singh
2010-01-21 15:25                                       ` Balbir Singh
2010-01-21 15:36                                         ` Peter Zijlstra
2010-01-25 23:14                                   ` Andrew Morton
2010-01-25 23:44                                     ` KOSAKI Motohiro
2010-01-26  6:17                                     ` Balbir Singh
2010-01-26  6:35                                       ` Andrew Morton
2010-01-26  8:07                                         ` Anton Blanchard
2010-01-27 13:15                                   ` [tip:sched/urgent] " tip-bot for Anton Blanchard
2010-01-27 13:34                                     ` Balbir Singh
2010-01-27 21:22                                       ` Andrew Morton
2010-01-27 21:43                                         ` Peter Zijlstra
2010-01-28 12:48                                           ` [PATCH 1/2] percpu_counter: Make __percpu_counter_add an inline function on UP Anton Blanchard
2010-01-28 12:52                                             ` [PATCH 2/2] sched: cpuacct: Use bigger percpu counter batch values for stats counters Anton Blanchard
2010-01-28  5:21                                         ` [tip:sched/urgent] " Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.