linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] cpuacct: add a branch prediction
@ 2009-02-26  7:40 Li Zefan
  2009-02-26  8:07 ` KAMEZAWA Hiroyuki
  2009-02-26  8:37 ` [PATCH] cpuacct: add a branch prediction Balbir Singh
  0 siblings, 2 replies; 42+ messages in thread
From: Li Zefan @ 2009-02-26  7:40 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Menage, Balbir Singh, LKML

cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
always returns false after cpuacct has been initialized at system boot.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 410eec4..fd2f7fc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
 	struct cpuacct *ca;
 	int cpu;
 
-	if (!cpuacct_subsys.active)
+	if (unlikely(!cpuacct_subsys.active))
 		return;
 
 	cpu = task_cpu(tsk);
-- 
1.5.4.rc3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  7:40 [PATCH] cpuacct: add a branch prediction Li Zefan
@ 2009-02-26  8:07 ` KAMEZAWA Hiroyuki
  2009-02-26  8:17   ` Li Zefan
  2009-02-26  8:37 ` [PATCH] cpuacct: add a branch prediction Balbir Singh
  1 sibling, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26  8:07 UTC (permalink / raw)
  To: Li Zefan; +Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, Balbir Singh, LKML

On Thu, 26 Feb 2009 15:40:15 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
> always returns false after cpuacct has been initialized at system boot.
> 
> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
> ---
>  kernel/sched.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 410eec4..fd2f7fc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
>  	struct cpuacct *ca;
>  	int cpu;
>  
> -	if (!cpuacct_subsys.active)
> +	if (unlikely(!cpuacct_subsys.active))
>  		return;
>  
(Just curious)
I wonder "ca = task_ca(tsk)" will return NULL if cpuacct subsys is not initalized.
Then, can we just remove this check ?


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:07 ` KAMEZAWA Hiroyuki
@ 2009-02-26  8:17   ` Li Zefan
  2009-02-26  8:22     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Li Zefan @ 2009-02-26  8:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, Balbir Singh, LKML

KAMEZAWA Hiroyuki wrote:
> On Thu, 26 Feb 2009 15:40:15 +0800
> Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
>> cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
>> always returns false after cpuacct has been initialized at system boot.
>>
>> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
>> ---
>>  kernel/sched.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 410eec4..fd2f7fc 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
>>  	struct cpuacct *ca;
>>  	int cpu;
>>  
>> -	if (!cpuacct_subsys.active)
>> +	if (unlikely(!cpuacct_subsys.active))
>>  		return;
>>  
> (Just curious)
> I wonder "ca = task_ca(tsk)" will return NULL if cpuacct subsys is not initalized.

Yes, it will be NULL, and that's why we need this check.

> Then, can we just remove this check ?
> 

cpuacct_charge() can be called before cpuacct is initialized, so we have to check this
case here.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:17   ` Li Zefan
@ 2009-02-26  8:22     ` KAMEZAWA Hiroyuki
  2009-02-26  8:35       ` Li Zefan
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26  8:22 UTC (permalink / raw)
  To: Li Zefan; +Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, Balbir Singh, LKML

On Thu, 26 Feb 2009 16:17:31 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Thu, 26 Feb 2009 15:40:15 +0800
> > Li Zefan <lizf@cn.fujitsu.com> wrote:
> > 
> >> cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
> >> always returns false after cpuacct has been initialized at system boot.
> >>
> >> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
> >> ---
> >>  kernel/sched.c |    2 +-
> >>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> index 410eec4..fd2f7fc 100644
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> >>  	struct cpuacct *ca;
> >>  	int cpu;
> >>  
> >> -	if (!cpuacct_subsys.active)
> >> +	if (unlikely(!cpuacct_subsys.active))
> >>  		return;
> >>  
> > (Just curious)
> > I wonder "ca = task_ca(tsk)" will return NULL if cpuacct subsys is not initalized.
> 
> Yes, it will be NULL, and that's why we need this check.
> 
> > Then, can we just remove this check ?
> > 
> 
> cpuacct_charge() can be called before cpuacct is initialized, so we have to check this
> case here.
> 
My point is, 

ca = task_ca(tsk)
for (; ca; ca->parent) {
   ...
}

What is problem even if ca is NULL.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:22     ` KAMEZAWA Hiroyuki
@ 2009-02-26  8:35       ` Li Zefan
  2009-02-26  8:40         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Li Zefan @ 2009-02-26  8:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, Balbir Singh, LKML

KAMEZAWA Hiroyuki wrote:
> On Thu, 26 Feb 2009 16:17:31 +0800
> Li Zefan <lizf@cn.fujitsu.com> wrote:
> 
>> KAMEZAWA Hiroyuki wrote:
>>> On Thu, 26 Feb 2009 15:40:15 +0800
>>> Li Zefan <lizf@cn.fujitsu.com> wrote:
>>>
>>>> cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
>>>> always returns false after cpuacct has been initialized at system boot.
>>>>
>>>> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
>>>> ---
>>>>  kernel/sched.c |    2 +-
>>>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>> index 410eec4..fd2f7fc 100644
>>>> --- a/kernel/sched.c
>>>> +++ b/kernel/sched.c
>>>> @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
>>>>  	struct cpuacct *ca;
>>>>  	int cpu;
>>>>  
>>>> -	if (!cpuacct_subsys.active)
>>>> +	if (unlikely(!cpuacct_subsys.active))
>>>>  		return;
>>>>  
>>> (Just curious)
>>> I wonder "ca = task_ca(tsk)" will return NULL if cpuacct subsys is not initalized.
>> Yes, it will be NULL, and that's why we need this check.
>>
>>> Then, can we just remove this check ?
>>>
>> cpuacct_charge() can be called before cpuacct is initialized, so we have to check this
>> case here.
>>
> My point is, 
> 

Ah, I see.

> ca = task_ca(tsk)
> for (; ca; ca->parent) {
>    ...
> }
> 

ca is not checked before hierarchy support, and it's a side-effect.

Before cpuacct is initialized, css == task->cgroups->subsys[cpuacct_subsys] == NULL,
but ca = task_ca(tsk) is not necessarily NULL, unless struct cgroup_subsys_state is the
first member of struct cpuacct.

And the above code actually should be:

do {
	...
} while (ca->parent);

> What is problem even if ca is NULL.
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  7:40 [PATCH] cpuacct: add a branch prediction Li Zefan
  2009-02-26  8:07 ` KAMEZAWA Hiroyuki
@ 2009-02-26  8:37 ` Balbir Singh
  2009-02-26  8:41   ` Li Zefan
  2009-02-26  8:43   ` KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 42+ messages in thread
From: Balbir Singh @ 2009-02-26  8:37 UTC (permalink / raw)
  To: Li Zefan; +Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, LKML

* Li Zefan <lizf@cn.fujitsu.com> [2009-02-26 15:40:15]:

> cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
> always returns false after cpuacct has been initialized at system boot.
> 
> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
> ---
>  kernel/sched.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 410eec4..fd2f7fc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
>  	struct cpuacct *ca;
>  	int cpu;
> 
> -	if (!cpuacct_subsys.active)
> +	if (unlikely(!cpuacct_subsys.active))

What happens if most distros make cpuacct_subsys active? Can we decide
on unlikely depending on how the system is configured?


-- 
	Balbir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:35       ` Li Zefan
@ 2009-02-26  8:40         ` KAMEZAWA Hiroyuki
  2009-02-26 10:10           ` Bharata B Rao
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26  8:40 UTC (permalink / raw)
  To: Li Zefan; +Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, Balbir Singh, LKML

On Thu, 26 Feb 2009 16:35:33 +0800
Li Zefan <lizf@cn.fujitsu.com> wrote:

> ca is not checked before hierarchy support, and it's a side-effect.
> 
> Before cpuacct is initialized, css == task->cgroups->subsys[cpuacct_subsys] == NULL,
> but ca = task_ca(tsk) is not necessarily NULL, unless struct cgroup_subsys_state is the
> first member of struct cpuacct.
> 
> And the above code actually should be:
> 
> do {
> 	...
> } while (ca->parent);
> 
I'll send no more objections to this patch itself.

But IMHO, this loop is tooo bad, I think. The hierarchy information should be gathered by
read-side and the total code should be

void charge_statistics(tsk, cputime)
{
   ca = task_ca(tsk)
   /* ca can be null while init */
   if (likely(ca)) {
     u64 *cpuusage = percpu_ptr(ca->cpuusage, task_cpu(tsk))
     *cpuusage += cputime
   }
}
(Need changes to read-side and create/destroy of cpuacct cgroups.)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:37 ` [PATCH] cpuacct: add a branch prediction Balbir Singh
@ 2009-02-26  8:41   ` Li Zefan
  2009-02-26 10:40     ` Balbir Singh
  2009-02-26  8:43   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 42+ messages in thread
From: Li Zefan @ 2009-02-26  8:41 UTC (permalink / raw)
  To: balbir; +Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, LKML

Balbir Singh wrote:
> * Li Zefan <lizf@cn.fujitsu.com> [2009-02-26 15:40:15]:
> 
>> cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
>> always returns false after cpuacct has been initialized at system boot.
>>
>> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
>> ---
>>  kernel/sched.c |    2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 410eec4..fd2f7fc 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
>>  	struct cpuacct *ca;
>>  	int cpu;
>>
>> -	if (!cpuacct_subsys.active)
>> +	if (unlikely(!cpuacct_subsys.active))
> 
> What happens if most distros make cpuacct_subsys active? Can we decide
> on unlikely depending on how the system is configured?
> 

Note, it's ->active but not ->disabled. ;)

And note, I make it "unlikely(!cpuacct_subsys.active)", but not "unlikely(cpuacct_subsys.active)"

->active can't be configured by users. If CONFIG_CGROUP_CPUACCT is enabled,
cpuacct_subsys.active will be set to true after cpuacct is initialized.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:37 ` [PATCH] cpuacct: add a branch prediction Balbir Singh
  2009-02-26  8:41   ` Li Zefan
@ 2009-02-26  8:43   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26  8:43 UTC (permalink / raw)
  To: balbir; +Cc: Li Zefan, Ingo Molnar, Peter Zijlstra, Paul Menage, LKML

On Thu, 26 Feb 2009 14:07:02 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Li Zefan <lizf@cn.fujitsu.com> [2009-02-26 15:40:15]:
> 
> > cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
> > always returns false after cpuacct has been initialized at system boot.
> > 
> > Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
> > ---
> >  kernel/sched.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 410eec4..fd2f7fc 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> >  	struct cpuacct *ca;
> >  	int cpu;
> > 
> > -	if (!cpuacct_subsys.active)
> > +	if (unlikely(!cpuacct_subsys.active))
> 
> What happens if most distros make cpuacct_subsys active? Can we decide
> on unlikely depending on how the system is configured?
>
"active" just guarantees task_ca(tsk) is not NULL and subsystem is initialized.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:40         ` KAMEZAWA Hiroyuki
@ 2009-02-26 10:10           ` Bharata B Rao
  2009-02-26 10:28             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Bharata B Rao @ 2009-02-26 10:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Li Zefan, Ingo Molnar, Peter Zijlstra, Paul Menage, Balbir Singh, LKML

On Thu, Feb 26, 2009 at 2:10 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Thu, 26 Feb 2009 16:35:33 +0800
> Li Zefan <lizf@cn.fujitsu.com> wrote:
>
>> ca is not checked before hierarchy support, and it's a side-effect.
>>
>> Before cpuacct is initialized, css == task->cgroups->subsys[cpuacct_subsys] == NULL,
>> but ca = task_ca(tsk) is not necessarily NULL, unless struct cgroup_subsys_state is the
>> first member of struct cpuacct.
>>
>> And the above code actually should be:
>>
>> do {
>>       ...
>> } while (ca->parent);
>>
> I'll send no more objections to this patch itself.
>
> But IMHO, this loop is tooo bad, I think. The hierarchy information should be gathered by
> read-side and the total code should be

How do you take care of accounting when a group gets deleted ?.
Wouldn't you loose some information due to deleted groups if you want
to accumulate hierarchical stats only during read ?

Regards,
Bharata.
-- 
http://bharata.sulekha.com/blog/posts.htm

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 10:10           ` Bharata B Rao
@ 2009-02-26 10:28             ` KAMEZAWA Hiroyuki
  2009-02-26 10:44               ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26 10:28 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: KAMEZAWA Hiroyuki, Li Zefan, Ingo Molnar, Peter Zijlstra,
	Paul Menage, Balbir Singh, LKML

Bharata B Rao:
> On Thu, Feb 26, 2009 at 2:10 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> On Thu, 26 Feb 2009 16:35:33 +0800
>> Li Zefan <lizf@cn.fujitsu.com> wrote:
>>
>>> ca is not checked before hierarchy support, and it's a side-effect.
>>>
>>> Before cpuacct is initialized, css ==
>>> task->cgroups->subsys[cpuacct_subsys] == NULL,
>>> but ca = task_ca(tsk) is not necessarily NULL, unless struct
>>> cgroup_subsys_state is the
>>> first member of struct cpuacct.
>>>
>>> And the above code actually should be:
>>>
>>> do {
>>> &#160; &#160; &#160; ...
>>> } while (ca->parent);
>>>
>> I'll send no more objections to this patch itself.
>>
>> But IMHO, this loop is tooo bad, I think. The hierarchy information
>> should be gathered by
>> read-side and the total code should be
>
> How do you take care of accounting when a group gets deleted ?.
propagate accounts to the parent or root group if deleted.

> Wouldn't you loose some information due to deleted groups if you want
> to accumulate hierarchical stats only during read ?
>
Taking hierarchy mutex while reading will make read-side stable.
I haven't considered in detail but I wrote CSS ID and scan hierarchy
code to do this kind of "visit all css object under hierarchy" jobs.
I think you can catch what I mean by reading what memory.stat file of mem
cgroup does. (mmotm)

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26  8:41   ` Li Zefan
@ 2009-02-26 10:40     ` Balbir Singh
  2009-02-26 10:43       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: Balbir Singh @ 2009-02-26 10:40 UTC (permalink / raw)
  To: Li Zefan; +Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, LKML

* Li Zefan <lizf@cn.fujitsu.com> [2009-02-26 16:41:20]:

> Balbir Singh wrote:
> > * Li Zefan <lizf@cn.fujitsu.com> [2009-02-26 15:40:15]:
> > 
> >> cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
> >> always returns false after cpuacct has been initialized at system boot.
> >>
> >> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
> >> ---
> >>  kernel/sched.c |    2 +-
> >>  1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/kernel/sched.c b/kernel/sched.c
> >> index 410eec4..fd2f7fc 100644
> >> --- a/kernel/sched.c
> >> +++ b/kernel/sched.c
> >> @@ -9589,7 +9589,7 @@ static void cpuacct_charge(struct task_struct *tsk, u64 cputime)
> >>  	struct cpuacct *ca;
> >>  	int cpu;
> >>
> >> -	if (!cpuacct_subsys.active)
> >> +	if (unlikely(!cpuacct_subsys.active))
> > 
> > What happens if most distros make cpuacct_subsys active? Can we decide
> > on unlikely depending on how the system is configured?
> > 
> 
> Note, it's ->active but not ->disabled. ;)
>

I've seen that
 
> And note, I make it "unlikely(!cpuacct_subsys.active)", but not "unlikely(cpuacct_subsys.active)"
>
> ->active can't be configured by users. If CONFIG_CGROUP_CPUACCT is enabled,
> cpuacct_subsys.active will be set to true after cpuacct is initialized.
>

My point is the likely/unlikely in this case depends on turning on/off
CONFIG_CGROUP_CPUACCT? Am I missing something?


-- 
	Balbir

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 10:40     ` Balbir Singh
@ 2009-02-26 10:43       ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2009-02-26 10:43 UTC (permalink / raw)
  To: balbir; +Cc: Li Zefan, Ingo Molnar, Paul Menage, LKML

On Thu, 2009-02-26 at 16:10 +0530, Balbir Singh wrote:

> My point is the likely/unlikely in this case depends on turning on/off
> CONFIG_CGROUP_CPUACCT? Am I missing something?

Yeah, that whole function is a stub for !CPUACCT

#ifdef CONFIG_CGROUP_CPUACCT
static void cpuacct_charge(struct task_struct *tsk, u64 cputime);
#else
static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {}
#endif



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 10:28             ` KAMEZAWA Hiroyuki
@ 2009-02-26 10:44               ` Peter Zijlstra
  2009-02-26 10:55                 ` KAMEZAWA Hiroyuki
  2009-02-26 11:17                 ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 42+ messages in thread
From: Peter Zijlstra @ 2009-02-26 10:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage, Balbir Singh, LKML

On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:

> Taking hierarchy mutex while reading will make read-side stable.

We're talking about scheduling here, taking a mutex to stop scheduling
won't work, nor will it be acceptible to use anything that will.




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 10:44               ` Peter Zijlstra
@ 2009-02-26 10:55                 ` KAMEZAWA Hiroyuki
  2009-02-26 11:22                   ` Peter Zijlstra
  2009-02-26 11:17                 ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26 10:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

Peter Zijlstra wrote:
> On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
>
>> Taking hierarchy mutex while reading will make read-side stable.
>
> We're talking about scheduling here, taking a mutex to stop scheduling
> won't work, nor will it be acceptible to use anything that will.
>
I'm sorry but I thought we are talking about cpuacct subsystem and not
cpu subsystem....cpuacct subsystem's information is used for schedule ?

Is there a user of information gathered by cpuacct subsystem in the kernel ?
I thought only user(reader) is the user who read cpuacct.usage file.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 10:44               ` Peter Zijlstra
  2009-02-26 10:55                 ` KAMEZAWA Hiroyuki
@ 2009-02-26 11:17                 ` KAMEZAWA Hiroyuki
  2009-02-26 11:28                   ` Peter Zijlstra
  1 sibling, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

Peter Zijlstra さんは書きました:
> On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
>
>> Taking hierarchy mutex while reading will make read-side stable.
>
> We're talking about scheduling here, taking a mutex to stop scheduling
> won't work, nor will it be acceptible to use anything that will.
>
No mutex is necessary, anyway.
hierarchy-walker function completely works well under rcu read lock,
if small jitter is allowed.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 10:55                 ` KAMEZAWA Hiroyuki
@ 2009-02-26 11:22                   ` Peter Zijlstra
  0 siblings, 0 replies; 42+ messages in thread
From: Peter Zijlstra @ 2009-02-26 11:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage, Balbir Singh, LKML

On Thu, 2009-02-26 at 19:55 +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra wrote:
> > On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
> >
> >> Taking hierarchy mutex while reading will make read-side stable.
> >
> > We're talking about scheduling here, taking a mutex to stop scheduling
> > won't work, nor will it be acceptible to use anything that will.
> >
> I'm sorry but I thought we are talking about cpuacct subsystem and not
> cpu subsystem....cpuacct subsystem's information is used for schedule ?

Not used for scheduling, but directly generated by it.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 11:17                 ` KAMEZAWA Hiroyuki
@ 2009-02-26 11:28                   ` Peter Zijlstra
  2009-02-26 12:06                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-02-26 11:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage, Balbir Singh, LKML

On Thu, 2009-02-26 at 20:17 +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra さんは書きました:
> > On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
> >
> >> Taking hierarchy mutex while reading will make read-side stable.
> >
> > We're talking about scheduling here, taking a mutex to stop scheduling
> > won't work, nor will it be acceptible to use anything that will.
> >
> No mutex is necessary, anyway.
> hierarchy-walker function completely works well under rcu read lock,
> if small jitter is allowed.

Right, should be doable -- and looking at the code, we have this
horrible 32 bit exception in there that locks the rq in order to read
the 64bit value.

Would be grand to get rid of that,. how bad would it be for userspace to
get the occasionally fubarred value?

But aside from that, the cpu controller itself is also summing directly
up the hierarchy, so cpuacct doing the same doesn't seem odd.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 11:28                   ` Peter Zijlstra
@ 2009-02-26 12:06                     ` KAMEZAWA Hiroyuki
  2009-02-26 12:20                       ` Peter Zijlstra
  2009-02-26 16:45                       ` Paul E. McKenney
  0 siblings, 2 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-26 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

Peter Zijlstra wrote:
> On Thu, 2009-02-26 at 20:17 +0900, KAMEZAWA Hiroyuki wrote:
>> Peter Zijlstra wrote:
>> > On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
>> >
>> >> Taking hierarchy mutex while reading will make read-side stable.
>> >
>> > We're talking about scheduling here, taking a mutex to stop scheduling
>> > won't work, nor will it be acceptible to use anything that will.
>> >
>> No mutex is necessary, anyway.
>> hierarchy-walker function completely works well under rcu read lock,
>> if small jitter is allowed.
>
> Right, should be doable -- and looking at the code, we have this
> horrible 32 bit exception in there that locks the rq in order to read
> the 64bit value.
>
> Would be grand to get rid of that,. how bad would it be for userspace to
> get the occasionally fubarred value?
>
>From view of user-support saler, if terrible broken value is reported,
it will be user-incident and annoy me(us) ;)

I'd like to get rid of rq->lock, too..Hmm.. some routine like
atomic64_read() can help this ? (But I don't want to use atomic_t here..)

> But aside from that, the cpu controller itself is also summing directly
> up the hierarchy, so cpuacct doing the same doesn't seem odd.
>
I'll post some idea if I can think of something reasonable.
But I tend to hesitate to modify sched.c ;)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 12:06                     ` KAMEZAWA Hiroyuki
@ 2009-02-26 12:20                       ` Peter Zijlstra
  2009-02-26 12:26                         ` Ingo Molnar
  2009-02-26 16:45                       ` Paul E. McKenney
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-02-26 12:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage, Balbir Singh, LKML

On Thu, 2009-02-26 at 21:06 +0900, KAMEZAWA Hiroyuki wrote:
> Hmm.. some routine like
> atomic64_read() can help this ? (But I don't want to use atomic_t here..)

Yeah, atomic64_t has been proposed numerous times, and x86 could
actually implement that using cmpxchg8b, just not sure about all the
other 32bit archs, and if we start using it in the scheduler, they'd
better have it implemented.

Furthermore, it would require all the ops to be atomic, which would
require them all to use cmpxchg8b and its a horridly expensive op to
use :-(


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 12:20                       ` Peter Zijlstra
@ 2009-02-26 12:26                         ` Ingo Molnar
  2009-02-26 12:40                           ` Arnd Bergmann
  2009-02-27  4:25                           ` Paul Mackerras
  0 siblings, 2 replies; 42+ messages in thread
From: Ingo Molnar @ 2009-02-26 12:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, Bharata B Rao, Li Zefan, Paul Menage,
	Balbir Singh, LKML


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2009-02-26 at 21:06 +0900, KAMEZAWA Hiroyuki wrote:
> > Hmm.. some routine like
> > atomic64_read() can help this ? (But I don't want to use atomic_t here..)
> 
> Yeah, atomic64_t has been proposed numerous times, and x86 
> could actually implement that using cmpxchg8b, just not sure 
> about all the other 32bit archs, and if we start using it in 
> the scheduler, they'd better have it implemented.

I have written a working atomic64_t implementation for 
tip:perfcounters/core, for 32-bit x86.

	Ingo

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 12:26                         ` Ingo Molnar
@ 2009-02-26 12:40                           ` Arnd Bergmann
  2009-02-27  4:25                           ` Paul Mackerras
  1 sibling, 0 replies; 42+ messages in thread
From: Arnd Bergmann @ 2009-02-26 12:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, Bharata B Rao, Li Zefan,
	Paul Menage, Balbir Singh, LKML

On Thursday 26 February 2009, Ingo Molnar wrote:
>
> > Yeah, atomic64_t has been proposed numerous times, and x86 
> > could actually implement that using cmpxchg8b, just not sure 
> > about all the other 32bit archs, and if we start using it in 
> > the scheduler, they'd better have it implemented.
> 
> I have written a working atomic64_t implementation for 
> tip:perfcounters/core, for 32-bit x86.

It should also be possible to write an asm-generic atomic64_t
implementation based on the parisc atomic_t hashlock code.
For non-SMP configurations, it falls back on local_irq_save(),
which is already the method used for atomic_t on half the
embedded architectures.

	Arnd <><

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 12:06                     ` KAMEZAWA Hiroyuki
  2009-02-26 12:20                       ` Peter Zijlstra
@ 2009-02-26 16:45                       ` Paul E. McKenney
  2009-02-27  0:58                         ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2009-02-26 16:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

On Thu, Feb 26, 2009 at 09:06:24PM +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra wrote:
> > On Thu, 2009-02-26 at 20:17 +0900, KAMEZAWA Hiroyuki wrote:
> >> Peter Zijlstra wrote:
> >> > On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
> >> >
> >> >> Taking hierarchy mutex while reading will make read-side stable.
> >> >
> >> > We're talking about scheduling here, taking a mutex to stop scheduling
> >> > won't work, nor will it be acceptible to use anything that will.
> >> >
> >> No mutex is necessary, anyway.
> >> hierarchy-walker function completely works well under rcu read lock,
> >> if small jitter is allowed.
> >
> > Right, should be doable -- and looking at the code, we have this
> > horrible 32 bit exception in there that locks the rq in order to read
> > the 64bit value.
> >
> > Would be grand to get rid of that,. how bad would it be for userspace to
> > get the occasionally fubarred value?
> >
> >From view of user-support saler, if terrible broken value is reported,
> it will be user-incident and annoy me(us) ;)
> 
> I'd like to get rid of rq->lock, too..Hmm.. some routine like
> atomic64_read() can help this ? (But I don't want to use atomic_t here..)

atomic64_read() will not help you on a 32-bit machine.  Here is the
sequence of events that will cause the aforementioned user incidents and
consequent annoyance:

o	The value of the counter is (2^32)-1, or 0xffffffff.

o	CPU 0 reads the high-order 32 bits of the counter, getting zero.

o	CPU 1 increments the low-order 32 bits of the counter, resulting
	in zero, but notes that there is a carry out of this field.

o	CPU 0 reads the low-order 32 bits of the counter, getting zero.

o	CPU 1 increments the high-order 32 bits of the counter, so that
	the new value of the counter is 2^32, or 0x100000000.

So CPU 0 gets a value that is -way- off.

The usual trick is something like the following for counter read:

1.	Read the high-order 32 bits of the counter.

2.	Do a memory barrier, smp_mb().

3.	Read the low-order 32 bits of the counter.

4.	Do another memory barrier, again smp_mb().

5.	Read the high-order 32 bits of the counter again.

	If it is the same as the value obtained in step 1 (or the previous
	execution of step 5), then we are done.  (This works even in case
	of complete 64-bit overflow, though we should be very lucky to
	live that long!)  Otherwise, go to step 2.

But it is also necessary to modify the counter update:

1.	Increment the low-order 32 bits of the counter.  If no overflow
	occurred, we are done, otherwise, continue through this sequence
	of steps.

2.	Do a memory barrier, smp_mb().

3.	Increment the high-order 32 bits of the counter.

How to detect overflow in step 1?  Well, if we are incrementing, we can
just test for the new value being zero.  Otherwise, if we are adding
a 32-bit number, if the new value of the low-order 32 bits of counter
is less than the old value, overflow occurred (but make sure that the
comparison is unsigned!).

This all assumes that you are adding a 32-bit quantity to the counter.
Adding 64-bit values is not much harder.

Does this approach work for you?

							Thanx, Paul

> > But aside from that, the cpu controller itself is also summing directly
> > up the hierarchy, so cpuacct doing the same doesn't seem odd.
> >
> I'll post some idea if I can think of something reasonable.
> But I tend to hesitate to modify sched.c ;)
> 
> Thanks,
> -Kame
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 16:45                       ` Paul E. McKenney
@ 2009-02-27  0:58                         ` KAMEZAWA Hiroyuki
  2009-02-27  1:29                           ` Paul E. McKenney
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-27  0:58 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

On Thu, 26 Feb 2009 08:45:09 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Thu, Feb 26, 2009 at 09:06:24PM +0900, KAMEZAWA Hiroyuki wrote:
> > Peter Zijlstra wrote:
> > > On Thu, 2009-02-26 at 20:17 +0900, KAMEZAWA Hiroyuki wrote:
> > >> Peter Zijlstra wrote:
> > >> > On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
> > >> >
> > >> >> Taking hierarchy mutex while reading will make read-side stable.
> > >> >
> > >> > We're talking about scheduling here, taking a mutex to stop scheduling
> > >> > won't work, nor will it be acceptible to use anything that will.
> > >> >
> > >> No mutex is necessary, anyway.
> > >> hierarchy-walker function completely works well under rcu read lock,
> > >> if small jitter is allowed.
> > >
> > > Right, should be doable -- and looking at the code, we have this
> > > horrible 32 bit exception in there that locks the rq in order to read
> > > the 64bit value.
> > >
> > > Would be grand to get rid of that,. how bad would it be for userspace to
> > > get the occasionally fubarred value?
> > >
> > >From view of user-support saler, if terrible broken value is reported,
> > it will be user-incident and annoy me(us) ;)
> > 
> > I'd like to get rid of rq->lock, too..Hmm.. some routine like
> > atomic64_read() can help this ? (But I don't want to use atomic_t here..)
> 
> atomic64_read() will not help you on a 32-bit machine.  Here is the
> sequence of events that will cause the aforementioned user incidents and
> consequent annoyance:
> 
> o	The value of the counter is (2^32)-1, or 0xffffffff.
> 
> o	CPU 0 reads the high-order 32 bits of the counter, getting zero.
> 
> o	CPU 1 increments the low-order 32 bits of the counter, resulting
> 	in zero, but notes that there is a carry out of this field.
> 
> o	CPU 0 reads the low-order 32 bits of the counter, getting zero.
> 
> o	CPU 1 increments the high-order 32 bits of the counter, so that
> 	the new value of the counter is 2^32, or 0x100000000.
> 
> So CPU 0 gets a value that is -way- off.
> 
> The usual trick is something like the following for counter read:
> 
> 1.	Read the high-order 32 bits of the counter.
> 
> 2.	Do a memory barrier, smp_mb().
> 
> 3.	Read the low-order 32 bits of the counter.
> 
> 4.	Do another memory barrier, again smp_mb().
> 
> 5.	Read the high-order 32 bits of the counter again.
> 
> 	If it is the same as the value obtained in step 1 (or the previous
> 	execution of step 5), then we are done.  (This works even in case
> 	of complete 64-bit overflow, though we should be very lucky to
> 	live that long!)  Otherwise, go to step 2.
> 
> But it is also necessary to modify the counter update:
> 
> 1.	Increment the low-order 32 bits of the counter.  If no overflow
> 	occurred, we are done, otherwise, continue through this sequence
> 	of steps.
> 
> 2.	Do a memory barrier, smp_mb().
> 
> 3.	Increment the high-order 32 bits of the counter.
> 
> How to detect overflow in step 1?  Well, if we are incrementing, we can
> just test for the new value being zero.  Otherwise, if we are adding
> a 32-bit number, if the new value of the low-order 32 bits of counter
> is less than the old value, overflow occurred (but make sure that the
> comparison is unsigned!).
> 
> This all assumes that you are adding a 32-bit quantity to the counter.
> Adding 64-bit values is not much harder.
> 
> Does this approach work for you?
> 

Thank you. I'll try some and post if it seems easy to read/merge.
Hmm, but in your approach, can't we see the counter goes backword ?
(if the reader see only low 32 bit is incremtend.)

Can't we use seq_counter in include/linux/seqlock.h ?
There is only one writer and we don't need write-side lock.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-27  0:58                         ` KAMEZAWA Hiroyuki
@ 2009-02-27  1:29                           ` Paul E. McKenney
  2009-02-27  3:22                             ` [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was " KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Paul E. McKenney @ 2009-02-27  1:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

On Fri, Feb 27, 2009 at 09:58:56AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 26 Feb 2009 08:45:09 -0800
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Thu, Feb 26, 2009 at 09:06:24PM +0900, KAMEZAWA Hiroyuki wrote:
> > > Peter Zijlstra wrote:
> > > > On Thu, 2009-02-26 at 20:17 +0900, KAMEZAWA Hiroyuki wrote:
> > > >> Peter Zijlstra wrote:
> > > >> > On Thu, 2009-02-26 at 19:28 +0900, KAMEZAWA Hiroyuki wrote:
> > > >> >
> > > >> >> Taking hierarchy mutex while reading will make read-side stable.
> > > >> >
> > > >> > We're talking about scheduling here, taking a mutex to stop scheduling
> > > >> > won't work, nor will it be acceptible to use anything that will.
> > > >> >
> > > >> No mutex is necessary, anyway.
> > > >> hierarchy-walker function completely works well under rcu read lock,
> > > >> if small jitter is allowed.
> > > >
> > > > Right, should be doable -- and looking at the code, we have this
> > > > horrible 32 bit exception in there that locks the rq in order to read
> > > > the 64bit value.
> > > >
> > > > Would be grand to get rid of that,. how bad would it be for userspace to
> > > > get the occasionally fubarred value?
> > > >
> > > >From view of user-support saler, if terrible broken value is reported,
> > > it will be user-incident and annoy me(us) ;)
> > > 
> > > I'd like to get rid of rq->lock, too..Hmm.. some routine like
> > > atomic64_read() can help this ? (But I don't want to use atomic_t here..)
> > 
> > atomic64_read() will not help you on a 32-bit machine.  Here is the
> > sequence of events that will cause the aforementioned user incidents and
> > consequent annoyance:
> > 
> > o	The value of the counter is (2^32)-1, or 0xffffffff.
> > 
> > o	CPU 0 reads the high-order 32 bits of the counter, getting zero.
> > 
> > o	CPU 1 increments the low-order 32 bits of the counter, resulting
> > 	in zero, but notes that there is a carry out of this field.
> > 
> > o	CPU 0 reads the low-order 32 bits of the counter, getting zero.
> > 
> > o	CPU 1 increments the high-order 32 bits of the counter, so that
> > 	the new value of the counter is 2^32, or 0x100000000.
> > 
> > So CPU 0 gets a value that is -way- off.
> > 
> > The usual trick is something like the following for counter read:
> > 
> > 1.	Read the high-order 32 bits of the counter.
> > 
> > 2.	Do a memory barrier, smp_mb().
> > 
> > 3.	Read the low-order 32 bits of the counter.
> > 
> > 4.	Do another memory barrier, again smp_mb().
> > 
> > 5.	Read the high-order 32 bits of the counter again.
> > 
> > 	If it is the same as the value obtained in step 1 (or the previous
> > 	execution of step 5), then we are done.  (This works even in case
> > 	of complete 64-bit overflow, though we should be very lucky to
> > 	live that long!)  Otherwise, go to step 2.
> > 
> > But it is also necessary to modify the counter update:
> > 
> > 1.	Increment the low-order 32 bits of the counter.  If no overflow
> > 	occurred, we are done, otherwise, continue through this sequence
> > 	of steps.
> > 
> > 2.	Do a memory barrier, smp_mb().
> > 
> > 3.	Increment the high-order 32 bits of the counter.
> > 
> > How to detect overflow in step 1?  Well, if we are incrementing, we can
> > just test for the new value being zero.  Otherwise, if we are adding
> > a 32-bit number, if the new value of the low-order 32 bits of counter
> > is less than the old value, overflow occurred (but make sure that the
> > comparison is unsigned!).
> > 
> > This all assumes that you are adding a 32-bit quantity to the counter.
> > Adding 64-bit values is not much harder.
> > 
> > Does this approach work for you?
> > 
> 
> Thank you. I'll try some and post if it seems easy to read/merge.
> Hmm, but in your approach, can't we see the counter goes backword ?
> (if the reader see only low 32 bit is incremtend.)

Ouch, indeed!  The update would need to be atomic for my approach to
work.  My apologies for my confusion!

> Can't we use seq_counter in include/linux/seqlock.h ?
> There is only one writer and we don't need write-side lock.

Yes, seqlock should work fine, good point!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-02-27  1:29                           ` Paul E. McKenney
@ 2009-02-27  3:22                             ` KAMEZAWA Hiroyuki
  2009-03-02 14:56                               ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-02-27  3:22 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

On Thu, 26 Feb 2009 17:29:15 -0800
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > Can't we use seq_counter in include/linux/seqlock.h ?
> > There is only one writer and we don't need write-side lock.
> 
> Yes, seqlock should work fine, good point!
> 

This is a trial version. (may hunk with the latest mmotm)
seems overkill ?

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
value in safe manner, it takes rq->lock of (other) cpus.

In general, taking rq->lock of other cpus from codes not for scheduler
is not good. This patch tries to remove rq->lock used in read-side.

To read 64bit value in safe, this patch uses seqcounter.

Pros.
  - rq->lock is not necessary.
Cons.
  - When updating counter, sequence number must be updated.
    (I hope this per-cpu sequence number is on cache...)
  - not simple.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 kernel/sched.c |  141 ++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 105 insertions(+), 36 deletions(-)

Index: mmotm-2.6.29-Feb24/kernel/sched.c
===================================================================
--- mmotm-2.6.29-Feb24.orig/kernel/sched.c
+++ mmotm-2.6.29-Feb24/kernel/sched.c
@@ -9581,6 +9581,67 @@ struct cgroup_subsys cpu_cgroup_subsys =
 
 #ifdef CONFIG_CGROUP_CPUACCT
 
+#ifndef CONFIG_64BIT
+DEFINE_PER_CPU(struct seqcount, cpuacct_cgroup_seq);
+
+static inline void cpuacct_start_counter_update(void)
+{
+	/* This is called under rq->lock and IRQ is off */
+	struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
+
+	write_seqcount_begin(s);
+	put_cpu_var(cpuacct_cgroup_seq);
+}
+
+static inline void cpuacct_end_counter_update(void)
+{
+	struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
+
+	write_seqcount_end(s);
+	put_cpu_var(cpuacct_cgroup_seq);
+}
+
+static inline u64
+cpuacct_read_counter(u64 *val, int cpu)
+{
+	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
+	unsigned int seq;
+	u64 data;
+
+	do {
+		seq = read_seqcount_begin(s);
+		data = *val;
+	} while (read_seqcount_retry(s, seq));
+	return data;
+}
+/* This is a special funtion called against "offline" cpus. */
+static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
+{
+	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
+
+	preempt_disable();
+	write_seqcount_begin(s);
+	*val = 0;
+	write_seqcount_end(s);
+	preempt_enable();
+}
+#else
+static inline void cpuacct_start_counter_update(void)
+{
+}
+static inline void cpuacct_end_counter_update(void)
+{
+}
+static inline u64 cpuacct_read_counter(u64 *val, int cpu)
+{
+	return *val;
+}
+static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
+{
+	*val = 0;
+}
+#endif
+
 /*
  * CPU accounting code for task groups.
  *
@@ -9596,6 +9657,11 @@ struct cpuacct {
 	struct cpuacct *parent;
 };
 
+struct cpuacct_work {
+	struct work_struct work;
+	struct cpuacct *cpuacct;
+};
+
 struct cgroup_subsys cpuacct_subsys;
 
 /* return cpu accounting group corresponding to this container */
@@ -9643,39 +9709,29 @@ cpuacct_destroy(struct cgroup_subsys *ss
 	kfree(ca);
 }
 
+/* In 32bit enviroment, seqcounter is used for reading 64bit in safe way */
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 {
 	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
 	u64 data;
 
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
-	spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	data = *cpuusage;
-#endif
+	data = cpuacct_read_counter(cpuusage, cpu);
 
 	return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+/* called by per-cpu workqueue */
+static void cpuacct_cpuusage_reset_cpu(struct work_struct *work)
 {
+	struct cpuacct_work *cw = container_of(work, struct cpuacct_work, work);
+	struct cpuacct *ca = cw->cpuacct;
+	int cpu = get_cpu();
 	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	spin_lock_irq(&cpu_rq(cpu)->lock);
-	*cpuusage = val;
-	spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
+	cpuacct_start_counter_update();
+	*cpuusage = 0;
+	cpuacct_end_counter_update();
+	put_cpu();
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
@@ -9691,23 +9747,34 @@ static u64 cpuusage_read(struct cgroup *
 	return totalcpuusage;
 }
 
-static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
-								u64 reset)
+static int cpuacct_cpuusage_reset(struct cgroup *cgrp, unsigned int event)
 {
 	struct cpuacct *ca = cgroup_ca(cgrp);
-	int err = 0;
-	int i;
-
-	if (reset) {
-		err = -EINVAL;
-		goto out;
+	int cpu;
+	/*
+	 * Reset All counters....doesn't need to be fast.
+	 * "ca" will be stable while doing this. We are in write() syscall.
+	 */
+	get_online_cpus();
+	/*
+	 * Because we use alloc_percpu() for allocating counter, we have
+	 * a counter per a possible cpu. Reset all online's by workqueue and
+	 * reset offline cpu's directly.
+	 */
+	for_each_possible_cpu(cpu) {
+		if (cpu_online(cpu)) {
+			struct cpuacct_work cw;
+			INIT_WORK(&cw.work, cpuacct_cpuusage_reset_cpu);
+			cw.cpuacct = ca;
+			schedule_work_on(cpu, &cw.work);
+			flush_work(&cw.work);
+		} else {
+			u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
+			cpuacct_reset_offline_counter(cpuusage, cpu);
+		}
 	}
-
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
-
-out:
-	return err;
+	put_online_cpus();
+	return 0;
 }
 
 static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
@@ -9729,7 +9796,7 @@ static struct cftype files[] = {
 	{
 		.name = "usage",
 		.read_u64 = cpuusage_read,
-		.write_u64 = cpuusage_write,
+		.trigger = cpuacct_cpuusage_reset,
 	},
 	{
 		.name = "usage_percpu",
@@ -9756,10 +9823,12 @@ static void cpuacct_charge(struct task_s
 	cpu = task_cpu(tsk);
 	ca = task_ca(tsk);
 
+	cpuacct_start_counter_update();
 	for (; ca; ca = ca->parent) {
 		u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
 		*cpuusage += cputime;
 	}
+	cpuacct_end_counter_update();
 }
 
 struct cgroup_subsys cpuacct_subsys = {


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] cpuacct: add a branch prediction
  2009-02-26 12:26                         ` Ingo Molnar
  2009-02-26 12:40                           ` Arnd Bergmann
@ 2009-02-27  4:25                           ` Paul Mackerras
  1 sibling, 0 replies; 42+ messages in thread
From: Paul Mackerras @ 2009-02-27  4:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, Bharata B Rao, Li Zefan,
	Paul Menage, Balbir Singh, LKML

Ingo Molnar writes:

> * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Thu, 2009-02-26 at 21:06 +0900, KAMEZAWA Hiroyuki wrote:
> > > Hmm.. some routine like
> > > atomic64_read() can help this ? (But I don't want to use atomic_t here..)
> > 
> > Yeah, atomic64_t has been proposed numerous times, and x86 
> > could actually implement that using cmpxchg8b, just not sure 
> > about all the other 32bit archs, and if we start using it in 
> > the scheduler, they'd better have it implemented.
> 
> I have written a working atomic64_t implementation for 
> tip:perfcounters/core, for 32-bit x86.

atomic64_t would be a big problem for 32-bit powerpc.  We'd have to
use an array of spinlocks, or make atomic64_t actually be 12 bytes so
we have a lock word to use.

Paul.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-02-27  3:22                             ` [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was " KAMEZAWA Hiroyuki
@ 2009-03-02 14:56                               ` Peter Zijlstra
  2009-03-02 23:42                                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-02 14:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: paulmck, Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage,
	Balbir Singh, LKML

On Fri, 2009-02-27 at 12:22 +0900, KAMEZAWA Hiroyuki wrote:

Comments below..

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> value in safe manner, it takes rq->lock of (other) cpus.
> 
> In general, taking rq->lock of other cpus from codes not for scheduler
> is not good. This patch tries to remove rq->lock used in read-side.
> 
> To read 64bit value in safe, this patch uses seqcounter.
> 
> Pros.
>   - rq->lock is not necessary.
> Cons.
>   - When updating counter, sequence number must be updated.
>     (I hope this per-cpu sequence number is on cache...)
>   - not simple.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  kernel/sched.c |  141 ++++++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 105 insertions(+), 36 deletions(-)
> 
> Index: mmotm-2.6.29-Feb24/kernel/sched.c
> ===================================================================
> --- mmotm-2.6.29-Feb24.orig/kernel/sched.c
> +++ mmotm-2.6.29-Feb24/kernel/sched.c
> @@ -9581,6 +9581,67 @@ struct cgroup_subsys cpu_cgroup_subsys =
>  
>  #ifdef CONFIG_CGROUP_CPUACCT
>  
> +#ifndef CONFIG_64BIT
> +DEFINE_PER_CPU(struct seqcount, cpuacct_cgroup_seq);
> +
> +static inline void cpuacct_start_counter_update(void)
> +{
> +	/* This is called under rq->lock and IRQ is off */
> +	struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> +
> +	write_seqcount_begin(s);
> +	put_cpu_var(cpuacct_cgroup_seq);
> +}
> +
> +static inline void cpuacct_end_counter_update(void)
> +{
> +	struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> +
> +	write_seqcount_end(s);
> +	put_cpu_var(cpuacct_cgroup_seq);
> +}

It seems odd we disable/enable preemption in both, I would expect for
start to disable preemption, and have end enable it again, or use
__get_cpu_var() and assume preemption is already disabled (callsites are
under rq->lock, right?)

> +static inline u64
> +cpuacct_read_counter(u64 *val, int cpu)
> +{
> +	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> +	unsigned int seq;
> +	u64 data;
> +
> +	do {
> +		seq = read_seqcount_begin(s);
> +		data = *val;
> +	} while (read_seqcount_retry(s, seq));
> +	return data;
> +}
> +/* This is a special funtion called against "offline" cpus. */
> +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> +{
> +	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> +
> +	preempt_disable();
> +	write_seqcount_begin(s);
> +	*val = 0;
> +	write_seqcount_end(s);
> +	preempt_enable();
> +}

And here you double disable preemption, quite useless if you take a
remote cpu's per-cpu data.

> +#else
> +static inline void cpuacct_start_counter_update(void)
> +{
> +}
> +static inline void cpuacct_end_counter_update(void)
> +{
> +}
> +static inline u64 cpuacct_read_counter(u64 *val, int cpu)
> +{
> +	return *val;
> +}
> +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> +{
> +	*val = 0;
> +}
> +#endif
> +
>  /*
>   * CPU accounting code for task groups.
>   *
> @@ -9596,6 +9657,11 @@ struct cpuacct {
>  	struct cpuacct *parent;
>  };
>  
> +struct cpuacct_work {
> +	struct work_struct work;
> +	struct cpuacct *cpuacct;
> +};
> +
>  struct cgroup_subsys cpuacct_subsys;
>  
>  /* return cpu accounting group corresponding to this container */
> @@ -9643,39 +9709,29 @@ cpuacct_destroy(struct cgroup_subsys *ss
>  	kfree(ca);
>  }
>  
> +/* In 32bit enviroment, seqcounter is used for reading 64bit in safe way */
>  static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
>  {
>  	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
>  	u64 data;
>  
> -#ifndef CONFIG_64BIT
> -	/*
> -	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> -	 */
> -	spin_lock_irq(&cpu_rq(cpu)->lock);
> -	data = *cpuusage;
> -	spin_unlock_irq(&cpu_rq(cpu)->lock);
> -#else
> -	data = *cpuusage;
> -#endif
> +	data = cpuacct_read_counter(cpuusage, cpu);
>  
>  	return data;
>  }
>  
> -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
> +/* called by per-cpu workqueue */
> +static void cpuacct_cpuusage_reset_cpu(struct work_struct *work)
>  {
> +	struct cpuacct_work *cw = container_of(work, struct cpuacct_work, work);
> +	struct cpuacct *ca = cw->cpuacct;
> +	int cpu = get_cpu();
>  	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
>  
> -#ifndef CONFIG_64BIT
> -	/*
> -	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> -	 */
> -	spin_lock_irq(&cpu_rq(cpu)->lock);
> -	*cpuusage = val;
> -	spin_unlock_irq(&cpu_rq(cpu)->lock);
> -#else
> -	*cpuusage = val;
> -#endif
> +	cpuacct_start_counter_update();
> +	*cpuusage = 0;
> +	cpuacct_end_counter_update();
> +	put_cpu();
>  }
>  
>  /* return total cpu usage (in nanoseconds) of a group */
> @@ -9691,23 +9747,34 @@ static u64 cpuusage_read(struct cgroup *
>  	return totalcpuusage;
>  }
>  
> -static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> -								u64 reset)
> +static int cpuacct_cpuusage_reset(struct cgroup *cgrp, unsigned int event)
>  {
>  	struct cpuacct *ca = cgroup_ca(cgrp);
> -	int err = 0;
> -	int i;
> -
> -	if (reset) {
> -		err = -EINVAL;
> -		goto out;
> +	int cpu;
> +	/*
> +	 * Reset All counters....doesn't need to be fast.
> +	 * "ca" will be stable while doing this. We are in write() syscall.
> +	 */
> +	get_online_cpus();
> +	/*
> +	 * Because we use alloc_percpu() for allocating counter, we have
> +	 * a counter per a possible cpu. Reset all online's by workqueue and
> +	 * reset offline cpu's directly.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		if (cpu_online(cpu)) {
> +			struct cpuacct_work cw;
> +			INIT_WORK(&cw.work, cpuacct_cpuusage_reset_cpu);
> +			cw.cpuacct = ca;
> +			schedule_work_on(cpu, &cw.work);
> +			flush_work(&cw.work);
> +		} else {
> +			u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> +			cpuacct_reset_offline_counter(cpuusage, cpu);
> +		}

I'm not particularly convinced this is the right way, schedule_work_on()
sounds way expensive for setting a variable to 0.

Furthermore, if you want something like schedule_work_on() for each cpu,
there's schedule_on_each_cpu().

>  	}
> -
> -	for_each_present_cpu(i)
> -		cpuacct_cpuusage_write(ca, i, 0);
> -
> -out:
> -	return err;
> +	put_online_cpus();
> +	return 0;
>  }
>  
>  static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
> @@ -9729,7 +9796,7 @@ static struct cftype files[] = {
>  	{
>  		.name = "usage",
>  		.read_u64 = cpuusage_read,
> -		.write_u64 = cpuusage_write,
> +		.trigger = cpuacct_cpuusage_reset,
>  	},
>  	{
>  		.name = "usage_percpu",
> @@ -9756,10 +9823,12 @@ static void cpuacct_charge(struct task_s
>  	cpu = task_cpu(tsk);
>  	ca = task_ca(tsk);
>  
> +	cpuacct_start_counter_update();
>  	for (; ca; ca = ca->parent) {
>  		u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
>  		*cpuusage += cputime;
>  	}
> +	cpuacct_end_counter_update();
>  }
>  
>  struct cgroup_subsys cpuacct_subsys = {
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-03-02 14:56                               ` Peter Zijlstra
@ 2009-03-02 23:42                                 ` KAMEZAWA Hiroyuki
  2009-03-03  7:51                                   ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-02 23:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: paulmck, Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage,
	Balbir Singh, LKML

On Mon, 02 Mar 2009 15:56:10 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2009-02-27 at 12:22 +0900, KAMEZAWA Hiroyuki wrote:
> 
> Comments below..
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> > per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> > value in safe manner, it takes rq->lock of (other) cpus.
> > 
> > In general, taking rq->lock of other cpus from codes not for scheduler
> > is not good. This patch tries to remove rq->lock used in read-side.
> > 
> > To read 64bit value in safe, this patch uses seqcounter.
> > 
> > Pros.
> >   - rq->lock is not necessary.
> > Cons.
> >   - When updating counter, sequence number must be updated.
> >     (I hope this per-cpu sequence number is on cache...)
> >   - not simple.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  kernel/sched.c |  141 ++++++++++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 105 insertions(+), 36 deletions(-)
> > 
> > Index: mmotm-2.6.29-Feb24/kernel/sched.c
> > ===================================================================
> > --- mmotm-2.6.29-Feb24.orig/kernel/sched.c
> > +++ mmotm-2.6.29-Feb24/kernel/sched.c
> > @@ -9581,6 +9581,67 @@ struct cgroup_subsys cpu_cgroup_subsys =
> >  
> >  #ifdef CONFIG_CGROUP_CPUACCT
> >  
> > +#ifndef CONFIG_64BIT
> > +DEFINE_PER_CPU(struct seqcount, cpuacct_cgroup_seq);
> > +
> > +static inline void cpuacct_start_counter_update(void)
> > +{
> > +	/* This is called under rq->lock and IRQ is off */
> > +	struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> > +
> > +	write_seqcount_begin(s);
> > +	put_cpu_var(cpuacct_cgroup_seq);
> > +}
> > +
> > +static inline void cpuacct_end_counter_update(void)
> > +{
> > +	struct seqcount *s = &get_cpu_var(cpuacct_cgroup_seq);
> > +
> > +	write_seqcount_end(s);
> > +	put_cpu_var(cpuacct_cgroup_seq);
> > +}
> 
> It seems odd we disable/enable preemption in both, I would expect for
> start to disable preemption, and have end enable it again, or use
> __get_cpu_var() and assume preemption is already disabled (callsites are
> under rq->lock, right?)
> 
yes. calles are under rq->lock...ok, you're right.
I'll remove preepmtp disabling codes.


> > +static inline u64
> > +cpuacct_read_counter(u64 *val, int cpu)
> > +{
> > +	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> > +	unsigned int seq;
> > +	u64 data;
> > +
> > +	do {
> > +		seq = read_seqcount_begin(s);
> > +		data = *val;
> > +	} while (read_seqcount_retry(s, seq));
> > +	return data;
> > +}
> > +/* This is a special funtion called against "offline" cpus. */
> > +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> > +{
> > +	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
> > +
> > +	preempt_disable();
> > +	write_seqcount_begin(s);
> > +	*val = 0;
> > +	write_seqcount_end(s);
> > +	preempt_enable();
> > +}
> 
> And here you double disable preemption, quite useless if you take a
> remote cpu's per-cpu data.
> 
Ok, maybe this "reset" from a user should be under rq->lock.
(But reading will not be under rq->lock.)


> > +#else
> > +static inline void cpuacct_start_counter_update(void)
> > +{
> > +}
> > +static inline void cpuacct_end_counter_update(void)
> > +{
> > +}
> > +static inline u64 cpuacct_read_counter(u64 *val, int cpu)
> > +{
> > +	return *val;
> > +}
> > +static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
> > +{
> > +	*val = 0;
> > +}
> > +#endif
> > +
> >  /*
> >   * CPU accounting code for task groups.
> >   *
> > @@ -9596,6 +9657,11 @@ struct cpuacct {
> >  	struct cpuacct *parent;
> >  };
> >  
> > +struct cpuacct_work {
> > +	struct work_struct work;
> > +	struct cpuacct *cpuacct;
> > +};
> > +
> >  struct cgroup_subsys cpuacct_subsys;
> >  
> >  /* return cpu accounting group corresponding to this container */
> > @@ -9643,39 +9709,29 @@ cpuacct_destroy(struct cgroup_subsys *ss
> >  	kfree(ca);
> >  }
> >  
> > +/* In 32bit enviroment, seqcounter is used for reading 64bit in safe way */
> >  static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
> >  {
> >  	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> >  	u64 data;
> >  
> > -#ifndef CONFIG_64BIT
> > -	/*
> > -	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
> > -	 */
> > -	spin_lock_irq(&cpu_rq(cpu)->lock);
> > -	data = *cpuusage;
> > -	spin_unlock_irq(&cpu_rq(cpu)->lock);
> > -#else
> > -	data = *cpuusage;
> > -#endif
> > +	data = cpuacct_read_counter(cpuusage, cpu);
> >  
> >  	return data;
> >  }
> >  
> > -static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
> > +/* called by per-cpu workqueue */
> > +static void cpuacct_cpuusage_reset_cpu(struct work_struct *work)
> >  {
> > +	struct cpuacct_work *cw = container_of(work, struct cpuacct_work, work);
> > +	struct cpuacct *ca = cw->cpuacct;
> > +	int cpu = get_cpu();
> >  	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> >  
> > -#ifndef CONFIG_64BIT
> > -	/*
> > -	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
> > -	 */
> > -	spin_lock_irq(&cpu_rq(cpu)->lock);
> > -	*cpuusage = val;
> > -	spin_unlock_irq(&cpu_rq(cpu)->lock);
> > -#else
> > -	*cpuusage = val;
> > -#endif
> > +	cpuacct_start_counter_update();
> > +	*cpuusage = 0;
> > +	cpuacct_end_counter_update();
> > +	put_cpu();
> >  }
> >  
> >  /* return total cpu usage (in nanoseconds) of a group */
> > @@ -9691,23 +9747,34 @@ static u64 cpuusage_read(struct cgroup *
> >  	return totalcpuusage;
> >  }
> >  
> > -static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
> > -								u64 reset)
> > +static int cpuacct_cpuusage_reset(struct cgroup *cgrp, unsigned int event)
> >  {
> >  	struct cpuacct *ca = cgroup_ca(cgrp);
> > -	int err = 0;
> > -	int i;
> > -
> > -	if (reset) {
> > -		err = -EINVAL;
> > -		goto out;
> > +	int cpu;
> > +	/*
> > +	 * Reset All counters....doesn't need to be fast.
> > +	 * "ca" will be stable while doing this. We are in write() syscall.
> > +	 */
> > +	get_online_cpus();
> > +	/*
> > +	 * Because we use alloc_percpu() for allocating counter, we have
> > +	 * a counter per a possible cpu. Reset all online's by workqueue and
> > +	 * reset offline cpu's directly.
> > +	 */
> > +	for_each_possible_cpu(cpu) {
> > +		if (cpu_online(cpu)) {
> > +			struct cpuacct_work cw;
> > +			INIT_WORK(&cw.work, cpuacct_cpuusage_reset_cpu);
> > +			cw.cpuacct = ca;
> > +			schedule_work_on(cpu, &cw.work);
> > +			flush_work(&cw.work);
> > +		} else {
> > +			u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
> > +			cpuacct_reset_offline_counter(cpuusage, cpu);
> > +		}
> 
> I'm not particularly convinced this is the right way, schedule_work_on()
> sounds way expensive for setting a variable to 0.
> 
yes, yes..

> Furthermore, if you want something like schedule_work_on() for each cpu,
> there's schedule_on_each_cpu().
> 
It can't pass arguments...Maybe I should use rq->lock here to reset
other cpu's value.

Thank you for review.
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-03-02 23:42                                 ` KAMEZAWA Hiroyuki
@ 2009-03-03  7:51                                   ` Peter Zijlstra
  2009-03-03  9:04                                     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-03  7:51 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: paulmck, Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage,
	Balbir Singh, LKML

On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:

> > Furthermore, if you want something like schedule_work_on() for each cpu,
> > there's schedule_on_each_cpu().
> > 
> It can't pass arguments...Maybe I should use rq->lock here to reset
> other cpu's value.

Why bother with serializing the reset code at all?

But there's always smp_call_function() I suppose.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-03-03  7:51                                   ` Peter Zijlstra
@ 2009-03-03  9:04                                     ` KAMEZAWA Hiroyuki
  2009-03-03  9:40                                       ` Peter Zijlstra
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-03  9:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, paulmck, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

Peter Zijlstra wrote:
> On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
>
>> > Furthermore, if you want something like schedule_work_on() for each
>> cpu,
>> > there's schedule_on_each_cpu().
>> >
>> It can't pass arguments...Maybe I should use rq->lock here to reset
>> other cpu's value.
>
> Why bother with serializing the reset code at all?
>
  I don't think reset v.s. read is problem but reset v.s. increment
  (read-modify-write) can't be ?

> But there's always smp_call_function() I suppose.
>

Hmm, I'll find some simpler way.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-03-03  9:04                                     ` KAMEZAWA Hiroyuki
@ 2009-03-03  9:40                                       ` Peter Zijlstra
  2009-03-03 10:42                                         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-03  9:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: paulmck, Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage,
	Balbir Singh, LKML

On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra wrote:
> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
> >
> >> > Furthermore, if you want something like schedule_work_on() for each
> >> cpu,
> >> > there's schedule_on_each_cpu().
> >> >
> >> It can't pass arguments...Maybe I should use rq->lock here to reset
> >> other cpu's value.
> >
> > Why bother with serializing the reset code at all?
> >
>   I don't think reset v.s. read is problem but reset v.s. increment
>   (read-modify-write) can't be ?

Sure, can be, do we care?



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-03-03  9:40                                       ` Peter Zijlstra
@ 2009-03-03 10:42                                         ` KAMEZAWA Hiroyuki
  2009-03-03 10:44                                           ` KAMEZAWA Hiroyuki
  2009-03-03 11:54                                           ` Peter Zijlstra
  0 siblings, 2 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-03 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, paulmck, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, LKML

Peter Zijlstra  wrote:
> On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
>> Peter Zijlstra wrote:
>> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
>> >
>> >> > Furthermore, if you want something like schedule_work_on() for each
>> >> cpu,
>> >> > there's schedule_on_each_cpu().
>> >> >
>> >> It can't pass arguments...Maybe I should use rq->lock here to reset
>> >> other cpu's value.
>> >
>> > Why bother with serializing the reset code at all?
>> >
>>   I don't think reset v.s. read is problem but reset v.s. increment
>>   (read-modify-write) can't be ?
>
> Sure, can be, do we care?
>
If small/easy code allows us to declare "there are any racy case!
and you don't have to check whether you successfully reseted",
it's worth to do I think.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-03-03 10:42                                         ` KAMEZAWA Hiroyuki
@ 2009-03-03 10:44                                           ` KAMEZAWA Hiroyuki
  2009-03-03 11:54                                           ` Peter Zijlstra
  1 sibling, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-03 10:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, paulmck, Bharata B Rao,
	Li Zefan, Ingo Molnar, Paul Menage, Balbir Singh, LKML

KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra  wrote:
>> On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
>>> Peter Zijlstra wrote:
>>> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
>>> >
>>> >> > Furthermore, if you want something like schedule_work_on() for
>>> each
>>> >> cpu,
>>> >> > there's schedule_on_each_cpu().
>>> >> >
>>> >> It can't pass arguments...Maybe I should use rq->lock here to reset
>>> >> other cpu's value.
>>> >
>>> > Why bother with serializing the reset code at all?
>>> >
>>>   I don't think reset v.s. read is problem but reset v.s. increment
>>>   (read-modify-write) can't be ?
>>
>> Sure, can be, do we care?
>>
> If small/easy code allows us to declare "there are any racy case!
                                                     isn't...
Sorry.
-Kame



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was Re: [PATCH] cpuacct: add a branch prediction
  2009-03-03 10:42                                         ` KAMEZAWA Hiroyuki
  2009-03-03 10:44                                           ` KAMEZAWA Hiroyuki
@ 2009-03-03 11:54                                           ` Peter Zijlstra
  2009-03-04  6:32                                             ` [PATCH] remove rq->lock from cpuacct cgroup v2 KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 42+ messages in thread
From: Peter Zijlstra @ 2009-03-03 11:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: paulmck, Bharata B Rao, Li Zefan, Ingo Molnar, Paul Menage,
	Balbir Singh, LKML

On Tue, 2009-03-03 at 19:42 +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra  wrote:
> > On Tue, 2009-03-03 at 18:04 +0900, KAMEZAWA Hiroyuki wrote:
> >> Peter Zijlstra wrote:
> >> > On Tue, 2009-03-03 at 08:42 +0900, KAMEZAWA Hiroyuki wrote:
> >> >
> >> >> > Furthermore, if you want something like schedule_work_on() for each
> >> >> cpu,
> >> >> > there's schedule_on_each_cpu().
> >> >> >
> >> >> It can't pass arguments...Maybe I should use rq->lock here to reset
> >> >> other cpu's value.
> >> >
> >> > Why bother with serializing the reset code at all?
> >> >
> >>   I don't think reset v.s. read is problem but reset v.s. increment
> >>   (read-modify-write) can't be ?
> >
> > Sure, can be, do we care?
> >
> If small/easy code allows us to declare "there are any racy case!
> and you don't have to check whether you successfully reseted",
> it's worth to do I think.

smp_call_function() it is...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH] remove rq->lock from cpuacct cgroup v2
  2009-03-03 11:54                                           ` Peter Zijlstra
@ 2009-03-04  6:32                                             ` KAMEZAWA Hiroyuki
  2009-03-04  7:54                                               ` Bharata B Rao
  0 siblings, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-04  6:32 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, paulmck, Bharata B Rao, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, kenchen

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
value in safe manner, it takes rq->lock of (other) cpus.

In general, taking rq->lock of other cpus from codes not for scheduler
is not good. This patch tries to remove rq->lock in read-side.

To read 64bit value in atomic, this patch uses seqcounter.

Pros.
  - rq->lock is not necessary.
Cons.
  - When updating counter, sequence number must be updated.
    (I hope this per-cpu sequence number is on cache...)

Changelog: v1->v2
 - checking calling context of all calls and avoid unnecessary
   preempt_disable calls.
 - use on_each_cpu() instead of workqueue, at reset

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Index: mmotm-2.6.29-Mar3/kernel/sched.c
===================================================================
--- mmotm-2.6.29-Mar3.orig/kernel/sched.c
+++ mmotm-2.6.29-Mar3/kernel/sched.c
@@ -9581,6 +9581,71 @@ struct cgroup_subsys cpu_cgroup_subsys =
 
 #ifdef CONFIG_CGROUP_CPUACCT
 
+#ifndef CONFIG_64BIT
+/* seq counter for handle 64bit counter on 32bit system */
+DEFINE_PER_CPU(struct seqcount, cpuacct_cgroup_seq);
+
+/*
+ * Counter update happens while rq->lock is held and we don't need to
+ * disable preempt explcitly.
+ */
+static inline void cpuacct_start_counter_update(void)
+{
+	/* This is called under rq->lock and IRQ is off */
+	struct seqcount *s = &__get_cpu_var(cpuacct_cgroup_seq);
+
+	write_seqcount_begin(s);
+}
+
+static inline void cpuacct_end_counter_update(void)
+{
+	struct seqcount *s = &__get_cpu_var(cpuacct_cgroup_seq);
+
+	write_seqcount_end(s);
+}
+
+static inline u64
+cpuacct_read_counter(u64 *val, int cpu)
+{
+	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
+	unsigned int seq;
+	u64 data;
+
+	do {
+		seq = read_seqcount_begin(s);
+		data = *val;
+	} while (read_seqcount_retry(s, seq));
+	return data;
+}
+/* This is a special funtion called against "offline" cpus. */
+static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
+{
+	struct seqcount *s = &per_cpu(cpuacct_cgroup_seq, cpu);
+
+	write_seqcount_begin(s);
+	*val = 0;
+	write_seqcount_end(s);
+}
+#else
+static inline void cpuacct_start_counter_update(void)
+{
+}
+
+static inline void cpuacct_end_counter_update(void)
+{
+}
+
+static inline u64 cpuacct_read_counter(u64 *val, int cpu)
+{
+	return *val;
+}
+
+static inline void cpuacct_reset_offline_counter(u64 *val, int cpu)
+{
+	*val = 0;
+}
+#endif
+
 /*
  * CPU accounting code for task groups.
  *
@@ -9643,39 +9708,27 @@ cpuacct_destroy(struct cgroup_subsys *ss
 	kfree(ca);
 }
 
+/* In 32bit enviroment, seqcounter is used for reading 64bit in safe way */
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu)
 {
 	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
 	u64 data;
 
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
-	 */
-	spin_lock_irq(&cpu_rq(cpu)->lock);
-	data = *cpuusage;
-	spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	data = *cpuusage;
-#endif
+	data = cpuacct_read_counter(cpuusage, cpu);
 
 	return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+/* called by per-cpu smp call function (in non-preemptable context) */
+static void cpuacct_cpuusage_reset_cpu(void *data)
 {
+	int cpu = smp_processor_id();
+	struct cpuacct *ca = data;
 	u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
 
-#ifndef CONFIG_64BIT
-	/*
-	 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
-	 */
-	spin_lock_irq(&cpu_rq(cpu)->lock);
-	*cpuusage = val;
-	spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
-	*cpuusage = val;
-#endif
+	cpuacct_start_counter_update();
+	*cpuusage = 0;
+	cpuacct_end_counter_update();
 }
 
 /* return total cpu usage (in nanoseconds) of a group */
@@ -9691,23 +9744,30 @@ static u64 cpuusage_read(struct cgroup *
 	return totalcpuusage;
 }
 
-static int cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
-								u64 reset)
+static int cpuacct_cpuusage_reset(struct cgroup *cgrp, unsigned int event)
 {
 	struct cpuacct *ca = cgroup_ca(cgrp);
-	int err = 0;
-	int i;
+	int cpu;
+	/*
+	 * We prevent cpu hotplug while we do reset.
+	 */
+	get_online_cpus();
+	/*
+	 * clear all online cpu's status (including local one)
+	 * This reseting uses nowait smp call and counter will be cleared in
+	 * asynchronous way.
+	 */
+	on_each_cpu(cpuacct_cpuusage_reset_cpu, ca, 0);
 
-	if (reset) {
-		err = -EINVAL;
-		goto out;
+	/* clear all present but offline cpus' */
+	for_each_possible_cpu(cpu) {
+		if (!cpu_online(cpu)) {
+			u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
+			cpuacct_reset_offline_counter(cpuusage, cpu);
+		}
 	}
-
-	for_each_present_cpu(i)
-		cpuacct_cpuusage_write(ca, i, 0);
-
-out:
-	return err;
+	put_online_cpus();
+	return 0;
 }
 
 static int cpuacct_percpu_seq_read(struct cgroup *cgroup, struct cftype *cft,
@@ -9729,7 +9789,7 @@ static struct cftype files[] = {
 	{
 		.name = "usage",
 		.read_u64 = cpuusage_read,
-		.write_u64 = cpuusage_write,
+		.trigger = cpuacct_cpuusage_reset,
 	},
 	{
 		.name = "usage_percpu",
@@ -9759,10 +9819,12 @@ static void cpuacct_charge(struct task_s
 	cpu = task_cpu(tsk);
 	ca = task_ca(tsk);
 
+	cpuacct_start_counter_update();
 	for (; ca; ca = ca->parent) {
 		u64 *cpuusage = percpu_ptr(ca->cpuusage, cpu);
 		*cpuusage += cputime;
 	}
+	cpuacct_end_counter_update();
 }
 
 struct cgroup_subsys cpuacct_subsys = {


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] remove rq->lock from cpuacct cgroup v2
  2009-03-04  6:32                                             ` [PATCH] remove rq->lock from cpuacct cgroup v2 KAMEZAWA Hiroyuki
@ 2009-03-04  7:54                                               ` Bharata B Rao
  2009-03-04  8:20                                                 ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 42+ messages in thread
From: Bharata B Rao @ 2009-03-04  7:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: LKML, Peter Zijlstra, paulmck, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, kenchen

On Wed, Mar 4, 2009 at 12:02 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> value in safe manner, it takes rq->lock of (other) cpus.
>
> In general, taking rq->lock of other cpus from codes not for scheduler
> is not good. This patch tries to remove rq->lock in read-side.
>
> To read 64bit value in atomic, this patch uses seqcounter.
>
> Pros.
>  - rq->lock is not necessary.
> Cons.
>  - When updating counter, sequence number must be updated.
>    (I hope this per-cpu sequence number is on cache...)
>
> Changelog: v1->v2
>  - checking calling context of all calls and avoid unnecessary
>   preempt_disable calls.
>  - use on_each_cpu() instead of workqueue, at reset
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

So cpuacct->cpuusage is a 64 bit percpu counter and I see that cpuacct
subsystem itself handles the following special cases:
- 32 vs 64 bit update issues
- resetting percpu counters on online and offline cpus

Tomorrow if I add other counters to cpuacct subsystem (like stime and
utime), I need to do what you have done in this patch all over again
for the additional counters.

Instead of subsystems handling all these percpu counter problems
themselves, shouldn't we be using percpu_counter subsytem and let it
handle all the issues transparently for us ? I am not sure if all
these problems have been addressed in percpu_counter, but would like
to know why we are not using percpu_counter for these kinds of things
and enhance percpu_counter if it can't handle some of the issues which
we are solving here specifically for cpuacct subsystem ?

Regards,
Bharata.
-- 
http://bharata.sulekha.com/blog/posts.htm

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] remove rq->lock from cpuacct cgroup v2
  2009-03-04  7:54                                               ` Bharata B Rao
@ 2009-03-04  8:20                                                 ` KAMEZAWA Hiroyuki
  2009-03-04  8:46                                                   ` KAMEZAWA Hiroyuki
  2009-03-04 12:11                                                   ` Bharata B Rao
  0 siblings, 2 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-04  8:20 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: LKML, Peter Zijlstra, paulmck, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, kenchen

On Wed, 4 Mar 2009 13:24:43 +0530
Bharata B Rao <bharata.rao@gmail.com> wrote:

> On Wed, Mar 4, 2009 at 12:02 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > cgroup/cpuacct subsystem counts cpu usage by 64bit coutnter in
> > per-cpu object. In read-side (via cpuacct.usage file), for reading 64bit
> > value in safe manner, it takes rq->lock of (other) cpus.
> >
> > In general, taking rq->lock of other cpus from codes not for scheduler
> > is not good. This patch tries to remove rq->lock in read-side.
> >
> > To read 64bit value in atomic, this patch uses seqcounter.
> >
> > Pros.
> >  - rq->lock is not necessary.
> > Cons.
> >  - When updating counter, sequence number must be updated.
> >    (I hope this per-cpu sequence number is on cache...)
> >
> > Changelog: v1->v2
> >  - checking calling context of all calls and avoid unnecessary
> >   preempt_disable calls.
> >  - use on_each_cpu() instead of workqueue, at reset
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> So cpuacct->cpuusage is a 64 bit percpu counter and I see that cpuacct
> subsystem itself handles the following special cases:
> - 32 vs 64 bit update issues
> - resetting percpu counters on online and offline cpus
> 
> Tomorrow if I add other counters to cpuacct subsystem (like stime and
> utime), I need to do what you have done in this patch all over again
> for the additional counters.
> 
I'm not sure Yes or No. Is it necessary to be per-cpu ?
IIUC, stime/utime update is at most once-per-tick and not so frequent
as cpuacct.

> Instead of subsystems handling all these percpu counter problems
> themselves, shouldn't we be using percpu_counter subsytem and let it
> handle all the issues transparently for us ? I am not sure if all
> these problems have been addressed in percpu_counter, but would like
> to know why we are not using percpu_counter for these kinds of things
> and enhance percpu_counter if it can't handle some of the issues which
> we are solving here specifically for cpuacct subsystem ?
> 
At first, generic per-cpu counter sounds interesting but to be honest,
some special handling is used for cpuacct based on its characteristic.

  - Writer works under non-preemptable context.
  - There is only one writer.

Otherwize, using generic atomic64_t.. or res_counter will be good...maybe.
(res_counter updates 64bit value under spinlock lock.)

Thanks,
-Kame




> Regards,
> Bharata.
> -- 
> http://bharata.sulekha.com/blog/posts.htm
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] remove rq->lock from cpuacct cgroup v2
  2009-03-04  8:20                                                 ` KAMEZAWA Hiroyuki
@ 2009-03-04  8:46                                                   ` KAMEZAWA Hiroyuki
  2009-03-04 10:35                                                     ` Bharata B Rao
  2009-03-04 12:11                                                   ` Bharata B Rao
  1 sibling, 1 reply; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-04  8:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Bharata B Rao, LKML, Peter Zijlstra, paulmck, Li Zefan,
	Ingo Molnar, Paul Menage, Balbir Singh, kenchen

On Wed, 4 Mar 2009 17:20:05 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 4 Mar 2009 13:24:43 +0530
> Bharata B Rao <bharata.rao@gmail.com> wrote:
> At first, generic per-cpu counter sounds interesting but to be honest,
> some special handling is used for cpuacct based on its characteristic.
> 
>   - Writer works under non-preemptable context.
>   - There is only one writer.
> 
If utime/stime updates works on above context, using the same code will be good.

I don't use any cpuacct structure specific in routines...
If you want me to rewrite it, I'll do. please request what you want.

Thanks,
-Kame 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] remove rq->lock from cpuacct cgroup v2
  2009-03-04  8:46                                                   ` KAMEZAWA Hiroyuki
@ 2009-03-04 10:35                                                     ` Bharata B Rao
  0 siblings, 0 replies; 42+ messages in thread
From: Bharata B Rao @ 2009-03-04 10:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: LKML, Peter Zijlstra, paulmck, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, kenchen

On Wed, Mar 4, 2009 at 2:16 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 4 Mar 2009 17:20:05 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Wed, 4 Mar 2009 13:24:43 +0530
>> Bharata B Rao <bharata.rao@gmail.com> wrote:
>> At first, generic per-cpu counter sounds interesting but to be honest,
>> some special handling is used for cpuacct based on its characteristic.
>>
>>   - Writer works under non-preemptable context.
>>   - There is only one writer.
>>
> If utime/stime updates works on above context, using the same code will be good.

IIUC, utime/stime updates also work under the above conditions.

>
> I don't use any cpuacct structure specific in routines...
> If you want me to rewrite it, I'll do. please request what you want.

After looking deep into your patch, I think I could use the same seq
counter introduced by you to update stime/utime also.
I guess I could use most part of your code except there is a slight
difference wrt preemption disabled assumption in the write path.
cpusuage updates happen under rq->lock but stime/utime updates don't.
So I probably can't use cpuacct_start/end_counter_update as is.

Regards,
Bharata.
-- 
http://bharata.sulekha.com/blog/posts.htm

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] remove rq->lock from cpuacct cgroup v2
  2009-03-04  8:20                                                 ` KAMEZAWA Hiroyuki
  2009-03-04  8:46                                                   ` KAMEZAWA Hiroyuki
@ 2009-03-04 12:11                                                   ` Bharata B Rao
  2009-03-04 14:17                                                     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 42+ messages in thread
From: Bharata B Rao @ 2009-03-04 12:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: LKML, Peter Zijlstra, paulmck, Li Zefan, Ingo Molnar,
	Paul Menage, Balbir Singh, kenchen

On Wed, Mar 4, 2009 at 1:50 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 4 Mar 2009 13:24:43 +0530
> Bharata B Rao <bharata.rao@gmail.com> wrote:
>
>> Instead of subsystems handling all these percpu counter problems
>> themselves, shouldn't we be using percpu_counter subsytem and let it
>> handle all the issues transparently for us ? I am not sure if all
>> these problems have been addressed in percpu_counter, but would like
>> to know why we are not using percpu_counter for these kinds of things
>> and enhance percpu_counter if it can't handle some of the issues which
>> we are solving here specifically for cpuacct subsystem ?
>>
> At first, generic per-cpu counter sounds interesting but to be honest,
> some special handling is used for cpuacct based on its characteristic.
>

Just trying to understand this clearly ...

>  - Writer works under non-preemptable context.

Which means the writer is running with preemption disabled.
percpu_counter writes (via percpu_counter_add) don't assume anything
and disable preemption themselves. Is this the problem or is there a
bigger issue here why percpu_counter can't be used ?

>  - There is only one writer.

Not sure how you have optimized for this case in cpuacct.
percpu_counters use spinlocks to serialize writers. Are you saying
using spinlocks for this 1 writer case is too much ? Also note that
they update 32 bit percpu counters without any lock and take spinlocks
only when they do a batch update to the 64bit counter.

Regards,
Bharata.
-- 
http://bharata.sulekha.com/blog/posts.htm

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] remove rq->lock from cpuacct cgroup v2
  2009-03-04 12:11                                                   ` Bharata B Rao
@ 2009-03-04 14:17                                                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 42+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-03-04 14:17 UTC (permalink / raw)
  To: Bharata B Rao
  Cc: KAMEZAWA Hiroyuki, LKML, Peter Zijlstra, paulmck, Li Zefan,
	Ingo Molnar, Paul Menage, Balbir Singh, kenchen

Bharata B Rao さんは書きました:
> On Wed, Mar 4, 2009 at 1:50 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> On Wed, 4 Mar 2009 13:24:43 +0530
>> Bharata B Rao <bharata.rao@gmail.com> wrote:
>>
>>> Instead of subsystems handling all these percpu counter problems
>>> themselves, shouldn't we be using percpu_counter subsytem and let it
>>> handle all the issues transparently for us ? I am not sure if all
>>> these problems have been addressed in percpu_counter, but would like
>>> to know why we are not using percpu_counter for these kinds of things
>>> and enhance percpu_counter if it can't handle some of the issues which
>>> we are solving here specifically for cpuacct subsystem ?
>>>
>> At first, generic per-cpu counter sounds interesting but to be honest,
>> some special handling is used for cpuacct based on its characteristic.
>>
>
> Just trying to understand this clearly ...
>
>> &#160;- Writer works under non-preemptable context.
>
> Which means the writer is running with preemption disabled.
> percpu_counter writes (via percpu_counter_add) don't assume anything
> and disable preemption themselves. Is this the problem or is there a
> bigger issue here why percpu_counter can't be used ?
>
Ah, just means it doesn't call disable_preempt() explicitly and
use __get_per_cpu() (not get_cpu_var) and smp_processor_id (not get_cpu).


>> &#160;- There is only one writer.
>
> Not sure how you have optimized for this case in cpuacct.
Because there is only one writer, I used seq_counter not seq_lock.
(there are no atomic ops by this.)

> percpu_counters use spinlocks to serialize writers. Are you saying
> using spinlocks for this 1 writer case is too much ?
I'm sorry I noticed there is lib/percpu_counter now.......
I think lib/percpu_counter does proper jobs.

> Also note that they update 32 bit percpu counters without any lock
> and take spinlocks only when they do a batch update to the 64bit counter.
>
Hmm, cpuacct's one uses hierachical update of several counters at once.
So, some private code like mine is not so bad, I think.

What I think now is it's ok to see your patch first and later do
this kind of update to avoid locks if necessary. This patch is micro
optimization and not in hurry.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-03-04 14:17 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-26  7:40 [PATCH] cpuacct: add a branch prediction Li Zefan
2009-02-26  8:07 ` KAMEZAWA Hiroyuki
2009-02-26  8:17   ` Li Zefan
2009-02-26  8:22     ` KAMEZAWA Hiroyuki
2009-02-26  8:35       ` Li Zefan
2009-02-26  8:40         ` KAMEZAWA Hiroyuki
2009-02-26 10:10           ` Bharata B Rao
2009-02-26 10:28             ` KAMEZAWA Hiroyuki
2009-02-26 10:44               ` Peter Zijlstra
2009-02-26 10:55                 ` KAMEZAWA Hiroyuki
2009-02-26 11:22                   ` Peter Zijlstra
2009-02-26 11:17                 ` KAMEZAWA Hiroyuki
2009-02-26 11:28                   ` Peter Zijlstra
2009-02-26 12:06                     ` KAMEZAWA Hiroyuki
2009-02-26 12:20                       ` Peter Zijlstra
2009-02-26 12:26                         ` Ingo Molnar
2009-02-26 12:40                           ` Arnd Bergmann
2009-02-27  4:25                           ` Paul Mackerras
2009-02-26 16:45                       ` Paul E. McKenney
2009-02-27  0:58                         ` KAMEZAWA Hiroyuki
2009-02-27  1:29                           ` Paul E. McKenney
2009-02-27  3:22                             ` [RFC][PATCH] remove rq->lock from cpuacct cgroup (Was " KAMEZAWA Hiroyuki
2009-03-02 14:56                               ` Peter Zijlstra
2009-03-02 23:42                                 ` KAMEZAWA Hiroyuki
2009-03-03  7:51                                   ` Peter Zijlstra
2009-03-03  9:04                                     ` KAMEZAWA Hiroyuki
2009-03-03  9:40                                       ` Peter Zijlstra
2009-03-03 10:42                                         ` KAMEZAWA Hiroyuki
2009-03-03 10:44                                           ` KAMEZAWA Hiroyuki
2009-03-03 11:54                                           ` Peter Zijlstra
2009-03-04  6:32                                             ` [PATCH] remove rq->lock from cpuacct cgroup v2 KAMEZAWA Hiroyuki
2009-03-04  7:54                                               ` Bharata B Rao
2009-03-04  8:20                                                 ` KAMEZAWA Hiroyuki
2009-03-04  8:46                                                   ` KAMEZAWA Hiroyuki
2009-03-04 10:35                                                     ` Bharata B Rao
2009-03-04 12:11                                                   ` Bharata B Rao
2009-03-04 14:17                                                     ` KAMEZAWA Hiroyuki
2009-02-26  8:37 ` [PATCH] cpuacct: add a branch prediction Balbir Singh
2009-02-26  8:41   ` Li Zefan
2009-02-26 10:40     ` Balbir Singh
2009-02-26 10:43       ` Peter Zijlstra
2009-02-26  8:43   ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).