All of lore.kernel.org
 help / color / mirror / Atom feed
* Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
@ 2021-05-12  3:28 hasegawa-hitomi
  2021-05-18  7:59 ` hasegawa-hitomi
  0 siblings, 1 reply; 16+ messages in thread
From: hasegawa-hitomi @ 2021-05-12  3:28 UTC (permalink / raw)
  To: 'fweisbec@gmail.com', 'tglx@linutronix.de',
	'mingo@kernel.org', 'peterz@infradead.org',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hello.

I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU, the utime and stime I get are less than the actual time, unlike when I run getrusage(RUSAGE_SELF) on a single thread.
This problem seems to be caused by the fact that se.sum_exec_runtime is not updated just before getting the information from 'current'.
In the current implementation, task_cputime_adjusted() calls task_cputime() to get the 'current' utime and stime, then calls cputime_adjust() to adjust the sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless CPU, sum_exec_runtime is not updated periodically, so there seems to be a discrepancy with the actual time.
Therefore, I think I should include a process to update se.sum_exec_runtime just before getting the information from 'current' (as in other processes except RUSAGE_THREAD). I'm thinking of the following improvement.

@@ void getrusage(struct task_struct *p, int who, struct rusage *r)
        if (who == RUSAGE_THREAD) {
+               task_sched_runtime(current);
                task_cputime_adjusted(current, &utime, &stime);

Is there any possible problem with this?

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-12  3:28 Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU hasegawa-hitomi
@ 2021-05-18  7:59 ` hasegawa-hitomi
  2021-05-18  8:23   ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: hasegawa-hitomi @ 2021-05-18  7:59 UTC (permalink / raw)
  To: 'fweisbec@gmail.com', 'tglx@linutronix.de',
	'mingo@kernel.org', 'peterz@infradead.org',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Ingo, Peter, Juri, and Vincent


> I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU, the utime and stime I get are less than the actual time, unlike when I run getrusage(RUSAGE_SELF) on a single thread.
> This problem seems to be caused by the fact that se.sum_exec_runtime is not updated just before getting the information from 'current'.
> In the current implementation, task_cputime_adjusted() calls task_cputime() to get the 'current' utime and stime, then calls cputime_adjust() >to adjust the sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless CPU, sum_exec_runtime is not updated >periodically, so there seems to be a discrepancy with the actual time.
> Therefore, I think I should include a process to update se.sum_exec_runtime just before getting the information from 'current' (as in other >processes except RUSAGE_THREAD). I'm thinking of the following improvement.
> 
> @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
>         if (who == RUSAGE_THREAD) {
> +               task_sched_runtime(current);
>                 task_cputime_adjusted(current, &utime, &stime);
> 
> Is there any possible problem with this?


Any comments?

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-18  7:59 ` hasegawa-hitomi
@ 2021-05-18  8:23   ` Peter Zijlstra
  2021-05-19  6:30     ` hasegawa-hitomi
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-05-18  8:23 UTC (permalink / raw)
  To: hasegawa-hitomi
  Cc: 'fweisbec@gmail.com', 'tglx@linutronix.de',
	'mingo@kernel.org', 'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org',
	'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

On Tue, May 18, 2021 at 07:59:06AM +0000, hasegawa-hitomi@fujitsu.com wrote:
> Hi Ingo, Peter, Juri, and Vincent
> 
> 
> > I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU, the utime and stime I get are less than the actual time, unlike when I run getrusage(RUSAGE_SELF) on a single thread.
> > This problem seems to be caused by the fact that se.sum_exec_runtime is not updated just before getting the information from 'current'.
> > In the current implementation, task_cputime_adjusted() calls task_cputime() to get the 'current' utime and stime, then calls cputime_adjust() >to adjust the sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless CPU, sum_exec_runtime is not updated >periodically, so there seems to be a discrepancy with the actual time.
> > Therefore, I think I should include a process to update se.sum_exec_runtime just before getting the information from 'current' (as in other >processes except RUSAGE_THREAD). I'm thinking of the following improvement.
> > 
> > @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
> >         if (who == RUSAGE_THREAD) {
> > +               task_sched_runtime(current);
> >                 task_cputime_adjusted(current, &utime, &stime);
> > 
> > Is there any possible problem with this?
> 
> 
> Any comments?

Your email is malformed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-18  8:23   ` Peter Zijlstra
@ 2021-05-19  6:30     ` hasegawa-hitomi
  2021-05-19  9:24       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: hasegawa-hitomi @ 2021-05-19  6:30 UTC (permalink / raw)
  To: Peter Zijlstra, 'mingo@kernel.org',
	'fweisbec@gmail.com', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Ingo, Peter, Juri, and Vincent.


> Your email is malformed.

I'm sorry. I was sent in the wrong format. I correct it and resend.
Thank you, Peter, for pointing this out.


I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU,
the utime and stime I get are less than the actual time, unlike when I run
getrusage(RUSAGE_SELF) on a single thread.
This problem seems to be caused by the fact that se.sum_exec_runtime is not
updated just before getting the information from 'current'.
In the current implementation, task_cputime_adjusted() calls task_cputime() to
get the 'current' utime and stime, then calls cputime_adjust() to adjust the
sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless
CPU, sum_exec_runtime is not updated periodically, so there seems to be a
discrepancy with the actual time.
Therefore, I think I should include a process to update se.sum_exec_runtime
just before getting the information from 'current' (as in other processes
except RUSAGE_THREAD). I'm thinking of the following improvement.

@@ void getrusage(struct task_struct *p, int who, struct rusage *r)
        if (who == RUSAGE_THREAD) {
+               task_sched_runtime(current);
                task_cputime_adjusted(current, &utime, &stime);

Is there any possible problem with this?


Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-19  6:30     ` hasegawa-hitomi
@ 2021-05-19  9:24       ` Peter Zijlstra
  2021-05-19  9:28         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Peter Zijlstra @ 2021-05-19  9:24 UTC (permalink / raw)
  To: hasegawa-hitomi
  Cc: 'mingo@kernel.org', 'fweisbec@gmail.com',
	'tglx@linutronix.de', 'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org',
	'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

On Wed, May 19, 2021 at 06:30:36AM +0000, hasegawa-hitomi@fujitsu.com wrote:
> Hi Ingo, Peter, Juri, and Vincent.
> 
> 
> > Your email is malformed.
> 
> I'm sorry. I was sent in the wrong format. I correct it and resend.
> Thank you, Peter, for pointing this out.
> 
> 
> I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU,
> the utime and stime I get are less than the actual time, unlike when I run
> getrusage(RUSAGE_SELF) on a single thread.
> This problem seems to be caused by the fact that se.sum_exec_runtime is not
> updated just before getting the information from 'current'.
> In the current implementation, task_cputime_adjusted() calls task_cputime() to
> get the 'current' utime and stime, then calls cputime_adjust() to adjust the
> sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless
> CPU, sum_exec_runtime is not updated periodically, so there seems to be a
> discrepancy with the actual time.
> Therefore, I think I should include a process to update se.sum_exec_runtime
> just before getting the information from 'current' (as in other processes
> except RUSAGE_THREAD). I'm thinking of the following improvement.
> 
> @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
>         if (who == RUSAGE_THREAD) {
> +               task_sched_runtime(current);
>                 task_cputime_adjusted(current, &utime, &stime);
> 
> Is there any possible problem with this?

Would be superfluous for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
architectures at the very least.

It also doesn't help any of the other callers, like for example procfs.

Something like the below ought to work and fix all variants I think. But
it does make the call significantly more expensive.

Looking at thread_group_cputime() that already does something like this,
but that's also susceptible to a variant of this very same issue; since
it doesn't call it unconditionally, nor on all tasks, so if current
isn't part of the threadgroup and/or another task is on a nohz_full cpu,
things will go wobbly again.

There's a note about syscall performance there, so clearly someone seems
to care about that aspect of things, but it does suck for nohz_full.

Frederic, didn't we have remote ticks that should help with this stuff?

And mostly I think the trade-off here is that if you run on nohz_full,
you're not expected to go do syscalls anyway (because they're sodding
expensive) and hence the accuracy of these sort of things is mostly
irrelevant.

So it might be the use-case is just fundamentally bonkers and we
shouldn't really bother fixing this.

Anyway?

---
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 872e481d5098..620871c8e4f8 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
 void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 {
 	struct task_cputime cputime = {
-		.sum_exec_runtime = p->se.sum_exec_runtime,
+		.sum_exec_runtime = task_sched_runtime(p),
 	};
 
 	task_cputime(p, &cputime.utime, &cputime.stime);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-19  9:24       ` Peter Zijlstra
@ 2021-05-19  9:28         ` Peter Zijlstra
  2021-05-21  6:40           ` hasegawa-hitomi
  2021-06-16 12:31         ` Frederic Weisbecker
  2021-06-16 12:54         ` Frederic Weisbecker
  2 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2021-05-19  9:28 UTC (permalink / raw)
  To: hasegawa-hitomi
  Cc: 'mingo@kernel.org', 'fweisbec@gmail.com',
	'tglx@linutronix.de', 'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org',
	'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

On Wed, May 19, 2021 at 11:24:58AM +0200, Peter Zijlstra wrote:
> On Wed, May 19, 2021 at 06:30:36AM +0000, hasegawa-hitomi@fujitsu.com wrote:
> > Hi Ingo, Peter, Juri, and Vincent.
> > 
> > 
> > > Your email is malformed.
> > 
> > I'm sorry. I was sent in the wrong format. I correct it and resend.
> > Thank you, Peter, for pointing this out.
> > 
> > 
> > I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU,
> > the utime and stime I get are less than the actual time, unlike when I run
> > getrusage(RUSAGE_SELF) on a single thread.
> > This problem seems to be caused by the fact that se.sum_exec_runtime is not
> > updated just before getting the information from 'current'.
> > In the current implementation, task_cputime_adjusted() calls task_cputime() to
> > get the 'current' utime and stime, then calls cputime_adjust() to adjust the
> > sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless
> > CPU, sum_exec_runtime is not updated periodically, so there seems to be a
> > discrepancy with the actual time.
> > Therefore, I think I should include a process to update se.sum_exec_runtime
> > just before getting the information from 'current' (as in other processes
> > except RUSAGE_THREAD). I'm thinking of the following improvement.
> > 
> > @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
> >         if (who == RUSAGE_THREAD) {
> > +               task_sched_runtime(current);
> >                 task_cputime_adjusted(current, &utime, &stime);
> > 
> > Is there any possible problem with this?
> 
> Would be superfluous for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
> architectures at the very least.
> 
> It also doesn't help any of the other callers, like for example procfs.
> 
> Something like the below ought to work and fix all variants I think. But
> it does make the call significantly more expensive.
> 
> Looking at thread_group_cputime() that already does something like this,
> but that's also susceptible to a variant of this very same issue; since
> it doesn't call it unconditionally, nor on all tasks, so if current
> isn't part of the threadgroup and/or another task is on a nohz_full cpu,
> things will go wobbly again.
> 
> There's a note about syscall performance there, so clearly someone seems
> to care about that aspect of things, but it does suck for nohz_full.
> 
> Frederic, didn't we have remote ticks that should help with this stuff?
> 
> And mostly I think the trade-off here is that if you run on nohz_full,
> you're not expected to go do syscalls anyway (because they're sodding
> expensive) and hence the accuracy of these sort of things is mostly
> irrelevant.
> 
> So it might be the use-case is just fundamentally bonkers and we
> shouldn't really bother fixing this.
> 
> Anyway?

Typing be hard... that should 'obviously' be reading: Anyone?

> 
> ---
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 872e481d5098..620871c8e4f8 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>  void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
>  {
>  	struct task_cputime cputime = {
> -		.sum_exec_runtime = p->se.sum_exec_runtime,
> +		.sum_exec_runtime = task_sched_runtime(p),
>  	};
>  
>  	task_cputime(p, &cputime.utime, &cputime.stime);

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-19  9:28         ` Peter Zijlstra
@ 2021-05-21  6:40           ` hasegawa-hitomi
  2021-05-21  8:41             ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: hasegawa-hitomi @ 2021-05-21  6:40 UTC (permalink / raw)
  To: Peter Zijlstra, 'fweisbec@gmail.com',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Peter and Frederic


> > Would be superfluous for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
> > architectures at the very least.
> > 
> > It also doesn't help any of the other callers, like for example procfs.
> > 
> > Something like the below ought to work and fix all variants I think. But
> > it does make the call significantly more expensive.
> > 
> > Looking at thread_group_cputime() that already does something like this,
> > but that's also susceptible to a variant of this very same issue; since
> > it doesn't call it unconditionally, nor on all tasks, so if current
> > isn't part of the threadgroup and/or another task is on a nohz_full cpu,
> > things will go wobbly again.
> > 
> > There's a note about syscall performance there, so clearly someone seems
> > to care about that aspect of things, but it does suck for nohz_full.
> > 
> > Frederic, didn't we have remote ticks that should help with this stuff?
> > 
> > And mostly I think the trade-off here is that if you run on nohz_full,
> > you're not expected to go do syscalls anyway (because they're sodding
> > expensive) and hence the accuracy of these sort of things is mostly
> > irrelevant.
> > 
> > So it might be the use-case is just fundamentally bonkers and we
> > shouldn't really bother fixing this.
> > 
> > Anyway?
> 
> Typing be hard... that should 'obviously' be reading: Anyone?


I understand that there is a trade-off between performance and accuracy
and that this issue may have already been discussed.
However, as Peter mentions, the process of updating sum_exec_runtime
just before retrieving information is already implemented in
thread_group_cputime() in the root of RUSAGE_SELF etc. So, I think
RUSAGE_THREAD should follow suit and implement the same process.

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-21  6:40           ` hasegawa-hitomi
@ 2021-05-21  8:41             ` Mel Gorman
  2021-06-16  2:27               ` hasegawa-hitomi
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2021-05-21  8:41 UTC (permalink / raw)
  To: hasegawa-hitomi
  Cc: Peter Zijlstra, 'fweisbec@gmail.com',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org',
	'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

On Fri, May 21, 2021 at 06:40:53AM +0000, hasegawa-hitomi@fujitsu.com wrote:
> Hi Peter and Frederic
> 
> 
> > > Would be superfluous for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
> > > architectures at the very least.
> > > 
> > > It also doesn't help any of the other callers, like for example procfs.
> > > 
> > > Something like the below ought to work and fix all variants I think. But
> > > it does make the call significantly more expensive.
> > > 
> > > Looking at thread_group_cputime() that already does something like this,
> > > but that's also susceptible to a variant of this very same issue; since
> > > it doesn't call it unconditionally, nor on all tasks, so if current
> > > isn't part of the threadgroup and/or another task is on a nohz_full cpu,
> > > things will go wobbly again.
> > > 
> > > There's a note about syscall performance there, so clearly someone seems
> > > to care about that aspect of things, but it does suck for nohz_full.
> > > 
> > > Frederic, didn't we have remote ticks that should help with this stuff?
> > > 
> > > And mostly I think the trade-off here is that if you run on nohz_full,
> > > you're not expected to go do syscalls anyway (because they're sodding
> > > expensive) and hence the accuracy of these sort of things is mostly
> > > irrelevant.
> > > 
> > > So it might be the use-case is just fundamentally bonkers and we
> > > shouldn't really bother fixing this.
> > > 
> > > Anyway?
> > 
> > Typing be hard... that should 'obviously' be reading: Anyone?
> 
> 
> I understand that there is a trade-off between performance and accuracy
> and that this issue may have already been discussed.
> However, as Peter mentions, the process of updating sum_exec_runtime
> just before retrieving information is already implemented in
> thread_group_cputime() in the root of RUSAGE_SELF etc. So, I think
> RUSAGE_THREAD should follow suit and implement the same process.
> 

I don't think it's a straight-forward issue. I know we've had to deal with
bugs in the past where the overhead of getting CPU usage statistics was
high enough to dominate workloads that had self-monitoring capabilities to
the extent the self-monitoring was counter-productive. It was particularly
problematic when self-monitoring was being activated to find the source
of a slowdown. I tend to agree with Peter here that the fix may be worse
than the problem ultimately where workloads are not necessarily willing
to pay the cost of accuracy and as he pointed out already, it's expected
nohz_full tasks are avoiding syscalls as much as possible.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-21  8:41             ` Mel Gorman
@ 2021-06-16  2:27               ` hasegawa-hitomi
  0 siblings, 0 replies; 16+ messages in thread
From: hasegawa-hitomi @ 2021-06-16  2:27 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra, 'fweisbec@gmail.com',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Mel, Peter, and Frederic

I'm sorry for the late reply.

> I don't think it's a straight-forward issue. I know we've had to deal with
> bugs in the past where the overhead of getting CPU usage statistics was
> high enough to dominate workloads that had self-monitoring capabilities to
> the extent the self-monitoring was counter-productive. It was particularly
> problematic when self-monitoring was being activated to find the source
> of a slowdown. I tend to agree with Peter here that the fix may be worse
> than the problem ultimately where workloads are not necessarily willing
> to pay the cost of accuracy and as he pointed out already, it's expected
> nohz_full tasks are avoiding syscalls as much as possible.

I understand that the use of syscall should be avoided as much as possible.
However, I think it is necessary to get accurate utime and stime
for performance tuning in nohz_full.
For example, how about introducing a new procfs entry or kernel configuration
so that the RUSAGE_THREAD process can switch the execution of
task_sched_runtime()? That way, It can leave it to the user to decide
whether to take performance or correctness.

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-19  9:24       ` Peter Zijlstra
  2021-05-19  9:28         ` Peter Zijlstra
@ 2021-06-16 12:31         ` Frederic Weisbecker
  2021-06-16 12:54         ` Frederic Weisbecker
  2 siblings, 0 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2021-06-16 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hasegawa-hitomi, 'mingo@kernel.org',
	'fweisbec@gmail.com', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org',
	'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

On Wed, May 19, 2021 at 11:24:58AM +0200, Peter Zijlstra wrote:
> On Wed, May 19, 2021 at 06:30:36AM +0000, hasegawa-hitomi@fujitsu.com wrote:
> > Hi Ingo, Peter, Juri, and Vincent.
> > 
> > 
> > > Your email is malformed.
> > 
> > I'm sorry. I was sent in the wrong format. I correct it and resend.
> > Thank you, Peter, for pointing this out.
> > 
> > 
> > I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU,
> > the utime and stime I get are less than the actual time, unlike when I run
> > getrusage(RUSAGE_SELF) on a single thread.
> > This problem seems to be caused by the fact that se.sum_exec_runtime is not
> > updated just before getting the information from 'current'.
> > In the current implementation, task_cputime_adjusted() calls task_cputime() to
> > get the 'current' utime and stime, then calls cputime_adjust() to adjust the
> > sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless
> > CPU, sum_exec_runtime is not updated periodically, so there seems to be a
> > discrepancy with the actual time.
> > Therefore, I think I should include a process to update se.sum_exec_runtime
> > just before getting the information from 'current' (as in other processes
> > except RUSAGE_THREAD). I'm thinking of the following improvement.
> > 
> > @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
> >         if (who == RUSAGE_THREAD) {
> > +               task_sched_runtime(current);
> >                 task_cputime_adjusted(current, &utime, &stime);
> > 
> > Is there any possible problem with this?
> 
> Would be superfluous for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
> architectures at the very least.
> 
> It also doesn't help any of the other callers, like for example procfs.
> 
> Something like the below ought to work and fix all variants I think. But
> it does make the call significantly more expensive.
> 
> Looking at thread_group_cputime() that already does something like this,
> but that's also susceptible to a variant of this very same issue; since
> it doesn't call it unconditionally, nor on all tasks, so if current
> isn't part of the threadgroup and/or another task is on a nohz_full cpu,
> things will go wobbly again.
> 
> There's a note about syscall performance there, so clearly someone seems
> to care about that aspect of things, but it does suck for nohz_full.
> 
> Frederic, didn't we have remote ticks that should help with this stuff?

Only once per second :-s

> 
> And mostly I think the trade-off here is that if you run on nohz_full,
> you're not expected to go do syscalls anyway (because they're sodding
> expensive) and hence the accuracy of these sort of things is mostly
> irrelevant.

I guess you can, before you enter some critical workload.

> 
> So it might be the use-case is just fundamentally bonkers and we
> shouldn't really bother fixing this.
> 
> Anyway?
> 
> ---
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 872e481d5098..620871c8e4f8 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>  void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
>  {
>  	struct task_cputime cputime = {
> -		.sum_exec_runtime = p->se.sum_exec_runtime,
> +		.sum_exec_runtime = task_sched_runtime(p),
>  	};
>  
>  	task_cputime(p, &cputime.utime, &cputime.stime);

Or perhaps just return task_cputime() if the task runs in a
nohz_full CPU? The check for that would be racy though and there
might be jumps between values if the thread goes/leave a nohz_full
CPU.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-05-19  9:24       ` Peter Zijlstra
  2021-05-19  9:28         ` Peter Zijlstra
  2021-06-16 12:31         ` Frederic Weisbecker
@ 2021-06-16 12:54         ` Frederic Weisbecker
  2021-06-22  6:49           ` hasegawa-hitomi
  2 siblings, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2021-06-16 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hasegawa-hitomi, 'mingo@kernel.org',
	'fweisbec@gmail.com', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org',
	'dietmar.eggemann@arm.com', 'rostedt@goodmis.org',
	'bsegall@google.com', 'mgorman@suse.de',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

On Wed, May 19, 2021 at 11:24:58AM +0200, Peter Zijlstra wrote:
> On Wed, May 19, 2021 at 06:30:36AM +0000, hasegawa-hitomi@fujitsu.com wrote:
> > Hi Ingo, Peter, Juri, and Vincent.
> > 
> > 
> > > Your email is malformed.
> > 
> > I'm sorry. I was sent in the wrong format. I correct it and resend.
> > Thank you, Peter, for pointing this out.
> > 
> > 
> > I found that when I run getrusage(RUSAGE_THREAD) on a tickless CPU,
> > the utime and stime I get are less than the actual time, unlike when I run
> > getrusage(RUSAGE_SELF) on a single thread.
> > This problem seems to be caused by the fact that se.sum_exec_runtime is not
> > updated just before getting the information from 'current'.
> > In the current implementation, task_cputime_adjusted() calls task_cputime() to
> > get the 'current' utime and stime, then calls cputime_adjust() to adjust the
> > sum of utime and stime to be equal to cputime.sum_exec_runtime. On a tickless
> > CPU, sum_exec_runtime is not updated periodically, so there seems to be a
> > discrepancy with the actual time.
> > Therefore, I think I should include a process to update se.sum_exec_runtime
> > just before getting the information from 'current' (as in other processes
> > except RUSAGE_THREAD). I'm thinking of the following improvement.
> > 
> > @@ void getrusage(struct task_struct *p, int who, struct rusage *r)
> >         if (who == RUSAGE_THREAD) {
> > +               task_sched_runtime(current);
> >                 task_cputime_adjusted(current, &utime, &stime);
> > 
> > Is there any possible problem with this?
> 
> Would be superfluous for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y
> architectures at the very least.
> 
> It also doesn't help any of the other callers, like for example procfs.
> 
> Something like the below ought to work and fix all variants I think. But
> it does make the call significantly more expensive.
> 
> Looking at thread_group_cputime() that already does something like this,
> but that's also susceptible to a variant of this very same issue; since
> it doesn't call it unconditionally, nor on all tasks, so if current
> isn't part of the threadgroup and/or another task is on a nohz_full cpu,
> things will go wobbly again.
> 
> There's a note about syscall performance there, so clearly someone seems
> to care about that aspect of things, but it does suck for nohz_full.
> 
> Frederic, didn't we have remote ticks that should help with this stuff?
> 
> And mostly I think the trade-off here is that if you run on nohz_full,
> you're not expected to go do syscalls anyway (because they're sodding
> expensive) and hence the accuracy of these sort of things is mostly
> irrelevant.
> 
> So it might be the use-case is just fundamentally bonkers and we
> shouldn't really bother fixing this.
> 
> Anyway?
> 
> ---
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 872e481d5098..620871c8e4f8 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -612,7 +612,7 @@ void cputime_adjust(struct task_cputime *curr, struct prev_cputime *prev,
>  void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
>  {
>  	struct task_cputime cputime = {
> -		.sum_exec_runtime = p->se.sum_exec_runtime,
> +		.sum_exec_runtime = task_sched_runtime(p),
>  	};
>  
>  	task_cputime(p, &cputime.utime, &cputime.stime);

If necessary I guess we can do something like the below, which
would only add the overhead where it's required:

diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
index 6c9f19a33865..ce3c58286062 100644
--- a/include/linux/sched/cputime.h
+++ b/include/linux/sched/cputime.h
@@ -18,15 +18,16 @@
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-extern void task_cputime(struct task_struct *t,
+extern bool task_cputime(struct task_struct *t,
 			 u64 *utime, u64 *stime);
 extern u64 task_gtime(struct task_struct *t);
 #else
-static inline void task_cputime(struct task_struct *t,
+static inline bool task_cputime(struct task_struct *t,
 				u64 *utime, u64 *stime)
 {
 	*utime = t->utime;
 	*stime = t->stime;
+	return false;
 }
 
 static inline u64 task_gtime(struct task_struct *t)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 872e481d5098..9392aea1804e 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
 		.sum_exec_runtime = p->se.sum_exec_runtime,
 	};
 
-	task_cputime(p, &cputime.utime, &cputime.stime);
+	if (task_cputime(p, &cputime.utime, &cputime.stime))
+		cputime.sum_exec_runtime = task_sched_runtime(p);
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 EXPORT_SYMBOL_GPL(task_cputime_adjusted);
@@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t)
  * add up the pending nohz execution time since the last
  * cputime snapshot.
  */
-void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
+bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 {
 	struct vtime *vtime = &t->vtime;
 	unsigned int seq;
 	u64 delta;
+	int ret;
 
 	if (!vtime_accounting_enabled()) {
 		*utime = t->utime;
 		*stime = t->stime;
-		return;
+		return false;
 	}
 
 	do {
+		ret = false;
 		seq = read_seqcount_begin(&vtime->seqcount);
 
 		*utime = t->utime;
@@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		if (vtime->state < VTIME_SYS)
 			continue;
 
+		ret = true;
 		delta = vtime_delta(vtime);
 
 		/*
@@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
 		else
 			*utime += vtime->utime + delta;
 	} while (read_seqcount_retry(&vtime->seqcount, seq));
+
+	return ret;
 }
 
 static int vtime_state_fetch(struct vtime *vtime, int cpu)

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-06-16 12:54         ` Frederic Weisbecker
@ 2021-06-22  6:49           ` hasegawa-hitomi
  2021-06-28  2:36             ` hasegawa-hitomi
  0 siblings, 1 reply; 16+ messages in thread
From: hasegawa-hitomi @ 2021-06-22  6:49 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra, 'mgorman@suse.de',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'fweisbec@gmail.com', 'dietmar.eggemann@arm.com',
	'rostedt@goodmis.org', 'bsegall@google.com',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Frederic, Peter, and Mel

> If necessary I guess we can do something like the below, which
> would only add the overhead where it's required:
> 
> diff --git a/include/linux/sched/cputime.h b/include/linux/sched/cputime.h
> index 6c9f19a33865..ce3c58286062 100644
> --- a/include/linux/sched/cputime.h
> +++ b/include/linux/sched/cputime.h
> @@ -18,15 +18,16 @@
>  #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
>  
>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> -extern void task_cputime(struct task_struct *t,
> +extern bool task_cputime(struct task_struct *t,
>  			 u64 *utime, u64 *stime);
>  extern u64 task_gtime(struct task_struct *t);
>  #else
> -static inline void task_cputime(struct task_struct *t,
> +static inline bool task_cputime(struct task_struct *t,
>  				u64 *utime, u64 *stime)
>  {
>  	*utime = t->utime;
>  	*stime = t->stime;
> +	return false;
>  }
>  
>  static inline u64 task_gtime(struct task_struct *t)
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 872e481d5098..9392aea1804e 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -615,7 +615,8 @@ void task_cputime_adjusted(struct task_struct *p, u64 *ut, u64 *st)
>  		.sum_exec_runtime = p->se.sum_exec_runtime,
>  	};
>  
> -	task_cputime(p, &cputime.utime, &cputime.stime);
> +	if (task_cputime(p, &cputime.utime, &cputime.stime))
> +		cputime.sum_exec_runtime = task_sched_runtime(p);
>  	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
>  }
>  EXPORT_SYMBOL_GPL(task_cputime_adjusted);
> @@ -828,19 +829,21 @@ u64 task_gtime(struct task_struct *t)
>   * add up the pending nohz execution time since the last
>   * cputime snapshot.
>   */
> -void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
> +bool task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  {
>  	struct vtime *vtime = &t->vtime;
>  	unsigned int seq;
>  	u64 delta;
> +	int ret;
>  
>  	if (!vtime_accounting_enabled()) {
>  		*utime = t->utime;
>  		*stime = t->stime;
> -		return;
> +		return false;
>  	}
>  
>  	do {
> +		ret = false;
>  		seq = read_seqcount_begin(&vtime->seqcount);
>  
>  		*utime = t->utime;
> @@ -850,6 +853,7 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  		if (vtime->state < VTIME_SYS)
>  			continue;
>  
> +		ret = true;
>  		delta = vtime_delta(vtime);
>  
>  		/*
> @@ -861,6 +865,8 @@ void task_cputime(struct task_struct *t, u64 *utime, u64 *stime)
>  		else
>  			*utime += vtime->utime + delta;
>  	} while (read_seqcount_retry(&vtime->seqcount, seq));
> +
> +	return ret;
>  }
>  
>  static int vtime_state_fetch(struct vtime *vtime, int cpu)

I tested this and confirmed that it gives accurate values.
Also, task_sched_runtime () is done only on nohz_full CPU, so I think
it's a better idea than before.
I hope you will incorporate this fix into your mainline.

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-06-22  6:49           ` hasegawa-hitomi
@ 2021-06-28  2:36             ` hasegawa-hitomi
  2021-06-28 14:13               ` Frederic Weisbecker
  0 siblings, 1 reply; 16+ messages in thread
From: hasegawa-hitomi @ 2021-06-28  2:36 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra, 'mgorman@suse.de',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'fweisbec@gmail.com', 'dietmar.eggemann@arm.com',
	'rostedt@goodmis.org', 'bsegall@google.com',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Frederic, Peter, and Mel

> I tested this and confirmed that it gives accurate values.
> Also, task_sched_runtime () is done only on nohz_full CPU, so I think
> it's a better idea than before.
> I hope you will incorporate this fix into your mainline.

I have also confirmed that CPUs other than nohz_full have very little overhead.
Please consider merging it into the mainline.

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-06-28  2:36             ` hasegawa-hitomi
@ 2021-06-28 14:13               ` Frederic Weisbecker
  2021-07-21  8:24                 ` hasegawa-hitomi
  2021-08-31  4:18                 ` hasegawa-hitomi
  0 siblings, 2 replies; 16+ messages in thread
From: Frederic Weisbecker @ 2021-06-28 14:13 UTC (permalink / raw)
  To: hasegawa-hitomi
  Cc: Peter Zijlstra, 'mgorman@suse.de',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org',
	'fweisbec@gmail.com', 'dietmar.eggemann@arm.com',
	'rostedt@goodmis.org', 'bsegall@google.com',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

On Mon, Jun 28, 2021 at 02:36:27AM +0000, hasegawa-hitomi@fujitsu.com wrote:
> Hi Frederic, Peter, and Mel
> 
> > I tested this and confirmed that it gives accurate values.
> > Also, task_sched_runtime () is done only on nohz_full CPU, so I think
> > it's a better idea than before.
> > I hope you will incorporate this fix into your mainline.
> 
> I have also confirmed that CPUs other than nohz_full have very little overhead.
> Please consider merging it into the mainline.

Ok, I'm going to submit a proper patch.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-06-28 14:13               ` Frederic Weisbecker
@ 2021-07-21  8:24                 ` hasegawa-hitomi
  2021-08-31  4:18                 ` hasegawa-hitomi
  1 sibling, 0 replies; 16+ messages in thread
From: hasegawa-hitomi @ 2021-07-21  8:24 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra, 'mgorman@suse.de',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'fweisbec@gmail.com', 'dietmar.eggemann@arm.com',
	'rostedt@goodmis.org', 'bsegall@google.com',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Frederic

> > I have also confirmed that CPUs other than nohz_full have very little overhead.
> > Please consider merging it into the mainline.
> 
> Ok, I'm going to submit a proper patch.

I'm looking forward to posting your patch.
Do you have a specific plan? I would like to know the rough period
until this correction is made.

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU.
  2021-06-28 14:13               ` Frederic Weisbecker
  2021-07-21  8:24                 ` hasegawa-hitomi
@ 2021-08-31  4:18                 ` hasegawa-hitomi
  1 sibling, 0 replies; 16+ messages in thread
From: hasegawa-hitomi @ 2021-08-31  4:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra, 'mgorman@suse.de',
	'mingo@kernel.org', 'tglx@linutronix.de',
	'juri.lelli@redhat.com',
	'vincent.guittot@linaro.org'
  Cc: 'fweisbec@gmail.com', 'dietmar.eggemann@arm.com',
	'rostedt@goodmis.org', 'bsegall@google.com',
	'bristot@redhat.com',
	'linux-kernel@vger.kernel.org'

Hi Frederic,

> > I have also confirmed that CPUs other than nohz_full have very little overhead.
> > Please consider merging it into the mainline.
> 
> Ok, I'm going to submit a proper patch.

Thank you for working on the patch.
Do you have any update?

Thanks.
Hitomi Hasegawa

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-08-31  4:25 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-12  3:28 Utime and stime are less when getrusage (RUSAGE_THREAD) is executed on a tickless CPU hasegawa-hitomi
2021-05-18  7:59 ` hasegawa-hitomi
2021-05-18  8:23   ` Peter Zijlstra
2021-05-19  6:30     ` hasegawa-hitomi
2021-05-19  9:24       ` Peter Zijlstra
2021-05-19  9:28         ` Peter Zijlstra
2021-05-21  6:40           ` hasegawa-hitomi
2021-05-21  8:41             ` Mel Gorman
2021-06-16  2:27               ` hasegawa-hitomi
2021-06-16 12:31         ` Frederic Weisbecker
2021-06-16 12:54         ` Frederic Weisbecker
2021-06-22  6:49           ` hasegawa-hitomi
2021-06-28  2:36             ` hasegawa-hitomi
2021-06-28 14:13               ` Frederic Weisbecker
2021-07-21  8:24                 ` hasegawa-hitomi
2021-08-31  4:18                 ` hasegawa-hitomi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.