LKML Archive on lore.kernel.org
 help / color / Atom feed
* process time < thread time?
@ 2011-09-01  3:07 David Miller
  2011-09-01  9:56 ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: David Miller @ 2011-09-01  3:07 UTC (permalink / raw)
  To: peterz; +Cc: tglx, linux-kernel


If someone who understands our thread/process time implementation can
look into this, I'd appreciate it.

Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.

Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference.  This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.

I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).

For example:

[davem@boricha build-x86_64-linux]$ ./test
process: before(0.001221967) after(0.498624371) diff(497402404)
thread:  before(0.000081692) after(0.498316431) diff(498234739)
self:    before(0.001223521) after(0.001240219) diff(16698)
[davem@boricha build-x86_64-linux]$ 

The diff of 'process' should always be >= the diff of 'thread'.

I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.

I suspect this might be some kind of artifact of how the partial
runqueue ->clock and ->clock_task updates work?  Maybe some weird
interaction with ->skip_clock_update?

Or is this some known issue?

Thanks!

--------------------
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>

static pthread_barrier_t barrier;

static void *chew_cpu(void *arg)
{
	pthread_barrier_wait(&barrier);
	while (1)
		__asm__ __volatile__("" : : : "memory");
	return NULL;
}

int main(void)
{
	clockid_t process_clock, my_thread_clock, th_clock;
	struct timespec process_before, process_after;
	struct timespec me_before, me_after;
	struct timespec th_before, th_after;
	struct timespec sleeptime;
	unsigned long diff;
	pthread_t th;
	int err;

	err = clock_getcpuclockid(0, &process_clock);
	if (err)
		return 1;

	err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
	if (err)
		return 1;

	pthread_barrier_init(&barrier, NULL, 2);
	err = pthread_create(&th, NULL, chew_cpu, NULL);
	if (err)
		return 1;

	err = pthread_getcpuclockid(th, &th_clock);
	if (err)
		return 1;

	pthread_barrier_wait(&barrier);

	err = clock_gettime(process_clock, &process_before);
	if (err)
		return 1;

	err = clock_gettime(my_thread_clock, &me_before);
	if (err)
		return 1;

	err = clock_gettime(th_clock, &th_before);
	if (err)
		return 1;

	sleeptime.tv_sec = 0;
	sleeptime.tv_nsec = 500000000;
	nanosleep(&sleeptime, NULL);

	err = clock_gettime(th_clock, &th_after);
	if (err)
		return 1;

	err = clock_gettime(my_thread_clock, &me_after);
	if (err)
		return 1;

	err = clock_gettime(process_clock, &process_after);
	if (err)
		return 1;

	diff = process_after.tv_nsec - process_before.tv_nsec;
	printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
	       process_before.tv_sec, process_before.tv_nsec,
	       process_after.tv_sec, process_after.tv_nsec, diff);
	diff = th_after.tv_nsec - th_before.tv_nsec;
	printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
	       th_before.tv_sec, th_before.tv_nsec,
	       th_after.tv_sec, th_after.tv_nsec, diff);
	diff = me_after.tv_nsec - me_before.tv_nsec;
	printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
	       me_before.tv_sec, me_before.tv_nsec,
	       me_after.tv_sec, me_after.tv_nsec, diff);

	return 0;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: process time < thread time?
  2011-09-01  3:07 process time < thread time? David Miller
@ 2011-09-01  9:56 ` Thomas Gleixner
  2011-09-01 10:11   ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2011-09-01  9:56 UTC (permalink / raw)
  To: David Miller; +Cc: peterz, linux-kernel

Dave,

On Wed, 31 Aug 2011, David Miller wrote:
> If someone who understands our thread/process time implementation can
> look into this, I'd appreciate it.
> 
> Attached below is a watered-down version of rt/tst-cpuclock2.c from
> GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
> similar.
> 
> Run it several times, and you will see cases where the main thread
> will measure a process clock difference before and after the nanosleep
> which is smaller than the cpu-burner thread's individual thread clock
> difference.  This doesn't make any sense since the cpu-burner thread
> is part of the top-level process's thread group.
> 
> I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
> 64-bit binaries).
> 
> For example:
> 
> [davem@boricha build-x86_64-linux]$ ./test
> process: before(0.001221967) after(0.498624371) diff(497402404)
> thread:  before(0.000081692) after(0.498316431) diff(498234739)
> self:    before(0.001223521) after(0.001240219) diff(16698)
> [davem@boricha build-x86_64-linux]$ 
> 
> The diff of 'process' should always be >= the diff of 'thread'.
> 
> I make sure to wrap the 'thread' clock measurements the most tightly
> around the nanosleep() call, and that the 'process' clock measurements
> are the outer-most ones.
> 
> I suspect this might be some kind of artifact of how the partial
> runqueue ->clock and ->clock_task updates work?  Maybe some weird
> interaction with ->skip_clock_update?
> 
> Or is this some known issue?

That's an SMP artifact. If you run "taskset 01 ./test" the result is
always correct.

The reason why this shows deviations on SMP is how the thread times
are accumulated in thread_group_cputime(). We sum
t->se.sum_exec_runtime of all threads. So if the hog thread is
currently running on the other core (which is likely) then the runtime
field of that thread is not up to date.

The untested patch below should cure this.

Thanks,

	tglx

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 58f405b..42378cb 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -250,7 +250,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	do {
 		times->utime = cputime_add(times->utime, t->utime);
 		times->stime = cputime_add(times->stime, t->stime);
-		times->sum_exec_runtime += t->se.sum_exec_runtime;
+		times->sum_exec_runtime += task_sched_runtime(t);
 	} while_each_thread(tsk, t);
 out:
 	rcu_read_unlock();



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: process time < thread time?
  2011-09-01  9:56 ` Thomas Gleixner
@ 2011-09-01 10:11   ` Peter Zijlstra
  2011-09-01 10:39     ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2011-09-01 10:11 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: David Miller, linux-kernel

On Thu, 2011-09-01 at 11:56 +0200, Thomas Gleixner wrote:
> The untested patch below should cure this.
> 

> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> index 58f405b..42378cb 100644
> --- a/kernel/posix-cpu-timers.c
> +++ b/kernel/posix-cpu-timers.c
> @@ -250,7 +250,7 @@ void thread_group_cputime(struct task_struct *tsk,
> struct task_cputime *times)
>         do {
>                 times->utime = cputime_add(times->utime, t->utime);
>                 times->stime = cputime_add(times->stime, t->stime);
> -               times->sum_exec_runtime += t->se.sum_exec_runtime;
> +               times->sum_exec_runtime += task_sched_runtime(t);
>         } while_each_thread(tsk, t);
>  out:
>         rcu_read_unlock(); 

I think we just deadlocked:

cpu_clock_sample_group()
  CPUCLOCK_SCHED: thread_group_sched_runtime()
    task_rq_lock()        <-.
    thread_group_cputime()  |
      task_sched_runtime()  |
       task_rq_lock() ------'


Also, having to take locks here is sad, but yeah if we want to cure this
there's not much we can do about that.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: process time < thread time?
  2011-09-01 10:11   ` Peter Zijlstra
@ 2011-09-01 10:39     ` Thomas Gleixner
  2011-09-01 10:54       ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2011-09-01 10:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: David Miller, linux-kernel

On Thu, 1 Sep 2011, Peter Zijlstra wrote:

> On Thu, 2011-09-01 at 11:56 +0200, Thomas Gleixner wrote:
> > The untested patch below should cure this.
> > 
> 
> > diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
> > index 58f405b..42378cb 100644
> > --- a/kernel/posix-cpu-timers.c
> > +++ b/kernel/posix-cpu-timers.c
> > @@ -250,7 +250,7 @@ void thread_group_cputime(struct task_struct *tsk,
> > struct task_cputime *times)
> >         do {
> >                 times->utime = cputime_add(times->utime, t->utime);
> >                 times->stime = cputime_add(times->stime, t->stime);
> > -               times->sum_exec_runtime += t->se.sum_exec_runtime;
> > +               times->sum_exec_runtime += task_sched_runtime(t);
> >         } while_each_thread(tsk, t);
> >  out:
> >         rcu_read_unlock(); 
> 
> I think we just deadlocked:

That's why I said untested :)
 
> cpu_clock_sample_group()
>   CPUCLOCK_SCHED: thread_group_sched_runtime()
>     task_rq_lock()        <-.
>     thread_group_cputime()  |
>       task_sched_runtime()  |
>        task_rq_lock() ------'
> 
> 
> Also, having to take locks here is sad, but yeah if we want to cure this
> there's not much we can do about that.

Yeah :(
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: process time < thread time?
  2011-09-01 10:39     ` Thomas Gleixner
@ 2011-09-01 10:54       ` Peter Zijlstra
  2011-09-01 14:54         ` Thomas Gleixner
  2011-09-01 14:56         ` David Miller
  0 siblings, 2 replies; 7+ messages in thread
From: Peter Zijlstra @ 2011-09-01 10:54 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: David Miller, linux-kernel

On Thu, 2011-09-01 at 12:39 +0200, Thomas Gleixner wrote:
> > I think we just deadlocked:
> 
> That's why I said untested :) 

the below seems to build boot and give consistent output.. 

---
Subject: posix-cpu-timers: Cure SMP wobbles
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Sep 01 12:42:04 CEST 2011

David reported:

  Attached below is a watered-down version of rt/tst-cpuclock2.c from
  GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
  similar.

  Run it several times, and you will see cases where the main thread
  will measure a process clock difference before and after the nanosleep
  which is smaller than the cpu-burner thread's individual thread clock
  difference.  This doesn't make any sense since the cpu-burner thread
  is part of the top-level process's thread group.

  I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
  64-bit binaries).

  For example:

  [davem@boricha build-x86_64-linux]$ ./test
  process: before(0.001221967) after(0.498624371) diff(497402404)
  thread:  before(0.000081692) after(0.498316431) diff(498234739)
  self:    before(0.001223521) after(0.001240219) diff(16698)
  [davem@boricha build-x86_64-linux]$ 

  The diff of 'process' should always be >= the diff of 'thread'.

  I make sure to wrap the 'thread' clock measurements the most tightly
  around the nanosleep() call, and that the 'process' clock measurements
  are the outer-most ones.

  ---
  #include <unistd.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <time.h>
  #include <fcntl.h>
  #include <string.h>
  #include <errno.h>
  #include <pthread.h>

  static pthread_barrier_t barrier;

  static void *chew_cpu(void *arg)
  {
	  pthread_barrier_wait(&barrier);
	  while (1)
		  __asm__ __volatile__("" : : : "memory");
	  return NULL;
  }

  int main(void)
  {
	  clockid_t process_clock, my_thread_clock, th_clock;
	  struct timespec process_before, process_after;
	  struct timespec me_before, me_after;
	  struct timespec th_before, th_after;
	  struct timespec sleeptime;
	  unsigned long diff;
	  pthread_t th;
	  int err;

	  err = clock_getcpuclockid(0, &process_clock);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
	  if (err)
		  return 1;

	  pthread_barrier_init(&barrier, NULL, 2);
	  err = pthread_create(&th, NULL, chew_cpu, NULL);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(th, &th_clock);
	  if (err)
		  return 1;

	  pthread_barrier_wait(&barrier);

	  err = clock_gettime(process_clock, &process_before);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &me_before);
	  if (err)
		  return 1;

	  err = clock_gettime(th_clock, &th_before);
	  if (err)
		  return 1;

	  sleeptime.tv_sec = 0;
	  sleeptime.tv_nsec = 500000000;
	  nanosleep(&sleeptime, NULL);

	  err = clock_gettime(th_clock, &th_after);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &me_after);
	  if (err)
		  return 1;

	  err = clock_gettime(process_clock, &process_after);
	  if (err)
		  return 1;

	  diff = process_after.tv_nsec - process_before.tv_nsec;
	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 process_before.tv_sec, process_before.tv_nsec,
		 process_after.tv_sec, process_after.tv_nsec, diff);
	  diff = th_after.tv_nsec - th_before.tv_nsec;
	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 th_before.tv_sec, th_before.tv_nsec,
		 th_after.tv_sec, th_after.tv_nsec, diff);
	  diff = me_after.tv_nsec - me_before.tv_nsec;
	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 me_before.tv_sec, me_before.tv_nsec,
		 me_after.tv_sec, me_after.tv_nsec, diff);

	  return 0;
  }

This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.

This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().

Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-nk5489zjtxsln3ixjonjs3h3@git.kernel.org
---
 include/linux/sched.h     |    1 -
 kernel/posix-cpu-timers.c |    5 +++--
 kernel/sched.c            |   24 ------------------------
 3 files changed, 3 insertions(+), 27 deletions(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1956,7 +1956,6 @@ static inline void disable_sched_clock_i
 
 extern unsigned long long
 task_sched_runtime(struct task_struct *task);
-extern unsigned long long thread_group_sched_runtime(struct task_struct *task);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
Index: linux-2.6/kernel/posix-cpu-timers.c
===================================================================
--- linux-2.6.orig/kernel/posix-cpu-timers.c
+++ linux-2.6/kernel/posix-cpu-timers.c
@@ -250,7 +250,7 @@ void thread_group_cputime(struct task_st
 	do {
 		times->utime = cputime_add(times->utime, t->utime);
 		times->stime = cputime_add(times->stime, t->stime);
-		times->sum_exec_runtime += t->se.sum_exec_runtime;
+		times->sum_exec_runtime += task_sched_runtime(t);
 	} while_each_thread(tsk, t);
 out:
 	rcu_read_unlock();
@@ -312,7 +312,8 @@ static int cpu_clock_sample_group(const 
 		cpu->cpu = cputime.utime;
 		break;
 	case CPUCLOCK_SCHED:
-		cpu->sched = thread_group_sched_runtime(p);
+		thread_group_cputime(p, &cputime);
+		cpu->sched = cputime.sum_exec_runtime;
 		break;
 	}
 	return 0;
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -3866,30 +3866,6 @@ unsigned long long task_sched_runtime(st
 }
 
 /*
- * Return sum_exec_runtime for the thread group.
- * In case the task is currently running, return the sum plus current's
- * pending runtime that have not been accounted yet.
- *
- * Note that the thread group might have other running tasks as well,
- * so the return value not includes other pending runtime that other
- * running tasks might have.
- */
-unsigned long long thread_group_sched_runtime(struct task_struct *p)
-{
-	struct task_cputime totals;
-	unsigned long flags;
-	struct rq *rq;
-	u64 ns;
-
-	rq = task_rq_lock(p, &flags);
-	thread_group_cputime(p, &totals);
-	ns = totals.sum_exec_runtime + do_task_delta_exec(p, rq);
-	task_rq_unlock(rq, p, &flags);
-
-	return ns;
-}
-
-/*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
  * @cputime: the cpu time spent in user space since the last update


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: process time < thread time?
  2011-09-01 10:54       ` Peter Zijlstra
@ 2011-09-01 14:54         ` Thomas Gleixner
  2011-09-01 14:56         ` David Miller
  1 sibling, 0 replies; 7+ messages in thread
From: Thomas Gleixner @ 2011-09-01 14:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: David Miller, linux-kernel

On Thu, 1 Sep 2011, Peter Zijlstra wrote:
> On Thu, 2011-09-01 at 12:39 +0200, Thomas Gleixner wrote:
> > > I think we just deadlocked:
> > 
> > That's why I said untested :) 
> 
> Reported-by: David Miller <davem@davemloft.net>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Link: http://lkml.kernel.org/n/tip-nk5489zjtxsln3ixjonjs3h3@git.kernel.org

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: process time < thread time?
  2011-09-01 10:54       ` Peter Zijlstra
  2011-09-01 14:54         ` Thomas Gleixner
@ 2011-09-01 14:56         ` David Miller
  1 sibling, 0 replies; 7+ messages in thread
From: David Miller @ 2011-09-01 14:56 UTC (permalink / raw)
  To: peterz; +Cc: tglx, linux-kernel

From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 01 Sep 2011 12:54:19 +0200

> On Thu, 2011-09-01 at 12:39 +0200, Thomas Gleixner wrote:
>> > I think we just deadlocked:
>> 
>> That's why I said untested :) 
> 
> the below seems to build boot and give consistent output.. 

Works here too.

> Reported-by: David Miller <davem@davemloft.net>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Link: http://lkml.kernel.org/n/tip-nk5489zjtxsln3ixjonjs3h3@git.kernel.org

Tested-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, back to index

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-01  3:07 process time < thread time? David Miller
2011-09-01  9:56 ` Thomas Gleixner
2011-09-01 10:11   ` Peter Zijlstra
2011-09-01 10:39     ` Thomas Gleixner
2011-09-01 10:54       ` Peter Zijlstra
2011-09-01 14:54         ` Thomas Gleixner
2011-09-01 14:56         ` David Miller

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git