All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] lockless sys_times and posix_cpu_clock_get
@ 2014-08-15 20:05 riel
  2014-08-15 20:05 ` [PATCH 1/3] exit: always reap resource stats in __exit_signal riel
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: riel @ 2014-08-15 20:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: oleg, peterz, umgwanakikbuti, fweisbec, akpm, srao, lwoodman, atheurer

Thanks to the feedback from Oleg, Peter, Mike, and Frederic,
I seem to have a patch series that manages to do times()
locklessly, and apparently correctly.

Oleg points out that the monotonicity alone is not enough of a
guarantee, but that should probably be attacked separately, since
that issue is equally present with and without these patches...

The test case below, slightly changed from the one posted by Spencer
Candland in 2009, now runs in 11 seconds instead of 5 minutes.

Is it worthwhile?  There apparently are some real workloads that call
times() a lot, and I believe Sanjay and Andrew have one sitting around.

--------

/*

Based on the test case from the following bug report, but changed
to measure utime on a per thread basis. (Rik van Riel)

https://lkml.org/lkml/2009/11/3/522

From: Spencer Candland
Subject: utime/stime decreasing on thread exit

I am seeing a problem with utime/stime decreasing on thread exit in a
multi-threaded process.  I have been able to track this regression down
to the "process wide cpu clocks/timers" changes introduces in
2.6.29-rc5, specifically when I revert the following commits I know
longer see decreasing utime/stime values:

4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
3fccfd67df79c6351a156eb25a7a514e5f39c4d9
7d8e23df69820e6be42bcc41d441f4860e8c76f7
4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0

I poked around a little, but I am afraid I have to admit that I am not
familiar enough with how this works to resolve this or suggest a fix.

I have verified this in happening in kernels 2.6.29-rc5 - 2.6.32-rc6, I
have been testing this on x86 vanilla kernels, but have also verified it
on several x86 2.6.29+ distro kernels (fedora and ubuntu).

I first noticed this on a production environment running Apache with the
worker MPM, however while tracking this down I put together a simple
program that has been reliable in showing me utime decreasing, hopefully
it will be helpful in demonstrating the issue:
*/

#include <stdio.h>
#include <pthread.h>
#include <sys/times.h>

#define NUM_THREADS 500

struct tms start;

void *pound (void *threadid)
{
  struct tms end;
  int oldutime = 0;
  int utime;
  int c, i;
  for (i = 0; i < 10000; i++) {
	  for (c = 0; c < 10000; c++);
	  times(&end);
	  utime = ((int)end.tms_utime - (int)start.tms_utime);
	  if (oldutime > utime) {
	    printf("utime decreased, was %d, now %d!\n", oldutime, utime);
	  }
	  oldutime = utime;
  }
  pthread_exit(NULL);
}

int main()
{
  pthread_t th[NUM_THREADS];
  long i;
  times(&start);
  for (i = 0; i < NUM_THREADS; i++) {
    pthread_create (&th[i], NULL, pound, (void *)i);
  }
  pthread_exit(NULL);
  return 0;
}


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] exit: always reap resource stats in __exit_signal
  2014-08-15 20:05 [PATCH 0/3] lockless sys_times and posix_cpu_clock_get riel
@ 2014-08-15 20:05 ` riel
  2014-09-08  6:39   ` [tip:sched/core] exit: Always reap resource stats in __exit_signal() tip-bot for Rik van Riel
  2014-08-15 20:05 ` [PATCH 2/3] time,signal: protect resource use statistics with seqlock riel
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-08-15 20:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: oleg, peterz, umgwanakikbuti, fweisbec, akpm, srao, lwoodman, atheurer

From: Rik van Riel <riel@redhat.com>

Oleg pointed out that wait_task_zombie adds a task's usage statistics
to the parent's signal struct, but the task's own signal struct should
also propagate the statistics at exit time.

This allows thread_group_cputime(reaped_zombie) to get the statistics
after __unhash_process() has made the task invisible to for_each_thread,
but before the thread has actually been rcu freed, making sure no
non-monotonic results are returned inside that window.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/exit.c | 43 +++++++++++++++++++++----------------------
 1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..b93d46d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -115,30 +115,29 @@ static void __exit_signal(struct task_struct *tsk)
 
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
-		/*
-		 * Accumulate here the counters for all threads but the
-		 * group leader as they die, so they can be added into
-		 * the process-wide totals when those are taken.
-		 * The group leader stays around as a zombie as long
-		 * as there are other threads.  When it gets reaped,
-		 * the exit.c code will add its counts into these totals.
-		 * We won't ever get here for the group leader, since it
-		 * will have been the last reference on the signal_struct.
-		 */
-		task_cputime(tsk, &utime, &stime);
-		sig->utime += utime;
-		sig->stime += stime;
-		sig->gtime += task_gtime(tsk);
-		sig->min_flt += tsk->min_flt;
-		sig->maj_flt += tsk->maj_flt;
-		sig->nvcsw += tsk->nvcsw;
-		sig->nivcsw += tsk->nivcsw;
-		sig->inblock += task_io_get_inblock(tsk);
-		sig->oublock += task_io_get_oublock(tsk);
-		task_io_accounting_add(&sig->ioac, &tsk->ioac);
-		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	}
 
+	/*
+	 * Accumulate here the counters for all threads but the group leader
+	 * as they die, so they can be added into the process-wide totals
+	 * when those are taken.  The group leader stays around as a zombie as
+	 * long as there are other threads.  When it gets reaped, the exit.c
+	 * code will add its counts into these totals.  We won't ever get here
+	 * for the group leader, since it will have been the last reference on
+	 * the signal_struct.
+	 */
+	task_cputime(tsk, &utime, &stime);
+	sig->utime += utime;
+	sig->stime += stime;
+	sig->gtime += task_gtime(tsk);
+	sig->min_flt += tsk->min_flt;
+	sig->maj_flt += tsk->maj_flt;
+	sig->nvcsw += tsk->nvcsw;
+	sig->nivcsw += tsk->nivcsw;
+	sig->inblock += task_io_get_inblock(tsk);
+	sig->oublock += task_io_get_oublock(tsk);
+	task_io_accounting_add(&sig->ioac, &tsk->ioac);
+	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
 	__unhash_process(tsk, group_dead);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-15 20:05 [PATCH 0/3] lockless sys_times and posix_cpu_clock_get riel
  2014-08-15 20:05 ` [PATCH 1/3] exit: always reap resource stats in __exit_signal riel
@ 2014-08-15 20:05 ` riel
  2014-08-16 14:11   ` Oleg Nesterov
  2014-08-15 20:05 ` [PATCH 3/3] sched,time: atomically increment stime & utime riel
  2014-08-19 21:21 ` [PATCH 0/3] lockless sys_times and posix_cpu_clock_get Andrew Theurer
  3 siblings, 1 reply; 19+ messages in thread
From: riel @ 2014-08-15 20:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: oleg, peterz, umgwanakikbuti, fweisbec, akpm, srao, lwoodman, atheurer

From: Rik van Riel <riel@redhat.com>

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h          |  1 +
 kernel/exit.c                  |  4 ++++
 kernel/fork.c                  |  1 +
 kernel/sched/cputime.c         | 36 +++++++++++++++++++++++-------------
 kernel/sys.c                   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --------------
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
+	seqlock_t stats_lock;
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index b93d46d..fa09b86 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -127,6 +127,7 @@ static void __exit_signal(struct task_struct *tsk)
 	 * the signal_struct.
 	 */
 	task_cputime(tsk, &utime, &stime);
+	write_seqlock(&sig->stats_lock);
 	sig->utime += utime;
 	sig->stime += stime;
 	sig->gtime += task_gtime(tsk);
@@ -140,6 +141,7 @@ static void __exit_signal(struct task_struct *tsk)
 	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
 	__unhash_process(tsk, group_dead);
+	write_sequnlock(&sig->stats_lock);
 
 	/*
 	 * Do this under ->siglock, we can race with another thread
@@ -1042,6 +1044,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
+		write_seqlock(&psig->stats_lock);
 		psig->cutime += tgutime + sig->cutime;
 		psig->cstime += tgstime + sig->cstime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1064,6 +1067,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		write_sequnlock(&psig->stats_lock);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
+	seqlock_init(&sig->stats_lock);
 
 	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	cputime_t utime, stime;
 	struct task_struct *t;
-
-	times->utime = sig->utime;
-	times->stime = sig->stime;
-	times->sum_exec_runtime = sig->sum_sched_runtime;
+	unsigned int seq, nextseq;
 
 	rcu_read_lock();
-	for_each_thread(tsk, t) {
-		task_cputime(t, &utime, &stime);
-		times->utime += utime;
-		times->stime += stime;
-		times->sum_exec_runtime += task_sched_runtime(t);
-	}
+	/* Attempt a lockless read on the first round. */
+	nextseq = 0;
+	do {
+		seq = nextseq;
+		read_seqbegin_or_lock(&sig->stats_lock, &seq);
+		times->utime = sig->utime;
+		times->stime = sig->stime;
+		times->sum_exec_runtime = sig->sum_sched_runtime;
+
+		for_each_thread(tsk, t) {
+			task_cputime(t, &utime, &stime);
+			times->utime += utime;
+			times->stime += stime;
+			times->sum_exec_runtime += task_sched_runtime(t);
+		}
+		/*
+		 * If a writer is currently active, seq will be odd, and
+		 * read_seqbegin_or_lock will take the lock.
+		 */
+		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
+	} while (need_seqretry(&sig->stats_lock, seq));
+	done_seqretry(&sig->stats_lock, seq);
 	rcu_read_unlock();
 }
 
@@ -611,9 +624,6 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
-/*
- * Must be called with siglock held.
- */
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce81291..b663664 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 3b89464..492b986 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk,
 		if (same_thread_group(tsk, current))
 			err = cpu_clock_sample(which_clock, tsk, &rtn);
 	} else {
-		unsigned long flags;
-		struct sighand_struct *sighand;
-
-		/*
-		 * while_each_thread() is not yet entirely RCU safe,
-		 * keep locking the group while sampling process
-		 * clock for now.
-		 */
-		sighand = lock_task_sighand(tsk, &flags);
-		if (!sighand)
-			return err;
-
 		if (tsk == current || thread_group_leader(tsk))
 			err = cpu_clock_sample_group(which_clock, tsk, &rtn);
-
-		unlock_task_sighand(tsk, &flags);
 	}
 
 	if (!err)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] sched,time: atomically increment stime & utime
  2014-08-15 20:05 [PATCH 0/3] lockless sys_times and posix_cpu_clock_get riel
  2014-08-15 20:05 ` [PATCH 1/3] exit: always reap resource stats in __exit_signal riel
  2014-08-15 20:05 ` [PATCH 2/3] time,signal: protect resource use statistics with seqlock riel
@ 2014-08-15 20:05 ` riel
  2014-08-16 14:55   ` Oleg Nesterov
  2014-09-08  6:40   ` [tip:sched/core] sched, time: Atomically " tip-bot for Rik van Riel
  2014-08-19 21:21 ` [PATCH 0/3] lockless sys_times and posix_cpu_clock_get Andrew Theurer
  3 siblings, 2 replies; 19+ messages in thread
From: riel @ 2014-08-15 20:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: oleg, peterz, umgwanakikbuti, fweisbec, akpm, srao, lwoodman, atheurer

From: Rik van Riel <riel@redhat.com>

The functions task_cputime_adjusted and thread_group_cputime_adjusted
can be called locklessly, as well as concurrently on many different CPUs.

This can occasionally lead to the utime and stime reported by times(), and
other syscalls like it, going backward. The cause for this appears to be
multiple threads racing in cputime_adjust, both with values for utime or
stime that is larger than the original, but each with a different value.

Sometimes the larger value gets saved first, only to be immediately
overwritten with a smaller value by another thread.

Using atomic exchange prevents that problem, and ensures time
progresses monotonically.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b5f1c58..ab84270 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
 	 * If the tick based count grows faster than the scheduler one,
 	 * the result of the scaling may go backward.
 	 * Let's enforce monotonicity.
+	 * Atomic exchange protects against concurrent cputime_adjust.
 	 */
-	prev->stime = max(prev->stime, stime);
-	prev->utime = max(prev->utime, utime);
+	while (stime > (rtime = ACCESS_ONCE(prev->stime)))
+		cmpxchg(&prev->stime, rtime, stime);
+	while (utime > (rtime = ACCESS_ONCE(prev->utime)))
+		cmpxchg(&prev->utime, rtime, utime);
 
 out:
 	*ut = prev->utime;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-15 20:05 ` [PATCH 2/3] time,signal: protect resource use statistics with seqlock riel
@ 2014-08-16 14:11   ` Oleg Nesterov
  2014-08-16 15:07     ` Rik van Riel
  2014-08-16 17:40     ` [PATCH v2 " Rik van Riel
  0 siblings, 2 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-08-16 14:11 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman, atheurer

On 08/15, riel@redhat.com wrote:
>
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  	struct signal_struct *sig = tsk->signal;
>  	cputime_t utime, stime;
>  	struct task_struct *t;
> -
> -	times->utime = sig->utime;
> -	times->stime = sig->stime;
> -	times->sum_exec_runtime = sig->sum_sched_runtime;
> +	unsigned int seq, nextseq;
>
>  	rcu_read_lock();
> -	for_each_thread(tsk, t) {
> -		task_cputime(t, &utime, &stime);
> -		times->utime += utime;
> -		times->stime += stime;
> -		times->sum_exec_runtime += task_sched_runtime(t);
> -	}
> +	/* Attempt a lockless read on the first round. */
> +	nextseq = 0;
> +	do {
> +		seq = nextseq;
> +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
> +		times->utime = sig->utime;
> +		times->stime = sig->stime;
> +		times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> +		for_each_thread(tsk, t) {
> +			task_cputime(t, &utime, &stime);
> +			times->utime += utime;
> +			times->stime += stime;
> +			times->sum_exec_runtime += task_sched_runtime(t);
> +		}
> +		/*
> +		 * If a writer is currently active, seq will be odd, and
> +		 * read_seqbegin_or_lock will take the lock.
> +		 */
> +		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
> +	} while (need_seqretry(&sig->stats_lock, seq));
> +	done_seqretry(&sig->stats_lock, seq);
>  	rcu_read_unlock();
>  }

Rik, I do not understand why did you silently ignore my comments about
this change twice ;)

Please see

	http://marc.info/?l=linux-kernel&m=140802271907396
	http://marc.info/?l=linux-kernel&m=140811486607850

I still do not think that the read_seqbegin_or_lock logic is correct,
in a sense that unless I missed something it does not guarantee the
forward progress.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched,time: atomically increment stime & utime
  2014-08-15 20:05 ` [PATCH 3/3] sched,time: atomically increment stime & utime riel
@ 2014-08-16 14:55   ` Oleg Nesterov
  2014-08-16 14:56     ` Oleg Nesterov
  2014-09-08  6:40   ` [tip:sched/core] sched, time: Atomically " tip-bot for Rik van Riel
  1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2014-08-16 14:55 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman, atheurer

On 08/15, riel@redhat.com wrote:
>
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
>  	 * If the tick based count grows faster than the scheduler one,
>  	 * the result of the scaling may go backward.
>  	 * Let's enforce monotonicity.
> +	 * Atomic exchange protects against concurrent cputime_adjust.
>  	 */
> -	prev->stime = max(prev->stime, stime);
> -	prev->utime = max(prev->utime, utime);
> +	while (stime > (rtime = ACCESS_ONCE(prev->stime)))
> +		cmpxchg(&prev->stime, rtime, stime);
> +	while (utime > (rtime = ACCESS_ONCE(prev->utime)))
> +		cmpxchg(&prev->utime, rtime, utime);
>
>  out:
>  	*ut = prev->utime;

I am still not sure about this change. At least I think it needs some
discussion.

Let me repeat, afaics this can lead to inconsistent results. Just
suppose that the caller of thread_group_cputime_adjusted() gets a long
preemption between thread_group_cputime() and cputime_adjust(), and
the numbers in signal->prev_cputime grow significantly when this task
resumes. If cputime_adjust() sees both prev->stime and prev->utime
updated everything is fine. But we can race with cputime_adjust() on
another CPU and miss, say, the change in ->utime.

IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime
with zeros. Then the caller X is preempted.

Another task does thread_group_cputime(T) and this time task_cputime is
{ .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust()
and sets prev->stime = A_LOT_S.

X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }.

If you think that we do not care, probably I won't argue. But at least
this should be documented/discussed. And if we can tolerate this, then we
can probably simply remove the scale_stime recalculation and change it to
just do

	static void cputime_adjust(struct task_cputime *curr,
				   struct cputime *prev,
				   cputime_t *ut, cputime_t *st)
	{
		cputime_t rtime, stime, utime;
		/*
		 * Let's enforce monotonicity.
		 * Atomic exchange protects against concurrent cputime_adjust.
		 */
		while (stime > (rtime = ACCESS_ONCE(prev->stime)))
			cmpxchg(&prev->stime, rtime, stime);
		while (utime > (rtime = ACCESS_ONCE(prev->utime)))
			cmpxchg(&prev->utime, rtime, utime);

		*ut = prev->utime;
		*st = prev->stime;
	}

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/3] sched,time: atomically increment stime & utime
  2014-08-16 14:55   ` Oleg Nesterov
@ 2014-08-16 14:56     ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2014-08-16 14:56 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman, atheurer

Aaah, Rik, I am sorry!

You documented this in 0/3 which I didn't bother to read.

Sorry for noise.

On 08/16, Oleg Nesterov wrote:
>
> On 08/15, riel@redhat.com wrote:
> >
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
> >  	 * If the tick based count grows faster than the scheduler one,
> >  	 * the result of the scaling may go backward.
> >  	 * Let's enforce monotonicity.
> > +	 * Atomic exchange protects against concurrent cputime_adjust.
> >  	 */
> > -	prev->stime = max(prev->stime, stime);
> > -	prev->utime = max(prev->utime, utime);
> > +	while (stime > (rtime = ACCESS_ONCE(prev->stime)))
> > +		cmpxchg(&prev->stime, rtime, stime);
> > +	while (utime > (rtime = ACCESS_ONCE(prev->utime)))
> > +		cmpxchg(&prev->utime, rtime, utime);
> >
> >  out:
> >  	*ut = prev->utime;
> 
> I am still not sure about this change. At least I think it needs some
> discussion.
> 
> Let me repeat, afaics this can lead to inconsistent results. Just
> suppose that the caller of thread_group_cputime_adjusted() gets a long
> preemption between thread_group_cputime() and cputime_adjust(), and
> the numbers in signal->prev_cputime grow significantly when this task
> resumes. If cputime_adjust() sees both prev->stime and prev->utime
> updated everything is fine. But we can race with cputime_adjust() on
> another CPU and miss, say, the change in ->utime.
> 
> IOW. To simplify, suppose that thread_group_cputime(T) fills task_cputime
> with zeros. Then the caller X is preempted.
> 
> Another task does thread_group_cputime(T) and this time task_cputime is
> { .utime = A_LOT_U, .stime = A_LOT_S }. This task calls cputime_adjust()
> and sets prev->stime = A_LOT_S.
> 
> X resumes, calls cputime_adjust(), and returns { 0, A_LOT_S }.
> 
> If you think that we do not care, probably I won't argue. But at least
> this should be documented/discussed. And if we can tolerate this, then we
> can probably simply remove the scale_stime recalculation and change it to
> just do
> 
> 	static void cputime_adjust(struct task_cputime *curr,
> 				   struct cputime *prev,
> 				   cputime_t *ut, cputime_t *st)
> 	{
> 		cputime_t rtime, stime, utime;
> 		/*
> 		 * Let's enforce monotonicity.
> 		 * Atomic exchange protects against concurrent cputime_adjust.
> 		 */
> 		while (stime > (rtime = ACCESS_ONCE(prev->stime)))
> 			cmpxchg(&prev->stime, rtime, stime);
> 		while (utime > (rtime = ACCESS_ONCE(prev->utime)))
> 			cmpxchg(&prev->utime, rtime, utime);
> 
> 		*ut = prev->utime;
> 		*st = prev->stime;
> 	}
> 
> Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-16 14:11   ` Oleg Nesterov
@ 2014-08-16 15:07     ` Rik van Riel
  2014-08-16 17:40     ` [PATCH v2 " Rik van Riel
  1 sibling, 0 replies; 19+ messages in thread
From: Rik van Riel @ 2014-08-16 15:07 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman, atheurer

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/16/2014 10:11 AM, Oleg Nesterov wrote:

>> +		/* +		 * If a writer is currently active, seq will be odd,
>> and +		 * read_seqbegin_or_lock will take the lock. +		 */ +
>> nextseq = raw_read_seqcount(&sig->stats_lock.seqcount); +	} while
>> (need_seqretry(&sig->stats_lock, seq)); +
>> done_seqretry(&sig->stats_lock, seq); rcu_read_unlock(); }
> 
> Rik, I do not understand why did you silently ignore my comments
> about this change twice ;)

I can explain why it appears that way.

I made the fix you suggested in one tree, and did my
"guilt patchbomb" from another tree.  Doh...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT73PLAAoJEM553pKExN6DPNcH/jWQ1qKSyW9rDOETjGAkau2r
KcVoeaqliqYhNi6IDDpqZb0LvOFp9jD+ykiwxGD4KjoNX3iR0pLE7ymm3mh/dH7q
0t3zI+640ydbd5WEcVkpPMT6C0pdPNuEM01iuM0YExWErLPbhFuGcgV7fUn1idTx
L//L9OQNImTxRPPYr+4OEGZh99CqnBG3PjoCtXoa4dvsqsrz2SQ/yUtfAlF4veVY
C6nLdt5zfehaRypzzufsYbyMjLm2uXufEuqibeJI0G2w/qB4Q0+80d+JytETB9Q3
R1VWDjQrLjrIPbMNVkXKK+Px/q7d5q+DPE+uB0lmpmcjIqBJjVhviw/wx9AhJc4=
=Gm8e
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-16 14:11   ` Oleg Nesterov
  2014-08-16 15:07     ` Rik van Riel
@ 2014-08-16 17:40     ` Rik van Riel
  2014-08-16 17:50       ` Oleg Nesterov
  2014-09-08  6:39       ` [tip:sched/core] time, signal: Protect " tip-bot for Rik van Riel
  1 sibling, 2 replies; 19+ messages in thread
From: Rik van Riel @ 2014-08-16 17:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman, atheurer

On Sat, 16 Aug 2014 16:11:59 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> Rik, I do not understand why did you silently ignore my comments about
> this change twice ;)

Here is the version I have actually been running my latest
tests with. This one addresses the forward progress issue
you pointed out.

---8<---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h          |  1 +
 kernel/exit.c                  |  4 ++++
 kernel/fork.c                  |  1 +
 kernel/sched/cputime.c         | 33 ++++++++++++++++++++-------------
 kernel/sys.c                   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --------------
 6 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
+	seqlock_t stats_lock;
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index b93d46d..fa09b86 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -127,6 +127,7 @@ static void __exit_signal(struct task_struct *tsk)
 	 * the signal_struct.
 	 */
 	task_cputime(tsk, &utime, &stime);
+	write_seqlock(&sig->stats_lock);
 	sig->utime += utime;
 	sig->stime += stime;
 	sig->gtime += task_gtime(tsk);
@@ -140,6 +141,7 @@ static void __exit_signal(struct task_struct *tsk)
 	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
 	__unhash_process(tsk, group_dead);
+	write_sequnlock(&sig->stats_lock);
 
 	/*
 	 * Do this under ->siglock, we can race with another thread
@@ -1042,6 +1044,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
+		write_seqlock(&psig->stats_lock);
 		psig->cutime += tgutime + sig->cutime;
 		psig->cstime += tgstime + sig->cstime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1064,6 +1067,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		write_sequnlock(&psig->stats_lock);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
+	seqlock_init(&sig->stats_lock);
 
 	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..49b7cfe 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,28 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	cputime_t utime, stime;
 	struct task_struct *t;
-
-	times->utime = sig->utime;
-	times->stime = sig->stime;
-	times->sum_exec_runtime = sig->sum_sched_runtime;
+	unsigned int seq, nextseq;
 
 	rcu_read_lock();
-	for_each_thread(tsk, t) {
-		task_cputime(t, &utime, &stime);
-		times->utime += utime;
-		times->stime += stime;
-		times->sum_exec_runtime += task_sched_runtime(t);
-	}
+	/* Attempt a lockless read on the first round. */
+	nextseq = 0;
+	do {
+		seq = nextseq;
+		read_seqbegin_or_lock(&sig->stats_lock, &seq);
+		times->utime = sig->utime;
+		times->stime = sig->stime;
+		times->sum_exec_runtime = sig->sum_sched_runtime;
+
+		for_each_thread(tsk, t) {
+			task_cputime(t, &utime, &stime);
+			times->utime += utime;
+			times->stime += stime;
+			times->sum_exec_runtime += task_sched_runtime(t);
+		}
+		/* If lockless access failed, take the lock. */
+		nextseq = 1;
+	} while (need_seqretry(&sig->stats_lock, seq));
+	done_seqretry(&sig->stats_lock, seq);
 	rcu_read_unlock();
 }
 
@@ -611,9 +621,6 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
-/*
- * Must be called with siglock held.
- */
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce81291..b663664 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 3b89464..492b986 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk,
 		if (same_thread_group(tsk, current))
 			err = cpu_clock_sample(which_clock, tsk, &rtn);
 	} else {
-		unsigned long flags;
-		struct sighand_struct *sighand;
-
-		/*
-		 * while_each_thread() is not yet entirely RCU safe,
-		 * keep locking the group while sampling process
-		 * clock for now.
-		 */
-		sighand = lock_task_sighand(tsk, &flags);
-		if (!sighand)
-			return err;
-
 		if (tsk == current || thread_group_leader(tsk))
 			err = cpu_clock_sample_group(which_clock, tsk, &rtn);
-
-		unlock_task_sighand(tsk, &flags);
 	}
 
 	if (!err)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-16 17:40     ` [PATCH v2 " Rik van Riel
@ 2014-08-16 17:50       ` Oleg Nesterov
  2014-08-18  4:44         ` Mike Galbraith
  2014-09-08  6:39       ` [tip:sched/core] time, signal: Protect " tip-bot for Rik van Riel
  1 sibling, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2014-08-16 17:50 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman, atheurer

On 08/16, Rik van Riel wrote:
>
> +	do {
> +		seq = nextseq;
> +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
> +		times->utime = sig->utime;
> +		times->stime = sig->stime;
> +		times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> +		for_each_thread(tsk, t) {
> +			task_cputime(t, &utime, &stime);
> +			times->utime += utime;
> +			times->stime += stime;
> +			times->sum_exec_runtime += task_sched_runtime(t);
> +		}
> +		/* If lockless access failed, take the lock. */
> +		nextseq = 1;

Yes, thanks, this answers my concerns.

Cough... can't resist, and I still think that we should take rcu_read_lock()
only around for_each_thread() and the patch expands the critical section for
no reason. But this is minor, I won't insist.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-16 17:50       ` Oleg Nesterov
@ 2014-08-18  4:44         ` Mike Galbraith
  2014-08-18 14:03           ` Rik van Riel
  0 siblings, 1 reply; 19+ messages in thread
From: Mike Galbraith @ 2014-08-18  4:44 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, linux-kernel, peterz, fweisbec, akpm, srao,
	lwoodman, atheurer

On Sat, 2014-08-16 at 19:50 +0200, Oleg Nesterov wrote: 
> On 08/16, Rik van Riel wrote:
> >
> > +	do {
> > +		seq = nextseq;
> > +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
> > +		times->utime = sig->utime;
> > +		times->stime = sig->stime;
> > +		times->sum_exec_runtime = sig->sum_sched_runtime;
> > +
> > +		for_each_thread(tsk, t) {
> > +			task_cputime(t, &utime, &stime);
> > +			times->utime += utime;
> > +			times->stime += stime;
> > +			times->sum_exec_runtime += task_sched_runtime(t);
> > +		}
> > +		/* If lockless access failed, take the lock. */
> > +		nextseq = 1;
> 
> Yes, thanks, this answers my concerns.
> 
> Cough... can't resist, and I still think that we should take rcu_read_lock()
> only around for_each_thread() and the patch expands the critical section for
> no reason. But this is minor, I won't insist.

Hm.  Should traversal not also disable preemption to preserve the error
bound Peter mentioned?

-Mike


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-18  4:44         ` Mike Galbraith
@ 2014-08-18 14:03           ` Rik van Riel
  2014-08-19 14:26             ` Mike Galbraith
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-08-18 14:03 UTC (permalink / raw)
  To: Mike Galbraith, Oleg Nesterov
  Cc: linux-kernel, peterz, fweisbec, akpm, srao, lwoodman, atheurer

On 08/18/2014 12:44 AM, Mike Galbraith wrote:
> On Sat, 2014-08-16 at 19:50 +0200, Oleg Nesterov wrote:
>> On 08/16, Rik van Riel wrote:
>>>
>>> +	do {
>>> +		seq = nextseq;
>>> +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
>>> +		times->utime = sig->utime;
>>> +		times->stime = sig->stime;
>>> +		times->sum_exec_runtime = sig->sum_sched_runtime;
>>> +
>>> +		for_each_thread(tsk, t) {
>>> +			task_cputime(t, &utime, &stime);
>>> +			times->utime += utime;
>>> +			times->stime += stime;
>>> +			times->sum_exec_runtime += task_sched_runtime(t);
>>> +		}
>>> +		/* If lockless access failed, take the lock. */
>>> +		nextseq = 1;
>>
>> Yes, thanks, this answers my concerns.
>>
>> Cough... can't resist, and I still think that we should take rcu_read_lock()
>> only around for_each_thread() and the patch expands the critical section for
>> no reason. But this is minor, I won't insist.
>
> Hm.  Should traversal not also disable preemption to preserve the error
> bound Peter mentioned?

The second traversal takes the spinlock, which automatically
disables preemption.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/3] time,signal: protect resource use statistics with seqlock
  2014-08-18 14:03           ` Rik van Riel
@ 2014-08-19 14:26             ` Mike Galbraith
  0 siblings, 0 replies; 19+ messages in thread
From: Mike Galbraith @ 2014-08-19 14:26 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Oleg Nesterov, linux-kernel, peterz, fweisbec, akpm, srao,
	lwoodman, atheurer

On Mon, 2014-08-18 at 10:03 -0400, Rik van Riel wrote: 
> On 08/18/2014 12:44 AM, Mike Galbraith wrote:
> > On Sat, 2014-08-16 at 19:50 +0200, Oleg Nesterov wrote:
> >> On 08/16, Rik van Riel wrote:
> >>>
> >>> +	do {
> >>> +		seq = nextseq;
> >>> +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
> >>> +		times->utime = sig->utime;
> >>> +		times->stime = sig->stime;
> >>> +		times->sum_exec_runtime = sig->sum_sched_runtime;
> >>> +
> >>> +		for_each_thread(tsk, t) {
> >>> +			task_cputime(t, &utime, &stime);
> >>> +			times->utime += utime;
> >>> +			times->stime += stime;
> >>> +			times->sum_exec_runtime += task_sched_runtime(t);
> >>> +		}
> >>> +		/* If lockless access failed, take the lock. */
> >>> +		nextseq = 1;
> >>
> >> Yes, thanks, this answers my concerns.
> >>
> >> Cough... can't resist, and I still think that we should take rcu_read_lock()
> >> only around for_each_thread() and the patch expands the critical section for
> >> no reason. But this is minor, I won't insist.
> >
> > Hm.  Should traversal not also disable preemption to preserve the error
> > bound Peter mentioned?
> 
> The second traversal takes the spinlock, which automatically
> disables preemption.

According to my testing, a PREEMPT kernel can get all the way through
thread_group_cputime() lockless, preemption can/does happen during
traversal, the call can and does then take more than ticks * CPUs (can
take LOTS more if you get silly), so Peter's bound appears to be toast
for PREEMPT.

Not that I really care mind you, just seemed the folks who don't do
zillion threads, would never feel the pain you're alleviating, now get
some accuracy loss if running PREEMPT.

BTW, something else that doesn't matter one bit but I was curious about,
as noted, clock_gettime() used to use the tasklist_lock, which is loads
better than siglock, at least on a modest box.  On a 64 core box with
200 threads, crusty old 3.0 kernel is faster than patched up master, and
configs are both NOPREEMPT tune-for-maximum-bloat.

('course what zillion cores + zillion threads does with tasklist_lock
ain't _at all_ pretty, but it doesn't demolish modest boxen)

patched master 
vogelweide:/abuild/mike/:[0]# time ./pound_clock_gettime

real    0m2.953s
user    0m0.036s
sys     3m2.588s
vogelweide:/abuild/mike/:[0]# time ./pound_clock_gettime

real    0m2.930s
user    0m0.076s
sys     3m1.800s
vogelweide:/abuild/mike/:[0]# time ./pound_clock_gettime

real    0m2.988s
user    0m0.052s
sys     3m5.208s

sle11-sp3 (3.0.101)
vogelweide:/abuild/mike/:[0]# time ./pound_clock_gettime

real    0m1.521s
user    0m0.072s
sys     0m8.397s
vogelweide:/abuild/mike/:[0]# time ./pound_clock_gettime

real    0m1.260s
user    0m0.032s
sys     0m6.244s
vogelweide:/abuild/mike/:[0]# time ./pound_clock_gettime

real    0m1.391s
user    0m0.020s
sys     0m7.016s


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] lockless sys_times and posix_cpu_clock_get
  2014-08-15 20:05 [PATCH 0/3] lockless sys_times and posix_cpu_clock_get riel
                   ` (2 preceding siblings ...)
  2014-08-15 20:05 ` [PATCH 3/3] sched,time: atomically increment stime & utime riel
@ 2014-08-19 21:21 ` Andrew Theurer
  2014-09-03 18:38   ` Rik van Riel
  3 siblings, 1 reply; 19+ messages in thread
From: Andrew Theurer @ 2014-08-19 21:21 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, oleg, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman


> Thanks to the feedback from Oleg, Peter, Mike, and Frederic,
> I seem to have a patch series that manages to do times()
> locklessly, and apparently correctly.


> 
> Oleg points out that the monotonicity alone is not enough of a
> guarantee, but that should probably be attacked separately, since
> that issue is equally present with and without these patches...
> 
> The test case below, slightly changed from the one posted by Spencer
> Candland in 2009, now runs in 11 seconds instead of 5 minutes.
> 
> Is it worthwhile?  There apparently are some real workloads that call
> times() a lot, and I believe Sanjay and Andrew have one sitting around.

Thanks for doing this.  When running a OLTP workload in a KVM VM, we saw a 71% increase in performance!  do_sys_times() was a big bottleneck for us.

-Andrew
> 
> --------
> 
> /*
> 
> Based on the test case from the following bug report, but changed
> to measure utime on a per thread basis. (Rik van Riel)
> 
> https://lkml.org/lkml/2009/11/3/522
> 
> From: Spencer Candland
> Subject: utime/stime decreasing on thread exit
> 
> I am seeing a problem with utime/stime decreasing on thread exit in a
> multi-threaded process.  I have been able to track this regression down
> to the "process wide cpu clocks/timers" changes introduces in
> 2.6.29-rc5, specifically when I revert the following commits I know
> longer see decreasing utime/stime values:
> 
> 4da94d49b2ecb0a26e716a8811c3ecc542c2a65d
> 3fccfd67df79c6351a156eb25a7a514e5f39c4d9
> 7d8e23df69820e6be42bcc41d441f4860e8c76f7
> 4cd4c1b40d40447fb5e7ba80746c6d7ba91d7a53
> 32bd671d6cbeda60dc73be77fa2b9037d9a9bfa0
> 
> I poked around a little, but I am afraid I have to admit that I am not
> familiar enough with how this works to resolve this or suggest a fix.
> 
> I have verified this in happening in kernels 2.6.29-rc5 - 2.6.32-rc6, I
> have been testing this on x86 vanilla kernels, but have also verified it
> on several x86 2.6.29+ distro kernels (fedora and ubuntu).
> 
> I first noticed this on a production environment running Apache with the
> worker MPM, however while tracking this down I put together a simple
> program that has been reliable in showing me utime decreasing, hopefully
> it will be helpful in demonstrating the issue:
> */
> 
> #include <stdio.h>
> #include <pthread.h>
> #include <sys/times.h>
> 
> #define NUM_THREADS 500
> 
> struct tms start;
> 
> void *pound (void *threadid)
> {
>   struct tms end;
>   int oldutime = 0;
>   int utime;
>   int c, i;
>   for (i = 0; i < 10000; i++) {
> 	  for (c = 0; c < 10000; c++);
> 	  times(&end);
> 	  utime = ((int)end.tms_utime - (int)start.tms_utime);
> 	  if (oldutime > utime) {
> 	    printf("utime decreased, was %d, now %d!\n", oldutime, utime);
> 	  }
> 	  oldutime = utime;
>   }
>   pthread_exit(NULL);
> }
> 
> int main()
> {
>   pthread_t th[NUM_THREADS];
>   long i;
>   times(&start);
>   for (i = 0; i < NUM_THREADS; i++) {
>     pthread_create (&th[i], NULL, pound, (void *)i);
>   }
>   pthread_exit(NULL);
>   return 0;
> }
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] lockless sys_times and posix_cpu_clock_get
  2014-08-19 21:21 ` [PATCH 0/3] lockless sys_times and posix_cpu_clock_get Andrew Theurer
@ 2014-09-03 18:38   ` Rik van Riel
  2014-09-04  7:48     ` Peter Zijlstra
  0 siblings, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2014-09-03 18:38 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: linux-kernel, oleg, peterz, umgwanakikbuti, fweisbec, akpm, srao,
	lwoodman, Peter Zijlstra, Andrew Morton, Ingo Molnar

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/19/2014 05:21 PM, Andrew Theurer wrote:
> 
>> Thanks to the feedback from Oleg, Peter, Mike, and Frederic, I
>> seem to have a patch series that manages to do times() 
>> locklessly, and apparently correctly.
> 
> 
>> 
>> Oleg points out that the monotonicity alone is not enough of a 
>> guarantee, but that should probably be attacked separately,
>> since that issue is equally present with and without these
>> patches...
>> 
>> The test case below, slightly changed from the one posted by
>> Spencer Candland in 2009, now runs in 11 seconds instead of 5
>> minutes.
>> 
>> Is it worthwhile?  There apparently are some real workloads that
>> call times() a lot, and I believe Sanjay and Andrew have one
>> sitting around.
> 
> Thanks for doing this.  When running a OLTP workload in a KVM VM,
> we saw a 71% increase in performance!  do_sys_times() was a big
> bottleneck for us.

Thanks Andrew, a 71% performance increase seems like it would be
enough to justify merging these patches...


Peter, Ingo, Andrew,

Do any of you have an objection to these patches?

Which tree should I merge them through?

I am happy to resubmit them against any tree, just let
me know where you want the patches to go.

thanks,

Rik
- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJUB2AiAAoJEM553pKExN6DlKsH/RygM0SAcKcqbhk7qbKgQsFI
fe9mzJeDg5X2OVW1LuKKhpdo0wPiJ6arg3s2kWnJ8YuToMlIYjFwh9V+fwk1p7bV
4X8KYEK1DyJux8ZYwOBXlZORL+mE30scwuOF8B0sY+TepiRHeorv0srTIXgJfGyJ
avv95X/hx5JSqjAeRomHPmIX8VzgbHTXPEzWxVj+64qehI63CqyLGXXSlHPvFL4D
uhIRvCC4WxKNldUX20HZFUlQETsJttWoM14SiT1HZbfZNJxDMkD6kjcNl7Uimw9j
gVQeE4qy5OkdY1RSsVN35mg+mGA8kzUoQV0aEkogXwbJYNB+wFQ7OEupA1BKiGw=
=6lUC
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] lockless sys_times and posix_cpu_clock_get
  2014-09-03 18:38   ` Rik van Riel
@ 2014-09-04  7:48     ` Peter Zijlstra
  0 siblings, 0 replies; 19+ messages in thread
From: Peter Zijlstra @ 2014-09-04  7:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrew Theurer, linux-kernel, oleg, umgwanakikbuti, fweisbec,
	akpm, srao, lwoodman, Ingo Molnar

> Peter, Ingo, Andrew,
> 
> Do any of you have an objection to these patches?
> 
> Which tree should I merge them through?
> 
> I am happy to resubmit them against any tree, just let
> me know where you want the patches to go.

I suppose I can try and stuff them in -tip. I picked up this lot which
seems to still apply fine.

Thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [tip:sched/core] exit: Always reap resource stats in __exit_signal()
  2014-08-15 20:05 ` [PATCH 1/3] exit: always reap resource stats in __exit_signal riel
@ 2014-09-08  6:39   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 19+ messages in thread
From: tip-bot for Rik van Riel @ 2014-09-08  6:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, guillaume, mschmidt,
	riel, akpm, ionut.m.alexa, tglx, oleg, mhocko, rientjes, lizefan

Commit-ID:  90ed9cbe765ad358b3151a12b8bf889a3cbcd573
Gitweb:     http://git.kernel.org/tip/90ed9cbe765ad358b3151a12b8bf889a3cbcd573
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 15 Aug 2014 16:05:36 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 8 Sep 2014 08:17:00 +0200

exit: Always reap resource stats in __exit_signal()

Oleg pointed out that wait_task_zombie adds a task's usage statistics
to the parent's signal struct, but the task's own signal struct should
also propagate the statistics at exit time.

This allows thread_group_cputime(reaped_zombie) to get the statistics
after __unhash_process() has made the task invisible to for_each_thread,
but before the thread has actually been rcu freed, making sure no
non-monotonic results are returned inside that window.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/1408133138-22048-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/exit.c | 43 +++++++++++++++++++++----------------------
 1 file changed, 21 insertions(+), 22 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..b93d46d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -115,30 +115,29 @@ static void __exit_signal(struct task_struct *tsk)
 
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
-		/*
-		 * Accumulate here the counters for all threads but the
-		 * group leader as they die, so they can be added into
-		 * the process-wide totals when those are taken.
-		 * The group leader stays around as a zombie as long
-		 * as there are other threads.  When it gets reaped,
-		 * the exit.c code will add its counts into these totals.
-		 * We won't ever get here for the group leader, since it
-		 * will have been the last reference on the signal_struct.
-		 */
-		task_cputime(tsk, &utime, &stime);
-		sig->utime += utime;
-		sig->stime += stime;
-		sig->gtime += task_gtime(tsk);
-		sig->min_flt += tsk->min_flt;
-		sig->maj_flt += tsk->maj_flt;
-		sig->nvcsw += tsk->nvcsw;
-		sig->nivcsw += tsk->nivcsw;
-		sig->inblock += task_io_get_inblock(tsk);
-		sig->oublock += task_io_get_oublock(tsk);
-		task_io_accounting_add(&sig->ioac, &tsk->ioac);
-		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	}
 
+	/*
+	 * Accumulate here the counters for all threads but the group leader
+	 * as they die, so they can be added into the process-wide totals
+	 * when those are taken.  The group leader stays around as a zombie as
+	 * long as there are other threads.  When it gets reaped, the exit.c
+	 * code will add its counts into these totals.  We won't ever get here
+	 * for the group leader, since it will have been the last reference on
+	 * the signal_struct.
+	 */
+	task_cputime(tsk, &utime, &stime);
+	sig->utime += utime;
+	sig->stime += stime;
+	sig->gtime += task_gtime(tsk);
+	sig->min_flt += tsk->min_flt;
+	sig->maj_flt += tsk->maj_flt;
+	sig->nvcsw += tsk->nvcsw;
+	sig->nivcsw += tsk->nivcsw;
+	sig->inblock += task_io_get_inblock(tsk);
+	sig->oublock += task_io_get_oublock(tsk);
+	task_io_accounting_add(&sig->ioac, &tsk->ioac);
+	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
 	__unhash_process(tsk, group_dead);
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip:sched/core] time, signal: Protect resource use statistics with seqlock
  2014-08-16 17:40     ` [PATCH v2 " Rik van Riel
  2014-08-16 17:50       ` Oleg Nesterov
@ 2014-09-08  6:39       ` tip-bot for Rik van Riel
  1 sibling, 0 replies; 19+ messages in thread
From: tip-bot for Rik van Riel @ 2014-09-08  6:39 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, peterz, guillaume, mschmidt, riel, akpm, oleg,
	tglx, vdavydov, yangds.fnst, rientjes, linux-kernel, hpa,
	daeseok.youn, athorlton, geert, keescook, ionut.m.alexa, mhocko,
	lizefan

Commit-ID:  e78c3496790ee8a36522a838b59b388e8a709e65
Gitweb:     http://git.kernel.org/tip/e78c3496790ee8a36522a838b59b388e8a709e65
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Sat, 16 Aug 2014 13:40:10 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 8 Sep 2014 08:17:01 +0200

time, signal: Protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guillaume Morin <guillaume@morinfr.org>
Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Michal Schmidt <mschmidt@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h          |  1 +
 kernel/exit.c                  |  4 ++++
 kernel/fork.c                  |  1 +
 kernel/sched/cputime.c         | 33 ++++++++++++++++++++-------------
 kernel/sys.c                   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --------------
 6 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c2c885..dd9eb48 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -645,6 +645,7 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
+	seqlock_t stats_lock;
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index b93d46d..fa09b86 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -127,6 +127,7 @@ static void __exit_signal(struct task_struct *tsk)
 	 * the signal_struct.
 	 */
 	task_cputime(tsk, &utime, &stime);
+	write_seqlock(&sig->stats_lock);
 	sig->utime += utime;
 	sig->stime += stime;
 	sig->gtime += task_gtime(tsk);
@@ -140,6 +141,7 @@ static void __exit_signal(struct task_struct *tsk)
 	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
 	__unhash_process(tsk, group_dead);
+	write_sequnlock(&sig->stats_lock);
 
 	/*
 	 * Do this under ->siglock, we can race with another thread
@@ -1042,6 +1044,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
+		write_seqlock(&psig->stats_lock);
 		psig->cutime += tgutime + sig->cutime;
 		psig->cstime += tgstime + sig->cstime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1064,6 +1067,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		write_sequnlock(&psig->stats_lock);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..9387ae8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
+	seqlock_init(&sig->stats_lock);
 
 	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..49b7cfe 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,28 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	cputime_t utime, stime;
 	struct task_struct *t;
-
-	times->utime = sig->utime;
-	times->stime = sig->stime;
-	times->sum_exec_runtime = sig->sum_sched_runtime;
+	unsigned int seq, nextseq;
 
 	rcu_read_lock();
-	for_each_thread(tsk, t) {
-		task_cputime(t, &utime, &stime);
-		times->utime += utime;
-		times->stime += stime;
-		times->sum_exec_runtime += task_sched_runtime(t);
-	}
+	/* Attempt a lockless read on the first round. */
+	nextseq = 0;
+	do {
+		seq = nextseq;
+		read_seqbegin_or_lock(&sig->stats_lock, &seq);
+		times->utime = sig->utime;
+		times->stime = sig->stime;
+		times->sum_exec_runtime = sig->sum_sched_runtime;
+
+		for_each_thread(tsk, t) {
+			task_cputime(t, &utime, &stime);
+			times->utime += utime;
+			times->stime += stime;
+			times->sum_exec_runtime += task_sched_runtime(t);
+		}
+		/* If lockless access failed, take the lock. */
+		nextseq = 1;
+	} while (need_seqretry(&sig->stats_lock, seq));
+	done_seqretry(&sig->stats_lock, seq);
 	rcu_read_unlock();
 }
 
@@ -611,9 +621,6 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
-/*
- * Must be called with siglock held.
- */
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce81291..b663664 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 3b89464..492b986 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk,
 		if (same_thread_group(tsk, current))
 			err = cpu_clock_sample(which_clock, tsk, &rtn);
 	} else {
-		unsigned long flags;
-		struct sighand_struct *sighand;
-
-		/*
-		 * while_each_thread() is not yet entirely RCU safe,
-		 * keep locking the group while sampling process
-		 * clock for now.
-		 */
-		sighand = lock_task_sighand(tsk, &flags);
-		if (!sighand)
-			return err;
-
 		if (tsk == current || thread_group_leader(tsk))
 			err = cpu_clock_sample_group(which_clock, tsk, &rtn);
-
-		unlock_task_sighand(tsk, &flags);
 	}
 
 	if (!err)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [tip:sched/core] sched, time: Atomically increment stime & utime
  2014-08-15 20:05 ` [PATCH 3/3] sched,time: atomically increment stime & utime riel
  2014-08-16 14:55   ` Oleg Nesterov
@ 2014-09-08  6:40   ` tip-bot for Rik van Riel
  1 sibling, 0 replies; 19+ messages in thread
From: tip-bot for Rik van Riel @ 2014-09-08  6:40 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, riel, hpa, mingo, torvalds, peterz, tglx

Commit-ID:  eb1b4af0a64ac7bb0ee36f579c1c7cefcbc3ac2c
Gitweb:     http://git.kernel.org/tip/eb1b4af0a64ac7bb0ee36f579c1c7cefcbc3ac2c
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 15 Aug 2014 16:05:38 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 8 Sep 2014 08:17:02 +0200

sched, time: Atomically increment stime & utime

The functions task_cputime_adjusted and thread_group_cputime_adjusted()
can be called locklessly, as well as concurrently on many different CPUs.

This can occasionally lead to the utime and stime reported by times(), and
other syscalls like it, going backward. The cause for this appears to be
multiple threads racing in cputime_adjust(), both with values for utime or
stime that is larger than the original, but each with a different value.

Sometimes the larger value gets saved first, only to be immediately
overwritten with a smaller value by another thread.

Using atomic exchange prevents that problem, and ensures time
progresses monotonically.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: akpm@linux-foundation.org
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 49b7cfe..2b57031 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -602,9 +602,12 @@ static void cputime_adjust(struct task_cputime *curr,
 	 * If the tick based count grows faster than the scheduler one,
 	 * the result of the scaling may go backward.
 	 * Let's enforce monotonicity.
+	 * Atomic exchange protects against concurrent cputime_adjust().
 	 */
-	prev->stime = max(prev->stime, stime);
-	prev->utime = max(prev->utime, utime);
+	while (stime > (rtime = ACCESS_ONCE(prev->stime)))
+		cmpxchg(&prev->stime, rtime, stime);
+	while (utime > (rtime = ACCESS_ONCE(prev->utime)))
+		cmpxchg(&prev->utime, rtime, utime);
 
 out:
 	*ut = prev->utime;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2014-09-08  6:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-15 20:05 [PATCH 0/3] lockless sys_times and posix_cpu_clock_get riel
2014-08-15 20:05 ` [PATCH 1/3] exit: always reap resource stats in __exit_signal riel
2014-09-08  6:39   ` [tip:sched/core] exit: Always reap resource stats in __exit_signal() tip-bot for Rik van Riel
2014-08-15 20:05 ` [PATCH 2/3] time,signal: protect resource use statistics with seqlock riel
2014-08-16 14:11   ` Oleg Nesterov
2014-08-16 15:07     ` Rik van Riel
2014-08-16 17:40     ` [PATCH v2 " Rik van Riel
2014-08-16 17:50       ` Oleg Nesterov
2014-08-18  4:44         ` Mike Galbraith
2014-08-18 14:03           ` Rik van Riel
2014-08-19 14:26             ` Mike Galbraith
2014-09-08  6:39       ` [tip:sched/core] time, signal: Protect " tip-bot for Rik van Riel
2014-08-15 20:05 ` [PATCH 3/3] sched,time: atomically increment stime & utime riel
2014-08-16 14:55   ` Oleg Nesterov
2014-08-16 14:56     ` Oleg Nesterov
2014-09-08  6:40   ` [tip:sched/core] sched, time: Atomically " tip-bot for Rik van Riel
2014-08-19 21:21 ` [PATCH 0/3] lockless sys_times and posix_cpu_clock_get Andrew Theurer
2014-09-03 18:38   ` Rik van Riel
2014-09-04  7:48     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.