All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] time: drop do_sys_times spinlock
@ 2014-08-12 18:25 Rik van Riel
  2014-08-12 19:12 ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-12 18:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Oleg Nesterov, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

Back in 2009, Spencer Candland pointed out there is a race with
do_sys_times, where multiple threads calling do_sys_times can
sometimes get decreasing results.

https://lkml.org/lkml/2009/11/3/522

As a result of that discussion, some of the code in do_sys_times
was moved under a spinlock.

However, that does not seem to actually make the race go away on
larger systems. One obvious remaining race is that after one thread
is about to return from do_sys_times, it is preempted by another
thread, which also runs do_sys_times, and stores a larger value in
the shared variable than what the first thread got.

This race is on the kernel/userspace boundary, and not fixable
with spinlocks.

Removing the spinlock from do_sys_times does not seem to result
in an increase in the number of times a decreasing utime is
observed when running the test case. In fact, on the 80 CPU test
system that I tried, I saw a small decrease, from an average
14.8 to 6.5 instances of backwards utime running the test case.

Back in 2009, in changeset 2b5fe6de5 Oleg Nesterov already found
that it should be safe to remove the spinlock.  I believe this is
true, because it appears that nobody changes another task's ->sighand
pointer, except at fork time and exit time, during which the task
cannot be in do_sys_times.

This is subtle enough to warrant documenting.

The increased scalability of removing the spinlock should help
things like databases and middleware that measure the resource
use of every query processed.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Frank Mayhar <fmayhar@google.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sanjay Rao <srao@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sys.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sys.c b/kernel/sys.c
index 66a751e..cb81ce4 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,15 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
+	/*
+	 * sys_times gets away with not locking &current->sighand->siglock
+	 * because most of the time only the current process gets to change
+	 * its own sighand pointer. The exception is exit, which changes
+	 * the sighand pointer of an exiting process.
+	 */
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-12 18:25 [PATCH RFC] time: drop do_sys_times spinlock Rik van Riel
@ 2014-08-12 19:12 ` Oleg Nesterov
  2014-08-12 19:22   ` Rik van Riel
                     ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-12 19:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/12, Rik van Riel wrote:
>
> Back in 2009, Spencer Candland pointed out there is a race with
> do_sys_times, where multiple threads calling do_sys_times can
> sometimes get decreasing results.
>
> https://lkml.org/lkml/2009/11/3/522
>
> As a result of that discussion, some of the code in do_sys_times
> was moved under a spinlock.
>
> However, that does not seem to actually make the race go away on
> larger systems. One obvious remaining race is that after one thread
> is about to return from do_sys_times, it is preempted by another
> thread, which also runs do_sys_times, and stores a larger value in
> the shared variable than what the first thread got.
>
> This race is on the kernel/userspace boundary, and not fixable
> with spinlocks.

Not sure I understand...

Afaics, the problem is that a single thread can observe the decreasing
(say) sum_exec_runtime if it calls do_sys_times() twice without the lock.

This is because it can account the exiting sub-thread twice if it races
with __exit_signal() which increments sig->sum_sched_runtime, but this
exiting thread can still be visible to thread_group_cputime().

IOW, it is not actually about decreasing, the problem is that the lockless
thread_group_cputime() can return the wrong result, and the next ys_times()
can show the right value.

> Back in 2009, in changeset 2b5fe6de5 Oleg Nesterov already found
> that it should be safe to remove the spinlock.

Yes, it is safe but only in a sense that for_each_thread() is fine lockless.
So this change was reverted.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-12 19:12 ` Oleg Nesterov
@ 2014-08-12 19:22   ` Rik van Riel
  2014-08-12 22:27   ` Rik van Riel
  2014-08-13  6:59   ` Mike Galbraith
  2 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2014-08-12 19:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/12/2014 03:12 PM, Oleg Nesterov wrote:
> On 08/12, Rik van Riel wrote:
>> 
>> Back in 2009, Spencer Candland pointed out there is a race with 
>> do_sys_times, where multiple threads calling do_sys_times can 
>> sometimes get decreasing results.
>> 
>> https://lkml.org/lkml/2009/11/3/522
>> 
>> As a result of that discussion, some of the code in do_sys_times 
>> was moved under a spinlock.
>> 
>> However, that does not seem to actually make the race go away on 
>> larger systems. One obvious remaining race is that after one
>> thread is about to return from do_sys_times, it is preempted by
>> another thread, which also runs do_sys_times, and stores a larger
>> value in the shared variable than what the first thread got.
>> 
>> This race is on the kernel/userspace boundary, and not fixable 
>> with spinlocks.
> 
> Not sure I understand...
> 
> Afaics, the problem is that a single thread can observe the
> decreasing (say) sum_exec_runtime if it calls do_sys_times() twice
> without the lock.
> 
> This is because it can account the exiting sub-thread twice if it
> races with __exit_signal() which increments sig->sum_sched_runtime,
> but this exiting thread can still be visible to
> thread_group_cputime().
> 
> IOW, it is not actually about decreasing, the problem is that the
> lockless thread_group_cputime() can return the wrong result, and
> the next ys_times() can show the right value.

Hmmm, that is not what the test case does.

The test case simply calls times() once in each thread, and saves
the value in a global variable for the next thread to use.

Does the seq_lock in task_cputime() prevent the problem you are
describing, or does the exit/zombie reaping code need to block the
seq_lock while it moves the stats from the zombie to the group?

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT6mmSAAoJEM553pKExN6D+EkH/2BexZ8XfKpHAKfkidIhPrOy
nr5q8WhKU1mJmdEULNx6NQxAjRnpORTOfDElwRT1gzXqOyXrTxXZ207/anezhstU
kyu5wRNBz/pilXPDzVsiF+DqTxoBnVOIc0eltQ00jmUden08eVEfEY5mjevCJalz
2AbWFa8QQZgtGSCZB1UPaUF6NHTu/Z35u9UTEIkLirLCqfIYPz325Wdfs+W+fggS
8vEgHhO50BrIAm9HCO/vgY8SCAU/0Pml73ABV3+4sB7dnYVgDkYXzS0iMimuAcZ/
qL0NhRrKH4sRxGQXBlQv87GgMpR9Tr4RVFK6eH9xwjVwthYXnYeDTbYryjpmdco=
=haSd
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-12 19:12 ` Oleg Nesterov
  2014-08-12 19:22   ` Rik van Riel
@ 2014-08-12 22:27   ` Rik van Riel
  2014-08-13 17:22     ` Oleg Nesterov
  2014-08-13  6:59   ` Mike Galbraith
  2 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-12 22:27 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/12/2014 03:12 PM, Oleg Nesterov wrote:

> Afaics, the problem is that a single thread can observe the
> decreasing (say) sum_exec_runtime if it calls do_sys_times() twice
> without the lock.
> 
> This is because it can account the exiting sub-thread twice if it
> races with __exit_signal() which increments sig->sum_sched_runtime,
> but this exiting thread can still be visible to
> thread_group_cputime().
> 
> IOW, it is not actually about decreasing, the problem is that the
> lockless thread_group_cputime() can return the wrong result, and
> the next ys_times() can show the right value.

You are right, changing the test case to call times() many
times in a row in each thread can result in the wrong value
being returned.

Not entirely sure what I can do there...

Replacing the spinlock with a seqlock, and taking it for
write in most places is pretty gross, and may lead to other
issues like reader livelock when there is a lot of write
activity.

Having a seqlock just for the stats?  Not sure the calls
to times() are a big enough issue for most workloads to
justify that...

Any other ideas?

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT6pTdAAoJEM553pKExN6DMrIIAKFFHD8luyqgVUAm0jbV8JHm
O5PD81kot95POV7ZAl6crKmPi0OoeSdZIzcmuLFIvRJWqrbgWY6h4rQH9va5B830
F7TC2PRzWwUVwcuEoaUkuZMbUWkWqzUwXcwwFl1blYmkVJVRF27VcUB4S0jia1eq
l2TlQyC1HgXa3E7rbQ6vuKsOq50jB08MWwxEfhAEMNvndhos/fvZlsxL39UO3/X7
AVk+V/leE5tfAfyr6uPrWDR7/u9sJkqmi/dGJ/xjfWNU2swEPvMXk6UhspSIY+mg
KAMa+JWTPANeUSRM9HRA9YUpo0rqvy0Azmg84tIYr4nXsIyvzHuRgCUNQkOmEDQ=
=5ap3
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-12 19:12 ` Oleg Nesterov
  2014-08-12 19:22   ` Rik van Riel
  2014-08-12 22:27   ` Rik van Riel
@ 2014-08-13  6:59   ` Mike Galbraith
  2014-08-13 11:11     ` Peter Zijlstra
  2 siblings, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2014-08-13  6:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, linux-kernel, Peter Zijlstra, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On Tue, 2014-08-12 at 21:12 +0200, Oleg Nesterov wrote: 
> On 08/12, Rik van Riel wrote:
> >
> > Back in 2009, Spencer Candland pointed out there is a race with
> > do_sys_times, where multiple threads calling do_sys_times can
> > sometimes get decreasing results.
> >
> > https://lkml.org/lkml/2009/11/3/522
> >
> > As a result of that discussion, some of the code in do_sys_times
> > was moved under a spinlock.
> >
> > However, that does not seem to actually make the race go away on
> > larger systems. One obvious remaining race is that after one thread
> > is about to return from do_sys_times, it is preempted by another
> > thread, which also runs do_sys_times, and stores a larger value in
> > the shared variable than what the first thread got.
> >
> > This race is on the kernel/userspace boundary, and not fixable
> > with spinlocks.
> 
> Not sure I understand...
> 
> Afaics, the problem is that a single thread can observe the decreasing
> (say) sum_exec_runtime if it calls do_sys_times() twice without the lock.
> 
> This is because it can account the exiting sub-thread twice if it races
> with __exit_signal() which increments sig->sum_sched_runtime, but this
> exiting thread can still be visible to thread_group_cputime().
> 
> IOW, it is not actually about decreasing, the problem is that the lockless
> thread_group_cputime() can return the wrong result, and the next ys_times()
> can show the right value.
> 
> > Back in 2009, in changeset 2b5fe6de5 Oleg Nesterov already found
> > that it should be safe to remove the spinlock.
> 
> Yes, it is safe but only in a sense that for_each_thread() is fine lockless.
> So this change was reverted.

Funny that thread_group_cputime() should come up just now..

Could you take tasklist_lock ala posix_cpu_clock_get_task()?  If so,
would that improve things at all?

I was told that clock_gettime(CLOCK_PROCESS_CPUTIME_ID) has scalability
issues on BIG boxen, but perhaps less so than times()?

I'm sure the real clock_gettime() using proggy that gummed up a ~1200
core box for "a while" wasn't the testcase below, which will gum it up
for a long while, but looks to me like using CLOCK_PROCESS_CPUTIME_ID
from LOTS of threads is a "Don't do that, it'll hurt a LOT".

#include <sys/time.h>
#include <mpi.h>
#include <stdio.h>
#include <time.h>

int
main(int argc, char **argv){
  struct timeval tv;
  struct timespec tp;
  int rc;
  int i;

  MPI_Init(&argc, &argv);
  for(i=0;i<100000;i++){
    rc = gettimeofday(&tv, NULL);
    if(rc < 0) perror("gettimeofday");
    rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &tp);
    if(rc < 0) perror("clock_gettime");
  }
  MPI_Finalize();
  return 0;
}




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13  6:59   ` Mike Galbraith
@ 2014-08-13 11:11     ` Peter Zijlstra
  2014-08-13 13:24       ` Rik van Riel
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-08-13 11:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Oleg Nesterov, Rik van Riel, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

[-- Attachment #1: Type: text/plain, Size: 505 bytes --]

On Wed, Aug 13, 2014 at 08:59:50AM +0200, Mike Galbraith wrote:
 
> I was told that clock_gettime(CLOCK_PROCESS_CPUTIME_ID) has scalability
> issues on BIG boxen

> I'm sure the real clock_gettime() using proggy that gummed up a ~1200
> core box for "a while" wasn't the testcase below, which will gum it up
> for a long while, but looks to me like using CLOCK_PROCESS_CPUTIME_ID
> from LOTS of threads is a "Don't do that, it'll hurt a LOT".

Yes, don't do that. Its unavoidably slow and bad.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 11:11     ` Peter Zijlstra
@ 2014-08-13 13:24       ` Rik van Riel
  2014-08-13 13:39         ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-13 13:24 UTC (permalink / raw)
  To: Peter Zijlstra, Mike Galbraith
  Cc: Oleg Nesterov, linux-kernel, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/13/2014 07:11 AM, Peter Zijlstra wrote:
> On Wed, Aug 13, 2014 at 08:59:50AM +0200, Mike Galbraith wrote:
> 
>> I was told that clock_gettime(CLOCK_PROCESS_CPUTIME_ID) has
>> scalability issues on BIG boxen
> 
>> I'm sure the real clock_gettime() using proggy that gummed up a
>> ~1200 core box for "a while" wasn't the testcase below, which
>> will gum it up for a long while, but looks to me like using
>> CLOCK_PROCESS_CPUTIME_ID from LOTS of threads is a "Don't do
>> that, it'll hurt a LOT".
> 
> Yes, don't do that. Its unavoidably slow and bad.

I don't see why that needs the tasklist_lock, when do_sys_times
grabs a different lock.

If the same bottleneck exists from multiple places, maybe it does
make sense to have a seqlock for the statistics at the sighand
level?

I can code up a patch that does that, and throw it over the wall
to people with big systems who hit that bottleneck on a regular
basis...

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT62b2AAoJEM553pKExN6DBJ4H/AyVsN4N73Gp/wrm7waaNjpS
kU5R2pIGzJqxJ4BZi+aeiuT09ZoSHCl3nvSsMNBm712NX1jyFVC4I91ON18tsB3o
P/tipcCP9Q6QSW8+lPRNz459OsaXX+wyRxcdnUtZN7SVb+NTWlxZ4o8UiVljZYSV
2mRr2ipd/0vKn7J9twaIP0UMddTpIrnMTCMKookoWXoHeJIXsYAs3XTRsoPJAddz
0ba5H7OGjphOSCyMkDDo3GG+K8oHJIpD8PHT38pXfX+suNEGxMO7PGvvEyUcrJKx
5355fnU6/1mksPlRD5DIwMowMjbY5zy71P8Lv4Eg+LY+C/kGjyrz9Maa0SyRMh8=
=VQ/m
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 13:24       ` Rik van Riel
@ 2014-08-13 13:39         ` Peter Zijlstra
  2014-08-13 14:09           ` Mike Galbraith
  0 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-08-13 13:39 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Mike Galbraith, Oleg Nesterov, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

[-- Attachment #1: Type: text/plain, Size: 1584 bytes --]

On Wed, Aug 13, 2014 at 09:24:06AM -0400, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 08/13/2014 07:11 AM, Peter Zijlstra wrote:
> > On Wed, Aug 13, 2014 at 08:59:50AM +0200, Mike Galbraith wrote:
> > 
> >> I was told that clock_gettime(CLOCK_PROCESS_CPUTIME_ID) has
> >> scalability issues on BIG boxen
> > 
> >> I'm sure the real clock_gettime() using proggy that gummed up a
> >> ~1200 core box for "a while" wasn't the testcase below, which
> >> will gum it up for a long while, but looks to me like using
> >> CLOCK_PROCESS_CPUTIME_ID from LOTS of threads is a "Don't do
> >> that, it'll hurt a LOT".
> > 
> > Yes, don't do that. Its unavoidably slow and bad.
> 
> I don't see why that needs the tasklist_lock, when do_sys_times
> grabs a different lock.
> 
> If the same bottleneck exists from multiple places, maybe it does
> make sense to have a seqlock for the statistics at the sighand
> level?
> 
> I can code up a patch that does that, and throw it over the wall
> to people with big systems who hit that bottleneck on a regular
> basis...

PROCESS_CPUTIME doesn't need tasklist lock; it only takes the sighand
lock. It needs that to stabilize the thread list, you cannot give a
straight answer if threads are coming/going.

It further needs to take the rq->lock for any active task in the thread
group.

Combined its painful; and it being painful should be no surprise to
anybody seeing how its basically a 'global' property -- the more CPUs
you stick in a machine the more expensive those become.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 13:39         ` Peter Zijlstra
@ 2014-08-13 14:09           ` Mike Galbraith
  0 siblings, 0 replies; 49+ messages in thread
From: Mike Galbraith @ 2014-08-13 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rik van Riel, Oleg Nesterov, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On Wed, 2014-08-13 at 15:39 +0200, Peter Zijlstra wrote:

> PROCESS_CPUTIME doesn't need tasklist lock;.

Oops, sorry, didn't notice that that had changed.

(not that it looks less painful, just different flavor of agony)

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-12 22:27   ` Rik van Riel
@ 2014-08-13 17:22     ` Oleg Nesterov
  2014-08-13 17:35       ` Rik van Riel
  2014-08-13 17:40       ` [PATCH RFC] time: drop do_sys_times spinlock Peter Zijlstra
  0 siblings, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-13 17:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/12, Rik van Riel wrote:
>
> Any other ideas?

To simplify, lets suppose that we only need sum_exec_runtime.

Perhaps we can do something like this

	u64 thread_group_sched_runtime(void)
	{
		struct task_struct *tsk = current;
		spinlock_t *siglock = &tsk->sighand->siglock; /* stable */
		struct task_struct *t;
		u64 x1, x2;

	retry:
		x1 = tsk->signal->sum_sched_runtime;
		rmb();
		spin_unlock_wait(siglock);
		rmb();

		x2 = 0;
		rcu_read_lock();
		for_each_thread(tsk, t)
			x2 += task_sched_runtime(t);
		rcu_read_unlock();

		rmb();
		spin_unlock_wait(siglock);
		rmb();

		if (x1 != tsk->signal->sum_sched_runtime)
			goto retry;

		return x1 + x2;
	}

?

We do not care if for_each_thread() misses the new thread, we can pretend
thread_group_sched_runtime() was called before clone.

We do not care if a thread with sum_sched_runtime == 0 exits, obviously.

Otherwise "x1 != tsk->signal->sum_sched_runtime" should tell us that we
raced with __exit_signal().

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 17:22     ` Oleg Nesterov
@ 2014-08-13 17:35       ` Rik van Riel
  2014-08-13 18:08         ` Oleg Nesterov
  2014-08-13 17:40       ` [PATCH RFC] time: drop do_sys_times spinlock Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-13 17:35 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On Wed, 13 Aug 2014 19:22:30 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> On 08/12, Rik van Riel wrote:
> >
> > Any other ideas?
> 
> To simplify, lets suppose that we only need sum_exec_runtime.
> 
> Perhaps we can do something like this

That would probably work, indeed.

However, it turns out that a seqcount doesn't look too badly either.

The following patch has only been compile tested so far, I am about to
give it a real test.

I believe k_getrusage can probably be changed in the same way.

---8<---

Subject: time,signal: protect cpu use statistics with seqcount

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqcount around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h          |  1 +
 kernel/exit.c                  |  4 ++++
 kernel/fork.c                  |  1 +
 kernel/sched/cputime.c         | 36 +++++++++++++++++++++---------------
 kernel/sys.c                   |  2 --
 kernel/time/posix-cpu-timers.c |  9 ++++-----
 6 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..5670d33 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -461,6 +461,7 @@ struct sighand_struct {
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
 	spinlock_t		siglock;
+	seqcount_t		stats_seq; /* write nests inside spinlock */
 	wait_queue_head_t	signalfd_wqh;
 };
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..019c263 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk)
 		 * will have been the last reference on the signal_struct.
 		 */
 		task_cputime(tsk, &utime, &stime);
+		write_seqcount_begin(&sighand->stats_seq);
 		sig->utime += utime;
 		sig->stime += stime;
 		sig->gtime += task_gtime(tsk);
@@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk)
 		sig->oublock += task_io_get_oublock(tsk);
 		task_io_accounting_add(&sig->ioac, &tsk->ioac);
 		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
+		write_seqcount_end(&sighand->stats_seq);
 	}
 
 	sig->nr_threads--;
@@ -1041,6 +1043,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		 */
 		thread_group_cputime_adjusted(p, &tgutime, &tgstime);
 		spin_lock_irq(&p->real_parent->sighand->siglock);
+		write_seqcount_begin(&p->real_parent->sighand->stats_seq);
 		psig = p->real_parent->signal;
 		sig = p->signal;
 		psig->cutime += tgutime + sig->cutime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		write_seqcount_end(&p->real_parent->sighand->stats_seq);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..4681694 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1749,6 +1749,7 @@ static void sighand_ctor(void *data)
 	struct sighand_struct *sighand = data;
 
 	spin_lock_init(&sighand->siglock);
+	seqcount_init(&sighand->stats_seq);
 	init_waitqueue_head(&sighand->signalfd_wqh);
 }
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 72fdf06..370fd67 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -286,25 +286,34 @@ static __always_inline bool steal_account_process_tick(void)
 void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 {
 	struct signal_struct *sig = tsk->signal;
+	struct sighand_struct *sighand;
 	cputime_t utime, stime;
 	struct task_struct *t;
-
-	times->utime = sig->utime;
-	times->stime = sig->stime;
-	times->sum_exec_runtime = sig->sum_sched_runtime;
+	int seq;
 
 	rcu_read_lock();
-	/* make sure we can trust tsk->thread_group list */
-	if (!likely(pid_alive(tsk)))
+	sighand = rcu_dereference(tsk->sighand);
+	if (unlikely(!sighand))
 		goto out;
 
-	t = tsk;
 	do {
-		task_cputime(t, &utime, &stime);
-		times->utime += utime;
-		times->stime += stime;
-		times->sum_exec_runtime += task_sched_runtime(t);
-	} while_each_thread(tsk, t);
+		seq = read_seqcount_begin(&sighand->stats_seq);
+		times->utime = sig->utime;
+		times->stime = sig->stime;
+		times->sum_exec_runtime = sig->sum_sched_runtime;
+
+		/* make sure we can trust tsk->thread_group list */
+		if (!likely(pid_alive(tsk)))
+			goto out;
+
+		t = tsk;
+		do {
+			task_cputime(t, &utime, &stime);
+			times->utime += utime;
+			times->stime += stime;
+			times->sum_exec_runtime += task_sched_runtime(t);
+		} while_each_thread(tsk, t);
+	} while (read_seqcount_retry(&sighand->stats_seq, seq));
 out:
 	rcu_read_unlock();
 }
@@ -617,9 +626,6 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
-/*
- * Must be called with siglock held.
- */
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce81291..b663664 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 3b89464..1bde818 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -781,14 +781,14 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
 		cpu_clock_sample(timer->it_clock, p, &now);
 	} else {
 		struct sighand_struct *sighand;
-		unsigned long flags;
 
 		/*
 		 * Protect against sighand release/switch in exit/exec and
 		 * also make timer sampling safe if it ends up calling
 		 * thread_group_cputime().
 		 */
-		sighand = lock_task_sighand(p, &flags);
+		rcu_read_lock();
+		sighand = rcu_dereference(p->sighand);
 		if (unlikely(sighand == NULL)) {
 			/*
 			 * The process has been reaped.
@@ -798,10 +798,9 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
 			timer->it.cpu.expires = 0;
 			sample_to_timespec(timer->it_clock, timer->it.cpu.expires,
 					   &itp->it_value);
-		} else {
+		} else
 			cpu_timer_sample_group(timer->it_clock, p, &now);
-			unlock_task_sighand(p, &flags);
-		}
+		rcu_read_unlock();
 	}
 
 	if (now < timer->it.cpu.expires) {

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 17:22     ` Oleg Nesterov
  2014-08-13 17:35       ` Rik van Riel
@ 2014-08-13 17:40       ` Peter Zijlstra
  2014-08-13 17:50         ` Rik van Riel
  1 sibling, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2014-08-13 17:40 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, linux-kernel, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

[-- Attachment #1: Type: text/plain, Size: 1369 bytes --]

On Wed, Aug 13, 2014 at 07:22:30PM +0200, Oleg Nesterov wrote:
> On 08/12, Rik van Riel wrote:
> >
> > Any other ideas?
> 
> To simplify, lets suppose that we only need sum_exec_runtime.
> 
> Perhaps we can do something like this
> 
> 	u64 thread_group_sched_runtime(void)
> 	{
> 		struct task_struct *tsk = current;
> 		spinlock_t *siglock = &tsk->sighand->siglock; /* stable */
> 		struct task_struct *t;
> 		u64 x1, x2;
> 
> 	retry:
> 		x1 = tsk->signal->sum_sched_runtime;
> 		rmb();
> 		spin_unlock_wait(siglock);
> 		rmb();
> 
> 		x2 = 0;
> 		rcu_read_lock();
> 		for_each_thread(tsk, t)
> 			x2 += task_sched_runtime(t);
> 		rcu_read_unlock();
> 
> 		rmb();
> 		spin_unlock_wait(siglock);
> 		rmb();
> 
> 		if (x1 != tsk->signal->sum_sched_runtime)
> 			goto retry;
> 
> 		return x1 + x2;
> 	}
> 
> ?
> 
> We do not care if for_each_thread() misses the new thread, we can pretend
> thread_group_sched_runtime() was called before clone.
> 
> We do not care if a thread with sum_sched_runtime == 0 exits, obviously.
> 
> Otherwise "x1 != tsk->signal->sum_sched_runtime" should tell us that we
> raced with __exit_signal().

So the problem with the above is the lack of fwd progress; if there's
enough clone()/exit() happening in the thread group (and the more CPUs
the more possible), we'll keep repeating.




[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 17:40       ` [PATCH RFC] time: drop do_sys_times spinlock Peter Zijlstra
@ 2014-08-13 17:50         ` Rik van Riel
  2014-08-13 17:53           ` Peter Zijlstra
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-13 17:50 UTC (permalink / raw)
  To: Peter Zijlstra, Oleg Nesterov
  Cc: linux-kernel, Hidetoshi Seto, Frank Mayhar, Frederic Weisbecker,
	Andrew Morton, Sanjay Rao, Larry Woodman

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/13/2014 01:40 PM, Peter Zijlstra wrote:

> So the problem with the above is the lack of fwd progress; if
> there's enough clone()/exit() happening in the thread group (and
> the more CPUs the more possible), we'll keep repeating.

We can fall back to taking the lock if we circle around,
or if there is a writer active when we are in seqcount_read,
similar to what the semaphore (ipc/sem.c) code is doing.

read_seqbegin_or_lock would do the trick...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJT66VgAAoJEM553pKExN6D1lMH/3/+5vCSsVAyOVLi/xFb7lmt
BhRoMJO5MtDaNV32thh8kv9m1Q6r8FqtExUN6aZkTGj6Gp8nAnKFi0tR7na6nwfv
dlxCP9Q8ETiGhhE1jcJLOfv8lut5PzkR8LvorUHIU3v9wuZMZHEZB4dV/Uc6Ntsf
Ek4iT6QYUUGz8zgmPN4DWZ3k86vof3BOQjzFCaQMMTp4W2fGUVNpTEG+h9fHv42w
SzLwCUpj/9UQ2Y+V7rQki2bKtcXNjZoOFMdcNL9AsAaeAbSWQyxXCGf5ku2kXKS6
GnxgXNX2rDGhkL8g61ZB3hq1bbWkgPDte8run9wKI3OU1Z0Pg+oxpe0HFC5Hoag=
=SACY
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 17:50         ` Rik van Riel
@ 2014-08-13 17:53           ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2014-08-13 17:53 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Oleg Nesterov, linux-kernel, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

[-- Attachment #1: Type: text/plain, Size: 577 bytes --]

On Wed, Aug 13, 2014 at 01:50:24PM -0400, Rik van Riel wrote:
> On 08/13/2014 01:40 PM, Peter Zijlstra wrote:
> 
> > So the problem with the above is the lack of fwd progress; if
> > there's enough clone()/exit() happening in the thread group (and
> > the more CPUs the more possible), we'll keep repeating.
> 
> We can fall back to taking the lock if we circle around,
> or if there is a writer active when we are in seqcount_read,
> similar to what the semaphore (ipc/sem.c) code is doing.
> 
> read_seqbegin_or_lock would do the trick...

Yep that would work.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 17:35       ` Rik van Riel
@ 2014-08-13 18:08         ` Oleg Nesterov
  2014-08-13 18:25           ` Rik van Riel
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-13 18:08 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/13, Rik van Riel wrote:
>
> On Wed, 13 Aug 2014 19:22:30 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > On 08/12, Rik van Riel wrote:
> > >
> > > Any other ideas?
> >
> > To simplify, lets suppose that we only need sum_exec_runtime.
> >
> > Perhaps we can do something like this
>
> That would probably work, indeed.

OK, perhaps I'll try to make a patch tomorrow for review.

> However, it turns out that a seqcount doesn't look too badly either.

Well, I disagree. This is more complex, and this adds yet another lock
which only protects the stats...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -461,6 +461,7 @@ struct sighand_struct {
>  	atomic_t		count;
>  	struct k_sigaction	action[_NSIG];
>  	spinlock_t		siglock;
> +	seqcount_t		stats_seq; /* write nests inside spinlock */

No, no, at least it should go to signal_struct. Unlike ->sighand, ->signal
is stable as long as task_struct can't go away.

>  void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  {
>  	struct signal_struct *sig = tsk->signal;
> +	struct sighand_struct *sighand;
>  	cputime_t utime, stime;
>  	struct task_struct *t;
> -
> -	times->utime = sig->utime;
> -	times->stime = sig->stime;
> -	times->sum_exec_runtime = sig->sum_sched_runtime;
> +	int seq;
>  
>  	rcu_read_lock();
> -	/* make sure we can trust tsk->thread_group list */
> -	if (!likely(pid_alive(tsk)))
> +	sighand = rcu_dereference(tsk->sighand);
> +	if (unlikely(!sighand))
>  		goto out;
>  
> -	t = tsk;
>  	do {
> -		task_cputime(t, &utime, &stime);
> -		times->utime += utime;
> -		times->stime += stime;
> -		times->sum_exec_runtime += task_sched_runtime(t);
> -	} while_each_thread(tsk, t);
> +		seq = read_seqcount_begin(&sighand->stats_seq);
> +		times->utime = sig->utime;
> +		times->stime = sig->stime;
> +		times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> +		/* make sure we can trust tsk->thread_group list */
> +		if (!likely(pid_alive(tsk)))
> +			goto out;

Whatever we do, we should convert thread_group_cputime() to use
for_each_thread() first().

> @@ -781,14 +781,14 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
>  		cpu_clock_sample(timer->it_clock, p, &now);
>  	} else {
>  		struct sighand_struct *sighand;
> -		unsigned long flags;
>
>  		/*
>  		 * Protect against sighand release/switch in exit/exec and
>  		 * also make timer sampling safe if it ends up calling
>  		 * thread_group_cputime().
>  		 */
> -		sighand = lock_task_sighand(p, &flags);
> +		rcu_read_lock();
> +		sighand = rcu_dereference(p->sighand);

This looks unneeded at first glance.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 18:08         ` Oleg Nesterov
@ 2014-08-13 18:25           ` Rik van Riel
  2014-08-13 18:45             ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-13 18:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/13/2014 02:08 PM, Oleg Nesterov wrote:
> On 08/13, Rik van Riel wrote:
>>
>> On Wed, 13 Aug 2014 19:22:30 +0200
>> Oleg Nesterov <oleg@redhat.com> wrote:
>>
>>> On 08/12, Rik van Riel wrote:
>>>>
>>>> Any other ideas?
>>>
>>> To simplify, lets suppose that we only need sum_exec_runtime.
>>>
>>> Perhaps we can do something like this
>>
>> That would probably work, indeed.
> 
> OK, perhaps I'll try to make a patch tomorrow for review.
> 
>> However, it turns out that a seqcount doesn't look too badly either.
> 
> Well, I disagree. This is more complex, and this adds yet another lock
> which only protects the stats...

The other lock is what can tell us that there is a writer active
NOW, which may be useful when it comes to guaranteeing forward
progress for readers when there are lots of threads exiting...

>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -461,6 +461,7 @@ struct sighand_struct {
>>  	atomic_t		count;
>>  	struct k_sigaction	action[_NSIG];
>>  	spinlock_t		siglock;
>> +	seqcount_t		stats_seq; /* write nests inside spinlock */
> 
> No, no, at least it should go to signal_struct. Unlike ->sighand, ->signal
> is stable as long as task_struct can't go away.

I can move it to signal_struct, no problem.

>>  void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>>  {
>>  	struct signal_struct *sig = tsk->signal;
>> +	struct sighand_struct *sighand;
>>  	cputime_t utime, stime;
>>  	struct task_struct *t;
>> -
>> -	times->utime = sig->utime;
>> -	times->stime = sig->stime;
>> -	times->sum_exec_runtime = sig->sum_sched_runtime;
>> +	int seq;
>>  
>>  	rcu_read_lock();
>> -	/* make sure we can trust tsk->thread_group list */
>> -	if (!likely(pid_alive(tsk)))
>> +	sighand = rcu_dereference(tsk->sighand);
>> +	if (unlikely(!sighand))
>>  		goto out;
>>  
>> -	t = tsk;
>>  	do {
>> -		task_cputime(t, &utime, &stime);
>> -		times->utime += utime;
>> -		times->stime += stime;
>> -		times->sum_exec_runtime += task_sched_runtime(t);
>> -	} while_each_thread(tsk, t);
>> +		seq = read_seqcount_begin(&sighand->stats_seq);
>> +		times->utime = sig->utime;
>> +		times->stime = sig->stime;
>> +		times->sum_exec_runtime = sig->sum_sched_runtime;
>> +
>> +		/* make sure we can trust tsk->thread_group list */
>> +		if (!likely(pid_alive(tsk)))
>> +			goto out;
> 
> Whatever we do, we should convert thread_group_cputime() to use
> for_each_thread() first().

What is the advantage of for_each_thread over while_each_thread,
besides getting rid of that t = tsk line?

>> @@ -781,14 +781,14 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
>>  		cpu_clock_sample(timer->it_clock, p, &now);
>>  	} else {
>>  		struct sighand_struct *sighand;
>> -		unsigned long flags;
>>
>>  		/*
>>  		 * Protect against sighand release/switch in exit/exec and
>>  		 * also make timer sampling safe if it ends up calling
>>  		 * thread_group_cputime().
>>  		 */
>> -		sighand = lock_task_sighand(p, &flags);
>> +		rcu_read_lock();
>> +		sighand = rcu_dereference(p->sighand);
> 
> This looks unneeded at first glance.

You are right. This change should be made to posix_cpu_clock_get_task
and not posix_cpu_timer_get. I think this is where I got distracted
by the way the sighand struct was RCU freed.

Sigh...


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 18:25           ` Rik van Riel
@ 2014-08-13 18:45             ` Oleg Nesterov
  2014-08-13 18:57               ` Rik van Riel
                                 ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-13 18:45 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/13, Rik van Riel wrote:
>
> On 08/13/2014 02:08 PM, Oleg Nesterov wrote:
> >
> > Well, I disagree. This is more complex, and this adds yet another lock
> > which only protects the stats...
>
> The other lock is what can tell us that there is a writer active
> NOW, which may be useful when it comes to guaranteeing forward
> progress for readers when there are lots of threads exiting...

I don't really understand why seqcount_t is better in this sense, either
way we need to to taking the lock if we want to guarantee a forward
progress. read_seqbegin_or_lock() doesn't even work "automagically",
and it can't be used in this case anyway.

That said, it is not that I am really sure that seqcount_t in ->signal
is actually worse, not to mention that this is subjective anyway. IOW,
I am not going to really fight with your approach ;)

> > Whatever we do, we should convert thread_group_cputime() to use
> > for_each_thread() first().
>
> What is the advantage of for_each_thread over while_each_thread,
> besides getting rid of that t = tsk line?

It is buggy and should die, see 0c740d0afc3bff0a097ad.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time: drop do_sys_times spinlock
  2014-08-13 18:45             ` Oleg Nesterov
@ 2014-08-13 18:57               ` Rik van Riel
  2014-08-13 21:03               ` [PATCH RFC] time,signal: protect resource use statistics with seqlock Rik van Riel
  2014-08-13 21:03               ` Rik van Riel
  2 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2014-08-13 18:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/13/2014 02:45 PM, Oleg Nesterov wrote:
> On 08/13, Rik van Riel wrote:
>>
>> On 08/13/2014 02:08 PM, Oleg Nesterov wrote:
>>>
>>> Well, I disagree. This is more complex, and this adds yet another lock
>>> which only protects the stats...
>>
>> The other lock is what can tell us that there is a writer active
>> NOW, which may be useful when it comes to guaranteeing forward
>> progress for readers when there are lots of threads exiting...
> 
> I don't really understand why seqcount_t is better in this sense, either
> way we need to to taking the lock if we want to guarantee a forward
> progress. read_seqbegin_or_lock() doesn't even work "automagically",
> and it can't be used in this case anyway.

It allows subsequent readers to fall back into lockless mode,
once the first reader (that got blocked behind writers) takes
the lock, temporarily locking out writers.

This protects forward progress, without the danger of
permanently degrading throughput due to increased contention.

> That said, it is not that I am really sure that seqcount_t in ->signal
> is actually worse, not to mention that this is subjective anyway. IOW,
> I am not going to really fight with your approach ;)

I agree that both approaches have their advantages and
disadvantages.

>>> Whatever we do, we should convert thread_group_cputime() to use
>>> for_each_thread() first().
>>
>> What is the advantage of for_each_thread over while_each_thread,
>> besides getting rid of that t = tsk line?
> 
> It is buggy and should die, see 0c740d0afc3bff0a097ad.

I just got rid of it in the code that I touched.

Thanks for pointing it out.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-13 18:45             ` Oleg Nesterov
  2014-08-13 18:57               ` Rik van Riel
@ 2014-08-13 21:03               ` Rik van Riel
  2014-08-14  0:43                 ` Frederic Weisbecker
                                   ` (2 more replies)
  2014-08-13 21:03               ` Rik van Riel
  2 siblings, 3 replies; 49+ messages in thread
From: Rik van Riel @ 2014-08-13 21:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On Wed, 13 Aug 2014 20:45:11 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> That said, it is not that I am really sure that seqcount_t in ->signal
> is actually worse, not to mention that this is subjective anyway. IOW,
> I am not going to really fight with your approach ;)

This is what it looks like, on top of your for_each_thread series
from yesterday:

---8<---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h          |  1 +
 kernel/exit.c                  |  4 ++++
 kernel/fork.c                  |  1 +
 kernel/sched/cputime.c         | 36 +++++++++++++++++++++++-------------
 kernel/sys.c                   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --------------
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
+	seqlock_t stats_lock;
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..8092e59 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk)
 		 * will have been the last reference on the signal_struct.
 		 */
 		task_cputime(tsk, &utime, &stime);
+		write_seqlock(&sig->stats_lock);
 		sig->utime += utime;
 		sig->stime += stime;
 		sig->gtime += task_gtime(tsk);
@@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk)
 		sig->oublock += task_io_get_oublock(tsk);
 		task_io_accounting_add(&sig->ioac, &tsk->ioac);
 		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
+		write_sequnlock(&sig->stats_lock);
 	}
 
 	sig->nr_threads--;
@@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
+		write_seqlock(&psig->stats_lock);
 		psig->cutime += tgutime + sig->cutime;
 		psig->cstime += tgstime + sig->cstime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		write_sequnlock(&psig->stats_lock);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
+	seqlock_init(&sig->stats_lock);
 
 	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	cputime_t utime, stime;
 	struct task_struct *t;
-
-	times->utime = sig->utime;
-	times->stime = sig->stime;
-	times->sum_exec_runtime = sig->sum_sched_runtime;
+	unsigned int seq, nextseq;
 
 	rcu_read_lock();
-	for_each_thread(tsk, t) {
-		task_cputime(t, &utime, &stime);
-		times->utime += utime;
-		times->stime += stime;
-		times->sum_exec_runtime += task_sched_runtime(t);
-	}
+	/* Attempt a lockless read on the first round. */
+	nextseq = 0;
+	do {
+		seq = nextseq;
+		read_seqbegin_or_lock(&sig->stats_lock, &seq);
+		times->utime = sig->utime;
+		times->stime = sig->stime;
+		times->sum_exec_runtime = sig->sum_sched_runtime;
+
+		for_each_thread(tsk, t) {
+			task_cputime(t, &utime, &stime);
+			times->utime += utime;
+			times->stime += stime;
+			times->sum_exec_runtime += task_sched_runtime(t);
+		}
+		/*
+		 * If a writer is currently active, seq will be odd, and
+		 * read_seqbegin_or_lock will take the lock.
+		 */
+		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
+	} while (need_seqretry(&sig->stats_lock, seq));
+	done_seqretry(&sig->stats_lock, seq);
 	rcu_read_unlock();
 }
 
@@ -611,9 +624,6 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
-/*
- * Must be called with siglock held.
- */
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce81291..b663664 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 3b89464..492b986 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk,
 		if (same_thread_group(tsk, current))
 			err = cpu_clock_sample(which_clock, tsk, &rtn);
 	} else {
-		unsigned long flags;
-		struct sighand_struct *sighand;
-
-		/*
-		 * while_each_thread() is not yet entirely RCU safe,
-		 * keep locking the group while sampling process
-		 * clock for now.
-		 */
-		sighand = lock_task_sighand(tsk, &flags);
-		if (!sighand)
-			return err;
-
 		if (tsk == current || thread_group_leader(tsk))
 			err = cpu_clock_sample_group(which_clock, tsk, &rtn);
-
-		unlock_task_sighand(tsk, &flags);
 	}
 
 	if (!err)

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-13 18:45             ` Oleg Nesterov
  2014-08-13 18:57               ` Rik van Riel
  2014-08-13 21:03               ` [PATCH RFC] time,signal: protect resource use statistics with seqlock Rik van Riel
@ 2014-08-13 21:03               ` Rik van Riel
  2 siblings, 0 replies; 49+ messages in thread
From: Rik van Riel @ 2014-08-13 21:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On Wed, 13 Aug 2014 20:45:11 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> That said, it is not that I am really sure that seqcount_t in ->signal
> is actually worse, not to mention that this is subjective anyway. IOW,
> I am not going to really fight with your approach ;)

This is what it looks like, on top of your for_each_thread series
from yesterday:

---8<---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h          |  1 +
 kernel/exit.c                  |  4 ++++
 kernel/fork.c                  |  1 +
 kernel/sched/cputime.c         | 36 +++++++++++++++++++++++-------------
 kernel/sys.c                   |  2 --
 kernel/time/posix-cpu-timers.c | 14 --------------
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
+	seqlock_t stats_lock;
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..8092e59 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -126,6 +126,7 @@ static void __exit_signal(struct task_struct *tsk)
 		 * will have been the last reference on the signal_struct.
 		 */
 		task_cputime(tsk, &utime, &stime);
+		write_seqlock(&sig->stats_lock);
 		sig->utime += utime;
 		sig->stime += stime;
 		sig->gtime += task_gtime(tsk);
@@ -137,6 +138,7 @@ static void __exit_signal(struct task_struct *tsk)
 		sig->oublock += task_io_get_oublock(tsk);
 		task_io_accounting_add(&sig->ioac, &tsk->ioac);
 		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
+		write_sequnlock(&sig->stats_lock);
 	}
 
 	sig->nr_threads--;
@@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
+		write_seqlock(&psig->stats_lock);
 		psig->cutime += tgutime + sig->cutime;
 		psig->cstime += tgstime + sig->cstime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		write_sequnlock(&psig->stats_lock);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
+	seqlock_init(&sig->stats_lock);
 
 	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	cputime_t utime, stime;
 	struct task_struct *t;
-
-	times->utime = sig->utime;
-	times->stime = sig->stime;
-	times->sum_exec_runtime = sig->sum_sched_runtime;
+	unsigned int seq, nextseq;
 
 	rcu_read_lock();
-	for_each_thread(tsk, t) {
-		task_cputime(t, &utime, &stime);
-		times->utime += utime;
-		times->stime += stime;
-		times->sum_exec_runtime += task_sched_runtime(t);
-	}
+	/* Attempt a lockless read on the first round. */
+	nextseq = 0;
+	do {
+		seq = nextseq;
+		read_seqbegin_or_lock(&sig->stats_lock, &seq);
+		times->utime = sig->utime;
+		times->stime = sig->stime;
+		times->sum_exec_runtime = sig->sum_sched_runtime;
+
+		for_each_thread(tsk, t) {
+			task_cputime(t, &utime, &stime);
+			times->utime += utime;
+			times->stime += stime;
+			times->sum_exec_runtime += task_sched_runtime(t);
+		}
+		/*
+		 * If a writer is currently active, seq will be odd, and
+		 * read_seqbegin_or_lock will take the lock.
+		 */
+		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
+	} while (need_seqretry(&sig->stats_lock, seq));
+	done_seqretry(&sig->stats_lock, seq);
 	rcu_read_unlock();
 }
 
@@ -611,9 +624,6 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
-/*
- * Must be called with siglock held.
- */
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce81291..b663664 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 3b89464..492b986 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk,
 		if (same_thread_group(tsk, current))
 			err = cpu_clock_sample(which_clock, tsk, &rtn);
 	} else {
-		unsigned long flags;
-		struct sighand_struct *sighand;
-
-		/*
-		 * while_each_thread() is not yet entirely RCU safe,
-		 * keep locking the group while sampling process
-		 * clock for now.
-		 */
-		sighand = lock_task_sighand(tsk, &flags);
-		if (!sighand)
-			return err;
-
 		if (tsk == current || thread_group_leader(tsk))
 			err = cpu_clock_sample_group(which_clock, tsk, &rtn);
-
-		unlock_task_sighand(tsk, &flags);
 	}
 
 	if (!err)

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-13 21:03               ` [PATCH RFC] time,signal: protect resource use statistics with seqlock Rik van Riel
@ 2014-08-14  0:43                 ` Frederic Weisbecker
  2014-08-14  1:57                   ` Rik van Riel
  2014-08-14 13:22                 ` Oleg Nesterov
  2014-08-14 14:24                 ` Oleg Nesterov
  2 siblings, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2014-08-14  0:43 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Oleg Nesterov, linux-kernel, Peter Zijlstra, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
> --- a/kernel/time/posix-cpu-timers.c
> +++ b/kernel/time/posix-cpu-timers.c
> @@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk,
>  		if (same_thread_group(tsk, current))
>  			err = cpu_clock_sample(which_clock, tsk, &rtn);
>  	} else {
> -		unsigned long flags;
> -		struct sighand_struct *sighand;
> -
> -		/*
> -		 * while_each_thread() is not yet entirely RCU safe,
> -		 * keep locking the group while sampling process
> -		 * clock for now.
> -		 */
> -		sighand = lock_task_sighand(tsk, &flags);
> -		if (!sighand)
> -			return err;
> -
>  		if (tsk == current || thread_group_leader(tsk))
>  			err = cpu_clock_sample_group(which_clock, tsk, &rtn);
> -
> -		unlock_task_sighand(tsk, &flags);
>  	}

I'm worried about such lockless solution based on RCU or read seqcount because
we lose the guarantee that an update is immediately visible by all subsequent
readers.

Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right after that
call clock_gettime(), with the spinlock we were guaranteed to see the new
update. Now with a pure seqlock read approach, we guarantee a
read sequence coherency but we don't guarantee the freshest update result.

So that looks like a source of non monotonic results.

>  
>  	if (!err)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14  0:43                 ` Frederic Weisbecker
@ 2014-08-14  1:57                   ` Rik van Riel
  2014-08-14 13:34                     ` Frederic Weisbecker
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-14  1:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Oleg Nesterov, linux-kernel, Peter Zijlstra, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
>> --- a/kernel/time/posix-cpu-timers.c +++
>> b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int
>> posix_cpu_clock_get_task(struct task_struct *tsk, if
>> (same_thread_group(tsk, current)) err =
>> cpu_clock_sample(which_clock, tsk, &rtn); } else { -		unsigned
>> long flags; -		struct sighand_struct *sighand; - -		/* -		 *
>> while_each_thread() is not yet entirely RCU safe, -		 * keep
>> locking the group while sampling process -		 * clock for now. -
>> */ -		sighand = lock_task_sighand(tsk, &flags); -		if (!sighand) 
>> -			return err; - if (tsk == current ||
>> thread_group_leader(tsk)) err =
>> cpu_clock_sample_group(which_clock, tsk, &rtn); - -
>> unlock_task_sighand(tsk, &flags); }
> 
> I'm worried about such lockless solution based on RCU or read
> seqcount because we lose the guarantee that an update is
> immediately visible by all subsequent readers.
> 
> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
> after that call clock_gettime(), with the spinlock we were
> guaranteed to see the new update. Now with a pure seqlock read
> approach, we guarantee a read sequence coherency but we don't
> guarantee the freshest update result.
> 
> So that looks like a source of non monotonic results.

Which update are you worried about, specifically?

The seq_write_lock to update the usage stat in p->signal will lock out
the seqlock read side used to check those results.

Is there another kind of thing read by cpu_clock_sample_group that you
believe is not excluded by the seq_lock?


- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT7BdtAAoJEM553pKExN6DngEH/1CJuBb6xij08AoZNQuW4WNQ
geKakADsYTz8FmutbGi+lJEHNKAMZQ5wYbyFNczPAX/fVJsOlC92Qtfwy5z/VupN
QzlRHh79ZJR5/6xGddlu/8LjGrMIXwKqShIeKtTzoENS+rxA82l42zoXTagal4yX
Peb5/Q7cBMg9pFZzUMITEJssQhDtyTN1rXiU5IsEG54PhDbSgFk7HK1cO46tRefb
x1WbUKZKUViVKnoKYhJqd6FDSWuPtL5EpebvMVj9TZ29JBQTMDGx8saUezjuY0YL
STAv/wqigmbbcNnloJpr3gO1iJMkknv3jHk6Bfv1Dil8Um1D3mBAAKFK+Ov8Rk0=
=kU1O
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-13 21:03               ` [PATCH RFC] time,signal: protect resource use statistics with seqlock Rik van Riel
  2014-08-14  0:43                 ` Frederic Weisbecker
@ 2014-08-14 13:22                 ` Oleg Nesterov
  2014-08-14 13:38                   ` Frederic Weisbecker
  2014-08-14 17:48                   ` Oleg Nesterov
  2014-08-14 14:24                 ` Oleg Nesterov
  2 siblings, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 13:22 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/13, Rik van Riel wrote:
>
> On Wed, 13 Aug 2014 20:45:11 +0200
> Oleg Nesterov <oleg@redhat.com> wrote:
>
> > That said, it is not that I am really sure that seqcount_t in ->signal
> > is actually worse, not to mention that this is subjective anyway. IOW,
> > I am not going to really fight with your approach ;)
>
> This is what it looks like, on top of your for_each_thread series
> from yesterday:

OK, lets forget about alternative approach for now. We can reconsider
it later. At least I have to admit that seqlock is more straighforward.

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -646,6 +646,7 @@ struct signal_struct {
>  	 * Live threads maintain their own counters and add to these
>  	 * in __exit_signal, except for the group leader.
>  	 */
> +	seqlock_t stats_lock;

Ah. Somehow I thought that you were going to use seqcount_t and fallback
to taking ->siglock if seqcount_retry, but this patch adds the "full blown"
seqlock_t.

OK, I won't argue, this can make the seqbegin_or_lock simpler...

> @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  	struct signal_struct *sig = tsk->signal;
>  	cputime_t utime, stime;
>  	struct task_struct *t;
> -
> -	times->utime = sig->utime;
> -	times->stime = sig->stime;
> -	times->sum_exec_runtime = sig->sum_sched_runtime;
> +	unsigned int seq, nextseq;
>
>  	rcu_read_lock();

Almost cosmetic nit, but afaics this patch expands the rcu critical section
for no reason. We only need rcu_read_lock/unlock around for_each_thread()
below.

> +	nextseq = 0;
> +	do {
> +		seq = nextseq;
> +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
> +		times->utime = sig->utime;
> +		times->stime = sig->stime;
> +		times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> +		for_each_thread(tsk, t) {
> +			task_cputime(t, &utime, &stime);
> +			times->utime += utime;
> +			times->stime += stime;
> +			times->sum_exec_runtime += task_sched_runtime(t);
> +		}
> +		/*
> +		 * If a writer is currently active, seq will be odd, and
> +		 * read_seqbegin_or_lock will take the lock.
> +		 */
> +		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
> +	} while (need_seqretry(&sig->stats_lock, seq));
> +	done_seqretry(&sig->stats_lock, seq);

Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean,
this code still can livelock in theory. Just suppose that anoter CPU does
write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this
case "seq & 1" will be never true and thus "or_lock" will never happen.

IMO, this should be fixed. Either we should guarantee the forward progress
or we should not play with read_seqbegin_or_lock() at all. This code assumes
that sooner or later "nextseq = raw_read_seqcount()" should return the odd
counter, but in theory this can never happen.

And if we want to fix this we do not need 2 counters, just we need to set
"seq = 1" manually after need_seqretry() == T. Say, like __dentry_path() does.
(but unlike __dentry_path() we do not need to worry about rcu_read_unlock so
the code will be simpler).

I am wondering if it makes sense to introduce

	bool read_seqretry_or_lock(const seqlock_t *sl, int *seq)
	{
		if (*seq & 1) {
			read_sequnlock_excl(lock);
			return false;
		}
	
		if (!read_seqretry(lock, *seq))
			return false;
	
		*seq = 1;
		return true;
	}

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14  1:57                   ` Rik van Riel
@ 2014-08-14 13:34                     ` Frederic Weisbecker
  2014-08-14 14:39                       ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2014-08-14 13:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Oleg Nesterov, LKML, Peter Zijlstra, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

2014-08-14 3:57 GMT+02:00 Rik van Riel <riel@redhat.com>:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
>> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
>>> --- a/kernel/time/posix-cpu-timers.c +++
>>> b/kernel/time/posix-cpu-timers.c @@ -272,22 +272,8 @@ static int
>>> posix_cpu_clock_get_task(struct task_struct *tsk, if
>>> (same_thread_group(tsk, current)) err =
>>> cpu_clock_sample(which_clock, tsk, &rtn); } else { -         unsigned
>>> long flags; -                struct sighand_struct *sighand; - -             /* -             *
>>> while_each_thread() is not yet entirely RCU safe, -           * keep
>>> locking the group while sampling process -            * clock for now. -
>>> */ -         sighand = lock_task_sighand(tsk, &flags); -             if (!sighand)
>>> -                    return err; - if (tsk == current ||
>>> thread_group_leader(tsk)) err =
>>> cpu_clock_sample_group(which_clock, tsk, &rtn); - -
>>> unlock_task_sighand(tsk, &flags); }
>>
>> I'm worried about such lockless solution based on RCU or read
>> seqcount because we lose the guarantee that an update is
>> immediately visible by all subsequent readers.
>>
>> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
>> after that call clock_gettime(), with the spinlock we were
>> guaranteed to see the new update. Now with a pure seqlock read
>> approach, we guarantee a read sequence coherency but we don't
>> guarantee the freshest update result.
>>
>> So that looks like a source of non monotonic results.
>
> Which update are you worried about, specifically?
>
> The seq_write_lock to update the usage stat in p->signal will lock out
> the seqlock read side used to check those results.
>
> Is there another kind of thing read by cpu_clock_sample_group that you
> believe is not excluded by the seq_lock?

I mean the read side doesn't use a lock with seqlocks. It's only made
of barriers and sequence numbers to ensure the reader doesn't read
some half-complete update. But other than that it can as well see the
update n - 1 since barriers don't enforce latest results.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 13:22                 ` Oleg Nesterov
@ 2014-08-14 13:38                   ` Frederic Weisbecker
  2014-08-14 13:53                     ` Oleg Nesterov
  2014-08-14 17:48                   ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2014-08-14 13:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, LKML, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

2014-08-14 15:22 GMT+02:00 Oleg Nesterov <oleg@redhat.com>:
> On 08/13, Rik van Riel wrote:
>>
>> On Wed, 13 Aug 2014 20:45:11 +0200
>> Oleg Nesterov <oleg@redhat.com> wrote:
>>
>> > That said, it is not that I am really sure that seqcount_t in ->signal
>> > is actually worse, not to mention that this is subjective anyway. IOW,
>> > I am not going to really fight with your approach ;)
>>
>> This is what it looks like, on top of your for_each_thread series
>> from yesterday:
>
> OK, lets forget about alternative approach for now. We can reconsider
> it later. At least I have to admit that seqlock is more straighforward.
>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -646,6 +646,7 @@ struct signal_struct {
>>        * Live threads maintain their own counters and add to these
>>        * in __exit_signal, except for the group leader.
>>        */
>> +     seqlock_t stats_lock;
>
> Ah. Somehow I thought that you were going to use seqcount_t and fallback
> to taking ->siglock if seqcount_retry, but this patch adds the "full blown"
> seqlock_t.
>
> OK, I won't argue, this can make the seqbegin_or_lock simpler...

Is this really needed? seqlock are useful when we have concurrent
updaters. But updaters of thread stats should be under the thread lock
already, right? If we have only one updater at a time, seqcount should
be enough.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 13:38                   ` Frederic Weisbecker
@ 2014-08-14 13:53                     ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 13:53 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rik van Riel, LKML, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Frederic Weisbecker wrote:
>
> 2014-08-14 15:22 GMT+02:00 Oleg Nesterov <oleg@redhat.com>:
> > On 08/13, Rik van Riel wrote:
> >>
> >> @@ -646,6 +646,7 @@ struct signal_struct {
> >>        * Live threads maintain their own counters and add to these
> >>        * in __exit_signal, except for the group leader.
> >>        */
> >> +     seqlock_t stats_lock;
> >
> > Ah. Somehow I thought that you were going to use seqcount_t and fallback
> > to taking ->siglock if seqcount_retry, but this patch adds the "full blown"
> > seqlock_t.
> >
> > OK, I won't argue, this can make the seqbegin_or_lock simpler...
>
> Is this really needed? seqlock are useful when we have concurrent
> updaters. But updaters of thread stats should be under the thread lock
> already, right? If we have only one updater at a time, seqcount should
> be enough.

Yes, this is what I meant. Although I can see 2 reasons to use seqlock_t:

	1. It can simplify the seqbegin-or-lock logic. If nothing else,
	   you simply can't use read_seqbegin_or_lock() to take ->siglock.
	   But this is just syntactic sugar.

	2. If we use ->siglock in fallback path, we need to verify that
	   thread_group_cputime() is never called with ->siglock held first.

	   Or, we need a fat comment to explain that need_seqrtry == T is not
	   possible if it is called under ->siglock, and thus "fallback to
	   lock_task_sighand" must be always safe. But in this case we need
	   to ensure that the caller didn't do write_seqcount_begin().

So perhaps seqlock_t makes more sense at least initially...

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-13 21:03               ` [PATCH RFC] time,signal: protect resource use statistics with seqlock Rik van Riel
  2014-08-14  0:43                 ` Frederic Weisbecker
  2014-08-14 13:22                 ` Oleg Nesterov
@ 2014-08-14 14:24                 ` Oleg Nesterov
  2014-08-14 15:37                   ` Rik van Riel
  2 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 14:24 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/13, Rik van Riel wrote:
>
> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
>  {
>  	cputime_t tgutime, tgstime, cutime, cstime;
>
> -	spin_lock_irq(&current->sighand->siglock);
>  	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
>  	cutime = current->signal->cutime;
>  	cstime = current->signal->cstime;
> -	spin_unlock_irq(&current->sighand->siglock);

Ah, wait, there is another problem afaics...

thread_group_cputime_adjusted()->cputime_adjust() plays with
signal->prev_cputime and thus it needs siglock or stats_lock to ensure
it can't race with itself. Not sure it is safe to simply take the lock
in cputime_adjust(), this should be checked.

OTOH, do_task_stat() already calls task_cputime_adjusted() lockless and
this looks wrong or I missed something. So perhaps we need a lock in or
around cputime_adjust() anyway.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 13:34                     ` Frederic Weisbecker
@ 2014-08-14 14:39                       ` Oleg Nesterov
  2014-08-15  2:52                         ` Frederic Weisbecker
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 14:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rik van Riel, LKML, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Frederic Weisbecker wrote:
>
> 2014-08-14 3:57 GMT+02:00 Rik van Riel <riel@redhat.com>:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
> >> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
> >>
> >> I'm worried about such lockless solution based on RCU or read
> >> seqcount because we lose the guarantee that an update is
> >> immediately visible by all subsequent readers.
> >>
> >> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
> >> after that call clock_gettime(), with the spinlock we were
> >> guaranteed to see the new update. Now with a pure seqlock read
> >> approach, we guarantee a read sequence coherency but we don't
> >> guarantee the freshest update result.
> >>
> >> So that looks like a source of non monotonic results.
> >
> > Which update are you worried about, specifically?
> >
> > The seq_write_lock to update the usage stat in p->signal will lock out
> > the seqlock read side used to check those results.
> >
> > Is there another kind of thing read by cpu_clock_sample_group that you
> > believe is not excluded by the seq_lock?
>
> I mean the read side doesn't use a lock with seqlocks. It's only made
> of barriers and sequence numbers to ensure the reader doesn't read
> some half-complete update. But other than that it can as well see the
> update n - 1 since barriers don't enforce latest results.

Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
write_seqcount_begin-update-write_seqcount_begin can miss "update" part
along with ->sequence modifications.

But I still can't understand how this can lead to non-monotonic results,
could you spell?

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 14:24                 ` Oleg Nesterov
@ 2014-08-14 15:37                   ` Rik van Riel
  2014-08-14 16:12                     ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-14 15:37 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
> On 08/13, Rik van Riel wrote:
>> 
>> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) { 
>> cputime_t tgutime, tgstime, cutime, cstime;
>> 
>> -	spin_lock_irq(&current->sighand->siglock); 
>> thread_group_cputime_adjusted(current, &tgutime, &tgstime); 
>> cutime = current->signal->cutime; cstime =
>> current->signal->cstime; -
>> spin_unlock_irq(&current->sighand->siglock);
> 
> Ah, wait, there is another problem afaics...

Last night I worked on another problem with this code.

After propagating the stats from a dying task to the signal struct,
we need to make sure that that task's stats are not counted twice.

This requires zeroing the stats under the write_seqlock, which was
easy enough to add. We cannot rely on any state in the task that
was set outside of the write_seqlock...

> thread_group_cputime_adjusted()->cputime_adjust() plays with 
> signal->prev_cputime and thus it needs siglock or stats_lock to
> ensure it can't race with itself. Not sure it is safe to simply
> take the lock in cputime_adjust(), this should be checked.
> 
> OTOH, do_task_stat() already calls task_cputime_adjusted() lockless
> and this looks wrong or I missed something. So perhaps we need a
> lock in or around cputime_adjust() anyway.

I'll take a look at this.

- -- 
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT7NfHAAoJEM553pKExN6DTVIH/RIFVl42fM+cBpiSavSa2s4k
B0ykVu/VwFbqoYVo5I5joSl25IpU5Xma3AwMBQHoJ7aY9a8w63iGFMoycKcDWbrY
nOyaOTvR92aMdn/GuGwS/XlU83PwIbLEyYWFrvn0CrnBqJw9pHz/sLYsvP/jASem
LbUStuWFzqGyasb4lJVZmLQKaIVhy30CM5Y3llTFuc16zyH/YG65tUasG+aR2miA
g3CiPOHP/IY0vZ+L3YYlLthLY4acVX/bwImE0vsx9fY+rG4hgj5xF9b0CnbaN41g
62oJ4jkFSH/voNFPjR7I5AnKpSeMsBqW2/l1tHlcaKcNCtkd9nri/HinxXN5bN4=
=dfSt
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 15:37                   ` Rik van Riel
@ 2014-08-14 16:12                     ` Oleg Nesterov
  2014-08-14 17:36                       ` Rik van Riel
  2014-08-15  2:14                       ` Rik van Riel
  0 siblings, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 16:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Rik van Riel wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
> > On 08/13, Rik van Riel wrote:
> >>
> >> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) {
> >> cputime_t tgutime, tgstime, cutime, cstime;
> >>
> >> -	spin_lock_irq(&current->sighand->siglock);
> >> thread_group_cputime_adjusted(current, &tgutime, &tgstime);
> >> cutime = current->signal->cutime; cstime =
> >> current->signal->cstime; -
> >> spin_unlock_irq(&current->sighand->siglock);
> >
> > Ah, wait, there is another problem afaics...
>
> Last night I worked on another problem with this code.
>
> After propagating the stats from a dying task to the signal struct,
> we need to make sure that that task's stats are not counted twice.

Heh indeed ;) Can't understand how I missed that.

> This requires zeroing the stats under the write_seqlock, which was
> easy enough to add.

Or you can expand the scope of write_seqlock/write_sequnlock, so that
__unhash_process in called from inside the critical section. This looks
simpler at first glance.

Hmm, wait, it seems there is yet another problem ;) Afaics, you also
need to modify __exit_signal() so that ->sum_sched_runtime/etc are
accounted unconditionally, even if the group leader exits.

Probably this is not a big problem, and sys_times() or clock_gettime()
do not care at all because they use current.

But without this change thread_group_cputime(reaped_zombie) won't look
at this task_struct at all, this can lead to non-monotonic result if
it was previously called when this task was alive (non-reaped).

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 16:12                     ` Oleg Nesterov
@ 2014-08-14 17:36                       ` Rik van Riel
  2014-08-14 18:15                         ` Oleg Nesterov
  2014-08-15  2:14                       ` Rik van Riel
  1 sibling, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-14 17:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
> On 08/14, Rik van Riel wrote:
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 08/14/2014 10:24 AM, Oleg Nesterov wrote:
>>> On 08/13, Rik van Riel wrote:
>>>>
>>>> @@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms) {
>>>> cputime_t tgutime, tgstime, cutime, cstime;
>>>>
>>>> -	spin_lock_irq(&current->sighand->siglock);
>>>> thread_group_cputime_adjusted(current, &tgutime, &tgstime);
>>>> cutime = current->signal->cutime; cstime =
>>>> current->signal->cstime; -
>>>> spin_unlock_irq(&current->sighand->siglock);
>>>
>>> Ah, wait, there is another problem afaics...
>>
>> Last night I worked on another problem with this code.
>>
>> After propagating the stats from a dying task to the signal struct,
>> we need to make sure that that task's stats are not counted twice.
> 
> Heh indeed ;) Can't understand how I missed that.
> 
>> This requires zeroing the stats under the write_seqlock, which was
>> easy enough to add.
> 
> Or you can expand the scope of write_seqlock/write_sequnlock, so that
> __unhash_process in called from inside the critical section. This looks
> simpler at first glance.

The problem with that is that wait_task_zombie() calls
thread_group_cputime_adjusted() in that if() branch, and
that code ends up taking the seqlock for read...

However, in __exit_signal that approach should work.

> Hmm, wait, it seems there is yet another problem ;) Afaics, you also
> need to modify __exit_signal() so that ->sum_sched_runtime/etc are
> accounted unconditionally, even if the group leader exits.
> 
> Probably this is not a big problem, and sys_times() or clock_gettime()
> do not care at all because they use current.
> 
> But without this change thread_group_cputime(reaped_zombie) won't look
> at this task_struct at all, this can lead to non-monotonic result if
> it was previously called when this task was alive (non-reaped).

You mean this whole block needs to run regardless of whether
the group is dead?

                task_cputime(tsk, &utime, &stime);
                write_seqlock(&sig->stats_lock);
                sig->utime += utime;
                sig->stime += stime;
                sig->gtime += task_gtime(tsk);
                sig->min_flt += tsk->min_flt;
                sig->maj_flt += tsk->maj_flt;
                sig->nvcsw += tsk->nvcsw;
                sig->nivcsw += tsk->nivcsw;
                sig->inblock += task_io_get_inblock(tsk);
                sig->oublock += task_io_get_oublock(tsk);
                task_io_accounting_add(&sig->ioac, &tsk->ioac);
                sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

How does that square with wait_task_zombie reaping the
statistics of the whole group with thread_group_cputime_adjusted()
when the group leader is exiting?

Could that lead to things being double-counted?

Or do you mean ONLY ->sum_sched_runtime is unconditionally
accounted in __exit_signal(), because wait_task_zombie() seems
to be missing that one?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 13:22                 ` Oleg Nesterov
  2014-08-14 13:38                   ` Frederic Weisbecker
@ 2014-08-14 17:48                   ` Oleg Nesterov
  2014-08-14 18:34                     ` Oleg Nesterov
  2014-08-15  5:19                     ` Mike Galbraith
  1 sibling, 2 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 17:48 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Oleg Nesterov wrote:
>
> OK, lets forget about alternative approach for now. We can reconsider
> it later. At least I have to admit that seqlock is more straighforward.

Yes.

But just for record, the "lockless" version doesn't look that bad to me,

	void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
	{
		struct signal_struct *sig = tsk->signal;
		bool lockless, is_dead;
		struct task_struct *t;
		unsigned long flags;
		u64 exec;

		lockless = true;
		is_dead = !lock_task_sighand(p, &flags);
	 retry:
		times->utime = sig->utime;
		times->stime = sig->stime;
		times->sum_exec_runtime = exec = sig->sum_sched_runtime;
		if (is_dead)
			return;

		if (lockless)
			unlock_task_sighand(p, &flags);

		rcu_read_lock();
		for_each_thread(tsk, t) {
			cputime_t utime, stime;
			task_cputime(t, &utime, &stime);
			times->utime += utime;
			times->stime += stime;
			times->sum_exec_runtime += task_sched_runtime(t);
		}
		rcu_read_unlock();

		if (lockless) {
			lockless = false;
			is_dead = !lock_task_sighand(p, &flags);
			if (is_dead || exec != sig->sum_sched_runtime)
				goto retry;
		}
		unlock_task_sighand(p, &flags);
	}

The obvious problem is that we should shift lock_task_sighand() from the
callers to thread_group_cputime() first, or add thread_group_cputime_lockless()
and change the current users one by one.

And of course, stats_lock is more generic.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 17:36                       ` Rik van Riel
@ 2014-08-14 18:15                         ` Oleg Nesterov
  2014-08-14 19:03                           ` Rik van Riel
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 18:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Rik van Riel wrote:
>
> On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
> >
> > Or you can expand the scope of write_seqlock/write_sequnlock, so that
> > __unhash_process in called from inside the critical section. This looks
> > simpler at first glance.
>
> The problem with that is that wait_task_zombie() calls
> thread_group_cputime_adjusted() in that if() branch, and
> that code ends up taking the seqlock for read...

Not sure I understand... This modifies parent->signal->c* counters,
and obviously the exiting thread is not the member of parent's thread
group, so thread_group_cputime_adjusted(parent) can never account the
exiting child twice simply because it won't see it?

> However, in __exit_signal that approach should work.

Yes,

> > Hmm, wait, it seems there is yet another problem ;) Afaics, you also
> > need to modify __exit_signal() so that ->sum_sched_runtime/etc are
> > accounted unconditionally, even if the group leader exits.
> >
> > Probably this is not a big problem, and sys_times() or clock_gettime()
> > do not care at all because they use current.
> >
> > But without this change thread_group_cputime(reaped_zombie) won't look
> > at this task_struct at all, this can lead to non-monotonic result if
> > it was previously called when this task was alive (non-reaped).
>
> You mean this whole block needs to run regardless of whether
> the group is dead?
>
>                 task_cputime(tsk, &utime, &stime);
>                 write_seqlock(&sig->stats_lock);
>                 sig->utime += utime;
>                 sig->stime += stime;
>                 sig->gtime += task_gtime(tsk);
>                 sig->min_flt += tsk->min_flt;
>                 sig->maj_flt += tsk->maj_flt;
>                 sig->nvcsw += tsk->nvcsw;
>                 sig->nivcsw += tsk->nivcsw;
>                 sig->inblock += task_io_get_inblock(tsk);
>                 sig->oublock += task_io_get_oublock(tsk);
>                 task_io_accounting_add(&sig->ioac, &tsk->ioac);
>                 sig->sum_sched_runtime += tsk->se.sum_exec_runtime;

Yes.

> How does that square with wait_task_zombie reaping the
> statistics of the whole group with thread_group_cputime_adjusted()
> when the group leader is exiting?

Again, not sure I understand... thread_group_cputime_adjusted() in
wait_task_zombie() is fine in any case. Nobody but us can reap this
zombie.

It seems that we misunderstood each other, let me try again. Just to
simplify, suppose we have, say,

	sys_times_by_pid(pid, ...)
	{
		rcu_read_lock();
		task = find_task_by_vpid(pid);
		if (task)
			get_task_struct(task);
		rcu_read_unlock();

		if (!task)
			return -ESRCH;

		thread_group_cputime(task, ...);
		copy_to_user();
		return 0;
	}

Note that this task can exit right after rcu_read_unlock(), and it can
be also reaped (by its parent or by itself) and removed from the thread
list. In this case for_each_thread() will see no threads, and thus it
will only read task->signal->*time.

This means that sys_times_by_pid() can simply return the wrong result
instead of failure. Say, It can even return "all zeros" if this task was
single-threaded.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 17:48                   ` Oleg Nesterov
@ 2014-08-14 18:34                     ` Oleg Nesterov
  2014-08-15  5:19                     ` Mike Galbraith
  1 sibling, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 18:34 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Oleg Nesterov wrote:
>
> But just for record, the "lockless" version doesn't look that bad to me,
>
> 	void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> 	{
> 		struct signal_struct *sig = tsk->signal;
> 		bool lockless, is_dead;
> 		struct task_struct *t;
> 		unsigned long flags;
> 		u64 exec;
>
> 		lockless = true;
> 		is_dead = !lock_task_sighand(p, &flags);
> 	 retry:
> 		times->utime = sig->utime;
> 		times->stime = sig->stime;
> 		times->sum_exec_runtime = exec = sig->sum_sched_runtime;
> 		if (is_dead)
> 			return;
>
> 		if (lockless)
> 			unlock_task_sighand(p, &flags);
>
> 		rcu_read_lock();
> 		for_each_thread(tsk, t) {
> 			cputime_t utime, stime;
> 			task_cputime(t, &utime, &stime);
> 			times->utime += utime;
> 			times->stime += stime;
> 			times->sum_exec_runtime += task_sched_runtime(t);
> 		}
> 		rcu_read_unlock();
>
> 		if (lockless) {
> 			lockless = false;
> 			is_dead = !lock_task_sighand(p, &flags);
> 			if (is_dead || exec != sig->sum_sched_runtime)
> 				goto retry;
> 		}
> 		unlock_task_sighand(p, &flags);
> 	}
>
> The obvious problem is that we should shift lock_task_sighand() from the
> callers to thread_group_cputime() first, or add thread_group_cputime_lockless()
> and change the current users one by one.

OTOH, it is simple to convert do_sys_times() and posix_cpu_clock_get_task()
to use the lockless version, and avoid the new stats_lock and other changes
it needs.

> And of course, stats_lock is more generic.

Yes, this is true in any case.

So I simply do not know.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 18:15                         ` Oleg Nesterov
@ 2014-08-14 19:03                           ` Rik van Riel
  2014-08-14 19:37                             ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-14 19:03 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14/2014 02:15 PM, Oleg Nesterov wrote:
> On 08/14, Rik van Riel wrote:
>>
>> On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
>>>
>>> Or you can expand the scope of write_seqlock/write_sequnlock, so that
>>> __unhash_process in called from inside the critical section. This looks
>>> simpler at first glance.
>>
>> The problem with that is that wait_task_zombie() calls
>> thread_group_cputime_adjusted() in that if() branch, and
>> that code ends up taking the seqlock for read...
> 
> Not sure I understand... This modifies parent->signal->c* counters,
> and obviously the exiting thread is not the member of parent's thread
> group, so thread_group_cputime_adjusted(parent) can never account the
> exiting child twice simply because it won't see it?

You are right, the tree of processes only goes one way,
so there should be no deadlock in taking psig->stats_lock
and having thread_group_cputime_adjusted take sig->stats_lock
for read within that section.

However, it might need some lockdep annotation to keep
lockdep from thinking we might the same lock recursively :)

>> However, in __exit_signal that approach should work.
> 
> Yes,
> 
>>> Hmm, wait, it seems there is yet another problem ;) Afaics, you also
>>> need to modify __exit_signal() so that ->sum_sched_runtime/etc are
>>> accounted unconditionally, even if the group leader exits.
>>>
>>> Probably this is not a big problem, and sys_times() or clock_gettime()
>>> do not care at all because they use current.
>>>
>>> But without this change thread_group_cputime(reaped_zombie) won't look
>>> at this task_struct at all, this can lead to non-monotonic result if
>>> it was previously called when this task was alive (non-reaped).
>>
>> You mean this whole block needs to run regardless of whether
>> the group is dead?
>>
>>                 task_cputime(tsk, &utime, &stime);
>>                 write_seqlock(&sig->stats_lock);
>>                 sig->utime += utime;
>>                 sig->stime += stime;
>>                 sig->gtime += task_gtime(tsk);
>>                 sig->min_flt += tsk->min_flt;
>>                 sig->maj_flt += tsk->maj_flt;
>>                 sig->nvcsw += tsk->nvcsw;
>>                 sig->nivcsw += tsk->nivcsw;
>>                 sig->inblock += task_io_get_inblock(tsk);
>>                 sig->oublock += task_io_get_oublock(tsk);
>>                 task_io_accounting_add(&sig->ioac, &tsk->ioac);
>>                 sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
> 
> Yes.

Let me give that a try and see what happens :)

>> How does that square with wait_task_zombie reaping the
>> statistics of the whole group with thread_group_cputime_adjusted()
>> when the group leader is exiting?
> 
> Again, not sure I understand... thread_group_cputime_adjusted() in
> wait_task_zombie() is fine in any case. Nobody but us can reap this
> zombie.
> 
> It seems that we misunderstood each other, let me try again. Just to
> simplify, suppose we have, say,
> 
> 	sys_times_by_pid(pid, ...)
> 	{
> 		rcu_read_lock();
> 		task = find_task_by_vpid(pid);
> 		if (task)
> 			get_task_struct(task);
> 		rcu_read_unlock();
> 
> 		if (!task)
> 			return -ESRCH;
> 
> 		thread_group_cputime(task, ...);
> 		copy_to_user();
> 		return 0;
> 	}
> 
> Note that this task can exit right after rcu_read_unlock(), and it can
> be also reaped (by its parent or by itself) and removed from the thread
> list. In this case for_each_thread() will see no threads, and thus it
> will only read task->signal->*time.
> 
> This means that sys_times_by_pid() can simply return the wrong result
> instead of failure. Say, It can even return "all zeros" if this task was
> single-threaded.

Ahh, that makes sense.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 19:03                           ` Rik van Riel
@ 2014-08-14 19:37                             ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-14 19:37 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Rik van Riel wrote:
>
> On 08/14/2014 02:15 PM, Oleg Nesterov wrote:
> > On 08/14, Rik van Riel wrote:
> >>
> >> On 08/14/2014 12:12 PM, Oleg Nesterov wrote:
> >>>
> >>> Or you can expand the scope of write_seqlock/write_sequnlock, so that
> >>> __unhash_process in called from inside the critical section. This looks
> >>> simpler at first glance.
> >>
> >> The problem with that is that wait_task_zombie() calls
> >> thread_group_cputime_adjusted() in that if() branch, and
> >> that code ends up taking the seqlock for read...
> >
> > Not sure I understand... This modifies parent->signal->c* counters,
> > and obviously the exiting thread is not the member of parent's thread
> > group, so thread_group_cputime_adjusted(parent) can never account the
> > exiting child twice simply because it won't see it?
>
> You are right, the tree of processes only goes one way,
> so there should be no deadlock in taking psig->stats_lock
> and having thread_group_cputime_adjusted take sig->stats_lock
> for read within that section.
>
> However, it might need some lockdep annotation to keep
> lockdep from thinking we might the same lock recursively :)

But wait_task_zombie() can (and should) call
thread_group_cputime_adjusted(zombie_child) outside of parent's ->siglock
or ->stats_lock so this this should be safe.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 16:12                     ` Oleg Nesterov
  2014-08-14 17:36                       ` Rik van Riel
@ 2014-08-15  2:14                       ` Rik van Riel
  2014-08-15 14:58                         ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-15  2:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On Thu, 14 Aug 2014 18:12:47 +0200
Oleg Nesterov <oleg@redhat.com> wrote:

> Or you can expand the scope of write_seqlock/write_sequnlock, so that
> __unhash_process in called from inside the critical section. This looks
> simpler at first glance.
> 
> Hmm, wait, it seems there is yet another problem ;) Afaics, you also
> need to modify __exit_signal() so that ->sum_sched_runtime/etc are
> accounted unconditionally, even if the group leader exits.

OK, this is what I have now.

I am still getting backwards time sometimes, but only tiny
increments. This suggests that cputime_adjust() may be the
culprit, and I have no good idea on how to fix that yet...

Should task_cputime_adjusted and thread_group_cputime_adjusted
pass in the address of a seqlock to use in case the values in
prev need to be updated?

Should we check whether the values in prev changed during the
time spent in the function?

Is this a race between task_cputime_adjusted and other writers
of signal->utime and signal->stime, instead of task_cputime_adjusted
racing with itself?

I am not sure what the best approach here is...

---8<---

Subject: time,signal: protect resource use statistics with seqlock

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times and clock_gettime being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime, slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task things can be simplified a
lot, because the calling function already ensures tsk sticks around,
and the rest is now taken care of in thread_group_cputime.

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h          |  1 +
 kernel/exit.c                  | 48 +++++++++++++++++++++++-------------------
 kernel/fork.c                  |  1 +
 kernel/sched/cputime.c         | 36 +++++++++++++++++++------------
 kernel/sys.c                   |  2 --
 kernel/time/posix-cpu-timers.c | 14 ------------
 6 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..91f9209 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -646,6 +646,7 @@ struct signal_struct {
 	 * Live threads maintain their own counters and add to these
 	 * in __exit_signal, except for the group leader.
 	 */
+	seqlock_t stats_lock;
 	cputime_t utime, stime, cutime, cstime;
 	cputime_t gtime;
 	cputime_t cgtime;
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..c1a0ef2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -115,32 +115,34 @@ static void __exit_signal(struct task_struct *tsk)
 
 		if (tsk == sig->curr_target)
 			sig->curr_target = next_thread(tsk);
-		/*
-		 * Accumulate here the counters for all threads but the
-		 * group leader as they die, so they can be added into
-		 * the process-wide totals when those are taken.
-		 * The group leader stays around as a zombie as long
-		 * as there are other threads.  When it gets reaped,
-		 * the exit.c code will add its counts into these totals.
-		 * We won't ever get here for the group leader, since it
-		 * will have been the last reference on the signal_struct.
-		 */
-		task_cputime(tsk, &utime, &stime);
-		sig->utime += utime;
-		sig->stime += stime;
-		sig->gtime += task_gtime(tsk);
-		sig->min_flt += tsk->min_flt;
-		sig->maj_flt += tsk->maj_flt;
-		sig->nvcsw += tsk->nvcsw;
-		sig->nivcsw += tsk->nivcsw;
-		sig->inblock += task_io_get_inblock(tsk);
-		sig->oublock += task_io_get_oublock(tsk);
-		task_io_accounting_add(&sig->ioac, &tsk->ioac);
-		sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	}
 
+	/*
+	 * Accumulate here the counters for all threads but the
+	 * group leader as they die, so they can be added into
+	 * the process-wide totals when those are taken.
+	 * The group leader stays around as a zombie as long
+	 * as there are other threads.  When it gets reaped,
+	 * the exit.c code will add its counts into these totals.
+	 * We won't ever get here for the group leader, since it
+	 * will have been the last reference on the signal_struct.
+	 */
+	task_cputime(tsk, &utime, &stime);
+	write_seqlock(&sig->stats_lock);
+	sig->utime += utime;
+	sig->stime += stime;
+	sig->gtime += task_gtime(tsk);
+	sig->min_flt += tsk->min_flt;
+	sig->maj_flt += tsk->maj_flt;
+	sig->nvcsw += tsk->nvcsw;
+	sig->nivcsw += tsk->nivcsw;
+	sig->inblock += task_io_get_inblock(tsk);
+	sig->oublock += task_io_get_oublock(tsk);
+	task_io_accounting_add(&sig->ioac, &tsk->ioac);
+	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
 	sig->nr_threads--;
 	__unhash_process(tsk, group_dead);
+	write_sequnlock(&sig->stats_lock);
 
 	/*
 	 * Do this under ->siglock, we can race with another thread
@@ -1043,6 +1045,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 		spin_lock_irq(&p->real_parent->sighand->siglock);
 		psig = p->real_parent->signal;
 		sig = p->signal;
+		write_seqlock(&psig->stats_lock);
 		psig->cutime += tgutime + sig->cutime;
 		psig->cstime += tgstime + sig->cstime;
 		psig->cgtime += task_gtime(p) + sig->gtime + sig->cgtime;
@@ -1065,6 +1068,7 @@ static int wait_task_zombie(struct wait_opts *wo, struct task_struct *p)
 			psig->cmaxrss = maxrss;
 		task_io_accounting_add(&psig->ioac, &p->ioac);
 		task_io_accounting_add(&psig->ioac, &sig->ioac);
+		write_sequnlock(&psig->stats_lock);
 		spin_unlock_irq(&p->real_parent->sighand->siglock);
 	}
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1380d8a..5d7cf2b 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1068,6 +1068,7 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
 	sig->curr_target = tsk;
 	init_sigpending(&sig->shared_pending);
 	INIT_LIST_HEAD(&sig->posix_timers);
+	seqlock_init(&sig->stats_lock);
 
 	hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 	sig->real_timer.function = it_real_fn;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 3e52836..b5f1c58 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	struct signal_struct *sig = tsk->signal;
 	cputime_t utime, stime;
 	struct task_struct *t;
-
-	times->utime = sig->utime;
-	times->stime = sig->stime;
-	times->sum_exec_runtime = sig->sum_sched_runtime;
+	unsigned int seq, nextseq;
 
 	rcu_read_lock();
-	for_each_thread(tsk, t) {
-		task_cputime(t, &utime, &stime);
-		times->utime += utime;
-		times->stime += stime;
-		times->sum_exec_runtime += task_sched_runtime(t);
-	}
+	/* Attempt a lockless read on the first round. */
+	nextseq = 0;
+	do {
+		seq = nextseq;
+		read_seqbegin_or_lock(&sig->stats_lock, &seq);
+		times->utime = sig->utime;
+		times->stime = sig->stime;
+		times->sum_exec_runtime = sig->sum_sched_runtime;
+
+		for_each_thread(tsk, t) {
+			task_cputime(t, &utime, &stime);
+			times->utime += utime;
+			times->stime += stime;
+			times->sum_exec_runtime += task_sched_runtime(t);
+		}
+		/*
+		 * If a writer is currently active, seq will be odd, and
+		 * read_seqbegin_or_lock will take the lock.
+		 */
+		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
+	} while (need_seqretry(&sig->stats_lock, seq));
+	done_seqretry(&sig->stats_lock, seq);
 	rcu_read_unlock();
 }
 
@@ -611,9 +624,6 @@ void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 	cputime_adjust(&cputime, &p->prev_cputime, ut, st);
 }
 
-/*
- * Must be called with siglock held.
- */
 void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime;
diff --git a/kernel/sys.c b/kernel/sys.c
index ce81291..b663664 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -862,11 +862,9 @@ void do_sys_times(struct tms *tms)
 {
 	cputime_t tgutime, tgstime, cutime, cstime;
 
-	spin_lock_irq(&current->sighand->siglock);
 	thread_group_cputime_adjusted(current, &tgutime, &tgstime);
 	cutime = current->signal->cutime;
 	cstime = current->signal->cstime;
-	spin_unlock_irq(&current->sighand->siglock);
 	tms->tms_utime = cputime_to_clock_t(tgutime);
 	tms->tms_stime = cputime_to_clock_t(tgstime);
 	tms->tms_cutime = cputime_to_clock_t(cutime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 3b89464..492b986 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -272,22 +272,8 @@ static int posix_cpu_clock_get_task(struct task_struct *tsk,
 		if (same_thread_group(tsk, current))
 			err = cpu_clock_sample(which_clock, tsk, &rtn);
 	} else {
-		unsigned long flags;
-		struct sighand_struct *sighand;
-
-		/*
-		 * while_each_thread() is not yet entirely RCU safe,
-		 * keep locking the group while sampling process
-		 * clock for now.
-		 */
-		sighand = lock_task_sighand(tsk, &flags);
-		if (!sighand)
-			return err;
-
 		if (tsk == current || thread_group_leader(tsk))
 			err = cpu_clock_sample_group(which_clock, tsk, &rtn);
-
-		unlock_task_sighand(tsk, &flags);
 	}
 
 	if (!err)

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 14:39                       ` Oleg Nesterov
@ 2014-08-15  2:52                         ` Frederic Weisbecker
  2014-08-15 14:26                           ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Frederic Weisbecker @ 2014-08-15  2:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, LKML, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

2014-08-14 16:39 GMT+02:00 Oleg Nesterov <oleg@redhat.com>:
> On 08/14, Frederic Weisbecker wrote:
>>
>> 2014-08-14 3:57 GMT+02:00 Rik van Riel <riel@redhat.com>:
>> > -----BEGIN PGP SIGNED MESSAGE-----
>> > Hash: SHA1
>> >
>> > On 08/13/2014 08:43 PM, Frederic Weisbecker wrote:
>> >> On Wed, Aug 13, 2014 at 05:03:24PM -0400, Rik van Riel wrote:
>> >>
>> >> I'm worried about such lockless solution based on RCU or read
>> >> seqcount because we lose the guarantee that an update is
>> >> immediately visible by all subsequent readers.
>> >>
>> >> Say CPU 0 updates the thread time and both CPU 1 and CPU 2 right
>> >> after that call clock_gettime(), with the spinlock we were
>> >> guaranteed to see the new update. Now with a pure seqlock read
>> >> approach, we guarantee a read sequence coherency but we don't
>> >> guarantee the freshest update result.
>> >>
>> >> So that looks like a source of non monotonic results.
>> >
>> > Which update are you worried about, specifically?
>> >
>> > The seq_write_lock to update the usage stat in p->signal will lock out
>> > the seqlock read side used to check those results.
>> >
>> > Is there another kind of thing read by cpu_clock_sample_group that you
>> > believe is not excluded by the seq_lock?
>>
>> I mean the read side doesn't use a lock with seqlocks. It's only made
>> of barriers and sequence numbers to ensure the reader doesn't read
>> some half-complete update. But other than that it can as well see the
>> update n - 1 since barriers don't enforce latest results.
>
> Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
> write_seqcount_begin-update-write_seqcount_begin can miss "update" part
> along with ->sequence modifications.
>
> But I still can't understand how this can lead to non-monotonic results,
> could you spell?

Well lets say clock = T.
CPU 0 updates at T + 1.
Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
while CPU 1 still reads T.
If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
possible that CPU 2 still sees T. With the spinlocked version that
thing can't happen, the second round would read at least T + 1 for
both CPUs.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-14 17:48                   ` Oleg Nesterov
  2014-08-14 18:34                     ` Oleg Nesterov
@ 2014-08-15  5:19                     ` Mike Galbraith
  2014-08-15  6:28                       ` Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2014-08-15  5:19 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, linux-kernel, Peter Zijlstra, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On Thu, 2014-08-14 at 19:48 +0200, Oleg Nesterov wrote: 
> On 08/14, Oleg Nesterov wrote:
> >
> > OK, lets forget about alternative approach for now. We can reconsider
> > it later. At least I have to admit that seqlock is more straighforward.
> 
> Yes.
> 
> But just for record, the "lockless" version doesn't look that bad to me,
> 
> 	void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> 	{
> 		struct signal_struct *sig = tsk->signal;
> 		bool lockless, is_dead;
> 		struct task_struct *t;
> 		unsigned long flags;
> 		u64 exec;
> 
> 		lockless = true;
> 		is_dead = !lock_task_sighand(p, &flags);
> 	 retry:
> 		times->utime = sig->utime;
> 		times->stime = sig->stime;
> 		times->sum_exec_runtime = exec = sig->sum_sched_runtime;
> 		if (is_dead)
> 			return;
> 
> 		if (lockless)
> 			unlock_task_sighand(p, &flags);
> 
> 		rcu_read_lock();
> 		for_each_thread(tsk, t) {
> 			cputime_t utime, stime;
> 			task_cputime(t, &utime, &stime);
> 			times->utime += utime;
> 			times->stime += stime;
> 			times->sum_exec_runtime += task_sched_runtime(t);
> 		}
> 		rcu_read_unlock();
> 
> 		if (lockless) {
> 			lockless = false;
> 			is_dead = !lock_task_sighand(p, &flags);
> 			if (is_dead || exec != sig->sum_sched_runtime)
> 				goto retry;
> 		}
> 		unlock_task_sighand(p, &flags);
> 	}
> 
> The obvious problem is that we should shift lock_task_sighand() from the
> callers to thread_group_cputime() first, or add thread_group_cputime_lockless()
> and change the current users one by one.
> 
> And of course, stats_lock is more generic.

Yours looks nice to me, particularly in that it doesn't munge structure
layout, could perhaps be backported to fix up production kernels.

For the N threads doing this on N cores case, seems rq->lock hammering
will still be a source of major box wide pain.  Is there any correctness
reason to add up unaccounted ->on_cpu beans, or is that just value
added?  Seems to me it can't matter, as you traverse, what you added up
on previous threads becomes ever more stale as you proceed, so big boxen
would be better off not doing that.

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15  5:19                     ` Mike Galbraith
@ 2014-08-15  6:28                       ` Peter Zijlstra
  2014-08-15  9:37                         ` Mike Galbraith
  2014-08-15 16:36                         ` Oleg Nesterov
  0 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2014-08-15  6:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Oleg Nesterov, Rik van Riel, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

[-- Attachment #1: Type: text/plain, Size: 689 bytes --]

On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
> For the N threads doing this on N cores case, seems rq->lock hammering
> will still be a source of major box wide pain.  Is there any correctness
> reason to add up unaccounted ->on_cpu beans, or is that just value
> added?  

That delta can be arbitrarily large with nohz_full. And without
nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
reported clock resolution.

Having a non-constant error bound is annoying for you never quite know
what to expect.

Also; why do we care about PROCESS_CPUTIME? People should really not use
it. What are the 'valid' usecases you guys care about?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15  6:28                       ` Peter Zijlstra
@ 2014-08-15  9:37                         ` Mike Galbraith
  2014-08-15  9:44                           ` Peter Zijlstra
  2014-08-15 16:36                         ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: Mike Galbraith @ 2014-08-15  9:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Rik van Riel, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: 
> On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
> > For the N threads doing this on N cores case, seems rq->lock hammering
> > will still be a source of major box wide pain.  Is there any correctness
> > reason to add up unaccounted ->on_cpu beans, or is that just value
> > added?  
> 
> That delta can be arbitrarily large with nohz_full. And without
> nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
> reported clock resolution.
> 
> Having a non-constant error bound is annoying for you never quite know
> what to expect.

Ah, yeah, that could get rather large.

> Also; why do we care about PROCESS_CPUTIME? People should really not use
> it. What are the 'valid' usecases you guys care about?

I don't care much, said "don't do that" before I saw a similar big box
problem had popped up with times().

-Mike


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15  9:37                         ` Mike Galbraith
@ 2014-08-15  9:44                           ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2014-08-15  9:44 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Oleg Nesterov, Rik van Riel, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

[-- Attachment #1: Type: text/plain, Size: 1352 bytes --]

On Fri, Aug 15, 2014 at 11:37:33AM +0200, Mike Galbraith wrote:
> On Fri, 2014-08-15 at 08:28 +0200, Peter Zijlstra wrote: 
> > On Fri, Aug 15, 2014 at 07:19:31AM +0200, Mike Galbraith wrote:
> > > For the N threads doing this on N cores case, seems rq->lock hammering
> > > will still be a source of major box wide pain.  Is there any correctness
> > > reason to add up unaccounted ->on_cpu beans, or is that just value
> > > added?  
> > 
> > That delta can be arbitrarily large with nohz_full. And without
> > nohz_full the error is nr_cpus*TICK_NSEC, which I bet is larger than the
> > reported clock resolution.
> > 
> > Having a non-constant error bound is annoying for you never quite know
> > what to expect.
> 
> Ah, yeah, that could get rather large.
> 
> > Also; why do we care about PROCESS_CPUTIME? People should really not use
> > it. What are the 'valid' usecases you guys care about?
> 
> I don't care much, said "don't do that" before I saw a similar big box
> problem had popped up with times().

Urgh, yes times().. Now I don't think we do very accurate accounting of
those particular numbers, so we could fudge some of that. Typically we
only do TICK_NSEC granularity accounting on user/system divide anyhow,
seeing how putting timestamp reads in the kernel<>user switch is
_expensive_ -- see NOHZ_FULL.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15  2:52                         ` Frederic Weisbecker
@ 2014-08-15 14:26                           ` Oleg Nesterov
  2014-08-15 22:33                             ` Frederic Weisbecker
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-15 14:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rik van Riel, LKML, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/15, Frederic Weisbecker wrote:
>
> 2014-08-14 16:39 GMT+02:00 Oleg Nesterov <oleg@redhat.com>:
> > On 08/14, Frederic Weisbecker wrote:
> >>
> >> I mean the read side doesn't use a lock with seqlocks. It's only made
> >> of barriers and sequence numbers to ensure the reader doesn't read
> >> some half-complete update. But other than that it can as well see the
> >> update n - 1 since barriers don't enforce latest results.
> >
> > Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
> > write_seqcount_begin-update-write_seqcount_begin can miss "update" part
> > along with ->sequence modifications.
> >
> > But I still can't understand how this can lead to non-monotonic results,
> > could you spell?
>
> Well lets say clock = T.
> CPU 0 updates at T + 1.
> Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
> while CPU 1 still reads T.
> If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
> possible that CPU 2 still sees T. With the spinlocked version that
> thing can't happen, the second round would read at least T + 1 for
> both CPUs.

But this is fine? And CPU 2 doesn't see a non-monotonic result?

OK, this could be wrong if, say,

	void print_clock(void)
	{
		lock(SOME_LOCK);
		printk(..., clock_gettime());
		unlock(SOME_LOCK);
	}
	
printed the non-monotonic numbers if print_clock() is called on CPU_1 and
then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if
they were already visible to CPU_1 under the same lock. IOW,

	int T = 0;	/* can be incremented at any time */

	void check_monotony(void)
	{
		static int t = 0;

		lock(SOME_LOCK);
		BUG(t > T);
		T = t;
		unlock(SOME_LOCK);
	}

must work corrrectly (ignoring overflow) even if T is changed without
SOME_LOCK.

Otherwise, without some sort of synchronization the different results on
CPU_1/2 should be fine.

Or I am still missing your point?

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15  2:14                       ` Rik van Riel
@ 2014-08-15 14:58                         ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-15 14:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: linux-kernel, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On 08/14, Rik van Riel wrote:
>
> @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  	struct signal_struct *sig = tsk->signal;
>  	cputime_t utime, stime;
>  	struct task_struct *t;
> -
> -	times->utime = sig->utime;
> -	times->stime = sig->stime;
> -	times->sum_exec_runtime = sig->sum_sched_runtime;
> +	unsigned int seq, nextseq;
>  
>  	rcu_read_lock();
> -	for_each_thread(tsk, t) {
> -		task_cputime(t, &utime, &stime);
> -		times->utime += utime;
> -		times->stime += stime;
> -		times->sum_exec_runtime += task_sched_runtime(t);
> -	}
> +	/* Attempt a lockless read on the first round. */
> +	nextseq = 0;
> +	do {
> +		seq = nextseq;
> +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
> +		times->utime = sig->utime;
> +		times->stime = sig->stime;
> +		times->sum_exec_runtime = sig->sum_sched_runtime;
> +
> +		for_each_thread(tsk, t) {
> +			task_cputime(t, &utime, &stime);
> +			times->utime += utime;
> +			times->stime += stime;
> +			times->sum_exec_runtime += task_sched_runtime(t);
> +		}
> +		/*
> +		 * If a writer is currently active, seq will be odd, and
> +		 * read_seqbegin_or_lock will take the lock.
> +		 */
> +		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
> +	} while (need_seqretry(&sig->stats_lock, seq));
> +	done_seqretry(&sig->stats_lock, seq);
>  	rcu_read_unlock();
>  }

I still think this is not right. Let me quote my previous email,

	> @@ -288,18 +288,31 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
	>  	struct signal_struct *sig = tsk->signal;
	>  	cputime_t utime, stime;
	>  	struct task_struct *t;
	> -
	> -	times->utime = sig->utime;
	> -	times->stime = sig->stime;
	> -	times->sum_exec_runtime = sig->sum_sched_runtime;
	> +	unsigned int seq, nextseq;
	>
	>  	rcu_read_lock();

	Almost cosmetic nit, but afaics this patch expands the rcu critical section
	for no reason. We only need rcu_read_lock/unlock around for_each_thread()
	below.

	> +	nextseq = 0;
	> +	do {
	> +		seq = nextseq;
	> +		read_seqbegin_or_lock(&sig->stats_lock, &seq);
	> +		times->utime = sig->utime;
	> +		times->stime = sig->stime;
	> +		times->sum_exec_runtime = sig->sum_sched_runtime;
	> +
	> +		for_each_thread(tsk, t) {
	> +			task_cputime(t, &utime, &stime);
	> +			times->utime += utime;
	> +			times->stime += stime;
	> +			times->sum_exec_runtime += task_sched_runtime(t);
	> +		}
	> +		/*
	> +		 * If a writer is currently active, seq will be odd, and
	> +		 * read_seqbegin_or_lock will take the lock.
	> +		 */
	> +		nextseq = raw_read_seqcount(&sig->stats_lock.seqcount);
	> +	} while (need_seqretry(&sig->stats_lock, seq));
	> +	done_seqretry(&sig->stats_lock, seq);

	Hmm. It seems that read_seqbegin_or_lock() is not used correctly. I mean,
	this code still can livelock in theory. Just suppose that anoter CPU does
	write_seqlock/write_sequnlock right after read_seqbegin_or_lock(). In this
	case "seq & 1" will be never true and thus "or_lock" will never happen.

	IMO, this should be fixed. Either we should guarantee the forward progress
	or we should not play with read_seqbegin_or_lock() at all. This code assumes
	that sooner or later "nextseq = raw_read_seqcount()" should return the odd
	counter, but in theory this can never happen.

	And if we want to fix this we do not need 2 counters, just we need to set
	"seq = 1" manually after need_seqretry() == T. Say, like __dentry_path() does.
	(but unlike __dentry_path() we do not need to worry about rcu_read_unlock so
	the code will be simpler).

	I am wondering if it makes sense to introduce

		bool read_seqretry_or_lock(const seqlock_t *sl, int *seq)
		{
			if (*seq & 1) {
				read_sequnlock_excl(lock);
				return false;
			}
		
			if (!read_seqretry(lock, *seq))
				return false;
		
			*seq = 1;
			return true;
		}

Or I missed your reply?

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15  6:28                       ` Peter Zijlstra
  2014-08-15  9:37                         ` Mike Galbraith
@ 2014-08-15 16:36                         ` Oleg Nesterov
  2014-08-15 16:49                           ` Oleg Nesterov
  1 sibling, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-15 16:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Rik van Riel, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On 08/15, Peter Zijlstra wrote:
>
> Also; why do we care about PROCESS_CPUTIME? People should really not use
> it. What are the 'valid' usecases you guys care about?

I do not really know. IIUC, the problematic usecase is sys_times().

I agree with Mike, "don't do this if you have a lot of threads". But
perhaps the kernel can help to applications which already abuse times().

However, if we only want to make sys_times() more scalable(), then
perhaps the "lockless" version of thread_group_cputime() makes more
sense. And given that do_sys_times() uses current we can simplify it;
is_dead is not possible and we do not need to take ->siglock twice:

	void current_group_cputime(struct task_cputime *times)
	{
		struct task_struct *tsk = current, *t;
		struct spinlock_t *siglock = &tsk->sighand->siglock;
		struct signal_struct *sig = tsk->signal;
		bool lockless = true;
		u64 exec;

	 retry:
		spin_lock_irq(siglock);
		times->utime = sig->utime;
		times->stime = sig->stime;
		times->sum_exec_runtime = exec = sig->sum_sched_runtime;

		if (lockless)
			spin_unlock_irq(siglock);

		rcu_read_lock();
		for_each_thread(tsk, t) {
			cputime_t utime, stime;
			task_cputime(t, &utime, &stime);
			times->utime += utime;
			times->stime += stime;
			times->sum_exec_runtime += task_sched_runtime(t);
		}
		rcu_read_unlock();

		if (lockless) {
			lockless = false;
			spin_unlock_wait(siglock);
			smp_rmb();
			if (exec != sig->sum_sched_runtime)
				goto retry;
		} else {
			spin_unlock_irq(siglock);
		}
	}

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15 16:36                         ` Oleg Nesterov
@ 2014-08-15 16:49                           ` Oleg Nesterov
  2014-08-15 17:25                             ` Rik van Riel
  0 siblings, 1 reply; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-15 16:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Rik van Riel, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On 08/15, Oleg Nesterov wrote:
>
> However, if we only want to make sys_times() more scalable(), then
> perhaps the "lockless" version of thread_group_cputime() makes more
> sense. And given that do_sys_times() uses current we can simplify it;
> is_dead is not possible and we do not need to take ->siglock twice:
>
> 	void current_group_cputime(struct task_cputime *times)
> 	{
> 		struct task_struct *tsk = current, *t;
> 		struct spinlock_t *siglock = &tsk->sighand->siglock;
> 		struct signal_struct *sig = tsk->signal;
> 		bool lockless = true;
> 		u64 exec;
>
> 	 retry:
> 		spin_lock_irq(siglock);
> 		times->utime = sig->utime;
> 		times->stime = sig->stime;
> 		times->sum_exec_runtime = exec = sig->sum_sched_runtime;
>
> 		if (lockless)
> 			spin_unlock_irq(siglock);
>
> 		rcu_read_lock();
> 		for_each_thread(tsk, t) {
> 			cputime_t utime, stime;
> 			task_cputime(t, &utime, &stime);
> 			times->utime += utime;
> 			times->stime += stime;
> 			times->sum_exec_runtime += task_sched_runtime(t);
> 		}
> 		rcu_read_unlock();
>
> 		if (lockless) {
> 			lockless = false;
> 			spin_unlock_wait(siglock);
> 			smp_rmb();
> 			if (exec != sig->sum_sched_runtime)
> 				goto retry;
> 		} else {
> 			spin_unlock_irq(siglock);
> 		}
> 	}

Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable. Just
I do not know it's worth the trouble.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15 16:49                           ` Oleg Nesterov
@ 2014-08-15 17:25                             ` Rik van Riel
  2014-08-15 18:36                               ` Oleg Nesterov
  0 siblings, 1 reply; 49+ messages in thread
From: Rik van Riel @ 2014-08-15 17:25 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra
  Cc: Mike Galbraith, linux-kernel, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 08/15/2014 12:49 PM, Oleg Nesterov wrote:

> Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable.
> Just I do not know it's worth the trouble.

If we don't know whether it is worth the trouble, it is probably best
to stick to a well-known generic locking algorithm, instead of brewing
our own and trying to maintain it.

I have fixed the other locking issue you pointed out, Oleg.

Now to see if this change to cputime_adjust does the trick :)

+++ b/kernel/sched/cputime.c
@@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
 	 * If the tick based count grows faster than the scheduler one,
 	 * the result of the scaling may go backward.
 	 * Let's enforce monotonicity.
+	 * Atomic exchange protects against concurrent cputime_adjust.
 	 */
- -	prev->stime = max(prev->stime, stime);
- -	prev->utime = max(prev->utime, utime);
+	while (stime > (rtime = ACCESS_ONCE(prev->stime)))
+		cmpxchg(&prev->stime, rtime, stime);
+	while (utime > (rtime = ACCESS_ONCE(prev->utime)))
+		cmpxchg(&prev->utime, rtime, utime);

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJT7kJ4AAoJEM553pKExN6Do/oH/2lA5X/CrVuhOLBK1sVq3kRh
gGiOTT9pDQZH1wwafVNHKWaro3T/s9GNqemgvgt4UiKbjFeYkaOycHp1cuntJj8j
Wk8zNnWBOuGqqcSxzk1Duco3CByxshLNXxuYJfpdkdEXPqRyvURAOL58pxSybZzh
E6lT747ntFJu3GIbfC6Ta3q58pWLpVrhWlvonhSaqat6tOvlzo4MKiJxz3SbT6i0
cCpmQ5p/JoQ5+IUEbTOZYbE2bK2y5tSrMggAFwKWLB3/0zJm1h4+2Q/5PenCX59X
VDFmaOJLkNxGcVXg8x87itvqzfq/LkvDtwl9tTJmA5ccG37MPvM3803XF5OWVo0=
=aKES
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15 17:25                             ` Rik van Riel
@ 2014-08-15 18:36                               ` Oleg Nesterov
  0 siblings, 0 replies; 49+ messages in thread
From: Oleg Nesterov @ 2014-08-15 18:36 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Peter Zijlstra, Mike Galbraith, linux-kernel, Hidetoshi Seto,
	Frank Mayhar, Frederic Weisbecker, Andrew Morton, Sanjay Rao,
	Larry Woodman

On 08/15, Rik van Riel wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 08/15/2014 12:49 PM, Oleg Nesterov wrote:
>
> > Just in case... Yes, sure, "seqlock_t stats_lock" is more scalable.
> > Just I do not know it's worth the trouble.
>
> If we don't know whether it is worth the trouble, it is probably best
> to stick to a well-known generic locking algorithm, instead of brewing
> our own and trying to maintain it.

Perhaps. I am obviously biased and can't judge ;) Plus, again, I do
understand that your approach has some advantages too.

> Now to see if this change to cputime_adjust does the trick :)
>
> +++ b/kernel/sched/cputime.c
> @@ -605,9 +605,12 @@ static void cputime_adjust(struct task_cputime *curr,
>  	 * If the tick based count grows faster than the scheduler one,
>  	 * the result of the scaling may go backward.
>  	 * Let's enforce monotonicity.
> +	 * Atomic exchange protects against concurrent cputime_adjust.
>  	 */
> - -	prev->stime = max(prev->stime, stime);
> - -	prev->utime = max(prev->utime, utime);
> +	while (stime > (rtime = ACCESS_ONCE(prev->stime)))
> +		cmpxchg(&prev->stime, rtime, stime);
> +	while (utime > (rtime = ACCESS_ONCE(prev->utime)))
> +		cmpxchg(&prev->utime, rtime, utime);

Yes, perhaps we need something like this in any case. To remind, at least
do_task_stat() calls task_cputime_adjusted() lockless, although we could
fix this separately.

But I do not think the change above is enough. With this change cputime_adjust()
can race with itself. Yes, this guarantees monotonicity even if it is called
lockless, but this can lead to "obviously inconsistent" numbers.

And I don't think we can ignore this. If we could, then we can remove the
scale_stime recalculation and change cputime_adjust() to simply do:

	static void cputime_adjust(struct task_cputime *curr,
				   struct cputime *prev,
				   cputime_t *ut, cputime_t *st)
	{
		/* enforce monotonicity */
		*ut = prev->stime = max(prev->stime, curr->stime);
		*st = prev->utime = max(prev->utime, curr->utime);
	}

Yes, we have this problem either way. And personally I think that this
"enforce monotonicity" logic is pointless, userspace could take care,
but it is too late to complain.

Oleg.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC] time,signal: protect resource use statistics with seqlock
  2014-08-15 14:26                           ` Oleg Nesterov
@ 2014-08-15 22:33                             ` Frederic Weisbecker
  0 siblings, 0 replies; 49+ messages in thread
From: Frederic Weisbecker @ 2014-08-15 22:33 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Rik van Riel, LKML, Peter Zijlstra, Hidetoshi Seto, Frank Mayhar,
	Frederic Weisbecker, Andrew Morton, Sanjay Rao, Larry Woodman

On Fri, Aug 15, 2014 at 04:26:01PM +0200, Oleg Nesterov wrote:
> On 08/15, Frederic Weisbecker wrote:
> >
> > 2014-08-14 16:39 GMT+02:00 Oleg Nesterov <oleg@redhat.com>:
> > > On 08/14, Frederic Weisbecker wrote:
> > >>
> > >> I mean the read side doesn't use a lock with seqlocks. It's only made
> > >> of barriers and sequence numbers to ensure the reader doesn't read
> > >> some half-complete update. But other than that it can as well see the
> > >> update n - 1 since barriers don't enforce latest results.
> > >
> > > Yes, sure, read_seqcount_begin/read_seqcount_retry "right after"
> > > write_seqcount_begin-update-write_seqcount_begin can miss "update" part
> > > along with ->sequence modifications.
> > >
> > > But I still can't understand how this can lead to non-monotonic results,
> > > could you spell?
> >
> > Well lets say clock = T.
> > CPU 0 updates at T + 1.
> > Then I call clock_gettime() from CPU 1 and CPU 2. CPU 1 reads T + 1
> > while CPU 1 still reads T.
> > If I do yet another round of clock_gettime() on CPU 1 and CPU 2, it's
> > possible that CPU 2 still sees T. With the spinlocked version that
> > thing can't happen, the second round would read at least T + 1 for
> > both CPUs.
> 
> But this is fine? And CPU 2 doesn't see a non-monotonic result?
> 
> OK, this could be wrong if, say,
> 
> 	void print_clock(void)
> 	{
> 		lock(SOME_LOCK);
> 		printk(..., clock_gettime());
> 		unlock(SOME_LOCK);
> 	}
> 	
> printed the non-monotonic numbers if print_clock() is called on CPU_1 and
> then on CPU_2. But in this case CPU_2 can't miss the changes on CPU_0 if
> they were already visible to CPU_1 under the same lock. IOW,
> 
> 	int T = 0;	/* can be incremented at any time */
> 
> 	void check_monotony(void)
> 	{
> 		static int t = 0;
> 
> 		lock(SOME_LOCK);
> 		BUG(t > T);
> 		T = t;
> 		unlock(SOME_LOCK);
> 	}
> 
> must work corrrectly (ignoring overflow) even if T is changed without
> SOME_LOCK.
> 
> Otherwise, without some sort of synchronization the different results on
> CPU_1/2 should be fine.
> 
> Or I am still missing your point?

No I think you're right, as long as ordering against something else is involved,
monotonicity is enforced.

Now I'm trying to think about a case where SMP ordering isn't involved.
Perhaps some usecase based on coupling CPU local clocks and clock_gettime()
where a drift between both can appear. Now using a local clock probably only
makes sense in the context of local usecases where the thread clock update
would be local as well. So that's probably not a problem. Now what if somebody
couples multithread process wide clocks with per CPU local clocks. Well that's
probably too foolish to be considered.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2014-08-15 22:33 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-12 18:25 [PATCH RFC] time: drop do_sys_times spinlock Rik van Riel
2014-08-12 19:12 ` Oleg Nesterov
2014-08-12 19:22   ` Rik van Riel
2014-08-12 22:27   ` Rik van Riel
2014-08-13 17:22     ` Oleg Nesterov
2014-08-13 17:35       ` Rik van Riel
2014-08-13 18:08         ` Oleg Nesterov
2014-08-13 18:25           ` Rik van Riel
2014-08-13 18:45             ` Oleg Nesterov
2014-08-13 18:57               ` Rik van Riel
2014-08-13 21:03               ` [PATCH RFC] time,signal: protect resource use statistics with seqlock Rik van Riel
2014-08-14  0:43                 ` Frederic Weisbecker
2014-08-14  1:57                   ` Rik van Riel
2014-08-14 13:34                     ` Frederic Weisbecker
2014-08-14 14:39                       ` Oleg Nesterov
2014-08-15  2:52                         ` Frederic Weisbecker
2014-08-15 14:26                           ` Oleg Nesterov
2014-08-15 22:33                             ` Frederic Weisbecker
2014-08-14 13:22                 ` Oleg Nesterov
2014-08-14 13:38                   ` Frederic Weisbecker
2014-08-14 13:53                     ` Oleg Nesterov
2014-08-14 17:48                   ` Oleg Nesterov
2014-08-14 18:34                     ` Oleg Nesterov
2014-08-15  5:19                     ` Mike Galbraith
2014-08-15  6:28                       ` Peter Zijlstra
2014-08-15  9:37                         ` Mike Galbraith
2014-08-15  9:44                           ` Peter Zijlstra
2014-08-15 16:36                         ` Oleg Nesterov
2014-08-15 16:49                           ` Oleg Nesterov
2014-08-15 17:25                             ` Rik van Riel
2014-08-15 18:36                               ` Oleg Nesterov
2014-08-14 14:24                 ` Oleg Nesterov
2014-08-14 15:37                   ` Rik van Riel
2014-08-14 16:12                     ` Oleg Nesterov
2014-08-14 17:36                       ` Rik van Riel
2014-08-14 18:15                         ` Oleg Nesterov
2014-08-14 19:03                           ` Rik van Riel
2014-08-14 19:37                             ` Oleg Nesterov
2014-08-15  2:14                       ` Rik van Riel
2014-08-15 14:58                         ` Oleg Nesterov
2014-08-13 21:03               ` Rik van Riel
2014-08-13 17:40       ` [PATCH RFC] time: drop do_sys_times spinlock Peter Zijlstra
2014-08-13 17:50         ` Rik van Riel
2014-08-13 17:53           ` Peter Zijlstra
2014-08-13  6:59   ` Mike Galbraith
2014-08-13 11:11     ` Peter Zijlstra
2014-08-13 13:24       ` Rik van Riel
2014-08-13 13:39         ` Peter Zijlstra
2014-08-13 14:09           ` Mike Galbraith

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.