All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance
@ 2016-08-17  9:30 Stanislaw Gruszka
  2016-08-18 11:04 ` [tip:sched/core] sched/cputime: Improve scalability by not accounting thread group tasks pending runtime tip-bot for Stanislaw Gruszka
  2016-08-26 15:24 ` [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance Giovanni Gherdovich
  0 siblings, 2 replies; 3+ messages in thread
From: Stanislaw Gruszka @ 2016-08-17  9:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Giovanni Gherdovich,
	Mel Gorman

Commit d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles") makes we
account thread group tasks pending runtime in thread_group_cputime().
Another commit 6e998916dfe32 ("sched/cputime:
Fix clock_nanosleep()/clock_gettime() inconsistency") makes we update
scheduler runtime statistics (call update_curr()) when read task pending
runtime. Those changes cause bad performance of times() and
clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls.

While we would like to have cpuclock monotonicity kept i.e. have
problems fixed by above commits stay fixed, we also would like to have
good performance.

However when we notice that change from commit d670ec13178d0 is not
longer needed to solve problem addressed by that commit, because of
change from the second commit 6e998916dfe32, we can get room for
optimization. Since we update task while reading it's pending runtime
in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will
see updated values and on testcase from d670ec13178d0 process cpuclock
will not be smaller than thread cpuclock.

I tested the patch on testcases from commits d670ec13178d0,
6e998916dfe32 and some other cpuclock/cputimers testcases and
did not found cpuclock monotonicity problems or other mallfunction.

Patch has drawback that we will not provide thread group cputime
up-to-date to the last moment. For example when arming cputime timer,
we will arm it with possibly a bit outdated values and that timer will
trigger earlier compared to behaviour without the patch. However that
was the behaviour before d670ec13178d0 commit (kernel v3.1).

Patch improves related syscalls performance what Giovanni measured
by benchmarks he described in -tip commit 6075620b0590e ("sched/cputime: 
Mitigate performance regression in times()/clock_gettime()").
Giovanni's benchmark results are below:
 
clock_gettime():

threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                       (pre-6e998916dfe3)
2          3.48        2.23 ( 35.68%)        3.06 ( 11.83%)        1.08 ( 68.81%)
5          3.33        2.83 ( 14.84%)        3.25 (  2.40%)        0.71 ( 78.55%)
8          3.37        2.84 ( 15.80%)        3.26 (  3.30%)        0.56 ( 83.49%)
12         3.32        3.09 (  6.69%)        3.37 ( -1.60%)        0.42 ( 87.28%)
21         4.01        3.14 ( 21.70%)        3.90 (  2.74%)        0.35 ( 91.35%)
30         3.63        3.28 (  9.75%)        3.36 (  7.41%)        0.28 ( 92.23%)
48         3.71        3.02 ( 18.69%)        3.11 ( 16.27%)        0.39 ( 89.39%)
79         3.75        2.88 ( 23.23%)        3.16 ( 15.74%)        0.46 ( 87.76%)
110        3.81        2.95 ( 22.62%)        3.25 ( 14.80%)        0.56 ( 85.41%)
128        3.88        3.05 ( 21.28%)        3.31 ( 14.76%)        0.62 ( 84.10%)

times():

threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                       (pre-6e998916dfe3)
2          3.65        2.27 ( 37.94%)        3.25 ( 11.03%)        1.62 ( 55.71%)
5          3.45        2.78 ( 19.34%)        3.17 (  7.92%)        2.33 ( 32.28%)
8          3.52        2.79 ( 20.66%)        3.22 (  8.69%)        2.06 ( 41.44%)
12         3.29        3.02 (  8.33%)        3.36 ( -2.04%)        2.00 ( 39.18%)
21         4.07        3.10 ( 23.86%)        3.92 (  3.78%)        2.07 ( 49.18%)
30         3.87        3.33 ( 13.80%)        3.40 ( 12.17%)        1.89 ( 51.12%)
48         3.79        2.96 ( 21.94%)        3.16 ( 16.61%)        1.69 ( 55.46%)
79         3.88        2.88 ( 25.82%)        3.28 ( 15.42%)        1.60 ( 58.81%)
110        3.90        2.98 ( 23.73%)        3.38 ( 13.35%)        1.73 ( 55.61%)
128        4.00        3.10 ( 22.40%)        3.38 ( 15.45%)        1.66 ( 58.52%)

Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 kernel/sched/cputime.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 1934f65..4fca604 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -301,6 +301,26 @@ static inline cputime_t account_other_time(cputime_t max)
 	return accounted;
 }
 
+#ifdef CONFIG_64BIT
+static inline u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	return t->se.sum_exec_runtime;
+}
+#else
+static u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	u64 ns;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(t, &rf);
+	ns = t->se.sum_exec_runtime;
+	task_rq_unlock(rq, t, &rf);
+
+	return ns;
+}
+#endif
+
 /*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
@@ -313,6 +333,17 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	unsigned int seq, nextseq;
 	unsigned long flags;
 
+	/*
+	 * Update current task runtime to account pending time since last
+	 * scheduler action or thread_group_cputime() call. This thread group
+	 * might have other running tasks on different CPUs, but updating
+	 * their runtime can affect syscall performance, so we skip account
+	 * those pending times and rely only on values updated on tick or
+	 * other scheduler action.
+	 */
+	if (same_thread_group(current, tsk))
+		(void) task_sched_runtime(current);
+
 	rcu_read_lock();
 	/* Attempt a lockless read on the first round. */
 	nextseq = 0;
@@ -327,7 +358,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
-			times->sum_exec_runtime += task_sched_runtime(t);
+			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
 		/* If lockless access failed, take the lock. */
 		nextseq = 1;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [tip:sched/core] sched/cputime: Improve scalability by not accounting thread group tasks pending runtime
  2016-08-17  9:30 [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance Stanislaw Gruszka
@ 2016-08-18 11:04 ` tip-bot for Stanislaw Gruszka
  2016-08-26 15:24 ` [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance Giovanni Gherdovich
  1 sibling, 0 replies; 3+ messages in thread
From: tip-bot for Stanislaw Gruszka @ 2016-08-18 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: pbonzini, tglx, ggherdovich, mgorman, wanpeng.li, sgruszka,
	mgalbraith, mingo, linux-kernel, hpa, peterz, torvalds, riel

Commit-ID:  a1eb1411b4e4251db02179e39d234c2ee5192c72
Gitweb:     http://git.kernel.org/tip/a1eb1411b4e4251db02179e39d234c2ee5192c72
Author:     Stanislaw Gruszka <sgruszka@redhat.com>
AuthorDate: Wed, 17 Aug 2016 11:30:44 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 18 Aug 2016 11:53:46 +0200

sched/cputime: Improve scalability by not accounting thread group tasks pending runtime

Commit:

  d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles")

started accounting thread group tasks pending runtime in thread_group_cputime().

Another commit:

  6e998916dfe32 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")

updated scheduler runtime statistics (call update_curr()) when reading task pending
runtime. Those changes cause bad performance of SYS_times() and
SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls, especially on
larger systems with many CPUs.

While we would like to have cpuclock monotonicity kept i.e. have
problems fixed by above commits stay fixed, we also would like to have
good performance.

However when we notice that change from commit d670ec13178d0 is not
longer needed to solve problem addressed by that commit, because of
change from the second commit 6e998916dfe32, we can get room for
optimization. Since we update task while reading it's pending runtime
in task_sched_runtime(), clock_gettime(CLOCK_PROCESS_CPUTIME_ID) will
see updated values and on testcase from d670ec13178d0 process cpuclock
will not be smaller than thread cpuclock.

I tested the patch on testcases from commits d670ec13178d0,
6e998916dfe32 and some other cpuclock/cputimers testcases and
did not found cpuclock monotonicity problems or other malfunction.

This patch has the drawback that we will not provide thread group cputime
up-to-date to the last moment. For example when arming cputime timer,
we will arm it with possibly a bit outdated values and that timer will
trigger earlier compared to behaviour without the patch. However that
was the behaviour before d670ec13178d0 commit (kernel v3.1) so it's
unlikely to affect applications.

Patch improves related syscall performance, as measured by Giovanni's
benchmarks described in commit:

  6075620b0590e ("sched/cputime: Mitigate performance regression in times()/clock_gettime()")

The benchmark results are:

SYS_clock_gettime():

  threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                         (pre-6e998916dfe3)
  2          3.48        2.23 ( 35.68%)        3.06 ( 11.83%)        1.08 ( 68.81%)
  5          3.33        2.83 ( 14.84%)        3.25 (  2.40%)        0.71 ( 78.55%)
  8          3.37        2.84 ( 15.80%)        3.26 (  3.30%)        0.56 ( 83.49%)
  12         3.32        3.09 (  6.69%)        3.37 ( -1.60%)        0.42 ( 87.28%)
  21         4.01        3.14 ( 21.70%)        3.90 (  2.74%)        0.35 ( 91.35%)
  30         3.63        3.28 (  9.75%)        3.36 (  7.41%)        0.28 ( 92.23%)
  48         3.71        3.02 ( 18.69%)        3.11 ( 16.27%)        0.39 ( 89.39%)
  79         3.75        2.88 ( 23.23%)        3.16 ( 15.74%)        0.46 ( 87.76%)
  110        3.81        2.95 ( 22.62%)        3.25 ( 14.80%)        0.56 ( 85.41%)
  128        3.88        3.05 ( 21.28%)        3.31 ( 14.76%)        0.62 ( 84.10%)

SYS_times():

  threads    4.7-rc7     3.18-rc3              4.7-rc7 + prefetch    4.7-rc7 + patch
                         (pre-6e998916dfe3)
  2          3.65        2.27 ( 37.94%)        3.25 ( 11.03%)        1.62 ( 55.71%)
  5          3.45        2.78 ( 19.34%)        3.17 (  7.92%)        2.33 ( 32.28%)
  8          3.52        2.79 ( 20.66%)        3.22 (  8.69%)        2.06 ( 41.44%)
  12         3.29        3.02 (  8.33%)        3.36 ( -2.04%)        2.00 ( 39.18%)
  21         4.07        3.10 ( 23.86%)        3.92 (  3.78%)        2.07 ( 49.18%)
  30         3.87        3.33 ( 13.80%)        3.40 ( 12.17%)        1.89 ( 51.12%)
  48         3.79        2.96 ( 21.94%)        3.16 ( 16.61%)        1.69 ( 55.46%)
  79         3.88        2.88 ( 25.82%)        3.28 ( 15.42%)        1.60 ( 58.81%)
  110        3.90        2.98 ( 23.73%)        3.38 ( 13.35%)        1.73 ( 55.61%)
  128        4.00        3.10 ( 22.40%)        3.38 ( 15.45%)        1.66 ( 58.52%)

Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <mgalbraith@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/20160817093043.GA25206@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a846cf8..b93c72d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -306,6 +306,26 @@ static inline cputime_t account_other_time(cputime_t max)
 	return accounted;
 }
 
+#ifdef CONFIG_64BIT
+static inline u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	return t->se.sum_exec_runtime;
+}
+#else
+static u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	u64 ns;
+	struct rq_flags rf;
+	struct rq *rq;
+
+	rq = task_rq_lock(t, &rf);
+	ns = t->se.sum_exec_runtime;
+	task_rq_unlock(rq, t, &rf);
+
+	return ns;
+}
+#endif
+
 /*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
@@ -318,6 +338,17 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	unsigned int seq, nextseq;
 	unsigned long flags;
 
+	/*
+	 * Update current task runtime to account pending time since last
+	 * scheduler action or thread_group_cputime() call. This thread group
+	 * might have other running tasks on different CPUs, but updating
+	 * their runtime can affect syscall performance, so we skip account
+	 * those pending times and rely only on values updated on tick or
+	 * other scheduler action.
+	 */
+	if (same_thread_group(current, tsk))
+		(void) task_sched_runtime(current);
+
 	rcu_read_lock();
 	/* Attempt a lockless read on the first round. */
 	nextseq = 0;
@@ -332,7 +363,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 			task_cputime(t, &utime, &stime);
 			times->utime += utime;
 			times->stime += stime;
-			times->sum_exec_runtime += task_sched_runtime(t);
+			times->sum_exec_runtime += read_sum_exec_runtime(t);
 		}
 		/* If lockless access failed, take the lock. */
 		nextseq = 1;

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance
  2016-08-17  9:30 [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance Stanislaw Gruszka
  2016-08-18 11:04 ` [tip:sched/core] sched/cputime: Improve scalability by not accounting thread group tasks pending runtime tip-bot for Stanislaw Gruszka
@ 2016-08-26 15:24 ` Giovanni Gherdovich
  1 sibling, 0 replies; 3+ messages in thread
From: Giovanni Gherdovich @ 2016-08-26 15:24 UTC (permalink / raw)
  To: Stanislaw Gruszka, linux-kernel
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Mel Gorman

On Wed, 2016-08-17 at 11:30 +0200, Stanislaw Gruszka wrote:
> Commit d670ec13178d0 ("posix-cpu-timers: Cure SMP wobbles") makes we
> account thread group tasks pending runtime in thread_group_cputime().
> Another commit 6e998916dfe32 ("sched/cputime:
> Fix clock_nanosleep()/clock_gettime() inconsistency") makes we update
> scheduler runtime statistics (call update_curr()) when read task pending
> runtime. Those changes cause bad performance of times() and
> clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls.
> 
> While we would like to have cpuclock monotonicity kept i.e. have
> problems fixed by above commits stay fixed, we also would like to have
> good performance.
>
>                  [... snip ...]
>
> Reported-and-tested-by: Giovanni Gherdovich <ggherdovich@suse.cz>
> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
> ---
>  kernel/sched/cputime.c | 33 ++++++++++++++++++++++++++++++++-
>  1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 1934f65..4fca604 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -301,6 +301,26 @@ static inline cputime_t account_other_time(cputime_t max)
>  	return accounted;
>  }
>  
> +#ifdef CONFIG_64BIT
> +static inline u64 read_sum_exec_runtime(struct task_struct *t)
> +{
> +	return t->se.sum_exec_runtime;
> +}
> +#else
> +static u64 read_sum_exec_runtime(struct task_struct *t)
> +{
> +	u64 ns;
> +	struct rq_flags rf;
> +	struct rq *rq;
> +
> +	rq = task_rq_lock(t, &rf);
> +	ns = t->se.sum_exec_runtime;
> +	task_rq_unlock(rq, t, &rf);
> +
> +	return ns;
> +}
> +#endif
> +
>  /*
>   * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
>   * tasks (sum on group iteration) belonging to @tsk's group.
> @@ -313,6 +333,17 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  	unsigned int seq, nextseq;
>  	unsigned long flags;
>  
> +	/*
> +	 * Update current task runtime to account pending time since last
> +	 * scheduler action or thread_group_cputime() call. This thread group
> +	 * might have other running tasks on different CPUs, but updating
> +	 * their runtime can affect syscall performance, so we skip account
> +	 * those pending times and rely only on values updated on tick or
> +	 * other scheduler action.
> +	 */
> +	if (same_thread_group(current, tsk))
> +		(void) task_sched_runtime(current);
> +
>  	rcu_read_lock();
>  	/* Attempt a lockless read on the first round. */
>  	nextseq = 0;
> @@ -327,7 +358,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>  			task_cputime(t, &utime, &stime);
>  			times->utime += utime;
>  			times->stime += stime;
> -			times->sum_exec_runtime += task_sched_runtime(t);
> +			times->sum_exec_runtime += read_sum_exec_runtime(t);
>  		}
>  		/* If lockless access failed, take the lock. */
>  		nextseq = 1;

Hello Stanislaw and all,

I know I'm quite late to the party as this patch is already taken in Ingo's
"tip" repo, but I want to chime in anyway and give my positive review and
acknowledgment of the patch.

The patch works as advertised in the commit message; the time accounting
behaviour you're changing is consistent with what happened before
d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles", i.e. only the runtime
statistics for the current task are up-to-date and not those for all the other
threads in the group. As you say, that's how things used to work -- I'm
favorable to this trade-off.

You correctly address Mel Gorman's remark ("how do you know that tsk ==
current?") by using the "current" macro when you call task_sched_runtime.
As you note, task_sched_runtime(current) (which in turns call update_curr on
that task) is all you need to solve the problem of "the diff of 'process'
should always be >= the diff of 'thread'" that you initially addressed in your
6e998916df "sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency".

Acked-by: Giovanni Gherdovich <ggherdovich@suse.cz>


--
Giovanni Gherdovich
SUSE Labs

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-08-26 15:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17  9:30 [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance Stanislaw Gruszka
2016-08-18 11:04 ` [tip:sched/core] sched/cputime: Improve scalability by not accounting thread group tasks pending runtime tip-bot for Stanislaw Gruszka
2016-08-26 15:24 ` [PATCH] sched/cputime: do not account thread group tasks pending runtime to improve performance Giovanni Gherdovich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.