* [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
@ 2016-08-05 8:21 Giovanni Gherdovich
2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich
0 siblings, 1 reply; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-05 8:21 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman,
Giovanni Gherdovich
As per Peter Zijlstra's review, these are the difference wrt V1:
* inclusion of appropriate header file linux/prefetch.h
* factorized the calls to prefetch into a separate function
* introduction of the local variable curr as a form of compiler
subexpression elimination (CSE)
* fixed Signed-off-by chain
* added comment as per why the prefetches are needed
Giovanni Gherdovich (1):
sched/cputime: Mitigate performance regression in
times()/clock_gettime()
kernel/sched/core.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
--
2.6.6
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-05 8:21 [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
@ 2016-08-05 8:21 ` Giovanni Gherdovich
2016-08-10 11:26 ` Ingo Molnar
2016-08-10 18:00 ` [tip:sched/core] " tip-bot for Giovanni Gherdovich
0 siblings, 2 replies; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-05 8:21 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Mike Galbraith, Stanislaw Gruszka, linux-kernel, Mel Gorman,
Giovanni Gherdovich
Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
inconsistency") fixed a problem whereby clock_nanosleep() followed by
clock_gettime() could allow a task to wake early. It addressed the problem
by calling the scheduling classes update_curr when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Suggested-by: Mike Galbraith <mgalbraith@suse.de>
---
kernel/sched/core.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 51d7105..4500421 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
#include <linux/context_tracking.h>
#include <linux/compiler.h>
#include <linux/frame.h>
+#include <linux/prefetch.h>
#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -2965,6 +2966,23 @@ EXPORT_PER_CPU_SYMBOL(kstat);
EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
/*
+ * The function fair_sched_class.update_curr accesses the struct curr
+ * and its field curr->exec_start; when called from task_sched_runtime,
+ * we observe a high rate of cache misses in practice.
+ * Prefetching this data results in improved performance.
+ */
+static inline void prefetch_curr_exec_start(struct task_struct *p)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct sched_entity *curr = (&p->se)->cfs_rq->curr;
+#else
+ struct sched_entity *curr = (&task_rq(p)->cfs)->curr;
+#endif
+ prefetch(curr);
+ prefetch(&curr->exec_start);
+}
+
+/*
* Return accounted runtime for the task.
* In case the task is currently running, return the runtime plus current's
* pending runtime that have not been accounted yet.
@@ -2998,6 +3016,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* thread, breaking clock_gettime().
*/
if (task_current(rq, p) && task_on_rq_queued(p)) {
+ prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
}
--
2.6.6
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich
@ 2016-08-10 11:26 ` Ingo Molnar
2016-08-10 13:02 ` Giovanni Gherdovich
2016-08-12 12:10 ` Stanislaw Gruszka
2016-08-10 18:00 ` [tip:sched/core] " tip-bot for Giovanni Gherdovich
1 sibling, 2 replies; 14+ messages in thread
From: Ingo Molnar @ 2016-08-10 11:26 UTC (permalink / raw)
To: Giovanni Gherdovich
Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Stanislaw Gruszka,
linux-kernel, Mel Gorman
* Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> Commit 6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime()
> inconsistency") fixed a problem whereby clock_nanosleep() followed by
> clock_gettime() could allow a task to wake early. It addressed the problem
> by calling the scheduling classes update_curr when the cputimer starts.
>
> Said change induced a considerable performance regression on the syscalls
> times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
> debuggers and applications that monitor their own performance that
> accidentally depend on the performance of these specific calls.
>
> This patch mitigates the performace loss by prefetching data in the CPU
> cache, as stalls due to cache misses appear to be where most time is spent
> in our benchmarks.
>
> Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
> box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
> variable number of threads, from 2 to 4*num_cpus; the results are in
> seconds and correspond to the average of 10 runs; the percentage gain is
> computed with (before-after)/before so a positive value is an improvement
> (it's faster). The improvement varies between a few percents for 5-20
> threads and more than 10% for 2 or >20 threads.
>
> pound_clock_gettime:
>
> threads 4.7-rc7 patched 4.7-rc7
> [num] [secs] [secs (percent)]
> 2 3.48 3.06 ( 11.83%)
> 5 3.33 3.25 ( 2.40%)
> 8 3.37 3.26 ( 3.30%)
> 12 3.32 3.37 ( -1.60%)
> 21 4.01 3.90 ( 2.74%)
> 30 3.63 3.36 ( 7.41%)
> 48 3.71 3.11 ( 16.27%)
> 79 3.75 3.16 ( 15.74%)
> 110 3.81 3.25 ( 14.80%)
> 128 3.88 3.31 ( 14.76%)
Nice detective work! I'm wondering, where do we stand if compared with a
pre-6e998916dfe3 kernel?
I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
make the result correct, just see what the performance gap is, roughly.
If there's still a significant gap then it might make sense to optimize this some
more.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-10 11:26 ` Ingo Molnar
@ 2016-08-10 13:02 ` Giovanni Gherdovich
2016-08-12 12:10 ` Stanislaw Gruszka
1 sibling, 0 replies; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-10 13:02 UTC (permalink / raw)
To: Ingo Molnar
Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Stanislaw Gruszka,
linux-kernel, Mel Gorman
Hello Ingo,
thank you for your reply.
Ingo Molnar <mingo@kernel.org>
> Nice detective work! I'm wondering, where do we stand if compared with a
> pre-6e998916dfe3 kernel?
The data follows. A considerable part of the performance loss is recovered;
something is still on the table.
"3.18-pre-bug" is the parent of 6e998916dfe3, i.e. 6e998916dfe3^1
"3.18-bug" is the revision 6e998916dfe3 itself.
Figures are in seconds. Percentages refer to 3.18-pre-bug, negative = worse.
times()
threads 3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched
2 2.27 ( 0.00%) 3.73 (-64.71%) 3.65 (-61.14%) 3.06 (-35.16%)
5 2.78 ( 0.00%) 3.77 (-35.56%) 3.45 (-23.98%) 3.25 (-16.79%)
8 2.79 ( 0.00%) 4.41 (-57.71%) 3.52 (-26.05%) 3.26 (-16.53%)
12 3.02 ( 0.00%) 3.56 (-17.94%) 3.29 ( -9.08%) 3.37 (-11.74%)
21 3.10 ( 0.00%) 4.61 (-48.74%) 4.07 (-31.34%) 3.90 (-25.89%)
30 3.33 ( 0.00%) 5.75 (-72.53%) 3.87 (-16.01%) 3.36 ( -0.81%)
48 2.96 ( 0.00%) 6.06 (-105.04%) 3.79 (-28.10%) 3.11 ( -5.14%)
79 2.88 ( 0.00%) 6.24 (-116.83%) 3.88 (-34.81%) 3.16 ( -9.84%)
110 2.98 ( 0.00%) 6.37 (-114.08%) 3.90 (-31.12%) 3.25 ( -9.07%)
128 3.10 ( 0.00%) 6.35 (-104.61%) 4.00 (-28.87%) 3.31 ( -6.57%)
clock_gettime()
threads 3.18-pre-bug 3.18-bug 4.7.0-rc7 4.7.0-rc7-patched
2 2.23 ( 0.00%) 3.68 (-64.56%) 3.48 (-55.48%) 3.25 (-45.41%)
5 2.83 ( 0.00%) 3.78 (-33.42%) 3.33 (-17.43%) 3.17 (-12.03%)
8 2.84 ( 0.00%) 4.31 (-52.12%) 3.37 (-18.76%) 3.22 (-13.43%)
12 3.09 ( 0.00%) 3.61 (-16.74%) 3.32 ( -7.17%) 3.36 ( -8.47%)
21 3.14 ( 0.00%) 4.63 (-47.36%) 4.01 (-27.71%) 3.92 (-24.68%)
30 3.28 ( 0.00%) 5.75 (-75.37%) 3.63 (-10.80%) 3.40 ( -3.69%)
48 3.02 ( 0.00%) 6.05 (-100.56%) 3.71 (-22.99%) 3.16 ( -4.64%)
79 2.88 ( 0.00%) 6.30 (-118.90%) 3.75 (-30.26%) 3.28 (-13.93%)
110 2.95 ( 0.00%) 6.46 (-119.00%) 3.81 (-29.24%) 3.38 (-14.69%)
128 3.05 ( 0.00%) 6.42 (-110.08%) 3.88 (-27.04%) 3.38 (-10.70%)
Regards,
Giovanni Gherdovich
^ permalink raw reply [flat|nested] 14+ messages in thread
* [tip:sched/core] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich
2016-08-10 11:26 ` Ingo Molnar
@ 2016-08-10 18:00 ` tip-bot for Giovanni Gherdovich
1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Giovanni Gherdovich @ 2016-08-10 18:00 UTC (permalink / raw)
To: linux-tip-commits
Cc: hpa, ggherdovich, tglx, sgruszka, mgorman, mingo, torvalds,
mgalbraith, linux-kernel, peterz
Commit-ID: 6075620b0590eaf22f10ce88833eb20a57f760d6
Gitweb: http://git.kernel.org/tip/6075620b0590eaf22f10ce88833eb20a57f760d6
Author: Giovanni Gherdovich <ggherdovich@suse.cz>
AuthorDate: Fri, 5 Aug 2016 10:21:56 +0200
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 10 Aug 2016 13:32:56 +0200
sched/cputime: Mitigate performance regression in times()/clock_gettime()
Commit:
6e998916dfe3 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.
Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.
This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.
Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.
pound_clock_gettime:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.48 3.06 ( 11.83%)
5 3.33 3.25 ( 2.40%)
8 3.37 3.26 ( 3.30%)
12 3.32 3.37 ( -1.60%)
21 4.01 3.90 ( 2.74%)
30 3.63 3.36 ( 7.41%)
48 3.71 3.11 ( 16.27%)
79 3.75 3.16 ( 15.74%)
110 3.81 3.25 ( 14.80%)
128 3.88 3.31 ( 14.76%)
pound_times:
threads 4.7-rc7 patched 4.7-rc7
[num] [secs] [secs (percent)]
2 3.65 3.25 ( 11.03%)
5 3.45 3.17 ( 7.92%)
8 3.52 3.22 ( 8.69%)
12 3.29 3.36 ( -2.04%)
21 4.07 3.92 ( 3.78%)
30 3.87 3.40 ( 12.17%)
48 3.79 3.16 ( 16.61%)
79 3.88 3.28 ( 15.42%)
110 3.90 3.38 ( 13.35%)
128 4.00 3.38 ( 15.45%)
pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:
$ git clone https://github.com/gormanm/mmtests.git
$ cd mmtests
$ cp configs/config-global-dhp__workload_poundtime config
$ ./run-mmtests.sh --run-monitor $(uname -r)
The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again
$ cd mmtests
$ ./run-mmtests.sh --run-monitor $(uname -r)
will produce results to compare with. A comparison table will be output
with:
$ cd mmtests/work/log
$ ../../compare-kernels.sh
the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.
The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be
struct sched_entity *curr = cfs_rq->curr;
and
delta_exec = now - curr->exec_start;
in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.
A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):
threads cycles before cycles after gain
2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88%
5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85%
8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74%
12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74%
21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89%
30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92%
48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10%
79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33%
110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21%
128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99%
A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):
threads L1 misses/total*100 L1 misses/total*100 gain
before after
2 7.43 +-4.90% 7.36 +-4.70% 0.94%
5 13.09 +-4.74% 13.52 +-3.73% -3.28%
8 13.79 +-5.61% 12.90 +-3.27% 6.45%
12 11.57 +-2.44% 8.71 +-1.40% 24.72%
21 12.39 +-3.92% 9.97 +-1.84% 19.53%
30 13.91 +-2.53% 11.73 +-2.28% 15.67%
48 13.71 +-1.59% 12.32 +-1.97% 10.14%
79 14.44 +-0.66% 13.40 +-1.06% 7.20%
110 15.86 +-0.50% 14.46 +-0.59% 8.83%
128 16.51 +-0.32% 15.06 +-0.78% 8.78%
As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916dfe3
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).
pound_clock_gettime:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.23 3.68 ( -64.56%) 3.48 (-55.48%)
5 2.83 3.78 ( -33.42%) 3.33 (-17.43%)
8 2.84 4.31 ( -52.12%) 3.37 (-18.76%)
12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%)
21 3.14 4.63 ( -47.36%) 4.01 (-27.71%)
30 3.28 5.75 ( -75.37%) 3.63 (-10.80%)
48 3.02 6.05 (-100.56%) 3.71 (-22.99%)
79 2.88 6.30 (-118.90%) 3.75 (-30.26%)
110 2.95 6.46 (-119.00%) 3.81 (-29.24%)
128 3.05 6.42 (-110.08%) 3.88 (-27.04%)
pound_times:
threads parent of 6e998916dfe3 4.7-rc7
6e998916dfe3 itself
2 2.27 3.73 ( -64.71%) 3.65 (-61.14%)
5 2.78 3.77 ( -35.56%) 3.45 (-23.98%)
8 2.79 4.41 ( -57.71%) 3.52 (-26.05%)
12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%)
21 3.10 4.61 ( -48.74%) 4.07 (-31.34%)
30 3.33 5.75 ( -72.53%) 3.87 (-16.01%)
48 2.96 6.06 (-105.04%) 3.79 (-28.10%)
79 2.88 6.24 (-116.83%) 3.88 (-34.81%)
110 2.98 6.37 (-114.08%) 3.90 (-31.12%)
128 3.10 6.35 (-104.61%) 4.00 (-28.87%)
The source code of the two benchmarks follows. To compile the two:
NR_THREADS=42
for FILE in pound_times pound_clock_gettime; do
gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
done
==== BEGIN pound_times.c ====
struct tms start;
void *pound (void *threadid)
{
struct tms end;
int oldutime = 0;
int utime;
int i;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
times(&end);
utime = ((int)end.tms_utime - (int)start.tms_utime);
if (oldutime > utime) {
printf("utime decreased, was %d, now %d!\n", oldutime, utime);
}
oldutime = utime;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long i;
times(&start);
for (i = 0; i < NUM_THREADS; i++) {
pthread_create (&th[i], NULL, pound, (void *)i);
}
pthread_exit(NULL);
return 0;
}
==== END pound_times.c ====
==== BEGIN pound_clock_gettime.c ====
void *pound (void *threadid)
{
struct timespec ts;
int rc, i;
unsigned long prev = 0, this = 0;
for (i = 0; i < 5000000 / NUM_THREADS; i++) {
rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
if (rc < 0)
perror("clock_gettime");
this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
if (0 && this < prev)
printf("%lu ns timewarp at iteration %d\n", prev - this, i);
prev = this;
}
pthread_exit(NULL);
}
int main()
{
pthread_t th[NUM_THREADS];
long rc, i;
pid_t pgid;
for (i = 0; i < NUM_THREADS; i++) {
rc = pthread_create(&th[i], NULL, pound, (void *)i);
if (rc < 0)
perror("pthread_create");
}
pthread_exit(NULL);
return 0;
}
==== END pound_clock_gettime.c ====
Suggested-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
kernel/sched/core.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5c883fe..2a906f2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
#include <linux/context_tracking.h>
#include <linux/compiler.h>
#include <linux/frame.h>
+#include <linux/prefetch.h>
#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -2972,6 +2973,23 @@ EXPORT_PER_CPU_SYMBOL(kstat);
EXPORT_PER_CPU_SYMBOL(kernel_cpustat);
/*
+ * The function fair_sched_class.update_curr accesses the struct curr
+ * and its field curr->exec_start; when called from task_sched_runtime(),
+ * we observe a high rate of cache misses in practice.
+ * Prefetching this data results in improved performance.
+ */
+static inline void prefetch_curr_exec_start(struct task_struct *p)
+{
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ struct sched_entity *curr = (&p->se)->cfs_rq->curr;
+#else
+ struct sched_entity *curr = (&task_rq(p)->cfs)->curr;
+#endif
+ prefetch(curr);
+ prefetch(&curr->exec_start);
+}
+
+/*
* Return accounted runtime for the task.
* In case the task is currently running, return the runtime plus current's
* pending runtime that have not been accounted yet.
@@ -3005,6 +3023,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
* thread, breaking clock_gettime().
*/
if (task_current(rq, p) && task_on_rq_queued(p)) {
+ prefetch_curr_exec_start(p);
update_rq_clock(rq);
p->sched_class->update_curr(rq);
}
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-10 11:26 ` Ingo Molnar
2016-08-10 13:02 ` Giovanni Gherdovich
@ 2016-08-12 12:10 ` Stanislaw Gruszka
2016-08-15 7:49 ` Giovanni Gherdovich
2016-08-15 9:13 ` Wanpeng Li
1 sibling, 2 replies; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-12 12:10 UTC (permalink / raw)
To: Ingo Molnar
Cc: Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra, Mike Galbraith,
linux-kernel, Mel Gorman
[-- Attachment #1: Type: text/plain, Size: 3629 bytes --]
Hi
On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> Nice detective work! I'm wondering, where do we stand if compared with a
> pre-6e998916dfe3 kernel?
>
> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
> make the result correct, just see what the performance gap is, roughly.
>
> If there's still a significant gap then it might make sense to optimize this some
> more.
I measured (partial) revert performance on 4.7 using mmtest instructions
from Giovanni and also tested some other possible fix (draft version):
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 75f98c5..54fdf6d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
unsigned int seq, nextseq;
unsigned long flags;
+ (void) task_sched_runtime(tsk);
+
rcu_read_lock();
/* Attempt a lockless read on the first round. */
nextseq = 0;
@@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
task_cputime(t, &utime, &stime);
times->utime += utime;
times->stime += stime;
- times->sum_exec_runtime += task_sched_runtime(t);
+ times->sum_exec_runtime += t->se.sum_exec_runtime;
}
/* If lockless access failed, take the lock. */
nextseq = 1;
---
mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
vanila-4.7 revert prefetch patch
4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%)
5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%)
6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%)
6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%)
7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%)
7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%)
7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%)
7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%)
8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%)
5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%)
5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%)
5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%)
6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%)
7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%)
7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%)
7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%)
7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%)
Patch works because we we update sum_exec_runtime on current thread
what assure we see proper sum_exec_runtime value on different CPUs. I
tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
patch did not break them. I'm going to run some other test.
Patch is draft version for early review, task_sched_runtime() will be
simplified (since it's called only current thread) and possibly split
into two functions: one that call update_curr() and other that return
sum_exec_runtime (assure it's consistent on 32 bit arches).
Stanislaw
[-- Attachment #2: compare.txt --]
[-- Type: text/plain, Size: 27653 bytes --]
poundtime
vanilla rever prefetc mas
4.7 revert prefetch mask
Min real-pound_clock_gettime-2 4.38 ( 0.00%) 2.73 ( 37.67%) 3.62 ( 17.35%) 1.19 ( 72.83%)
Min real-pound_clock_gettime-5 5.40 ( 0.00%) 4.76 ( 11.85%) 4.49 ( 16.85%) 0.99 ( 81.67%)
Min real-pound_clock_gettime-8 5.83 ( 0.00%) 4.88 ( 16.30%) 5.91 ( -1.37%) 0.88 ( 84.91%)
Min real-pound_clock_gettime-12 6.55 ( 0.00%) 4.87 ( 25.65%) 5.98 ( 8.70%) 0.84 ( 87.18%)
Min real-pound_clock_gettime-21 7.11 ( 0.00%) 5.10 ( 28.27%) 6.63 ( 6.75%) 0.85 ( 88.05%)
Min real-pound_clock_gettime-30 7.56 ( 0.00%) 5.20 ( 31.22%) 7.08 ( 6.35%) 0.87 ( 88.49%)
Min real-pound_clock_gettime-48 7.78 ( 0.00%) 5.24 ( 32.65%) 7.20 ( 7.46%) 0.92 ( 88.17%)
Min real-pound_clock_gettime-79 7.89 ( 0.00%) 5.23 ( 33.71%) 7.20 ( 8.75%) 1.00 ( 87.33%)
Min real-pound_clock_gettime-96 7.88 ( 0.00%) 5.24 ( 33.50%) 7.29 ( 7.49%) 1.09 ( 86.17%)
Min real-pound_times-2 4.87 ( 0.00%) 3.19 ( 34.50%) 4.00 ( 17.86%) 2.06 ( 57.70%)
Min real-pound_times-5 5.59 ( 0.00%) 3.91 ( 30.05%) 4.61 ( 17.53%) 3.61 ( 35.42%)
Min real-pound_times-8 5.74 ( 0.00%) 4.88 ( 14.98%) 5.80 ( -1.05%) 3.56 ( 37.98%)
Min real-pound_times-12 6.44 ( 0.00%) 4.90 ( 23.91%) 6.00 ( 6.83%) 3.52 ( 45.34%)
Min real-pound_times-21 7.11 ( 0.00%) 5.11 ( 28.13%) 6.61 ( 7.03%) 3.59 ( 49.51%)
Min real-pound_times-30 7.60 ( 0.00%) 5.20 ( 31.58%) 7.03 ( 7.50%) 3.54 ( 53.42%)
Min real-pound_times-48 7.80 ( 0.00%) 5.24 ( 32.82%) 7.20 ( 7.69%) 3.61 ( 53.72%)
Min real-pound_times-79 7.92 ( 0.00%) 5.24 ( 33.84%) 7.31 ( 7.70%) 3.61 ( 54.42%)
Min real-pound_times-96 7.94 ( 0.00%) 5.24 ( 34.01%) 7.29 ( 8.19%) 3.58 ( 54.91%)
Min syst-pound_clock_gettime-2 8.54 ( 0.00%) 4.89 ( 42.74%) 6.98 ( 18.27%) 2.16 ( 74.71%)
Min syst-pound_clock_gettime-5 26.57 ( 0.00%) 23.29 ( 12.34%) 22.09 ( 16.86%) 4.47 ( 83.18%)
Min syst-pound_clock_gettime-8 45.82 ( 0.00%) 38.02 ( 17.02%) 46.61 ( -1.72%) 6.44 ( 85.95%)
Min syst-pound_clock_gettime-12 77.23 ( 0.00%) 56.61 ( 26.70%) 69.25 ( 10.33%) 9.34 ( 87.91%)
Min syst-pound_clock_gettime-21 147.44 ( 0.00%) 103.97 ( 29.48%) 134.76 ( 8.60%) 15.12 ( 89.74%)
Min syst-pound_clock_gettime-30 176.07 ( 0.00%) 117.81 ( 33.09%) 162.77 ( 7.55%) 15.95 ( 90.94%)
Min syst-pound_clock_gettime-48 182.93 ( 0.00%) 119.92 ( 34.44%) 168.06 ( 8.13%) 19.82 ( 89.17%)
Min syst-pound_clock_gettime-79 186.13 ( 0.00%) 123.31 ( 33.75%) 170.34 ( 8.48%) 22.90 ( 87.70%)
Min syst-pound_clock_gettime-96 187.05 ( 0.00%) 124.22 ( 33.59%) 172.67 ( 7.69%) 25.19 ( 86.53%)
Min syst-pound_times-2 9.55 ( 0.00%) 6.22 ( 34.87%) 7.80 ( 18.32%) 3.90 ( 59.16%)
Min syst-pound_times-5 27.68 ( 0.00%) 19.24 ( 30.49%) 22.76 ( 17.77%) 17.56 ( 36.56%)
Min syst-pound_times-8 45.11 ( 0.00%) 38.75 ( 14.10%) 45.15 ( -0.09%) 27.77 ( 38.44%)
Min syst-pound_times-12 76.60 ( 0.00%) 56.89 ( 25.73%) 71.06 ( 7.23%) 41.64 ( 45.64%)
Min syst-pound_times-21 145.25 ( 0.00%) 102.48 ( 29.45%) 136.15 ( 6.27%) 72.98 ( 49.76%)
Min syst-pound_times-30 175.03 ( 0.00%) 118.89 ( 32.07%) 161.32 ( 7.83%) 79.91 ( 54.34%)
Min syst-pound_times-48 183.61 ( 0.00%) 121.06 ( 34.07%) 167.26 ( 8.90%) 83.24 ( 54.66%)
Min syst-pound_times-79 187.18 ( 0.00%) 123.24 ( 34.16%) 173.22 ( 7.46%) 84.36 ( 54.93%)
Min syst-pound_times-96 188.88 ( 0.00%) 124.04 ( 34.33%) 173.52 ( 8.13%) 83.02 ( 56.05%)
Amean real-pound_clock_gettime-2 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%)
Amean real-pound_clock_gettime-5 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%)
Amean real-pound_clock_gettime-8 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%)
Amean real-pound_clock_gettime-12 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%)
Amean real-pound_clock_gettime-21 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%)
Amean real-pound_clock_gettime-30 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%)
Amean real-pound_clock_gettime-48 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%)
Amean real-pound_clock_gettime-79 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%)
Amean real-pound_clock_gettime-96 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%)
Amean real-pound_times-2 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%)
Amean real-pound_times-5 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%)
Amean real-pound_times-8 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%)
Amean real-pound_times-12 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%)
Amean real-pound_times-21 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%)
Amean real-pound_times-30 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%)
Amean real-pound_times-48 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%)
Amean real-pound_times-79 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%)
Amean real-pound_times-96 8.01 ( 0.00%) 5.35 ( 33.20%) 7.36 ( 8.09%) 3.64 ( 54.49%)
Amean syst-pound_clock_gettime-2 9.22 ( 0.00%) 5.45 ( 40.95%) 7.90 ( 14.32%) 2.34 ( 74.66%)
Amean syst-pound_clock_gettime-5 27.03 ( 0.00%) 24.21 ( 10.40%) 26.24 ( 2.90%) 4.73 ( 82.48%)
Amean syst-pound_clock_gettime-8 48.33 ( 0.00%) 38.40 ( 20.55%) 47.11 ( 2.52%) 6.64 ( 86.25%)
Amean syst-pound_clock_gettime-12 78.93 ( 0.00%) 57.30 ( 27.41%) 71.04 ( 10.00%) 9.69 ( 87.72%)
Amean syst-pound_clock_gettime-21 149.27 ( 0.00%) 105.34 ( 29.43%) 138.19 ( 7.42%) 16.50 ( 88.95%)
Amean syst-pound_clock_gettime-30 178.36 ( 0.00%) 119.83 ( 32.82%) 166.75 ( 6.51%) 18.67 ( 89.53%)
Amean syst-pound_clock_gettime-48 185.77 ( 0.00%) 124.80 ( 32.82%) 171.14 ( 7.88%) 21.12 ( 88.63%)
Amean syst-pound_clock_gettime-79 188.17 ( 0.00%) 126.34 ( 32.86%) 173.99 ( 7.53%) 24.07 ( 87.21%)
Amean syst-pound_clock_gettime-96 190.24 ( 0.00%) 126.63 ( 33.44%) 175.32 ( 7.84%) 26.12 ( 86.27%)
Amean syst-pound_times-2 11.02 ( 0.00%) 6.91 ( 37.27%) 8.85 ( 19.68%) 4.36 ( 60.45%)
Amean syst-pound_times-5 27.99 ( 0.00%) 21.31 ( 23.88%) 23.42 ( 16.32%) 17.95 ( 35.87%)
Amean syst-pound_times-8 47.33 ( 0.00%) 39.27 ( 17.04%) 47.16 ( 0.35%) 28.56 ( 39.66%)
Amean syst-pound_times-12 78.24 ( 0.00%) 58.26 ( 25.55%) 71.55 ( 8.55%) 42.78 ( 45.32%)
Amean syst-pound_times-21 148.75 ( 0.00%) 106.28 ( 28.55%) 138.22 ( 7.08%) 74.25 ( 50.09%)
Amean syst-pound_times-30 177.74 ( 0.00%) 121.16 ( 31.83%) 166.70 ( 6.21%) 81.82 ( 53.96%)
Amean syst-pound_times-48 184.85 ( 0.00%) 125.37 ( 32.18%) 170.87 ( 7.56%) 84.20 ( 54.45%)
Amean syst-pound_times-79 189.50 ( 0.00%) 127.45 ( 32.74%) 175.58 ( 7.34%) 86.01 ( 54.61%)
Amean syst-pound_times-96 190.56 ( 0.00%) 127.11 ( 33.30%) 175.08 ( 8.12%) 86.03 ( 54.85%)
Stddev real-pound_clock_gettime-2 0.25 ( 0.00%) 0.27 ( -7.76%) 0.41 (-65.62%) 0.10 ( 60.73%)
Stddev real-pound_clock_gettime-5 0.07 ( 0.00%) 0.09 (-35.10%) 0.51 (-674.46%) 0.05 ( 26.28%)
Stddev real-pound_clock_gettime-8 0.28 ( 0.00%) 0.02 ( 92.09%) 0.04 ( 86.10%) 0.02 ( 93.65%)
Stddev real-pound_clock_gettime-12 0.08 ( 0.00%) 0.02 ( 78.31%) 0.04 ( 52.02%) 0.02 ( 78.95%)
Stddev real-pound_clock_gettime-21 0.06 ( 0.00%) 0.02 ( 68.54%) 0.11 (-70.01%) 0.01 ( 78.27%)
Stddev real-pound_clock_gettime-30 0.05 ( 0.00%) 0.01 ( 75.00%) 0.10 (-98.93%) 0.04 ( 20.82%)
Stddev real-pound_clock_gettime-48 0.09 ( 0.00%) 0.19 (-106.51%) 0.08 ( 15.24%) 0.04 ( 58.70%)
Stddev real-pound_clock_gettime-79 0.03 ( 0.00%) 0.10 (-191.56%) 0.08 (-138.02%) 0.04 (-21.18%)
Stddev real-pound_clock_gettime-96 0.05 ( 0.00%) 0.08 (-56.69%) 0.07 (-21.04%) 0.04 ( 31.40%)
Stddev real-pound_times-2 0.55 ( 0.00%) 0.25 ( 53.82%) 0.38 ( 30.80%) 0.14 ( 74.19%)
Stddev real-pound_times-5 0.06 ( 0.00%) 0.28 (-358.77%) 0.13 (-108.26%) 0.03 ( 54.64%)
Stddev real-pound_times-8 0.25 ( 0.00%) 0.04 ( 83.52%) 0.06 ( 76.99%) 0.06 ( 76.94%)
Stddev real-pound_times-12 0.09 ( 0.00%) 0.05 ( 41.52%) 0.02 ( 77.55%) 0.04 ( 51.60%)
Stddev real-pound_times-21 0.06 ( 0.00%) 0.15 (-141.91%) 0.11 (-74.22%) 0.03 ( 48.73%)
Stddev real-pound_times-30 0.06 ( 0.00%) 0.14 (-129.04%) 0.10 (-66.59%) 0.04 ( 30.36%)
Stddev real-pound_times-48 0.05 ( 0.00%) 0.13 (-151.20%) 0.07 (-37.30%) 0.02 ( 54.64%)
Stddev real-pound_times-79 0.04 ( 0.00%) 0.11 (-205.48%) 0.07 (-97.82%) 0.03 ( 28.17%)
Stddev real-pound_times-96 0.05 ( 0.00%) 0.05 ( -1.83%) 0.04 ( 24.17%) 0.04 ( 20.00%)
Stddev syst-pound_clock_gettime-2 0.47 ( 0.00%) 0.45 ( 4.96%) 0.79 (-66.33%) 0.18 ( 61.36%)
Stddev syst-pound_clock_gettime-5 0.32 ( 0.00%) 0.39 (-20.09%) 2.49 (-666.63%) 0.25 ( 21.71%)
Stddev syst-pound_clock_gettime-8 2.25 ( 0.00%) 0.26 ( 88.54%) 0.40 ( 82.10%) 0.17 ( 92.55%)
Stddev syst-pound_clock_gettime-12 1.23 ( 0.00%) 0.43 ( 64.59%) 0.73 ( 40.82%) 0.19 ( 84.58%)
Stddev syst-pound_clock_gettime-21 1.15 ( 0.00%) 1.06 ( 7.62%) 2.64 (-129.56%) 0.66 ( 42.45%)
Stddev syst-pound_clock_gettime-30 1.34 ( 0.00%) 1.26 ( 6.25%) 2.69 (-99.81%) 1.58 (-17.86%)
Stddev syst-pound_clock_gettime-48 2.52 ( 0.00%) 4.85 (-92.44%) 2.12 ( 15.94%) 1.08 ( 57.23%)
Stddev syst-pound_clock_gettime-79 1.22 ( 0.00%) 2.51 (-105.82%) 1.99 (-62.56%) 0.96 ( 21.62%)
Stddev syst-pound_clock_gettime-96 1.54 ( 0.00%) 2.21 (-43.34%) 1.74 (-12.67%) 0.80 ( 48.24%)
Stddev syst-pound_times-2 1.09 ( 0.00%) 0.50 ( 53.61%) 0.76 ( 30.43%) 0.28 ( 74.11%)
Stddev syst-pound_times-5 0.30 ( 0.00%) 1.41 (-367.82%) 0.65 (-115.62%) 0.21 ( 29.66%)
Stddev syst-pound_times-8 2.12 ( 0.00%) 0.27 ( 87.24%) 0.71 ( 66.44%) 0.55 ( 73.94%)
Stddev syst-pound_times-12 1.03 ( 0.00%) 0.74 ( 27.70%) 0.37 ( 64.41%) 0.47 ( 54.81%)
Stddev syst-pound_times-21 1.60 ( 0.00%) 3.07 (-92.49%) 2.30 (-43.99%) 0.93 ( 41.93%)
Stddev syst-pound_times-30 1.75 ( 0.00%) 3.05 (-74.55%) 2.84 (-62.67%) 1.17 ( 32.95%)
Stddev syst-pound_times-48 0.79 ( 0.00%) 3.36 (-327.41%) 2.51 (-219.14%) 0.51 ( 34.63%)
Stddev syst-pound_times-79 1.08 ( 0.00%) 2.77 (-156.12%) 1.84 (-70.34%) 0.86 ( 20.82%)
Stddev syst-pound_times-96 1.19 ( 0.00%) 1.35 (-13.61%) 1.01 ( 15.16%) 1.29 ( -8.56%)
CoeffVar real-pound_clock_gettime-2 5.19 ( 0.00%) 8.73 (-68.19%) 9.97 (-92.16%) 7.43 (-43.23%)
CoeffVar real-pound_clock_gettime-5 1.19 ( 0.00%) 1.77 (-48.40%) 9.49 (-696.07%) 4.66 (-291.28%)
CoeffVar real-pound_clock_gettime-8 4.53 ( 0.00%) 0.45 ( 90.14%) 0.64 ( 85.76%) 1.95 ( 56.89%)
CoeffVar real-pound_clock_gettime-12 1.24 ( 0.00%) 0.37 ( 70.42%) 0.66 ( 46.83%) 2.00 (-60.60%)
CoeffVar real-pound_clock_gettime-21 0.88 ( 0.00%) 0.39 ( 55.78%) 1.61 (-82.98%) 1.58 (-79.84%)
CoeffVar real-pound_clock_gettime-30 0.68 ( 0.00%) 0.25 ( 63.35%) 1.44 (-112.50%) 4.49 (-560.29%)
CoeffVar real-pound_clock_gettime-48 1.18 ( 0.00%) 3.61 (-204.68%) 1.09 ( 8.14%) 4.06 (-243.23%)
CoeffVar real-pound_clock_gettime-79 0.43 ( 0.00%) 1.85 (-333.44%) 1.10 (-157.77%) 3.87 (-808.42%)
CoeffVar real-pound_clock_gettime-96 0.68 ( 0.00%) 1.59 (-134.97%) 0.89 (-31.18%) 3.28 (-383.77%)
CoeffVar real-pound_times-2 9.79 ( 0.00%) 7.15 ( 26.98%) 8.39 ( 14.33%) 6.21 ( 36.61%)
CoeffVar real-pound_times-5 1.06 ( 0.00%) 6.39 (-501.98%) 2.64 (-148.46%) 0.75 ( 29.71%)
CoeffVar real-pound_times-8 4.24 ( 0.00%) 0.84 ( 80.17%) 0.98 ( 76.91%) 1.61 ( 61.95%)
CoeffVar real-pound_times-12 1.29 ( 0.00%) 1.01 ( 22.11%) 0.32 ( 75.51%) 1.14 ( 12.18%)
CoeffVar real-pound_times-21 0.87 ( 0.00%) 2.91 (-236.03%) 1.63 (-87.60%) 0.87 ( -0.97%)
CoeffVar real-pound_times-30 0.78 ( 0.00%) 2.62 (-233.35%) 1.40 (-78.41%) 1.16 (-47.41%)
CoeffVar real-pound_times-48 0.65 ( 0.00%) 2.40 (-269.32%) 0.97 (-48.56%) 0.64 ( 2.00%)
CoeffVar real-pound_times-79 0.45 ( 0.00%) 2.03 (-352.70%) 0.96 (-113.68%) 0.71 (-57.00%)
CoeffVar real-pound_times-96 0.61 ( 0.00%) 0.93 (-52.43%) 0.50 ( 17.50%) 1.07 (-75.79%)
CoeffVar syst-pound_clock_gettime-2 5.12 ( 0.00%) 8.25 (-60.95%) 9.95 (-94.12%) 7.81 (-52.47%)
CoeffVar syst-pound_clock_gettime-5 1.20 ( 0.00%) 1.61 (-34.04%) 9.48 (-689.57%) 5.37 (-346.99%)
CoeffVar syst-pound_clock_gettime-8 4.66 ( 0.00%) 0.67 ( 85.58%) 0.86 ( 81.64%) 2.53 ( 45.79%)
CoeffVar syst-pound_clock_gettime-12 1.56 ( 0.00%) 0.76 ( 51.21%) 1.02 ( 34.25%) 1.95 (-25.60%)
CoeffVar syst-pound_clock_gettime-21 0.77 ( 0.00%) 1.01 (-30.89%) 1.91 (-147.96%) 4.01 (-420.63%)
CoeffVar syst-pound_clock_gettime-30 0.75 ( 0.00%) 1.05 (-39.54%) 1.61 (-113.72%) 8.48 (-1026.12%)
CoeffVar syst-pound_clock_gettime-48 1.36 ( 0.00%) 3.89 (-186.46%) 1.24 ( 8.75%) 5.11 (-276.18%)
CoeffVar syst-pound_clock_gettime-79 0.65 ( 0.00%) 1.99 (-206.55%) 1.14 (-75.81%) 3.98 (-512.73%)
CoeffVar syst-pound_clock_gettime-96 0.81 ( 0.00%) 1.74 (-115.35%) 0.99 (-22.26%) 3.05 (-277.01%)
CoeffVar syst-pound_times-2 9.86 ( 0.00%) 7.29 ( 26.04%) 8.54 ( 13.39%) 6.45 ( 34.55%)
CoeffVar syst-pound_times-5 1.08 ( 0.00%) 6.62 (-514.60%) 2.78 (-157.67%) 1.18 ( -9.68%)
CoeffVar syst-pound_times-8 4.48 ( 0.00%) 0.69 ( 84.62%) 1.51 ( 66.32%) 1.94 ( 56.81%)
CoeffVar syst-pound_times-12 1.32 ( 0.00%) 1.28 ( 2.89%) 0.51 ( 61.08%) 1.09 ( 17.35%)
CoeffVar syst-pound_times-21 1.07 ( 0.00%) 2.89 (-169.42%) 1.66 (-54.96%) 1.25 (-16.34%)
CoeffVar syst-pound_times-30 0.98 ( 0.00%) 2.52 (-156.06%) 1.71 (-73.43%) 1.43 (-45.64%)
CoeffVar syst-pound_times-48 0.43 ( 0.00%) 2.68 (-530.20%) 1.47 (-245.25%) 0.61 (-43.50%)
CoeffVar syst-pound_times-79 0.57 ( 0.00%) 2.17 (-280.81%) 1.05 (-83.84%) 1.00 (-74.46%)
CoeffVar syst-pound_times-96 0.63 ( 0.00%) 1.07 (-70.33%) 0.58 ( 7.66%) 1.50 (-140.44%)
Max real-pound_clock_gettime-2 5.10 ( 0.00%) 3.56 ( 30.20%) 4.98 ( 2.35%) 1.47 ( 71.18%)
Max real-pound_clock_gettime-5 5.59 ( 0.00%) 5.10 ( 8.77%) 6.00 ( -7.33%) 1.17 ( 79.07%)
Max real-pound_clock_gettime-8 6.82 ( 0.00%) 4.95 ( 27.42%) 6.02 ( 11.73%) 0.93 ( 86.36%)
Max real-pound_clock_gettime-12 6.82 ( 0.00%) 4.93 ( 27.71%) 6.13 ( 10.12%) 0.90 ( 86.80%)
Max real-pound_clock_gettime-21 7.33 ( 0.00%) 5.17 ( 29.47%) 7.01 ( 4.37%) 0.89 ( 87.86%)
Max real-pound_clock_gettime-30 7.71 ( 0.00%) 5.24 ( 32.04%) 7.38 ( 4.28%) 1.00 ( 87.03%)
Max real-pound_clock_gettime-48 8.11 ( 0.00%) 5.86 ( 27.74%) 7.47 ( 7.89%) 1.05 ( 87.05%)
Max real-pound_clock_gettime-79 8.03 ( 0.00%) 5.53 ( 31.13%) 7.48 ( 6.85%) 1.13 ( 85.93%)
Max real-pound_clock_gettime-96 8.05 ( 0.00%) 5.55 ( 31.06%) 7.51 ( 6.71%) 1.21 ( 84.97%)
Max real-pound_times-2 6.66 ( 0.00%) 3.89 ( 41.59%) 5.23 ( 21.47%) 2.56 ( 61.56%)
Max real-pound_times-5 5.77 ( 0.00%) 4.96 ( 14.04%) 5.01 ( 13.17%) 3.69 ( 36.05%)
Max real-pound_times-8 6.42 ( 0.00%) 5.04 ( 21.50%) 6.02 ( 6.23%) 3.72 ( 42.06%)
Max real-pound_times-12 6.69 ( 0.00%) 5.07 ( 24.22%) 6.07 ( 9.27%) 3.67 ( 45.14%)
Max real-pound_times-21 7.32 ( 0.00%) 5.63 ( 23.09%) 7.00 ( 4.37%) 3.68 ( 49.73%)
Max real-pound_times-30 7.78 ( 0.00%) 5.68 ( 26.99%) 7.36 ( 5.40%) 3.66 ( 52.96%)
Max real-pound_times-48 7.98 ( 0.00%) 5.58 ( 30.08%) 7.41 ( 7.14%) 3.68 ( 53.88%)
Max real-pound_times-79 8.05 ( 0.00%) 5.61 ( 30.31%) 7.53 ( 6.46%) 3.69 ( 54.16%)
Max real-pound_times-96 8.08 ( 0.00%) 5.42 ( 32.92%) 7.42 ( 8.17%) 3.71 ( 54.08%)
Max syst-pound_clock_gettime-2 9.91 ( 0.00%) 6.30 ( 36.43%) 9.64 ( 2.72%) 2.68 ( 72.96%)
Max syst-pound_clock_gettime-5 27.53 ( 0.00%) 24.74 ( 10.13%) 29.35 ( -6.61%) 5.43 ( 80.28%)
Max syst-pound_clock_gettime-8 53.96 ( 0.00%) 38.82 ( 28.06%) 47.75 ( 11.51%) 6.99 ( 87.05%)
Max syst-pound_clock_gettime-12 81.09 ( 0.00%) 57.99 ( 28.49%) 71.93 ( 11.30%) 10.04 ( 87.62%)
Max syst-pound_clock_gettime-21 151.50 ( 0.00%) 107.03 ( 29.35%) 145.33 ( 4.07%) 17.48 ( 88.46%)
Max syst-pound_clock_gettime-30 179.94 ( 0.00%) 121.68 ( 32.38%) 172.10 ( 4.36%) 21.29 ( 88.17%)
Max syst-pound_clock_gettime-48 191.29 ( 0.00%) 136.82 ( 28.48%) 174.84 ( 8.60%) 23.80 ( 87.56%)
Max syst-pound_clock_gettime-79 190.22 ( 0.00%) 130.28 ( 31.51%) 177.26 ( 6.81%) 25.71 ( 86.48%)
Max syst-pound_clock_gettime-96 192.02 ( 0.00%) 132.27 ( 31.12%) 178.26 ( 7.17%) 27.66 ( 85.60%)
Max syst-pound_times-2 13.10 ( 0.00%) 7.57 ( 42.21%) 10.21 ( 22.06%) 4.89 ( 62.67%)
Max syst-pound_times-5 28.56 ( 0.00%) 24.55 ( 14.04%) 24.80 ( 13.17%) 18.20 ( 36.27%)
Max syst-pound_times-8 50.89 ( 0.00%) 39.54 ( 22.30%) 47.78 ( 6.11%) 29.45 ( 42.13%)
Max syst-pound_times-12 79.85 ( 0.00%) 59.80 ( 25.11%) 72.21 ( 9.57%) 43.27 ( 45.81%)
Max syst-pound_times-21 151.33 ( 0.00%) 115.02 ( 23.99%) 144.60 ( 4.45%) 75.85 ( 49.88%)
Max syst-pound_times-30 180.79 ( 0.00%) 130.12 ( 28.03%) 171.98 ( 4.87%) 83.31 ( 53.92%)
Max syst-pound_times-48 186.61 ( 0.00%) 130.89 ( 29.86%) 174.40 ( 6.54%) 84.85 ( 54.53%)
Max syst-pound_times-79 190.96 ( 0.00%) 133.09 ( 30.30%) 179.58 ( 5.96%) 87.17 ( 54.35%)
Max syst-pound_times-96 192.42 ( 0.00%) 128.95 ( 32.99%) 177.09 ( 7.97%) 87.82 ( 54.36%)
vanilla rever prefetc mas
4.7 revert prefetch mask
User 54.91 73.30 56.08 47.56
System 21115.14 14616.16 19553.36 6360.52
Elapsed 1247.71 890.24 1149.26 409.20
vanilla rever prefetc mas
4.7 revert prefetch mask
Minor Faults 291321 267632 324632 274236
Major Faults 196 272 279 279
Swap Ins 0 0 0 0
Swap Outs 0 0 0 0
Allocation stalls 0 0 0 0
DMA allocs 0 0 0 0
DMA32 allocs 12836 11773 23439 21745
Normal allocs 252492 245667 302327 270404
Movable allocs 0 0 0 0
Direct pages scanned 0 0 0 0
Kswapd pages scanned 0 0 0 0
Kswapd pages reclaimed 0 0 0 0
Direct pages reclaimed 0 0 0 0
Kswapd efficiency 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0%
Zone normal velocity 0.000 0.000 0.000 0.000
Zone dma32 velocity 0.000 0.000 0.000 0.000
Zone dma velocity 0.000 0.000 0.000 0.000
Page writes by reclaim 0.000 0.000 0.000 0.000
Page writes file 0 0 0 0
Page writes anon 0 0 0 0
Page reclaim immediate 0 0 0 0
Sector Reads 24440 38464 144944 143876
Sector Writes 569300 12712 16036 6956
Page rescued immediate 0 0 0 0
Slabs scanned 0 0 0 0
Direct inode steals 0 0 0 0
Kswapd inode steals 0 0 0 0
Kswapd skipped wait 0 0 0 0
THP fault alloc 0 0 0 0
THP collapse alloc 0 0 0 0
THP splits 0 0 0 0
THP fault fallback 0 0 0 0
THP collapse fail 0 0 0 0
Compaction stalls 0 0 0 0
Compaction success 0 0 0 0
Compaction failures 0 0 0 0
Page migrate success 11177 10858 14598 9857
Page migrate failure 0 2 1 1
Compaction pages isolated 0 0 0 0
Compaction migrate scanned 0 0 0 0
Compaction free scanned 0 0 0 0
Compaction cost 11 11 15 10
NUMA alloc hit 237281 229068 296261 263464
NUMA alloc miss 7 5 5 6
NUMA interleave hit 0 0 0 0
NUMA alloc local 237281 229068 296261 263464
NUMA base PTE updates 25433 20398 35883 22264
NUMA huge PMD updates 0 0 0 0
NUMA page range updates 25433 20398 35883 22264
NUMA hint faults 23242 18097 31026 17002
NUMA hint local faults 10012 6038 14657 6903
NUMA hint local percent 43 33 47 40
NUMA pages migrated 11177 10858 14598 9857
AutoNUMA cost 116% 90% 155% 85%
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-12 12:10 ` Stanislaw Gruszka
@ 2016-08-15 7:49 ` Giovanni Gherdovich
2016-08-15 8:33 ` Mel Gorman
2016-08-15 9:13 ` Wanpeng Li
1 sibling, 1 reply; 14+ messages in thread
From: Giovanni Gherdovich @ 2016-08-15 7:49 UTC (permalink / raw)
To: Stanislaw Gruszka, Ingo Molnar
Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, linux-kernel, Mel Gorman
Hello Stanislaw,
On Fri, 2016-08-12 at 14:10 +0200, Stanislaw Gruszka wrote:
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> unsigned int seq, nextseq;
> unsigned long flags;
>
> + (void) task_sched_runtime(tsk);
> +
> rcu_read_lock();
> /* Attempt a lockless read on the first round. */
> nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> task_cputime(t, &utime, &stime);
> times->utime += utime;
> times->stime += stime;
> - times->sum_exec_runtime += task_sched_runtime(t);
> + times->sum_exec_runtime += t->se.sum_exec_runtime;
> }
> /* If lockless access failed, take the lock. */
> nextseq = 1;
> ---
> mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
>
> vanila-4.7 revert prefetch patch
> 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%)
> 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%)
> 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%)
> 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%)
> 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%)
> 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%)
> 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%)
> 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%)
> 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%)
> 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%)
> 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%)
> 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%)
> 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%)
> 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%)
> 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%)
> 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%)
> 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%)
>
> Patch works because we we update sum_exec_runtime on current thread
> what assure we see proper sum_exec_runtime value on different CPUs. I
> tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> patch did not break them. I'm going to run some other test.
>
> Patch is draft version for early review, task_sched_runtime() will be
> simplified (since it's called only current thread) and possibly split
> into two functions: one that call update_curr() and other that return
> sum_exec_runtime (assure it's consistent on 32 bit arches).
>
> Stanislaw
Thank you for having a look at this.
Your patch performs very well, even better than the pre-6e998916dfe3
numbers I was aiming for. I confirm your results on my test machine
(Sandy Bridge, 32 cores, 2 NUMA nodes).
I didn't apply on the very latest 4.8-rc but used what I had handy for
comparison (i.e. 4.7-rc7 and the parent of 6e998916dfe3).
As I said, my measurements match yours (my tables follow); looks like
your diff cures the problem while mine cures the symptoms.
clock_gettime():
threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + Stanislaw
(pre-6e998916dfe3)
2 3.48 2.23 ( 35.68%) 3.06 ( 11.83%) 1.08 ( 68.81%)
5 3.33 2.83 ( 14.84%) 3.25 ( 2.40%) 0.71 ( 78.55%)
8 3.37 2.84 ( 15.80%) 3.26 ( 3.30%) 0.56 ( 83.49%)
12 3.32 3.09 ( 6.69%) 3.37 ( -1.60%) 0.42 ( 87.28%)
21 4.01 3.14 ( 21.70%) 3.90 ( 2.74%) 0.35 ( 91.35%)
30 3.63 3.28 ( 9.75%) 3.36 ( 7.41%) 0.28 ( 92.23%)
48 3.71 3.02 ( 18.69%) 3.11 ( 16.27%) 0.39 ( 89.39%)
79 3.75 2.88 ( 23.23%) 3.16 ( 15.74%) 0.46 ( 87.76%)
110 3.81 2.95 ( 22.62%) 3.25 ( 14.80%) 0.56 ( 85.41%)
128 3.88 3.05 ( 21.28%) 3.31 ( 14.76%) 0.62 ( 84.10%)
times():
threads 4.7-rc7 3.18-rc3 4.7-rc7 + prefetch 4.7-rc7 + Stanislaw
(pre-6e998916dfe3)
2 3.65 2.27 ( 37.94%) 3.25 ( 11.03%) 1.62 ( 55.71%)
5 3.45 2.78 ( 19.34%) 3.17 ( 7.92%) 2.33 ( 32.28%)
8 3.52 2.79 ( 20.66%) 3.22 ( 8.69%) 2.06 ( 41.44%)
12 3.29 3.02 ( 8.33%) 3.36 ( -2.04%) 2.00 ( 39.18%)
21 4.07 3.10 ( 23.86%) 3.92 ( 3.78%) 2.07 ( 49.18%)
30 3.87 3.33 ( 13.80%) 3.40 ( 12.17%) 1.89 ( 51.12%)
48 3.79 2.96 ( 21.94%) 3.16 ( 16.61%) 1.69 ( 55.46%)
79 3.88 2.88 ( 25.82%) 3.28 ( 15.42%) 1.60 ( 58.81%)
110 3.90 2.98 ( 23.73%) 3.38 ( 13.35%) 1.73 ( 55.61%)
128 4.00 3.10 ( 22.40%) 3.38 ( 15.45%) 1.66 ( 58.52%)
Regards,
Giovanni
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 7:49 ` Giovanni Gherdovich
@ 2016-08-15 8:33 ` Mel Gorman
2016-08-15 9:19 ` Stanislaw Gruszka
0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-08-15 8:33 UTC (permalink / raw)
To: Giovanni Gherdovich
Cc: Stanislaw Gruszka, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
Mike Galbraith, linux-kernel
On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
> >
> > vanila-4.7 revert prefetch patch
> > 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%)
> > 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%)
> > 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%)
> > 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%)
> > 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%)
> > 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%)
> > 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%)
> > 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%)
> > 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%)
> > 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%)
> > 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%)
> > 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%)
> > 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%)
> > 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%)
> > 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%)
> > 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%)
> > 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%)
> >
> > Patch works because we we update sum_exec_runtime on current thread
> > what assure we see proper sum_exec_runtime value on different CPUs. I
> > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > patch did not break them. I'm going to run some other test.
> >
> > Patch is draft version for early review, task_sched_runtime() will be
> > simplified (since it's called only current thread) and possibly split
> > into two functions: one that call update_curr() and other that return
> > sum_exec_runtime (assure it's consistent on 32 bit arches).
> >
> > Stanislaw
>
Is this really equivalent though? It updates one task instead of all
tasks in the group and there is no guarantee that tsk == current.
Glancing at it, it should monotonically increase but it looks like it
would calculate stale data.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-12 12:10 ` Stanislaw Gruszka
2016-08-15 7:49 ` Giovanni Gherdovich
@ 2016-08-15 9:13 ` Wanpeng Li
2016-08-15 9:21 ` Stanislaw Gruszka
1 sibling, 1 reply; 14+ messages in thread
From: Wanpeng Li @ 2016-08-15 9:13 UTC (permalink / raw)
To: Stanislaw Gruszka
Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra,
Mike Galbraith, linux-kernel, Mel Gorman
2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
> Hi
>
> On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> Nice detective work! I'm wondering, where do we stand if compared with a
>> pre-6e998916dfe3 kernel?
>>
>> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
>> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
>> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
>> make the result correct, just see what the performance gap is, roughly.
>>
>> If there's still a significant gap then it might make sense to optimize this some
>> more.
>
> I measured (partial) revert performance on 4.7 using mmtest instructions
> from Giovanni and also tested some other possible fix (draft version):
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 75f98c5..54fdf6d 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> unsigned int seq, nextseq;
> unsigned long flags;
>
> + (void) task_sched_runtime(tsk);
> +
> rcu_read_lock();
> /* Attempt a lockless read on the first round. */
> nextseq = 0;
> @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> task_cputime(t, &utime, &stime);
> times->utime += utime;
> times->stime += stime;
> - times->sum_exec_runtime += task_sched_runtime(t);
> + times->sum_exec_runtime += t->se.sum_exec_runtime;
If this will not have updated stats for other threads?
Regards,
Wanpeng Li
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 8:33 ` Mel Gorman
@ 2016-08-15 9:19 ` Stanislaw Gruszka
2016-08-15 9:58 ` Mel Gorman
0 siblings, 1 reply; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-15 9:19 UTC (permalink / raw)
To: Mel Gorman
Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
Mike Galbraith, linux-kernel
On Mon, Aug 15, 2016 at 09:33:49AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 09:49:05AM +0200, Giovanni Gherdovich wrote:
> > > mmtest benchmark results are below (full compare-kernels.sh output is in attachment):
> > >
> > > vanila-4.7 revert prefetch patch
> > > 4.74 ( 0.00%) 3.04 ( 35.93%) 4.09 ( 13.81%) 1.30 ( 72.59%)
> > > 5.49 ( 0.00%) 5.00 ( 8.97%) 5.34 ( 2.72%) 1.03 ( 81.16%)
> > > 6.12 ( 0.00%) 4.91 ( 19.73%) 5.97 ( 2.40%) 0.90 ( 85.27%)
> > > 6.68 ( 0.00%) 4.90 ( 26.66%) 6.02 ( 9.75%) 0.88 ( 86.89%)
> > > 7.21 ( 0.00%) 5.13 ( 28.85%) 6.70 ( 7.09%) 0.87 ( 87.91%)
> > > 7.66 ( 0.00%) 5.22 ( 31.80%) 7.17 ( 6.39%) 0.92 ( 88.01%)
> > > 7.91 ( 0.00%) 5.36 ( 32.22%) 7.30 ( 7.72%) 0.95 ( 87.97%)
> > > 7.95 ( 0.00%) 5.35 ( 32.73%) 7.34 ( 7.66%) 1.06 ( 86.66%)
> > > 8.00 ( 0.00%) 5.33 ( 33.31%) 7.38 ( 7.73%) 1.13 ( 85.82%)
> > > 5.61 ( 0.00%) 3.55 ( 36.76%) 4.53 ( 19.23%) 2.29 ( 59.28%)
> > > 5.66 ( 0.00%) 4.32 ( 23.79%) 4.75 ( 16.18%) 3.65 ( 35.46%)
> > > 5.98 ( 0.00%) 4.97 ( 16.87%) 5.96 ( 0.35%) 3.62 ( 39.40%)
> > > 6.58 ( 0.00%) 4.94 ( 24.93%) 6.04 ( 8.32%) 3.63 ( 44.89%)
> > > 7.19 ( 0.00%) 5.18 ( 28.01%) 6.68 ( 7.13%) 3.65 ( 49.22%)
> > > 7.67 ( 0.00%) 5.27 ( 31.29%) 7.16 ( 6.63%) 3.62 ( 52.76%)
> > > 7.88 ( 0.00%) 5.36 ( 31.98%) 7.28 ( 7.58%) 3.65 ( 53.71%)
> > > 7.99 ( 0.00%) 5.39 ( 32.52%) 7.40 ( 7.42%) 3.65 ( 54.25%)
> > >
> > > Patch works because we we update sum_exec_runtime on current thread
> > > what assure we see proper sum_exec_runtime value on different CPUs. I
> > > tested it with reproducers from commits 6e998916dfe32 and d670ec13178d0,
> > > patch did not break them. I'm going to run some other test.
> > >
> > > Patch is draft version for early review, task_sched_runtime() will be
> > > simplified (since it's called only current thread) and possibly split
> > > into two functions: one that call update_curr() and other that return
> > > sum_exec_runtime (assure it's consistent on 32 bit arches).
> > >
> > > Stanislaw
> >
>
> Is this really equivalent though? It updates one task instead of all
> tasks in the group and there is no guarantee that tsk == current.
Oh, my intention was to update runtime on current.
> Glancing at it, it should monotonically increase but it looks like it
> would calculate stale data.
Yes, until the next tick on a CPU, the patch does not count partial
runtime of thread running on that CPU. However that was the behaviour
before commit d670ec13178d0 - that how old thread_group_sched_runtime()
function worked:
/*
- * Return sum_exec_runtime for the thread group.
- * In case the task is currently running, return the sum plus current's
- * pending runtime that have not been accounted yet.
- *
- * Note that the thread group might have other running tasks as well,
- * so the return value not includes other pending runtime that other
- * running tasks might have.
- */
-unsigned long long thread_group_sched_runtime(struct task_struct *p)
Stanislaw
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 9:13 ` Wanpeng Li
@ 2016-08-15 9:21 ` Stanislaw Gruszka
2016-08-15 9:28 ` Wanpeng Li
0 siblings, 1 reply; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-15 9:21 UTC (permalink / raw)
To: Wanpeng Li
Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra,
Mike Galbraith, linux-kernel, Mel Gorman
On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
> > Hi
> >
> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
> >> Nice detective work! I'm wondering, where do we stand if compared with a
> >> pre-6e998916dfe3 kernel?
> >>
> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
> >> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
> >> make the result correct, just see what the performance gap is, roughly.
> >>
> >> If there's still a significant gap then it might make sense to optimize this some
> >> more.
> >
> > I measured (partial) revert performance on 4.7 using mmtest instructions
> > from Giovanni and also tested some other possible fix (draft version):
> >
> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> > index 75f98c5..54fdf6d 100644
> > --- a/kernel/sched/cputime.c
> > +++ b/kernel/sched/cputime.c
> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> > unsigned int seq, nextseq;
> > unsigned long flags;
> >
> > + (void) task_sched_runtime(tsk);
> > +
> > rcu_read_lock();
> > /* Attempt a lockless read on the first round. */
> > nextseq = 0;
> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
> > task_cputime(t, &utime, &stime);
> > times->utime += utime;
> > times->stime += stime;
> > - times->sum_exec_runtime += task_sched_runtime(t);
> > + times->sum_exec_runtime += t->se.sum_exec_runtime;
>
> If this will not have updated stats for other threads?
No, until tick/sched() on CPUs running threads.
Stanislaw
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 9:21 ` Stanislaw Gruszka
@ 2016-08-15 9:28 ` Wanpeng Li
0 siblings, 0 replies; 14+ messages in thread
From: Wanpeng Li @ 2016-08-15 9:28 UTC (permalink / raw)
To: Stanislaw Gruszka
Cc: Ingo Molnar, Giovanni Gherdovich, Ingo Molnar, Peter Zijlstra,
Mike Galbraith, linux-kernel, Mel Gorman
2016-08-15 17:21 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
> On Mon, Aug 15, 2016 at 05:13:30PM +0800, Wanpeng Li wrote:
>> 2016-08-12 20:10 GMT+08:00 Stanislaw Gruszka <sgruszka@redhat.com>:
>> > Hi
>> >
>> > On Wed, Aug 10, 2016 at 01:26:41PM +0200, Ingo Molnar wrote:
>> >> Nice detective work! I'm wondering, where do we stand if compared with a
>> >> pre-6e998916dfe3 kernel?
>> >>
>> >> I admit this is a difficult question: 6e998916dfe3 does not revert cleanly and I
>> >> suspect v3.17 does not run easily on a recent distro. Could you attempt to revert
>> >> the bad effects of 6e998916dfe3 perhaps, just to get numbers - i.e. don't try to
>> >> make the result correct, just see what the performance gap is, roughly.
>> >>
>> >> If there's still a significant gap then it might make sense to optimize this some
>> >> more.
>> >
>> > I measured (partial) revert performance on 4.7 using mmtest instructions
>> > from Giovanni and also tested some other possible fix (draft version):
>> >
>> > diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
>> > index 75f98c5..54fdf6d 100644
>> > --- a/kernel/sched/cputime.c
>> > +++ b/kernel/sched/cputime.c
>> > @@ -294,6 +294,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> > unsigned int seq, nextseq;
>> > unsigned long flags;
>> >
>> > + (void) task_sched_runtime(tsk);
>> > +
>> > rcu_read_lock();
>> > /* Attempt a lockless read on the first round. */
>> > nextseq = 0;
>> > @@ -308,7 +310,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> > task_cputime(t, &utime, &stime);
>> > times->utime += utime;
>> > times->stime += stime;
>> > - times->sum_exec_runtime += task_sched_runtime(t);
>> > + times->sum_exec_runtime += t->se.sum_exec_runtime;
>>
>> If this will not have updated stats for other threads?
>
> No, until tick/sched() on CPUs running threads.
Yeah, I think this change will result in not updated stats for other
threads if they are running and before next update_curr() is called.
Regards,
Wanpeng Li
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 9:19 ` Stanislaw Gruszka
@ 2016-08-15 9:58 ` Mel Gorman
2016-08-15 10:29 ` Stanislaw Gruszka
0 siblings, 1 reply; 14+ messages in thread
From: Mel Gorman @ 2016-08-15 9:58 UTC (permalink / raw)
To: Stanislaw Gruszka
Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
Mike Galbraith, linux-kernel
On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > Is this really equivalent though? It updates one task instead of all
> > tasks in the group and there is no guarantee that tsk == current.
>
> Oh, my intention was to update runtime on current.
>
Ok, so minimally that would need addressing. However, then I would worry
that two tasks in a group calling the function at the same time would
see different results because each of them updated a different task.
Such a situation is inherently race-prone anyway but it's a large enough
functional difference to be worth calling out.
Minimally, I don't think such a patch is a replacement for Giovanni's
which is functionally equivalent to the current code but could be layered
on top if it is proven to be ok.
> > Glancing at it, it should monotonically increase but it looks like it
> > would calculate stale data.
>
> Yes, until the next tick on a CPU, the patch does not count partial
> runtime of thread running on that CPU. However that was the behaviour
> before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> function worked:
>
Sure, but does this patch not reintroduce the "SMP wobble" and the
problem of "the diff of 'process' should always be >= the diff of
'thread'" ?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] sched/cputime: Mitigate performance regression in times()/clock_gettime()
2016-08-15 9:58 ` Mel Gorman
@ 2016-08-15 10:29 ` Stanislaw Gruszka
0 siblings, 0 replies; 14+ messages in thread
From: Stanislaw Gruszka @ 2016-08-15 10:29 UTC (permalink / raw)
To: Mel Gorman
Cc: Giovanni Gherdovich, Ingo Molnar, Ingo Molnar, Peter Zijlstra,
Mike Galbraith, linux-kernel
On Mon, Aug 15, 2016 at 10:58:04AM +0100, Mel Gorman wrote:
> On Mon, Aug 15, 2016 at 11:19:01AM +0200, Stanislaw Gruszka wrote:
> > > Is this really equivalent though? It updates one task instead of all
> > > tasks in the group and there is no guarantee that tsk == current.
> >
> > Oh, my intention was to update runtime on current.
> >
>
> Ok, so minimally that would need addressing. However, then I would worry
> that two tasks in a group calling the function at the same time would
> see different results because each of them updated a different task.
> Such a situation is inherently race-prone anyway but it's a large enough
> functional difference to be worth calling out.
It races bacause we don't know which thread will call the clock_gettime()
first. But once that happen, second thread will see updated runtime value
from first thread as we call update_curr() for it with task_rq_lock (change
from commit 6e998916dfe3).
> Minimally, I don't think such a patch is a replacement for Giovanni's
> which is functionally equivalent to the current code but could be layered
> on top if it is proven to be ok.
I agree. I wanted to post my patch on top of Giovanni's.
> > > Glancing at it, it should monotonically increase but it looks like it
> > > would calculate stale data.
> >
> > Yes, until the next tick on a CPU, the patch does not count partial
> > runtime of thread running on that CPU. However that was the behaviour
> > before commit d670ec13178d0 - that how old thread_group_sched_runtime()
> > function worked:
> >
>
> Sure, but does this patch not reintroduce the "SMP wobble" and the
> problem of "the diff of 'process' should always be >= the diff of
> 'thread'" ?
It should not reintroduce that problem, also because of change from
commit 6e998916dfe3. When a thread reads sum_exec_runtime it also
update that value, then process reads updated value. I run test
case from "SMP wobble" commit and the problem do not happen
on my tests.
Perhaps I should post patch with a descriptive changelog and things
would be clearer ...
Stanislaw
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2016-08-15 10:32 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-05 8:21 [PATCH 0/1] sched/cputime: Mitigate performance regression in times()/clock_gettime() Giovanni Gherdovich
2016-08-05 8:21 ` [PATCH 1/1] " Giovanni Gherdovich
2016-08-10 11:26 ` Ingo Molnar
2016-08-10 13:02 ` Giovanni Gherdovich
2016-08-12 12:10 ` Stanislaw Gruszka
2016-08-15 7:49 ` Giovanni Gherdovich
2016-08-15 8:33 ` Mel Gorman
2016-08-15 9:19 ` Stanislaw Gruszka
2016-08-15 9:58 ` Mel Gorman
2016-08-15 10:29 ` Stanislaw Gruszka
2016-08-15 9:13 ` Wanpeng Li
2016-08-15 9:21 ` Stanislaw Gruszka
2016-08-15 9:28 ` Wanpeng Li
2016-08-10 18:00 ` [tip:sched/core] " tip-bot for Giovanni Gherdovich
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).