[PATCH 0/3] sched/cputime: sum_exec_runtime fixes for 32-bit cpus

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] sched/cputime: sum_exec_runtime fixes for 32-bit cpus
@ 2016-09-01  9:27 Stanislaw Gruszka
  2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2016-09-01  9:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Giovanni Gherdovich, Linus Torvalds, Mel Gorman, Mike Galbraith,
	Paolo Bonzini, Peter Zijlstra, Rik van Riel, Thomas Gleixner,
	Wanpeng Li, Ingo Molnar, Stanislaw Gruszka

We already improved scalability of SYS_times() and
SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) syscalls on 64-bit cpus.
However performance on 32-bit cpus is still poor. Patches improved that
and fix sum_exec_runtime 64 bit variable possible (but very very unlike)
inconsistency problem on 32-bit architectures.

Second patch could be merged with first one, but I posted it separately
for eventual discussion.

Stanislaw Gruszka (3):
  sched/cputime: Improve scalability of times()/clock_gettime() on 32
    bit cpus
  sched/cputime: Make nr_migrations u32 on 32 bit cpus
  sched/cputime: Protect other sum_exec_runtime reads on 32 bit cpus

 fs/proc/base.c                 |  2 +-
 include/linux/sched.h          | 40 ++++++++++++++++++++++++++++++++++++++--
 kernel/delayacct.c             |  2 +-
 kernel/exit.c                  |  2 +-
 kernel/sched/core.c            | 17 +++++------------
 kernel/sched/cputime.c         | 24 ++----------------------
 kernel/sched/deadline.c        |  2 +-
 kernel/sched/debug.c           |  4 ++--
 kernel/sched/fair.c            |  2 +-
 kernel/sched/rt.c              |  2 +-
 kernel/time/posix-cpu-timers.c |  7 ++++---
 11 files changed, 57 insertions(+), 47 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01  9:27 [PATCH 0/3] sched/cputime: sum_exec_runtime fixes for 32-bit cpus Stanislaw Gruszka
@ 2016-09-01  9:27 ` Stanislaw Gruszka
  2016-09-01  9:49   ` Peter Zijlstra
                     ` (3 more replies)
  2016-09-01  9:27 ` [PATCH 2/3] sched/cputime: Make nr_migrations u32 " Stanislaw Gruszka
  2016-09-01  9:27 ` [PATCH 3/3] sched/cputime: Protect other sum_exec_runtime reads " Stanislaw Gruszka
  2 siblings, 4 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2016-09-01  9:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Giovanni Gherdovich, Linus Torvalds, Mel Gorman, Mike Galbraith,
	Paolo Bonzini, Peter Zijlstra, Rik van Riel, Thomas Gleixner,
	Wanpeng Li, Ingo Molnar, Stanislaw Gruszka

My previous commit:

  a1eb1411b4e4 ("sched/cputime: Improve scalability by not accounting thread group tasks pending runtime")

helped to achieve good performance of SYS_times() and
SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) on 64 bit architectures.
However taking task_rq_lock() when reading t->se.sum_exec_runtime on
32 bit architectures still make those syscalls slow.

The reason why we take the lock is to make 64bit sum_exec_runtime
variable consistent. While a inconsistency scenario is very very unlike,
I assume it still may happen at least on some 32 bit architectures.

To protect the variable I introduced new seqcount lock. Performance
improvements on machine with 32 cores (32-bit cpus) measured by
benchmarks described in commit:

  6075620b0590 ("sched/cputime: Mitigate performance regression in times()/clock_gettime()")

are shown below:

syscall-nr_threads:      not patched:          patched:

clock_gettime-2          3.42 (  0.00%)        2.24 ( 34.29%)
clock_gettime-5          3.72 (  0.00%)        1.41 ( 61.98%)
clock_gettime-8          3.76 (  0.00%)        1.25 ( 66.74%)
clock_gettime-12         3.68 (  0.00%)        1.14 ( 68.91%)
clock_gettime-21         4.24 (  0.00%)        1.02 ( 75.88%)
clock_gettime-30         4.56 (  0.00%)        0.89 ( 80.43%)
clock_gettime-48         5.98 (  0.00%)        0.87 ( 85.38%)
clock_gettime-79         8.87 (  0.00%)        0.87 ( 90.18%)
clock_gettime-110       12.30 (  0.00%)        0.88 ( 92.82%)
clock_gettime-128       14.98 (  0.00%)        0.89 ( 94.02%)
times-2                  4.03 (  0.00%)        2.50 ( 38.08%)
times-5                  3.71 (  0.00%)        1.61 ( 56.55%)
times-8                  3.77 (  0.00%)        1.47 ( 61.04%)
times-12                 3.79 (  0.00%)        2.30 ( 39.37%)
times-21                 4.32 (  0.00%)        2.62 ( 39.25%)
times-30                 4.58 (  0.00%)        2.87 ( 37.31%)
times-48                 5.98 (  0.00%)        2.75 ( 53.98%)
times-79                 8.96 (  0.00%)        2.82 ( 68.55%)
times-110               12.83 (  0.00%)        2.88 ( 77.59%)
times-128               15.39 (  0.00%)        2.82 ( 81.67%)

I also tested some alternatives to the patch. I tried to use
vtime_seqcount instead of new seqcount and make sum_exec_runtime
atomic64_t variable. Performance results of those were worse:

syscall-nr_threads:      New seqcount:         vtime_seqcount:       atomic64_t

clock_gettime-5          1.41 (  0.00%)        1.89 (-33.43%)        2.37 (-67.70%)
clock_gettime-8          1.25 (  0.00%)        1.64 (-31.39%)        2.25 (-79.82%)
clock_gettime-12         1.14 (  0.00%)        1.52 (-32.63%)        2.36 (-106.56%)
clock_gettime-21         1.02 (  0.00%)        1.30 (-27.17%)        2.51 (-145.85%)
clock_gettime-30         0.89 (  0.00%)        1.09 (-22.17%)        3.06 (-243.11%)
clock_gettime-48         0.87 (  0.00%)        1.05 (-20.14%)        3.81 (-336.16%)
clock_gettime-79         0.87 (  0.00%)        1.00 (-14.47%)        4.61 (-429.51%)
clock_gettime-110        0.88 (  0.00%)        0.98 (-11.33%)        5.28 (-497.73%)
clock_gettime-128        0.89 (  0.00%)        1.00 (-11.96%)        5.64 (-530.39%)
times-2                  2.50 (  0.00%)        2.66 ( -6.65%)        3.45 (-38.19%)
times-5                  1.61 (  0.00%)        1.52 (  5.40%)        2.33 (-44.54%)
times-8                  1.47 (  0.00%)        1.67 (-13.83%)        2.43 (-65.67%)
times-12                 2.30 (  0.00%)        2.07 ( 10.02%)        2.24 (  2.35%)
times-21                 2.62 (  0.00%)        2.48 (  5.53%)        2.49 (  5.26%)
times-30                 2.87 (  0.00%)        2.71 (  5.75%)        3.13 ( -8.88%)
times-48                 2.75 (  0.00%)        2.70 (  2.03%)        3.78 (-37.46%)
times-79                 2.82 (  0.00%)        3.01 ( -6.92%)        4.53 (-60.55%)
times-110                2.88 (  0.00%)        2.76 (  4.17%)        5.22 (-81.57%)
times-128                2.82 (  0.00%)        2.90 ( -2.87%)        5.48 (-94.43%)

Especially atomic64_t results are very disappointing, looks atomic64
operations variants on this machine (atomic64_*_cx8()) are slow.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 include/linux/sched.h          | 39 +++++++++++++++++++++++++++++++++++++--
 kernel/sched/core.c            | 17 +++++------------
 kernel/sched/cputime.c         | 22 +---------------------
 kernel/sched/deadline.c        |  2 +-
 kernel/sched/fair.c            |  2 +-
 kernel/sched/rt.c              |  2 +-
 kernel/time/posix-cpu-timers.c |  3 ++-
 7 files changed, 48 insertions(+), 39 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eb64fcd..65714bc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,10 @@ struct sched_entity {
 
 	u64			nr_migrations;
 
+#ifndef CONFIG_64BIT
+	seqcount_t		sum_exec_runtime_seqcount;
+#endif
+
 #ifdef CONFIG_SCHEDSTATS
 	struct sched_statistics statistics;
 #endif
@@ -2505,8 +2509,39 @@ static inline void enable_sched_clock_irqtime(void) {}
 static inline void disable_sched_clock_irqtime(void) {}
 #endif
 
-extern unsigned long long
-task_sched_runtime(struct task_struct *task);
+extern void update_sched_runtime(struct task_struct *task);
+
+#ifdef CONFIG_64BIT
+static inline void update_sum_exec_runtime(struct sched_entity *se, u64 delta)
+{
+	se->sum_exec_runtime += delta;
+}
+
+static inline u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	return  t->se.sum_exec_runtime;
+}
+#else
+static inline void update_sum_exec_runtime(struct sched_entity *se, u64 delta)
+{
+	write_seqcount_begin(&se->sum_exec_runtime_seqcount);
+	se->sum_exec_runtime += delta;
+	write_seqcount_end(&se->sum_exec_runtime_seqcount);
+}
+
+static inline u64 read_sum_exec_runtime(struct task_struct *t)
+{
+	u64 ns;
+	int seq;
+
+	do {
+		seq = read_seqcount_begin(&t->se.sum_exec_runtime_seqcount);
+		ns = t->se.sum_exec_runtime;
+	} while (read_seqcount_retry(&t->se.sum_exec_runtime_seqcount, seq));
+
+	return ns;
+}
+#endif
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 556cb07..9f87e0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2991,20 +2991,17 @@ static inline void prefetch_curr_exec_start(struct task_struct *p)
 }
 
 /*
- * Return accounted runtime for the task.
- * In case the task is currently running, return the runtime plus current's
+ * In case the task is currently running, update scheduler stats to include
  * pending runtime that have not been accounted yet.
  */
-unsigned long long task_sched_runtime(struct task_struct *p)
+void update_sched_runtime(struct task_struct *p)
 {
 	struct rq_flags rf;
 	struct rq *rq;
-	u64 ns;
 
-#if defined(CONFIG_64BIT) && defined(CONFIG_SMP)
 	/*
-	 * 64-bit doesn't need locks to atomically read a 64bit value.
-	 * So we have a optimization chance when the task's delta_exec is 0.
+	 * Optimization to to avoid taking task lock.
+	 *
 	 * Reading ->on_cpu is racy, but this is ok.
 	 *
 	 * If we race with it leaving cpu, we'll take a lock. So we're correct.
@@ -3014,8 +3011,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 	 * been accounted, so we're correct here as well.
 	 */
 	if (!p->on_cpu || !task_on_rq_queued(p))
-		return p->se.sum_exec_runtime;
-#endif
+		return;
 
 	rq = task_rq_lock(p, &rf);
 	/*
@@ -3028,10 +3024,7 @@ unsigned long long task_sched_runtime(struct task_struct *p)
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
 	}
-	ns = p->se.sum_exec_runtime;
 	task_rq_unlock(rq, p, &rf);
-
-	return ns;
 }
 
 /*
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b93c72d..026c1c4 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -306,26 +306,6 @@ static inline cputime_t account_other_time(cputime_t max)
 	return accounted;
 }
 
-#ifdef CONFIG_64BIT
-static inline u64 read_sum_exec_runtime(struct task_struct *t)
-{
-	return t->se.sum_exec_runtime;
-}
-#else
-static u64 read_sum_exec_runtime(struct task_struct *t)
-{
-	u64 ns;
-	struct rq_flags rf;
-	struct rq *rq;
-
-	rq = task_rq_lock(t, &rf);
-	ns = t->se.sum_exec_runtime;
-	task_rq_unlock(rq, t, &rf);
-
-	return ns;
-}
-#endif
-
 /*
  * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
  * tasks (sum on group iteration) belonging to @tsk's group.
@@ -347,7 +327,7 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
 	 * other scheduler action.
 	 */
 	if (same_thread_group(current, tsk))
-		(void) task_sched_runtime(current);
+		update_sched_runtime(current);
 
 	rcu_read_lock();
 	/* Attempt a lockless read on the first round. */
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d091f4a..3e75329 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -742,7 +742,7 @@ static void update_curr_dl(struct rq *rq)
 	schedstat_set(curr->se.statistics.exec_max,
 		      max(curr->se.statistics.exec_max, delta_exec));
 
-	curr->se.sum_exec_runtime += delta_exec;
+	update_sum_exec_runtime(&curr->se, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61d4854..2e88bb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -802,7 +802,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
 	schedstat_set(curr->statistics.exec_max,
 		      max(delta_exec, curr->statistics.exec_max));
 
-	curr->sum_exec_runtime += delta_exec;
+	update_sum_exec_runtime(curr, delta_exec);
 	schedstat_add(cfs_rq, exec_clock, delta_exec);
 
 	curr->vruntime += calc_delta_fair(delta_exec, curr);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d5690b7..74990a5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -964,7 +964,7 @@ static void update_curr_rt(struct rq *rq)
 	schedstat_set(curr->se.statistics.exec_max,
 		      max(curr->se.statistics.exec_max, delta_exec));
 
-	curr->se.sum_exec_runtime += delta_exec;
+	update_sum_exec_runtime(&curr->se, delta_exec);
 	account_group_exec_runtime(curr, delta_exec);
 
 	curr->se.exec_start = rq_clock_task(rq);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 39008d7..4d8466b 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -190,7 +190,8 @@ static int cpu_clock_sample(const clockid_t which_clock, struct task_struct *p,
 		*sample = virt_ticks(p);
 		break;
 	case CPUCLOCK_SCHED:
-		*sample = task_sched_runtime(p);
+		update_sched_runtime(p);
+		*sample = read_sum_exec_runtime(p);
 		break;
 	}
 	return 0;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] sched/cputime: Make nr_migrations u32 on 32 bit cpus
  2016-09-01  9:27 [PATCH 0/3] sched/cputime: sum_exec_runtime fixes for 32-bit cpus Stanislaw Gruszka
  2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
@ 2016-09-01  9:27 ` Stanislaw Gruszka
  2016-09-01  9:27 ` [PATCH 3/3] sched/cputime: Protect other sum_exec_runtime reads " Stanislaw Gruszka
  2 siblings, 0 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2016-09-01  9:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Giovanni Gherdovich, Linus Torvalds, Mel Gorman, Mike Galbraith,
	Paolo Bonzini, Peter Zijlstra, Rik van Riel, Thomas Gleixner,
	Wanpeng Li, Ingo Molnar, Stanislaw Gruszka

My previous patch:

("sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus")

add new sum_exec_runtime_seqclock fields to sched_entity struct.

This changed the alignment of other fields of the sched_entity struct.
To preserve previous alignment, which makes variables commonly used
together are kept in a cache line, make nr_migrations field u32 on
32 bit architectures.

This field is used only for debugging and basically nr_migrations
should not be enormous number.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 include/linux/sched.h | 5 +++--
 kernel/sched/debug.c  | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 65714bc..b9d9333 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1332,9 +1332,10 @@ struct sched_entity {
 	u64			vruntime;
 	u64			prev_sum_exec_runtime;
 
+#ifdef CONFIG_64BIT
 	u64			nr_migrations;
-
-#ifndef CONFIG_64BIT
+#else
+	u32			nr_migrations;
 	seqcount_t		sum_exec_runtime_seqcount;
 #endif
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2a0a999..db85816 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -921,8 +921,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 
 		avg_per_cpu = p->se.sum_exec_runtime;
 		if (p->se.nr_migrations) {
-			avg_per_cpu = div64_u64(avg_per_cpu,
-						p->se.nr_migrations);
+			avg_per_cpu = div64_ul(avg_per_cpu,
+					       p->se.nr_migrations);
 		} else {
 			avg_per_cpu = -1LL;
 		}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] sched/cputime: Protect other sum_exec_runtime reads on 32 bit cpus
  2016-09-01  9:27 [PATCH 0/3] sched/cputime: sum_exec_runtime fixes for 32-bit cpus Stanislaw Gruszka
  2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
  2016-09-01  9:27 ` [PATCH 2/3] sched/cputime: Make nr_migrations u32 " Stanislaw Gruszka
@ 2016-09-01  9:27 ` Stanislaw Gruszka
  2 siblings, 0 replies; 11+ messages in thread
From: Stanislaw Gruszka @ 2016-09-01  9:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Giovanni Gherdovich, Linus Torvalds, Mel Gorman, Mike Galbraith,
	Paolo Bonzini, Peter Zijlstra, Rik van Riel, Thomas Gleixner,
	Wanpeng Li, Ingo Molnar, Stanislaw Gruszka

We protect sum_exec_runtime variable against inconsistency when reading
it on thread_group_cputime(), but we also read it on other places not
called with task_rq_lock() taken.

Patch changes that, except file kernel/sched/debug.c , where we also
read some other u64 variables (vrunime, exec_start) without protection.
But since this is for debug purpose and inconsistency is very very
improbable, we can ignore it.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
---
 fs/proc/base.c                 | 2 +-
 kernel/delayacct.c             | 2 +-
 kernel/exit.c                  | 2 +-
 kernel/sched/cputime.c         | 2 +-
 kernel/time/posix-cpu-timers.c | 4 ++--
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e9ff186..8118d97 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -505,7 +505,7 @@ static int proc_pid_schedstat(struct seq_file *m, struct pid_namespace *ns,
 		seq_printf(m, "0 0 0\n");
 	else
 		seq_printf(m, "%llu %llu %lu\n",
-		   (unsigned long long)task->se.sum_exec_runtime,
+		   (unsigned long long)read_sum_exec_runtime(task),
 		   (unsigned long long)task->sched_info.run_delay,
 		   task->sched_info.pcount);
 
diff --git a/kernel/delayacct.c b/kernel/delayacct.c
index 435c14a..023b2ca 100644
--- a/kernel/delayacct.c
+++ b/kernel/delayacct.c
@@ -104,7 +104,7 @@ int __delayacct_add_tsk(struct taskstats *d, struct task_struct *tsk)
 	 */
 	t1 = tsk->sched_info.pcount;
 	t2 = tsk->sched_info.run_delay;
-	t3 = tsk->se.sum_exec_runtime;
+	t3 = read_sum_exec_runtime(tsk);
 
 	d->cpu_count += t1;
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 2f974ae..a46f96f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -134,7 +134,7 @@ static void __exit_signal(struct task_struct *tsk)
 	sig->inblock += task_io_get_inblock(tsk);
 	sig->oublock += task_io_get_oublock(tsk);
 	task_io_accounting_add(&sig->ioac, &tsk->ioac);
-	sig->sum_sched_runtime += tsk->se.sum_exec_runtime;
+	sig->sum_sched_runtime += read_sum_exec_runtime(tsk);
 	sig->nr_threads--;
 	__unhash_process(tsk, group_dead);
 	write_sequnlock(&sig->stats_lock);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 026c1c4..cd1dd43 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -682,7 +682,7 @@ out:
 void task_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime_t *st)
 {
 	struct task_cputime cputime = {
-		.sum_exec_runtime = p->se.sum_exec_runtime,
+		.sum_exec_runtime = read_sum_exec_runtime(p),
 	};
 
 	task_cputime(p, &cputime.utime, &cputime.stime);
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 4d8466b..b1e7275 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -849,7 +849,7 @@ static void check_thread_timers(struct task_struct *tsk,
 	tsk_expires->virt_exp = expires_to_cputime(expires);
 
 	tsk_expires->sched_exp = check_timers_list(++timers, firing,
-						   tsk->se.sum_exec_runtime);
+						   read_sum_exec_runtime(tsk));
 
 	/*
 	 * Check for the special case thread timers.
@@ -1116,7 +1116,7 @@ static inline int fastpath_timer_check(struct task_struct *tsk)
 		struct task_cputime task_sample;
 
 		task_cputime(tsk, &task_sample.utime, &task_sample.stime);
-		task_sample.sum_exec_runtime = tsk->se.sum_exec_runtime;
+		task_sample.sum_exec_runtime = read_sum_exec_runtime(tsk);
 		if (task_cputime_expired(&task_sample, &tsk->cputime_expires))
 			return 1;
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
@ 2016-09-01  9:49   ` Peter Zijlstra
  2016-09-01 10:07     ` Stanislaw Gruszka
  2016-09-01  9:55   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-09-01  9:49 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Giovanni Gherdovich, Linus Torvalds, Mel Gorman,
	Mike Galbraith, Paolo Bonzini, Rik van Riel, Thomas Gleixner,
	Wanpeng Li, Ingo Molnar

On Thu, Sep 01, 2016 at 11:27:42AM +0200, Stanislaw Gruszka wrote:
> My previous commit:
> 
>   a1eb1411b4e4 ("sched/cputime: Improve scalability by not accounting thread group tasks pending runtime")
> 
> helped to achieve good performance of SYS_times() and
> SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) on 64 bit architectures.
> However taking task_rq_lock() when reading t->se.sum_exec_runtime on
> 32 bit architectures still make those syscalls slow.
> 
> The reason why we take the lock is to make 64bit sum_exec_runtime
> variable consistent. While a inconsistency scenario is very very unlike,
> I assume it still may happen at least on some 32 bit architectures.
> 
> To protect the variable I introduced new seqcount lock. Performance
> improvements on machine with 32 cores (32-bit cpus) measured by
> benchmarks described in commit:

No,.. running 32bit kernels on a machine with 32 cores is insane, full
stop.

You're now making rather hot paths slower to benefit a rather slow path,
that too is backwards.

[ also, seqcount is not a lock ].

Really, people should not expect process wide numbers to be fast.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
  2016-09-01  9:49   ` Peter Zijlstra
@ 2016-09-01  9:55   ` kbuild test robot
  2016-09-01 10:05   ` kbuild test robot
  2016-09-01 11:18   ` kbuild test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2016-09-01  9:55 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: kbuild-all, linux-kernel, Giovanni Gherdovich, Linus Torvalds,
	Mel Gorman, Mike Galbraith, Paolo Bonzini, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Wanpeng Li, Ingo Molnar,
	Stanislaw Gruszka

[-- Attachment #1: Type: text/plain, Size: 2493 bytes --]

Hi Stanislaw,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20160825]
[cannot apply to v4.8-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Stanislaw-Gruszka/sched-cputime-sum_exec_runtime-fixes-for-32-bit-cpus/20160901-173835
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   kernel/sched/core.c: In function 'update_sched_runtime':
>> kernel/sched/core.c:3013:8: error: 'struct task_struct' has no member named 'on_cpu'
     if (!p->on_cpu || !task_on_rq_queued(p))
           ^

vim +3013 kernel/sched/core.c

911b2898 Peter Zijlstra    2013-11-11  3007  	 * If we race with it leaving cpu, we'll take a lock. So we're correct.
911b2898 Peter Zijlstra    2013-11-11  3008  	 * If we race with it entering cpu, unaccounted time is 0. This is
911b2898 Peter Zijlstra    2013-11-11  3009  	 * indistinguishable from the read occurring a few cycles earlier.
4036ac15 Mike Galbraith    2014-06-24  3010  	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has
4036ac15 Mike Galbraith    2014-06-24  3011  	 * been accounted, so we're correct here as well.
911b2898 Peter Zijlstra    2013-11-11  3012  	 */
da0c1e65 Kirill Tkhai      2014-08-20 @3013  	if (!p->on_cpu || !task_on_rq_queued(p))
d6a8f295 Stanislaw Gruszka 2016-09-01  3014  		return;
911b2898 Peter Zijlstra    2013-11-11  3015  
eb580751 Peter Zijlstra    2015-07-31  3016  	rq = task_rq_lock(p, &rf);

:::::: The code at line 3013 was first introduced by commit
:::::: da0c1e65b51a289540159663aa4b90ba2366bc21 sched: Add wrapper for checking task_struct::on_rq

:::::: TO: Kirill Tkhai <ktkhai@parallels.com>
:::::: CC: Ingo Molnar <mingo@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 46034 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
  2016-09-01  9:49   ` Peter Zijlstra
  2016-09-01  9:55   ` kbuild test robot
@ 2016-09-01 10:05   ` kbuild test robot
  2016-09-01 11:18   ` kbuild test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2016-09-01 10:05 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: kbuild-all, linux-kernel, Giovanni Gherdovich, Linus Torvalds,
	Mel Gorman, Mike Galbraith, Paolo Bonzini, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Wanpeng Li, Ingo Molnar,
	Stanislaw Gruszka

[-- Attachment #1: Type: text/plain, Size: 2378 bytes --]

Hi Stanislaw,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20160825]
[cannot apply to v4.8-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Stanislaw-Gruszka/sched-cputime-sum_exec_runtime-fixes-for-32-bit-cpus/20160901-173835
config: x86_64-randconfig-x015-201635 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   kernel/sched/core.c: In function 'update_sched_runtime':
>> kernel/sched/core.c:3013:8: error: 'struct task_struct' has no member named 'on_cpu'; did you mean 'on_rq'?
     if (!p->on_cpu || !task_on_rq_queued(p))
           ^~

vim +3013 kernel/sched/core.c

911b2898 Peter Zijlstra    2013-11-11  3007  	 * If we race with it leaving cpu, we'll take a lock. So we're correct.
911b2898 Peter Zijlstra    2013-11-11  3008  	 * If we race with it entering cpu, unaccounted time is 0. This is
911b2898 Peter Zijlstra    2013-11-11  3009  	 * indistinguishable from the read occurring a few cycles earlier.
4036ac15 Mike Galbraith    2014-06-24  3010  	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has
4036ac15 Mike Galbraith    2014-06-24  3011  	 * been accounted, so we're correct here as well.
911b2898 Peter Zijlstra    2013-11-11  3012  	 */
da0c1e65 Kirill Tkhai      2014-08-20 @3013  	if (!p->on_cpu || !task_on_rq_queued(p))
d6a8f295 Stanislaw Gruszka 2016-09-01  3014  		return;
911b2898 Peter Zijlstra    2013-11-11  3015  
eb580751 Peter Zijlstra    2015-07-31  3016  	rq = task_rq_lock(p, &rf);

:::::: The code at line 3013 was first introduced by commit
:::::: da0c1e65b51a289540159663aa4b90ba2366bc21 sched: Add wrapper for checking task_struct::on_rq

:::::: TO: Kirill Tkhai <ktkhai@parallels.com>
:::::: CC: Ingo Molnar <mingo@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 21684 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01  9:49   ` Peter Zijlstra
@ 2016-09-01 10:07     ` Stanislaw Gruszka
  2016-09-01 10:29       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Stanislaw Gruszka @ 2016-09-01 10:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Giovanni Gherdovich, Linus Torvalds, Mel Gorman,
	Mike Galbraith, Paolo Bonzini, Rik van Riel, Thomas Gleixner,
	Wanpeng Li, Ingo Molnar

On Thu, Sep 01, 2016 at 11:49:06AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 01, 2016 at 11:27:42AM +0200, Stanislaw Gruszka wrote:
> > My previous commit:
> > 
> >   a1eb1411b4e4 ("sched/cputime: Improve scalability by not accounting thread group tasks pending runtime")
> > 
> > helped to achieve good performance of SYS_times() and
> > SYS_clock_gettimes(CLOCK_PROCESS_CPUTIME_ID) on 64 bit architectures.
> > However taking task_rq_lock() when reading t->se.sum_exec_runtime on
> > 32 bit architectures still make those syscalls slow.
> > 
> > The reason why we take the lock is to make 64bit sum_exec_runtime
> > variable consistent. While a inconsistency scenario is very very unlike,
> > I assume it still may happen at least on some 32 bit architectures.
> > 
> > To protect the variable I introduced new seqcount lock. Performance
> > improvements on machine with 32 cores (32-bit cpus) measured by
> > benchmarks described in commit:
> 
> No,.. running 32bit kernels on a machine with 32 cores is insane, full
> stop.

I agree with that. But I also run this the benchmark on 4 cores
armv7l and see good improvements there too.

> You're now making rather hot paths slower to benefit a rather slow path,
> that too is backwards.

Ok, you have right, I made update_curr() slower (a bit I think, since
this new seqcount primitive should be in the same cache line as other
things).

But do we don't care about inconsistency of accessing of 64 bit variable
on 32 bit processors (see patch 3) ? I know this is unlikely scenario
to get inconsistency, but I assume it's still possible, or not?

If not, I can get rid of read_sum_exec_runtime() and just read
sum_exec_runtime without task_rq_lock() protection on 
thread_group_cputime() . That would make the benchmark happy. 

Stanislaw

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01 10:07     ` Stanislaw Gruszka
@ 2016-09-01 10:29       ` Peter Zijlstra
  2016-09-04 18:46         ` Giovanni Gherdovich
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-09-01 10:29 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: linux-kernel, Giovanni Gherdovich, Linus Torvalds, Mel Gorman,
	Mike Galbraith, Paolo Bonzini, Rik van Riel, Thomas Gleixner,
	Wanpeng Li, Ingo Molnar

On Thu, Sep 01, 2016 at 12:07:34PM +0200, Stanislaw Gruszka wrote:
> On Thu, Sep 01, 2016 at 11:49:06AM +0200, Peter Zijlstra wrote:
> > You're now making rather hot paths slower to benefit a rather slow path,
> > that too is backwards.
> 
> Ok, you have right, I made update_curr() slower (a bit I think, since
> this new seqcount primitive should be in the same cache line as other
> things).

seqcount adds 2 smp_wmb(), which on ARM, are not free (it is possible to
do with just 1 FWIW).

> But do we don't care about inconsistency of accessing of 64 bit variable
> on 32 bit processors (see patch 3) ? I know this is unlikely scenario
> to get inconsistency, but I assume it's still possible, or not?

Its actually quite possible. We've observed it a fair few times. 64bit
variables are 2 32bit stores/loads and getting interleaved data is quite
possible.

> If not, I can get rid of read_sum_exec_runtime() and just read
> sum_exec_runtime without task_rq_lock() protection on 
> thread_group_cputime() . That would make the benchmark happy. 

I think this benchmark is misguided. Just accept that O(nr_threads) is
expensive, same with process wide itimer, just don't use them when you
care about performance.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
                     ` (2 preceding siblings ...)
  2016-09-01 10:05   ` kbuild test robot
@ 2016-09-01 11:18   ` kbuild test robot
  3 siblings, 0 replies; 11+ messages in thread
From: kbuild test robot @ 2016-09-01 11:18 UTC (permalink / raw)
  To: Stanislaw Gruszka
  Cc: kbuild-all, linux-kernel, Giovanni Gherdovich, Linus Torvalds,
	Mel Gorman, Mike Galbraith, Paolo Bonzini, Peter Zijlstra,
	Rik van Riel, Thomas Gleixner, Wanpeng Li, Ingo Molnar,
	Stanislaw Gruszka

[-- Attachment #1: Type: text/plain, Size: 5641 bytes --]

Hi Stanislaw,

[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20160825]
[cannot apply to v4.8-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Stanislaw-Gruszka/sched-cputime-sum_exec_runtime-fixes-for-32-bit-cpus/20160901-173835
config: i386-randconfig-x014-201635 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/uapi/linux/capability.h:16,
                    from include/linux/capability.h:15,
                    from include/linux/sched.h:15,
                    from include/linux/kasan.h:4,
                    from kernel/sched/core.c:29:
   kernel/sched/core.c: In function 'update_sched_runtime':
   kernel/sched/core.c:3013:8: error: 'struct task_struct' has no member named 'on_cpu'; did you mean 'on_rq'?
     if (!p->on_cpu || !task_on_rq_queued(p))
           ^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> kernel/sched/core.c:3013:2: note: in expansion of macro 'if'
     if (!p->on_cpu || !task_on_rq_queued(p))
     ^~
   kernel/sched/core.c:3013:8: error: 'struct task_struct' has no member named 'on_cpu'; did you mean 'on_rq'?
     if (!p->on_cpu || !task_on_rq_queued(p))
           ^
   include/linux/compiler.h:149:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^~~~
>> kernel/sched/core.c:3013:2: note: in expansion of macro 'if'
     if (!p->on_cpu || !task_on_rq_queued(p))
     ^~
   kernel/sched/core.c:3013:8: error: 'struct task_struct' has no member named 'on_cpu'; did you mean 'on_rq'?
     if (!p->on_cpu || !task_on_rq_queued(p))
           ^
   include/linux/compiler.h:160:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^~~~
>> kernel/sched/core.c:3013:2: note: in expansion of macro 'if'
     if (!p->on_cpu || !task_on_rq_queued(p))
     ^~

vim +/if +3013 kernel/sched/core.c

d6a8f295 kernel/sched/core.c Stanislaw Gruszka 2016-09-01  2997  void update_sched_runtime(struct task_struct *p)
c5f8d995 kernel/sched.c      Hidetoshi Seto    2009-03-31  2998  {
eb580751 kernel/sched/core.c Peter Zijlstra    2015-07-31  2999  	struct rq_flags rf;
c5f8d995 kernel/sched.c      Hidetoshi Seto    2009-03-31  3000  	struct rq *rq;
c5f8d995 kernel/sched.c      Hidetoshi Seto    2009-03-31  3001  
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3002  	/*
d6a8f295 kernel/sched/core.c Stanislaw Gruszka 2016-09-01  3003  	 * Optimization to to avoid taking task lock.
d6a8f295 kernel/sched/core.c Stanislaw Gruszka 2016-09-01  3004  	 *
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3005  	 * Reading ->on_cpu is racy, but this is ok.
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3006  	 *
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3007  	 * If we race with it leaving cpu, we'll take a lock. So we're correct.
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3008  	 * If we race with it entering cpu, unaccounted time is 0. This is
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3009  	 * indistinguishable from the read occurring a few cycles earlier.
4036ac15 kernel/sched/core.c Mike Galbraith    2014-06-24  3010  	 * If we see ->on_cpu without ->on_rq, the task is leaving, and has
4036ac15 kernel/sched/core.c Mike Galbraith    2014-06-24  3011  	 * been accounted, so we're correct here as well.
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3012  	 */
da0c1e65 kernel/sched/core.c Kirill Tkhai      2014-08-20 @3013  	if (!p->on_cpu || !task_on_rq_queued(p))
d6a8f295 kernel/sched/core.c Stanislaw Gruszka 2016-09-01  3014  		return;
911b2898 kernel/sched/core.c Peter Zijlstra    2013-11-11  3015  
eb580751 kernel/sched/core.c Peter Zijlstra    2015-07-31  3016  	rq = task_rq_lock(p, &rf);
6e998916 kernel/sched/core.c Stanislaw Gruszka 2014-11-12  3017  	/*
6e998916 kernel/sched/core.c Stanislaw Gruszka 2014-11-12  3018  	 * Must be ->curr _and_ ->on_rq.  If dequeued, we would
6e998916 kernel/sched/core.c Stanislaw Gruszka 2014-11-12  3019  	 * project cycles that may never be accounted to this
6e998916 kernel/sched/core.c Stanislaw Gruszka 2014-11-12  3020  	 * thread, breaking clock_gettime().
6e998916 kernel/sched/core.c Stanislaw Gruszka 2014-11-12  3021  	 */

:::::: The code at line 3013 was first introduced by commit
:::::: da0c1e65b51a289540159663aa4b90ba2366bc21 sched: Add wrapper for checking task_struct::on_rq

:::::: TO: Kirill Tkhai <ktkhai@parallels.com>
:::::: CC: Ingo Molnar <mingo@kernel.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 22473 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus
  2016-09-01 10:29       ` Peter Zijlstra
@ 2016-09-04 18:46         ` Giovanni Gherdovich
  0 siblings, 0 replies; 11+ messages in thread
From: Giovanni Gherdovich @ 2016-09-04 18:46 UTC (permalink / raw)
  To: Peter Zijlstra, Stanislaw Gruszka
  Cc: linux-kernel, Linus Torvalds, Mel Gorman, Mike Galbraith,
	Paolo Bonzini, Rik van Riel, Thomas Gleixner, Wanpeng Li,
	Ingo Molnar

On Thu, 2016-09-01 at 12:29 +0200, Peter Zijlstra wrote:
> On Thu, Sep 01, 2016 at 12:07:34PM +0200, Stanislaw Gruszka wrote:
> > 
> > On Thu, Sep 01, 2016 at 11:49:06AM +0200, Peter Zijlstra wrote:
> > > 
> > > You're now making rather hot paths slower to benefit a rather
> > > slow path, that too is backwards.
> > 
> > Ok, you have right, I made update_curr() slower (a bit I think,
> > since this new seqcount primitive should be in the same cache line
> > as other things).
> 
> seqcount adds 2 smp_wmb(), which on ARM, are not free (it is
> possible to do with just 1 FWIW).
> 
> > 
> > But do we don't care about inconsistency of accessing of 64 bit
> > variable on 32 bit processors (see patch 3) ? I know this is
> > unlikely scenario to get inconsistency, but I assume it's still
> > possible, or not?
> 
> Its actually quite possible. We've observed it a fair few
> times. 64bit variables are 2 32bit stores/loads and getting
> interleaved data is quite possible.
> 

I think leaving the 32bit benchmark numbers where they are, in the
interest of not perturbing the update_curr() path, is the right call
here. task_rq_lock() may hurt the thread_group_cputime() path but the
seqcount alternate strategy could impact other scheduler-related
workloads.

> > 
> > If not, I can get rid of read_sum_exec_runtime() and just read
> > sum_exec_runtime without task_rq_lock() protection on
> > thread_group_cputime() . That would make the benchmark happy.
> 
> I think this benchmark is misguided. Just accept that O(nr_threads)
> is expensive, same with process wide itimer, just don't use them
> when you care about performance.

As you say, the results of the "poundtime" benchmark have to be read
with a grain of salt, and probably I should put them in
perspective. In a sentence: a low number of threads represents real
world scenarios more faithfully, obviously. We run it in a framework
(Mel Gorman's MMTests) which stresses the box from 2 to 4*num_cpus
threads as it does with many other workloads where num_thread is a
parameter.

We're spraying all over the input space just to see if anything
interesting happens. If we see a regression in some obscure corner
case, that's not necessarily a bug -- sometimes it's just not
interesting or the trade-offs aren't worth fixing it.

"poundtime" first appeared on LKML in 2009 as test case for a
functional bug where a process' time wasn't monotonic; it was then
reused by Rik van Riel in 2014 as a performance workload, see
https://marc.info/?i=1408133138-22048-1-git-send-email-riel@redhat.com

The slightly edited version we use at SUSE in MMTest is in the
changelog of 6075620b0590 "sched/cputime: Mitigate performance
regression in times()/clock_gettime()".

Giovanni

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-09-04 18:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-01  9:27 [PATCH 0/3] sched/cputime: sum_exec_runtime fixes for 32-bit cpus Stanislaw Gruszka
2016-09-01  9:27 ` [PATCH 1/3] sched/cputime: Improve scalability of times()/clock_gettime() on 32 bit cpus Stanislaw Gruszka
2016-09-01  9:49   ` Peter Zijlstra
2016-09-01 10:07     ` Stanislaw Gruszka
2016-09-01 10:29       ` Peter Zijlstra
2016-09-04 18:46         ` Giovanni Gherdovich
2016-09-01  9:55   ` kbuild test robot
2016-09-01 10:05   ` kbuild test robot
2016-09-01 11:18   ` kbuild test robot
2016-09-01  9:27 ` [PATCH 2/3] sched/cputime: Make nr_migrations u32 " Stanislaw Gruszka
2016-09-01  9:27 ` [PATCH 3/3] sched/cputime: Protect other sum_exec_runtime reads " Stanislaw Gruszka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.