[PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40%

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40%
@ 2016-02-02 17:19 riel
  2016-02-02 17:19 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: riel @ 2016-02-02 17:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

(v5: address comments by Frederic & Peter, fix bug found by Eric)

Running with nohz_full introduces a fair amount of overhead.
Specifically, various things that are usually done from the
timer interrupt are now done at syscall, irq, and guest
entry and exit times.

However, some of the code that is called every single time
has only ever worked at jiffy resolution. The code in
__acct_update_integrals was also doing some unnecessary
calculations.

Getting rid of the unnecessary calculations, without
changing any of the functionality in __acct_update_integrals
gets us about an 11% win.

Not calling the time statistics updating code more than
once per jiffy, like is done on housekeeping CPUs and on
all the CPUs of a non-nohz_full system, shaves off a
further 30%.

I tested this series with a microbenchmark calling
an invalid syscall number ten million times in a row,
on a nohz_full cpu.

    Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

   Same test on a non-NOHZ_FULL, non-housekeeping CPU:
all kernels			1.86 seconds

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-02 17:19 [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% riel
@ 2016-02-02 17:19 ` riel
  2016-02-02 17:19 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: riel @ 2016-02-02 17:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

When running a microbenchmark calling an invalid syscall number
in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
time in __acct_update_integrals.

This function converts cputime_t to jiffies, to a timeval, only to
convert the timeval back to microseconds before discarding it.

This patch leaves __acct_update_integrals functionally equivalent,
but speeds things up by about 12%, with 10 million calls to an
invalid syscall number dropping from 3.7 to 3.25 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..460ee2bbfef3 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -93,9 +93,11 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {
 	struct mm_struct *mm;
 
-	/* convert pages-usec to Mbyte-usec */
-	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
-	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
+	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
+	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE;
+	do_div(stats->coremem, 1000 * KB);
+	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE;
+	do_div(stats->virtmem, 1000 * KB);
 	mm = get_task_mm(p);
 	if (mm) {
 		/* adjust to KB unit */
@@ -125,22 +127,26 @@ static void __acct_update_integrals(struct task_struct *tsk,
 {
 	if (likely(tsk->mm)) {
 		cputime_t time, dtime;
-		struct timeval value;
 		unsigned long flags;
 		u64 delta;
 
 		local_irq_save(flags);
 		time = stime + utime;
 		dtime = time - tsk->acct_timexpd;
-		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
-		delta = value.tv_sec;
-		delta = delta * USEC_PER_SEC + value.tv_usec;
+		/* Avoid division: cputime_t is often in nanoseconds already. */
+		delta = cputime_to_nsecs(dtime);
 
-		if (delta == 0)
+		if (delta < TICK_NSEC)
 			goto out;
+
 		tsk->acct_timexpd = time;
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
+		/*
+		 * Divide by 1024 to avoid overflow, and to avoid division.
+		 * The final unit reported to userspace is Mbyte-usecs,
+		 * the rest of the math is done in xacct_add_tsk.
+		 */
+		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
 	out:
 		local_irq_restore(flags);
 	}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] acct,time: change indentation in __acct_update_integrals
  2016-02-02 17:19 [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% riel
  2016-02-02 17:19 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
@ 2016-02-02 17:19 ` riel
  2016-02-02 17:19 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 17+ messages in thread
From: riel @ 2016-02-02 17:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

Change the indentation in __acct_update_integrals to make the function
a little easier to read.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
---
 kernel/tsacct.c | 51 ++++++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 460ee2bbfef3..d12e815b7bcd 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -125,31 +125,32 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
-	if (likely(tsk->mm)) {
-		cputime_t time, dtime;
-		unsigned long flags;
-		u64 delta;
-
-		local_irq_save(flags);
-		time = stime + utime;
-		dtime = time - tsk->acct_timexpd;
-		/* Avoid division: cputime_t is often in nanoseconds already. */
-		delta = cputime_to_nsecs(dtime);
-
-		if (delta < TICK_NSEC)
-			goto out;
-
-		tsk->acct_timexpd = time;
-		/*
-		 * Divide by 1024 to avoid overflow, and to avoid division.
-		 * The final unit reported to userspace is Mbyte-usecs,
-		 * the rest of the math is done in xacct_add_tsk.
-		 */
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-	out:
-		local_irq_restore(flags);
-	}
+	cputime_t time, dtime;
+	unsigned long flags;
+	u64 delta;
+
+	if (!likely(tsk->mm))
+		return;
+
+	local_irq_save(flags);
+	time = stime + utime;
+	dtime = time - tsk->acct_timexpd;
+	/* Avoid division: cputime_t is often in nanoseconds already. */
+	delta = cputime_to_nsecs(dtime);
+
+	if (delta < TICK_NSEC)
+		goto out;
+
+	tsk->acct_timexpd = time;
+	/*
+	 * Divide by 1024 to avoid overflow, and to avoid division.
+	 * The final unit reported to userspace is Mbyte-usecs,
+	 * the rest of the math is done in xacct_add_tsk.
+	 */
+	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
+out:
+	local_irq_restore(flags);
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals
  2016-02-02 17:19 [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% riel
  2016-02-02 17:19 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
  2016-02-02 17:19 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
@ 2016-02-02 17:19 ` riel
  2016-02-02 17:19 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
  2016-02-08 21:07 ` [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% Rik van Riel
  4 siblings, 0 replies; 17+ messages in thread
From: riel @ 2016-02-02 17:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

It looks like all the call paths that lead to __acct_update_integrals
already have irqs disabled, and __acct_update_integrals does not need
to disable irqs itself.

This is very convenient since about half the CPU time left in this
function was spent in local_irq_save alone.

Performance of a microbenchmark that calls an invalid syscall
ten million times in a row on a nohz_full CPU improves 21% vs.
4.5-rc1 with both the removal of divisions from __acct_update_integrals
and this patch, with runtime dropping from 3.7 to 2.9 seconds.

With these patches applied, the highest remaining cpu user in
the trace is native_sched_clock, which is addressed in the next
patch.

For testing purposes I stuck a WARN_ON(!irqs_disabled()) test
in __acct_update_integrals. It did not trigger.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index d12e815b7bcd..f8e26ab963ed 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -126,20 +126,18 @@ static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
 	cputime_t time, dtime;
-	unsigned long flags;
 	u64 delta;
 
 	if (!likely(tsk->mm))
 		return;
 
-	local_irq_save(flags);
 	time = stime + utime;
 	dtime = time - tsk->acct_timexpd;
 	/* Avoid division: cputime_t is often in nanoseconds already. */
 	delta = cputime_to_nsecs(dtime);
 
 	if (delta < TICK_NSEC)
-		goto out;
+		return;
 
 	tsk->acct_timexpd = time;
 	/*
@@ -149,8 +147,6 @@ static void __acct_update_integrals(struct task_struct *tsk,
 	 */
 	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
 	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-out:
-	local_irq_restore(flags);
 }
 
 /**
@@ -160,9 +156,12 @@ static void __acct_update_integrals(struct task_struct *tsk,
 void acct_update_integrals(struct task_struct *tsk)
 {
 	cputime_t utime, stime;
+	unsigned long flags;
 
+	local_irq_save(flags);
 	task_cputime(tsk, &utime, &stime);
 	__acct_update_integrals(tsk, utime, stime);
+	local_irq_restore(flags);
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-02 17:19 [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% riel
                   ` (2 preceding siblings ...)
  2016-02-02 17:19 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
@ 2016-02-02 17:19 ` riel
  2016-02-09 17:11   ` Frederic Weisbecker
  2016-02-08 21:07 ` [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% Rik van Riel
  4 siblings, 1 reply; 17+ messages in thread
From: riel @ 2016-02-02 17:19 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

After removing __acct_update_integrals from the profile,
native_sched_clock remains as the top CPU user. This can be
reduced by only calling account_{user,sys,guest,idle}_time
once per jiffy for long running tasks on nohz_full CPUs.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and kvm guest entry & exit.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

A non-NOHZ_FULL cpu (not the housekeeping CPU)
all kernels			1.86 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h  |  1 +
 kernel/sched/cputime.c | 35 +++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..019c3af98503 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1532,6 +1532,7 @@ struct task_struct {
 	struct prev_cputime prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t vtime_seqcount;
+	unsigned long vtime_jiffies;
 	unsigned long long vtime_snap;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ffb1adc..6ab25719d05d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,6 +668,15 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
+{
+	if (tsk->vtime_jiffies == now)
+		return false;
+
+	tsk->vtime_jiffies = now;
+	return true;
+}
+
 static unsigned long long vtime_delta(struct task_struct *tsk)
 {
 	unsigned long long clock;
@@ -699,6 +708,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +719,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +731,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_jiffies_changed(tsk, jiffies)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +758,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -759,8 +776,12 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta(tsk);
+	cputime_t delta_cpu;
+
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
 
+	delta_cpu = get_vtime_delta(tsk);
 	account_idle_time(delta_cpu);
 }
 
@@ -773,6 +794,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
 	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_jiffies = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -784,6 +806,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
 	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_jiffies = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40%
  2016-02-02 17:19 [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% riel
                   ` (3 preceding siblings ...)
  2016-02-02 17:19 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
@ 2016-02-08 21:07 ` Rik van Riel
  2016-02-09  2:13   ` Frederic Weisbecker
  4 siblings, 1 reply; 17+ messages in thread
From: Rik van Riel @ 2016-02-08 21:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

On 02/02/2016 12:19 PM, riel@redhat.com wrote:
> (v5: address comments by Frederic & Peter, fix bug found by Eric)
> 
> Running with nohz_full introduces a fair amount of overhead.
> Specifically, various things that are usually done from the
> timer interrupt are now done at syscall, irq, and guest
> entry and exit times.

It looks like nobody has objections to my v5
series.

Is there anything else I need to do before these
patches can get merged?

Also, who is going to merge these patches into
their tree?

Frederic? Thomas? Ingo?


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40%
  2016-02-08 21:07 ` [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% Rik van Riel
@ 2016-02-09  2:13   ` Frederic Weisbecker
  0 siblings, 0 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2016-02-09  2:13 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, tglx, mingo, luto, peterz, clark, eric.dumazet

On Mon, Feb 08, 2016 at 04:07:24PM -0500, Rik van Riel wrote:
> On 02/02/2016 12:19 PM, riel@redhat.com wrote:
> > (v5: address comments by Frederic & Peter, fix bug found by Eric)
> > 
> > Running with nohz_full introduces a fair amount of overhead.
> > Specifically, various things that are usually done from the
> > timer interrupt are now done at syscall, irq, and guest
> > entry and exit times.
> 
> It looks like nobody has objections to my v5
> series.
> 
> Is there anything else I need to do before these
> patches can get merged?

Sorry for the delay, I'm going to review and respond to it tomorrow.

> 
> Also, who is going to merge these patches into
> their tree?
> 
> Frederic? Thomas? Ingo?

Eventually I think this should go through Peterz/Ingo as it's scheduler code.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-02 17:19 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
@ 2016-02-09 17:11   ` Frederic Weisbecker
  2016-02-09 18:13     ` Rik van Riel
  2016-02-09 21:20     ` Rik van Riel
  0 siblings, 2 replies; 17+ messages in thread
From: Frederic Weisbecker @ 2016-02-09 17:11 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, tglx, mingo, luto, peterz, clark, eric.dumazet

On Tue, Feb 02, 2016 at 12:19:46PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> After removing __acct_update_integrals from the profile,
> native_sched_clock remains as the top CPU user. This can be
> reduced by only calling account_{user,sys,guest,idle}_time
> once per jiffy for long running tasks on nohz_full CPUs.
> 
> This will reduce timing accuracy on nohz_full CPUs to jiffy
> based sampling, just like on normal CPUs.

I wonder if that assumption is actually right.

With tick based sampling, we indeed have a statistical accounting
which precision is based on HZ. Now the time accounted when the tick
fires is always a single unit: 1 jiffy. So we have a well distributed
accounting value because it's constant and based on the probability of
a periodic event.

So for any T_slice being a given cpu timeslice (in secs) executed between
two ring switch (user <-> kernel), we are going to account: 1 * P(T_slice*HZ)
(P() stand for probability here).

Now after this patch, the scenario is rather different. We are accounting the
real time spent in a slice with a similar probablity.
This becomes: T_slice * P(T_slice*HZ).

So it seems it could result into logarithmic accounting: timeslices of 1 second
will be accounted right whereas repeating tiny timeslices may result in much lower
values than expected.

To fix this we should instead account jiffies_to_nsecs(jiffies - t->vtime_jiffies).
Well, that would drop the use of finegrained clock and even the need of nsecs based
cputime. But why not if we still have acceptable result for much more performances.

I don't know if all the above actually makes sense. I suck at maths so I may well be
wrong.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-09 17:11   ` Frederic Weisbecker
@ 2016-02-09 18:13     ` Rik van Riel
  2016-02-09 21:20     ` Rik van Riel
  1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2016-02-09 18:13 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, tglx, mingo, luto, peterz, clark, eric.dumazet

[-- Attachment #1: Type: text/plain, Size: 1599 bytes --]

On Tue, 2016-02-09 at 18:11 +0100, Frederic Weisbecker wrote:
> On Tue, Feb 02, 2016 at 12:19:46PM -0500, riel@redhat.com wrote:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > After removing __acct_update_integrals from the profile,
> > native_sched_clock remains as the top CPU user. This can be
> > reduced by only calling account_{user,sys,guest,idle}_time
> > once per jiffy for long running tasks on nohz_full CPUs.
> > 
> > This will reduce timing accuracy on nohz_full CPUs to jiffy
> > based sampling, just like on normal CPUs.
> 
> Now after this patch, the scenario is rather different. We are
> accounting the
> real time spent in a slice with a similar probablity.
> This becomes: T_slice * P(T_slice*HZ).
> 
> So it seems it could result into logarithmic accounting: timeslices
> of 1 second
> will be accounted right whereas repeating tiny timeslices may result
> in much lower
> values than expected.

You are right that this code does not handle
short timeslices well.

However, I believe it does not have to, because
the scheduler already takes care of that.

At context switch time, the scheduler will call
vtime_common_task_switch, which calls
arch_vtime_task_switch, which handles precise
time accounting at task switch time.

The call chain is like this:

arch_vtime_task_switch
vtime_common_task_switch
vtime_task_switch
finish_task_switch
context_switch
__schedule

As you can see, the time accounting for shorter
running tasks is already handled by other code,
which means my patch should be ok.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-09 17:11   ` Frederic Weisbecker
  2016-02-09 18:13     ` Rik van Riel
@ 2016-02-09 21:20     ` Rik van Riel
  1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2016-02-09 21:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, tglx, mingo, luto, peterz, clark, eric.dumazet

[-- Attachment #1: Type: text/plain, Size: 2319 bytes --]

On Tue, 2016-02-09 at 18:11 +0100, Frederic Weisbecker wrote:
> 
> So for any T_slice being a given cpu timeslice (in secs) executed
> between
> two ring switch (user <-> kernel), we are going to account: 1 *
> P(T_slice*HZ)
> (P() stand for probability here).
> 
> Now after this patch, the scenario is rather different. We are
> accounting the
> real time spent in a slice with a similar probablity.
> This becomes: T_slice * P(T_slice*HZ).
> 
> So it seems it could result into logarithmic accounting: timeslices
> of 1 second
> will be accounted right whereas repeating tiny timeslices may result
> in much lower
> values than expected.
> 
> To fix this we should instead account jiffies_to_nsecs(jiffies - t-
> >vtime_jiffies).
> Well, that would drop the use of finegrained clock and even the need
> of nsecs based
> cputime. But why not if we still have acceptable result for much more
> performances.

Looking over the code some more, you are right.

My changes to vtime_account_idle and
vtime_account_system will cause them to do
nothing a lot of the time, when called from
vtime_common_task_switch.

This causes a discrepancy between the time
accounted at task switch time, and the time
delta accounted later on.

I see two ways to fix this:
1) Always do fine granularity accounting at
   task switch time, and revert to coarser
   granularity at syscall time only. This may
   or may not be more accurate (not sure how
   much, or whether we care).
2) Always account at coarser granularity,
   like you suggest above. This has the
   advantage of leading to faster context
   switch times (no TSC reads).

I am happy to implement either.

Frederic, Thomas, Ingo, do you have a
preference between these approaches?

Would you like me to opt for the higher
performance option, or go for the potentially
higher accuracy one?

As an aside, I am wondering whether the call
to vtime_account_user() from
vtime_common_task_switch() ever does anything.

After all, the preceding call to vtime_account_system
would have already accounted all the CPU time that
passed to system time, and there will be no time left
to account to userspace.

Unless I am missing something, I suspect that line
can just go.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01 20:00   ` Eric Dumazet
@ 2016-02-01 20:08     ` Rik van Riel
  0 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2016-02-01 20:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, tglx, peterz, fweisbec, clark, luto, mingo

On 02/01/2016 03:00 PM, Eric Dumazet wrote:
> On Mon, 2016-02-01 at 14:21 -0500, riel@redhat.com wrote:
>> From: Rik van Riel <riel@redhat.com>
>>
> 
>>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
>> +static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
>> +{
>> +	if (tsk->vtime_jiffies == jiffies)
>> +		return false;
>> +
>> +	tsk->vtime_jiffies = jiffies;
>> +	return true;
>> +}
> 
> I am guessing you intended to use 'now' instead of jiffies ?

Yes indeed. Good catch.

I have fixed this up in my local tree, and will wait for
more people's comments before sending out a v5 (tomorrow?).

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01 19:21 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
@ 2016-02-01 20:00   ` Eric Dumazet
  2016-02-01 20:08     ` Rik van Riel
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2016-02-01 20:00 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, tglx, peterz, fweisbec, clark, luto, mingo

On Mon, 2016-02-01 at 14:21 -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 

>  #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
> +static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
> +{
> +	if (tsk->vtime_jiffies == jiffies)
> +		return false;
> +
> +	tsk->vtime_jiffies = jiffies;
> +	return true;
> +}

I am guessing you intended to use 'now' instead of jiffies ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01  9:29   ` Peter Zijlstra
@ 2016-02-01 19:23     ` Rik van Riel
  0 siblings, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2016-02-01 19:23 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, fweisbec, tglx, mingo, luto, clark

On 02/01/2016 04:29 AM, Peter Zijlstra wrote:
> On Sun, Jan 31, 2016 at 09:12:31PM -0500, riel@redhat.com wrote:
>> Run times for the microbenchmark:
>>
>> 4.4				3.8 seconds
>> 4.5-rc1				3.7 seconds
>> 4.5-rc1 + first patch		3.3 seconds
>> 4.5-rc1 + first 3 patches	3.1 seconds
>> 4.5-rc1 + all patches		2.3 seconds
> 
> Would be good to compare that also to a !NOHZ_FULL bloated path.

On a non-nohz_full, non-housekeeping CPU, I see 1.86
seconds run time, or about 20% faster than on a nohz_full
CPU with all patches applied.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01 19:21 [PATCH 0/4 v4] " riel
@ 2016-02-01 19:21 ` riel
  2016-02-01 20:00   ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: riel @ 2016-02-01 19:21 UTC (permalink / raw)
  To: linux-kernel; +Cc: tglx, peterz, fweisbec, clark, luto, mingo

From: Rik van Riel <riel@redhat.com>

After removing __acct_update_integrals from the profile,
native_sched_clock remains as the top CPU user. This can be
reduced by only calling account_{user,sys,guest,idle}_time
once per jiffy for long running tasks on nohz_full CPUs.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and kvm guest entry & exit.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h  |  1 +
 kernel/sched/cputime.c | 35 +++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..019c3af98503 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1532,6 +1532,7 @@ struct task_struct {
 	struct prev_cputime prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t vtime_seqcount;
+	unsigned long vtime_jiffies;
 	unsigned long long vtime_snap;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ffb1adc..923c110319b1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,6 +668,15 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
+{
+	if (tsk->vtime_jiffies == jiffies)
+		return false;
+
+	tsk->vtime_jiffies = jiffies;
+	return true;
+}
+
 static unsigned long long vtime_delta(struct task_struct *tsk)
 {
 	unsigned long long clock;
@@ -699,6 +708,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +719,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +731,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_jiffies_changed(tsk, jiffies)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +758,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -759,8 +776,12 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta(tsk);
+	cputime_t delta_cpu;
+
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
 
+	delta_cpu = get_vtime_delta(tsk);
 	account_idle_time(delta_cpu);
 }
 
@@ -773,6 +794,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
 	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_jiffies = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -784,6 +806,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
 	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_jiffies = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01  2:12 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
@ 2016-02-01  9:29   ` Peter Zijlstra
  2016-02-01 19:23     ` Rik van Riel
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2016-02-01  9:29 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, fweisbec, tglx, mingo, luto, clark

On Sun, Jan 31, 2016 at 09:12:31PM -0500, riel@redhat.com wrote:
> Run times for the microbenchmark:
> 
> 4.4				3.8 seconds
> 4.5-rc1				3.7 seconds
> 4.5-rc1 + first patch		3.3 seconds
> 4.5-rc1 + first 3 patches	3.1 seconds
> 4.5-rc1 + all patches		2.3 seconds

Would be good to compare that also to a !NOHZ_FULL bloated path.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
@ 2016-02-01  2:12 ` riel
  2016-02-01  9:29   ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: riel @ 2016-02-01  2:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark

From: Rik van Riel <riel@redhat.com>

After removing __acct_update_integrals from the profile,
native_sched_clock remains as the top CPU user. This can be
reduced by only calling account_{user,sys,guest,idle}_time
once per jiffy for long running tasks on nohz_full CPUs.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and kvm guest entry & exit.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h  |  1 +
 kernel/sched/cputime.c | 35 +++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..019c3af98503 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1532,6 +1532,7 @@ struct task_struct {
 	struct prev_cputime prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t vtime_seqcount;
+	unsigned long vtime_jiffies;
 	unsigned long long vtime_snap;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ffb1adc..923c110319b1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,6 +668,15 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
+{
+	if (tsk->vtime_jiffies == jiffies)
+		return false;
+
+	tsk->vtime_jiffies = jiffies;
+	return true;
+}
+
 static unsigned long long vtime_delta(struct task_struct *tsk)
 {
 	unsigned long long clock;
@@ -699,6 +708,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +719,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +731,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_jiffies_changed(tsk, jiffies)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +758,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -759,8 +776,12 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta(tsk);
+	cputime_t delta_cpu;
+
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
 
+	delta_cpu = get_vtime_delta(tsk);
 	account_idle_time(delta_cpu);
 }
 
@@ -773,6 +794,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
 	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_jiffies = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -784,6 +806,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
 	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_jiffies = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-01-30  3:36 [PATCH 0/2] sched,time: reduce nohz_full syscall overhead 40% riel
@ 2016-01-30  3:36 ` riel
  0 siblings, 0 replies; 17+ messages in thread
From: riel @ 2016-01-30  3:36 UTC (permalink / raw)
  To: linux-kernel; +Cc: tglx, mingo, luto, fweisbec, peterz, clark

From: Rik van Riel <riel@redhat.com>

After removing __acct_update_integrals from the profile,
native_sched_clock remains as the top CPU user. This can be
reduced by only calling account_{user,sys,guest,idle}_time
once per jiffy for long running tasks on nohz_full CPUs.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and kvm guest entry & exit.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h  |  1 +
 kernel/sched/cputime.c | 35 +++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..019c3af98503 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1532,6 +1532,7 @@ struct task_struct {
 	struct prev_cputime prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t vtime_seqcount;
+	unsigned long vtime_jiffies;
 	unsigned long long vtime_snap;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ffb1adc..923c110319b1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,6 +668,15 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
+{
+	if (tsk->vtime_jiffies == jiffies)
+		return false;
+
+	tsk->vtime_jiffies = jiffies;
+	return true;
+}
+
 static unsigned long long vtime_delta(struct task_struct *tsk)
 {
 	unsigned long long clock;
@@ -699,6 +708,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +719,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +731,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_jiffies_changed(tsk, jiffies)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +758,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -759,8 +776,12 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta(tsk);
+	cputime_t delta_cpu;
+
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
 
+	delta_cpu = get_vtime_delta(tsk);
 	account_idle_time(delta_cpu);
 }
 
@@ -773,6 +794,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
 	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_jiffies = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -784,6 +806,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
 	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_jiffies = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2016-02-09 21:20 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02 17:19 [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% riel
2016-02-02 17:19 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
2016-02-02 17:19 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
2016-02-02 17:19 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
2016-02-02 17:19 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
2016-02-09 17:11   ` Frederic Weisbecker
2016-02-09 18:13     ` Rik van Riel
2016-02-09 21:20     ` Rik van Riel
2016-02-08 21:07 ` [PATCH 0/4 v5] sched,time: reduce nohz_full syscall overhead 40% Rik van Riel
2016-02-09  2:13   ` Frederic Weisbecker
  -- strict thread matches above, loose matches on Subject: below --
2016-02-01 19:21 [PATCH 0/4 v4] " riel
2016-02-01 19:21 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
2016-02-01 20:00   ` Eric Dumazet
2016-02-01 20:08     ` Rik van Riel
2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
2016-02-01  2:12 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
2016-02-01  9:29   ` Peter Zijlstra
2016-02-01 19:23     ` Rik van Riel
2016-01-30  3:36 [PATCH 0/2] sched,time: reduce nohz_full syscall overhead 40% riel
2016-01-30  3:36 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.