[PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40%

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40%
@ 2016-02-11  1:08 riel
  2016-02-11  1:08 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: riel @ 2016-02-11  1:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

(v6: make VIRT_CPU_ACCOUNTING_GEN jiffy granularity)

Running with nohz_full introduces a fair amount of overhead.
Specifically, various things that are usually done from the
timer interrupt are now done at syscall, irq, and guest
entry and exit times.

However, some of the code that is called every single time
has only ever worked at jiffy resolution. The code in
__acct_update_integrals was also doing some unnecessary
calculations.

Getting rid of the unnecessary calculations, without
changing any of the functionality in __acct_update_integrals
gets us about an 11% win.

Not calling the time statistics updating code more than
once per jiffy, like is done on housekeeping CPUs and on
all the CPUs of a non-nohz_full system, shaves off a
further 30%.

I tested this series with a microbenchmark calling
an invalid syscall number ten million times in a row,
on a nohz_full cpu.

    Run times for the microbenchmark:

4.4                             3.8 seconds
4.5-rc1                         3.7 seconds
4.5-rc1 + first patch           3.3 seconds
4.5-rc1 + first 3 patches       3.1 seconds
4.5-rc1 + all patches           2.3 seconds

   Same test on a non-NOHZ_FULL, non-housekeeping CPU:
all kernels                     1.86 seconds

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-11  1:08 [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% riel
@ 2016-02-11  1:08 ` riel
  2016-02-29 11:17   ` [tip:sched/core] sched, time: Remove non-power-of-two divides from __acct_update_integrals() tip-bot for Rik van Riel
  2016-02-11  1:08 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: riel @ 2016-02-11  1:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

When running a microbenchmark calling an invalid syscall number
in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
time in __acct_update_integrals.

This function converts cputime_t to jiffies, to a timeval, only to
convert the timeval back to microseconds before discarding it.

This patch leaves __acct_update_integrals functionally equivalent,
but speeds things up by about 12%, with 10 million calls to an
invalid syscall number dropping from 3.7 to 3.25 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..460ee2bbfef3 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -93,9 +93,11 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {
 	struct mm_struct *mm;
 
-	/* convert pages-usec to Mbyte-usec */
-	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
-	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
+	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
+	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE;
+	do_div(stats->coremem, 1000 * KB);
+	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE;
+	do_div(stats->virtmem, 1000 * KB);
 	mm = get_task_mm(p);
 	if (mm) {
 		/* adjust to KB unit */
@@ -125,22 +127,26 @@ static void __acct_update_integrals(struct task_struct *tsk,
 {
 	if (likely(tsk->mm)) {
 		cputime_t time, dtime;
-		struct timeval value;
 		unsigned long flags;
 		u64 delta;
 
 		local_irq_save(flags);
 		time = stime + utime;
 		dtime = time - tsk->acct_timexpd;
-		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
-		delta = value.tv_sec;
-		delta = delta * USEC_PER_SEC + value.tv_usec;
+		/* Avoid division: cputime_t is often in nanoseconds already. */
+		delta = cputime_to_nsecs(dtime);
 
-		if (delta == 0)
+		if (delta < TICK_NSEC)
 			goto out;
+
 		tsk->acct_timexpd = time;
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
+		/*
+		 * Divide by 1024 to avoid overflow, and to avoid division.
+		 * The final unit reported to userspace is Mbyte-usecs,
+		 * the rest of the math is done in xacct_add_tsk.
+		 */
+		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
 	out:
 		local_irq_restore(flags);
 	}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] acct,time: change indentation in __acct_update_integrals
  2016-02-11  1:08 [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% riel
  2016-02-11  1:08 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
@ 2016-02-11  1:08 ` riel
  2016-02-11  1:23   ` Joe Perches
  2016-02-29 11:18   ` [tip:sched/core] acct, time: Change indentation in __acct_update_integrals() tip-bot for Rik van Riel
  2016-02-11  1:08 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: riel @ 2016-02-11  1:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

Change the indentation in __acct_update_integrals to make the function
a little easier to read.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
---
 kernel/tsacct.c | 51 ++++++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 460ee2bbfef3..d12e815b7bcd 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -125,31 +125,32 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
-	if (likely(tsk->mm)) {
-		cputime_t time, dtime;
-		unsigned long flags;
-		u64 delta;
-
-		local_irq_save(flags);
-		time = stime + utime;
-		dtime = time - tsk->acct_timexpd;
-		/* Avoid division: cputime_t is often in nanoseconds already. */
-		delta = cputime_to_nsecs(dtime);
-
-		if (delta < TICK_NSEC)
-			goto out;
-
-		tsk->acct_timexpd = time;
-		/*
-		 * Divide by 1024 to avoid overflow, and to avoid division.
-		 * The final unit reported to userspace is Mbyte-usecs,
-		 * the rest of the math is done in xacct_add_tsk.
-		 */
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-	out:
-		local_irq_restore(flags);
-	}
+	cputime_t time, dtime;
+	unsigned long flags;
+	u64 delta;
+
+	if (!likely(tsk->mm))
+		return;
+
+	local_irq_save(flags);
+	time = stime + utime;
+	dtime = time - tsk->acct_timexpd;
+	/* Avoid division: cputime_t is often in nanoseconds already. */
+	delta = cputime_to_nsecs(dtime);
+
+	if (delta < TICK_NSEC)
+		goto out;
+
+	tsk->acct_timexpd = time;
+	/*
+	 * Divide by 1024 to avoid overflow, and to avoid division.
+	 * The final unit reported to userspace is Mbyte-usecs,
+	 * the rest of the math is done in xacct_add_tsk.
+	 */
+	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
+out:
+	local_irq_restore(flags);
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals
  2016-02-11  1:08 [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% riel
  2016-02-11  1:08 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
  2016-02-11  1:08 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
@ 2016-02-11  1:08 ` riel
  2016-02-29 11:18   ` [tip:sched/core] time, acct: Drop irq save & restore from __acct_update_integrals() tip-bot for Rik van Riel
  2016-02-11  1:08 ` [PATCH 4/4] sched,time: switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity riel
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: riel @ 2016-02-11  1:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

It looks like all the call paths that lead to __acct_update_integrals
already have irqs disabled, and __acct_update_integrals does not need
to disable irqs itself.

This is very convenient since about half the CPU time left in this
function was spent in local_irq_save alone.

Performance of a microbenchmark that calls an invalid syscall
ten million times in a row on a nohz_full CPU improves 21% vs.
4.5-rc1 with both the removal of divisions from __acct_update_integrals
and this patch, with runtime dropping from 3.7 to 2.9 seconds.

With these patches applied, the highest remaining cpu user in
the trace is native_sched_clock, which is addressed in the next
patch.

For testing purposes I stuck a WARN_ON(!irqs_disabled()) test
in __acct_update_integrals. It did not trigger.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index d12e815b7bcd..f8e26ab963ed 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -126,20 +126,18 @@ static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
 	cputime_t time, dtime;
-	unsigned long flags;
 	u64 delta;
 
 	if (!likely(tsk->mm))
 		return;
 
-	local_irq_save(flags);
 	time = stime + utime;
 	dtime = time - tsk->acct_timexpd;
 	/* Avoid division: cputime_t is often in nanoseconds already. */
 	delta = cputime_to_nsecs(dtime);
 
 	if (delta < TICK_NSEC)
-		goto out;
+		return;
 
 	tsk->acct_timexpd = time;
 	/*
@@ -149,8 +147,6 @@ static void __acct_update_integrals(struct task_struct *tsk,
 	 */
 	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
 	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-out:
-	local_irq_restore(flags);
 }
 
 /**
@@ -160,9 +156,12 @@ static void __acct_update_integrals(struct task_struct *tsk,
 void acct_update_integrals(struct task_struct *tsk)
 {
 	cputime_t utime, stime;
+	unsigned long flags;
 
+	local_irq_save(flags);
 	task_cputime(tsk, &utime, &stime);
 	__acct_update_integrals(tsk, utime, stime);
+	local_irq_restore(flags);
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] sched,time: switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
  2016-02-11  1:08 [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% riel
                   ` (2 preceding siblings ...)
  2016-02-11  1:08 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
@ 2016-02-11  1:08 ` riel
  2016-02-29 11:18   ` [tip:sched/core] sched, time: Switch " tip-bot for Rik van Riel
  2016-02-15  9:01 ` [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% Mike Galbraith
  2016-02-24 11:16 ` Thomas Gleixner
  5 siblings, 1 reply; 14+ messages in thread
From: riel @ 2016-02-11  1:08 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

From: Rik van Riel <riel@redhat.com>

After removing __acct_update_integrals from the profile,
native_sched_clock remains as the top CPU user. This can be
reduced by moving VIRT_CPU_ACCOUNTING_GEN to jiffy
granularity.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and kvm guest entry & exit.

Additionally, only call the more expensive functions (and
advance the seqlock) when jiffies actually changed.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

A non-NOHZ_FULL cpu (not the housekeeping CPU)
all kernels			1.86 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/sched/cputime.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ffb1adc..01d9898bc9a2 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,26 +668,25 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static unsigned long long vtime_delta(struct task_struct *tsk)
+static cputime_t vtime_delta(struct task_struct *tsk)
 {
-	unsigned long long clock;
+	unsigned long now = READ_ONCE(jiffies);
 
-	clock = local_clock();
-	if (clock < tsk->vtime_snap)
+	if (time_before(now, (unsigned long)tsk->vtime_snap))
 		return 0;
 
-	return clock - tsk->vtime_snap;
+	return jiffies_to_cputime(now - tsk->vtime_snap);
 }
 
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
-	unsigned long long delta = vtime_delta(tsk);
+	unsigned long now = READ_ONCE(jiffies);
+	unsigned long delta = now - tsk->vtime_snap;
 
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-	tsk->vtime_snap += delta;
+	tsk->vtime_snap = now;
 
-	/* CHECKME: always safe to convert nsecs to cputime? */
-	return nsecs_to_cputime(delta);
+	return jiffies_to_cputime(delta);
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -699,6 +698,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_delta(tsk))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +709,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_delta(tsk))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +721,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_delta(tsk)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_delta(tsk))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +748,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_delta(tsk))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -772,7 +779,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
-	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_snap = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -783,7 +790,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_save(flags);
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
-	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_snap = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] acct,time: change indentation in __acct_update_integrals
  2016-02-11  1:08 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
@ 2016-02-11  1:23   ` Joe Perches
  2016-02-29 11:18   ` [tip:sched/core] acct, time: Change indentation in __acct_update_integrals() tip-bot for Rik van Riel
  1 sibling, 0 replies; 14+ messages in thread
From: Joe Perches @ 2016-02-11  1:23 UTC (permalink / raw)
  To: riel, linux-kernel
  Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

On Wed, 2016-02-10 at 20:08 -0500, riel@redhat.com wrote:
> Change the indentation in __acct_update_integrals to make the function
> a little easier to read.

trivia:

> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
[]
> @@ -125,31 +125,32 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
[]
> +	if (!likely(tsk->mm))
> +		return;

Using

	if (unlikely(!tsk->mm))
		return;

would be a lot more common.

(~150:1 in the kernel sources)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40%
  2016-02-11  1:08 [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% riel
                   ` (3 preceding siblings ...)
  2016-02-11  1:08 ` [PATCH 4/4] sched,time: switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity riel
@ 2016-02-15  9:01 ` Mike Galbraith
  2016-02-24 11:16 ` Thomas Gleixner
  5 siblings, 0 replies; 14+ messages in thread
From: Mike Galbraith @ 2016-02-15  9:01 UTC (permalink / raw)
  To: riel, linux-kernel
  Cc: fweisbec, tglx, mingo, luto, peterz, clark, eric.dumazet

Hi Rik,

On Wed, 2016-02-10 at 20:08 -0500, riel@redhat.com wrote:

> I tested this series with a microbenchmark calling
> an invalid syscall number ten million times in a row,
> on a nohz_full cpu.
> 
>     Run times for the microbenchmark:
>     
> 4.4                             3.8 seconds
> 4.5-rc1                         3.7 seconds
> 4.5-rc1 + first patch           3.3 seconds
> 4.5-rc1 + first 3 patches       3.1 seconds
> 4.5-rc1 + all patches           2.3 seconds
> 
>    Same test on a non-NOHZ_FULL, non-housekeeping CPU:
> all kernels                     1.86 seconds

I tested 10M stat(".", &buf) calls, and saw a win of ~20% on a
nohz_full cpu.  Below are nopreempt vs nohz_full+patches overhead
numbers from my box.
                                                        avg
4.4.1-nopreempt        0m1.652s   0m1.633s   0m1.635s   1.640   1.000

nohz_full + patches
nohz_full inactive     0m1.642s   0m1.631s   0m1.629s   1.634    .996
housekeeper CPU        0m2.013s   0m2.012s   0m2.033s   2.019   1.231
nohz_full CPU          0m2.247s   0m2.233s   0m2.239s   2.239   1.365

It still ain't free ;-) but between this set, and all the other work
that has gone in ~recently, it looks one hell of a lot better.  That's
not too scary a pricetag.

	-Mike

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40%
  2016-02-11  1:08 [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% riel
                   ` (4 preceding siblings ...)
  2016-02-15  9:01 ` [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% Mike Galbraith
@ 2016-02-24 11:16 ` Thomas Gleixner
  5 siblings, 0 replies; 14+ messages in thread
From: Thomas Gleixner @ 2016-02-24 11:16 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, fweisbec, mingo, luto, peterz, clark, eric.dumazet

On Wed, 10 Feb 2016, riel@redhat.com wrote:

> (v6: make VIRT_CPU_ACCOUNTING_GEN jiffy granularity)
> 
> Running with nohz_full introduces a fair amount of overhead.
> Specifically, various things that are usually done from the
> timer interrupt are now done at syscall, irq, and guest
> entry and exit times.
> 
> However, some of the code that is called every single time
> has only ever worked at jiffy resolution. The code in
> __acct_update_integrals was also doing some unnecessary
> calculations.
> 
> Getting rid of the unnecessary calculations, without
> changing any of the functionality in __acct_update_integrals
> gets us about an 11% win.
> 
> Not calling the time statistics updating code more than
> once per jiffy, like is done on housekeeping CPUs and on
> all the CPUs of a non-nohz_full system, shaves off a
> further 30%.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched, time: Remove non-power-of-two divides from __acct_update_integrals()
  2016-02-11  1:08 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
@ 2016-02-29 11:17   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Rik van Riel @ 2016-02-29 11:17 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, tglx, hpa, peterz, mingo, linux-kernel, riel, efault

Commit-ID:  382c2fe994321d503647ce8ee329b9420dc7c1f9
Gitweb:     http://git.kernel.org/tip/382c2fe994321d503647ce8ee329b9420dc7c1f9
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 10 Feb 2016 20:08:24 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:53:08 +0100

sched, time: Remove non-power-of-two divides from __acct_update_integrals()

When running a microbenchmark calling an invalid syscall number
in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
time in __acct_update_integrals().

This function converts cputime_t to jiffies, to a timeval, only to
convert the timeval back to microseconds before discarding it.

This patch leaves __acct_update_integrals() functionally equivalent,
but speeds things up by about 12%, with 10 million calls to an
invalid syscall number dropping from 3.7 to 3.25 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-2-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/tsacct.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49..460ee2b 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -93,9 +93,11 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {
 	struct mm_struct *mm;
 
-	/* convert pages-usec to Mbyte-usec */
-	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
-	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
+	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
+	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE;
+	do_div(stats->coremem, 1000 * KB);
+	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE;
+	do_div(stats->virtmem, 1000 * KB);
 	mm = get_task_mm(p);
 	if (mm) {
 		/* adjust to KB unit */
@@ -125,22 +127,26 @@ static void __acct_update_integrals(struct task_struct *tsk,
 {
 	if (likely(tsk->mm)) {
 		cputime_t time, dtime;
-		struct timeval value;
 		unsigned long flags;
 		u64 delta;
 
 		local_irq_save(flags);
 		time = stime + utime;
 		dtime = time - tsk->acct_timexpd;
-		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
-		delta = value.tv_sec;
-		delta = delta * USEC_PER_SEC + value.tv_usec;
+		/* Avoid division: cputime_t is often in nanoseconds already. */
+		delta = cputime_to_nsecs(dtime);
 
-		if (delta == 0)
+		if (delta < TICK_NSEC)
 			goto out;
+
 		tsk->acct_timexpd = time;
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
+		/*
+		 * Divide by 1024 to avoid overflow, and to avoid division.
+		 * The final unit reported to userspace is Mbyte-usecs,
+		 * the rest of the math is done in xacct_add_tsk.
+		 */
+		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
 	out:
 		local_irq_restore(flags);
 	}

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] acct, time: Change indentation in __acct_update_integrals()
  2016-02-11  1:08 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
  2016-02-11  1:23   ` Joe Perches
@ 2016-02-29 11:18   ` tip-bot for Rik van Riel
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Rik van Riel @ 2016-02-29 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, mingo, riel, torvalds, linux-kernel, tglx, efault, hpa, fweisbec

Commit-ID:  b2add86edd3bc050af350515e6ba26f4622c38f3
Gitweb:     http://git.kernel.org/tip/b2add86edd3bc050af350515e6ba26f4622c38f3
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 10 Feb 2016 20:08:25 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:53:09 +0100

acct, time: Change indentation in __acct_update_integrals()

Change the indentation in __acct_update_integrals() to make the function
a little easier to read.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/tsacct.c | 51 ++++++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 460ee2b..d12e815 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -125,31 +125,32 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
-	if (likely(tsk->mm)) {
-		cputime_t time, dtime;
-		unsigned long flags;
-		u64 delta;
-
-		local_irq_save(flags);
-		time = stime + utime;
-		dtime = time - tsk->acct_timexpd;
-		/* Avoid division: cputime_t is often in nanoseconds already. */
-		delta = cputime_to_nsecs(dtime);
-
-		if (delta < TICK_NSEC)
-			goto out;
-
-		tsk->acct_timexpd = time;
-		/*
-		 * Divide by 1024 to avoid overflow, and to avoid division.
-		 * The final unit reported to userspace is Mbyte-usecs,
-		 * the rest of the math is done in xacct_add_tsk.
-		 */
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-	out:
-		local_irq_restore(flags);
-	}
+	cputime_t time, dtime;
+	unsigned long flags;
+	u64 delta;
+
+	if (!likely(tsk->mm))
+		return;
+
+	local_irq_save(flags);
+	time = stime + utime;
+	dtime = time - tsk->acct_timexpd;
+	/* Avoid division: cputime_t is often in nanoseconds already. */
+	delta = cputime_to_nsecs(dtime);
+
+	if (delta < TICK_NSEC)
+		goto out;
+
+	tsk->acct_timexpd = time;
+	/*
+	 * Divide by 1024 to avoid overflow, and to avoid division.
+	 * The final unit reported to userspace is Mbyte-usecs,
+	 * the rest of the math is done in xacct_add_tsk.
+	 */
+	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
+out:
+	local_irq_restore(flags);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] time, acct: Drop irq save & restore from __acct_update_integrals()
  2016-02-11  1:08 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
@ 2016-02-29 11:18   ` tip-bot for Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: tip-bot for Rik van Riel @ 2016-02-29 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mingo, torvalds, linux-kernel, peterz, hpa, riel, efault

Commit-ID:  9344c92c2e72e495f695caef8364b3dd73af0eab
Gitweb:     http://git.kernel.org/tip/9344c92c2e72e495f695caef8364b3dd73af0eab
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 10 Feb 2016 20:08:26 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:53:09 +0100

time, acct: Drop irq save & restore from __acct_update_integrals()

It looks like all the call paths that lead to __acct_update_integrals()
already have irqs disabled, and __acct_update_integrals() does not need
to disable irqs itself.

This is very convenient since about half the CPU time left in this
function was spent in local_irq_save alone.

Performance of a microbenchmark that calls an invalid syscall
ten million times in a row on a nohz_full CPU improves 21% vs.
4.5-rc1 with both the removal of divisions from __acct_update_integrals()
and this patch, with runtime dropping from 3.7 to 2.9 seconds.

With these patches applied, the highest remaining cpu user in
the trace is native_sched_clock, which is addressed in the next
patch.

For testing purposes I stuck a WARN_ON(!irqs_disabled()) test
in __acct_update_integrals(). It did not trigger.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/tsacct.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index d12e815..f8e26ab 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -126,20 +126,18 @@ static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
 	cputime_t time, dtime;
-	unsigned long flags;
 	u64 delta;
 
 	if (!likely(tsk->mm))
 		return;
 
-	local_irq_save(flags);
 	time = stime + utime;
 	dtime = time - tsk->acct_timexpd;
 	/* Avoid division: cputime_t is often in nanoseconds already. */
 	delta = cputime_to_nsecs(dtime);
 
 	if (delta < TICK_NSEC)
-		goto out;
+		return;
 
 	tsk->acct_timexpd = time;
 	/*
@@ -149,8 +147,6 @@ static void __acct_update_integrals(struct task_struct *tsk,
 	 */
 	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
 	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-out:
-	local_irq_restore(flags);
 }
 
 /**
@@ -160,9 +156,12 @@ out:
 void acct_update_integrals(struct task_struct *tsk)
 {
 	cputime_t utime, stime;
+	unsigned long flags;
 
+	local_irq_save(flags);
 	task_cputime(tsk, &utime, &stime);
 	__acct_update_integrals(tsk, utime, stime);
+	local_irq_restore(flags);
 }
 
 /**

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:sched/core] sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
  2016-02-11  1:08 ` [PATCH 4/4] sched,time: switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity riel
@ 2016-02-29 11:18   ` tip-bot for Rik van Riel
  2016-02-29 15:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 14+ messages in thread
From: tip-bot for Rik van Riel @ 2016-02-29 11:18 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, hpa, tglx, efault, torvalds, peterz, linux-kernel, mingo

Commit-ID:  ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
Gitweb:     http://git.kernel.org/tip/ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Wed, 10 Feb 2016 20:08:27 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 29 Feb 2016 09:53:10 +0100

sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity

When profiling syscall overhead on nohz-full kernels,
after removing __acct_update_integrals() from the profile,
native_sched_clock() remains as the top CPU user. This can be
reduced by moving VIRT_CPU_ACCOUNTING_GEN to jiffy granularity.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and KVM guest entry & exit.

Additionally, only call the more expensive functions (and
advance the seqlock) when jiffies actually changed.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

 4.4				3.8 seconds
 4.5-rc1			3.7 seconds
 4.5-rc1 + first patch		3.3 seconds
 4.5-rc1 + first 3 patches	3.1 seconds
 4.5-rc1 + all patches		2.3 seconds

A non-NOHZ_FULL cpu (not the housekeeping CPU):

 all kernels			1.86 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: clark@redhat.com
Cc: eric.dumazet@gmail.com
Cc: fweisbec@gmail.com
Cc: luto@amacapital.net
Link: http://lkml.kernel.org/r/1455152907-18495-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/cputime.c | 39 +++++++++++++++++++++++----------------
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ff..01d9898 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,26 +668,25 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static unsigned long long vtime_delta(struct task_struct *tsk)
+static cputime_t vtime_delta(struct task_struct *tsk)
 {
-	unsigned long long clock;
+	unsigned long now = READ_ONCE(jiffies);
 
-	clock = local_clock();
-	if (clock < tsk->vtime_snap)
+	if (time_before(now, (unsigned long)tsk->vtime_snap))
 		return 0;
 
-	return clock - tsk->vtime_snap;
+	return jiffies_to_cputime(now - tsk->vtime_snap);
 }
 
 static cputime_t get_vtime_delta(struct task_struct *tsk)
 {
-	unsigned long long delta = vtime_delta(tsk);
+	unsigned long now = READ_ONCE(jiffies);
+	unsigned long delta = now - tsk->vtime_snap;
 
 	WARN_ON_ONCE(tsk->vtime_snap_whence == VTIME_INACTIVE);
-	tsk->vtime_snap += delta;
+	tsk->vtime_snap = now;
 
-	/* CHECKME: always safe to convert nsecs to cputime? */
-	return nsecs_to_cputime(delta);
+	return jiffies_to_cputime(delta);
 }
 
 static void __vtime_account_system(struct task_struct *tsk)
@@ -699,6 +698,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_delta(tsk))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +709,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_delta(tsk))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +721,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_delta(tsk)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_delta(tsk))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +748,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_delta(tsk))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -772,7 +779,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
-	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_snap = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -783,7 +790,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	local_irq_save(flags);
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
-	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_snap = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [tip:sched/core] sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
  2016-02-29 11:18   ` [tip:sched/core] sched, time: Switch " tip-bot for Rik van Riel
@ 2016-02-29 15:31     ` Frederic Weisbecker
  2016-03-01 15:35       ` Frederic Weisbecker
  0 siblings, 1 reply; 14+ messages in thread
From: Frederic Weisbecker @ 2016-02-29 15:31 UTC (permalink / raw)
  To: Rik van Riel, Thomas Gleixner, H. Peter Anvin, Linus Torvalds,
	Mike Galbraith, Ingo Molnar, LKML, Peter Zijlstra
  Cc: linux-tip-commits

2016-02-29 12:18 GMT+01:00 tip-bot for Rik van Riel <tipbot@zytor.com>:
> Commit-ID:  ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
> Gitweb:     http://git.kernel.org/tip/ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
> Author:     Rik van Riel <riel@redhat.com>
> AuthorDate: Wed, 10 Feb 2016 20:08:27 -0500
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Mon, 29 Feb 2016 09:53:10 +0100
>
> sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
>
> When profiling syscall overhead on nohz-full kernels,
> after removing __acct_update_integrals() from the profile,
> native_sched_clock() remains as the top CPU user. This can be
> reduced by moving VIRT_CPU_ACCOUNTING_GEN to jiffy granularity.
>
> This will reduce timing accuracy on nohz_full CPUs to jiffy
> based sampling, just like on normal CPUs. It results in
> totally removing native_sched_clock from the profile, and
> significantly speeding up the syscall entry and exit path,
> as well as irq entry and exit, and KVM guest entry & exit.
>
> Additionally, only call the more expensive functions (and
> advance the seqlock) when jiffies actually changed.
>
> This code relies on another CPU advancing jiffies when the
> system is busy. On a nohz_full system, this is done by a
> housekeeping CPU.
>
> A microbenchmark calling an invalid syscall number 10 million
> times in a row speeds up an additional 30% over the numbers
> with just the previous patches, for a total speedup of about
> 40% over 4.4 and 4.5-rc1.
>
> Run times for the microbenchmark:
>
>  4.4                            3.8 seconds
>  4.5-rc1                        3.7 seconds
>  4.5-rc1 + first patch          3.3 seconds
>  4.5-rc1 + first 3 patches      3.1 seconds
>  4.5-rc1 + all patches          2.3 seconds
>
> A non-NOHZ_FULL cpu (not the housekeeping CPU):
>
>  all kernels                    1.86 seconds
>
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: clark@redhat.com
> Cc: eric.dumazet@gmail.com
> Cc: fweisbec@gmail.com

It seems the tip bot doesn't parse correctly the Cc tags as I wasn't
cc'ed on this commit.

Also I wish I had a chance to test and ack this patch before it got
applied. I guess I should have told I was in vacation for the last
weeks.

I'm going to run it through some tests.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [tip:sched/core] sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
  2016-02-29 15:31     ` Frederic Weisbecker
@ 2016-03-01 15:35       ` Frederic Weisbecker
  0 siblings, 0 replies; 14+ messages in thread
From: Frederic Weisbecker @ 2016-03-01 15:35 UTC (permalink / raw)
  To: Rik van Riel, Thomas Gleixner, H. Peter Anvin, Linus Torvalds,
	Mike Galbraith, Ingo Molnar, LKML, Peter Zijlstra
  Cc: linux-tip-commits

2016-02-29 16:31 GMT+01:00 Frederic Weisbecker <fweisbec@gmail.com>:
> 2016-02-29 12:18 GMT+01:00 tip-bot for Rik van Riel <tipbot@zytor.com>:
>> Commit-ID:  ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
>> Gitweb:     http://git.kernel.org/tip/ff9a9b4c4334b53b52ee9279f30bd5dd92ea9bdd
>> Author:     Rik van Riel <riel@redhat.com>
>> AuthorDate: Wed, 10 Feb 2016 20:08:27 -0500
>> Committer:  Ingo Molnar <mingo@kernel.org>
>> CommitDate: Mon, 29 Feb 2016 09:53:10 +0100
>>
>> sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
>>
>> When profiling syscall overhead on nohz-full kernels,
>> after removing __acct_update_integrals() from the profile,
>> native_sched_clock() remains as the top CPU user. This can be
>> reduced by moving VIRT_CPU_ACCOUNTING_GEN to jiffy granularity.
>>
>> This will reduce timing accuracy on nohz_full CPUs to jiffy
>> based sampling, just like on normal CPUs. It results in
>> totally removing native_sched_clock from the profile, and
>> significantly speeding up the syscall entry and exit path,
>> as well as irq entry and exit, and KVM guest entry & exit.
>>
>> Additionally, only call the more expensive functions (and
>> advance the seqlock) when jiffies actually changed.
>>
>> This code relies on another CPU advancing jiffies when the
>> system is busy. On a nohz_full system, this is done by a
>> housekeeping CPU.
>>
>> A microbenchmark calling an invalid syscall number 10 million
>> times in a row speeds up an additional 30% over the numbers
>> with just the previous patches, for a total speedup of about
>> 40% over 4.4 and 4.5-rc1.
>>
>> Run times for the microbenchmark:
>>
>>  4.4                            3.8 seconds
>>  4.5-rc1                        3.7 seconds
>>  4.5-rc1 + first patch          3.3 seconds
>>  4.5-rc1 + first 3 patches      3.1 seconds
>>  4.5-rc1 + all patches          2.3 seconds
>>
>> A non-NOHZ_FULL cpu (not the housekeeping CPU):
>>
>>  all kernels                    1.86 seconds
>>
>> Signed-off-by: Rik van Riel <riel@redhat.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: clark@redhat.com
>> Cc: eric.dumazet@gmail.com
>> Cc: fweisbec@gmail.com
>
> It seems the tip bot doesn't parse correctly the Cc tags as I wasn't
> cc'ed on this commit.
>
> Also I wish I had a chance to test and ack this patch before it got
> applied. I guess I should have told I was in vacation for the last
> weeks.
>
> I'm going to run it through some tests.

Ok I did some simple tests (kernel loops, user loops) and it seems to
account the cputime accurately. The kernel loop consists in brk()
calls (borrowed from an old test from Steve) and it also accounted the
small fragments of user time spent between syscall calls.

So it looks like a very nice improvement, thanks a lot Rik!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-03-01 15:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-11  1:08 [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% riel
2016-02-11  1:08 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
2016-02-29 11:17   ` [tip:sched/core] sched, time: Remove non-power-of-two divides from __acct_update_integrals() tip-bot for Rik van Riel
2016-02-11  1:08 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
2016-02-11  1:23   ` Joe Perches
2016-02-29 11:18   ` [tip:sched/core] acct, time: Change indentation in __acct_update_integrals() tip-bot for Rik van Riel
2016-02-11  1:08 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
2016-02-29 11:18   ` [tip:sched/core] time, acct: Drop irq save & restore from __acct_update_integrals() tip-bot for Rik van Riel
2016-02-11  1:08 ` [PATCH 4/4] sched,time: switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity riel
2016-02-29 11:18   ` [tip:sched/core] sched, time: Switch " tip-bot for Rik van Riel
2016-02-29 15:31     ` Frederic Weisbecker
2016-03-01 15:35       ` Frederic Weisbecker
2016-02-15  9:01 ` [PATCH 0/4 v6] sched,time: reduce nohz_full syscall overhead 40% Mike Galbraith
2016-02-24 11:16 ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).