[PATCH 0/2] sched,time: reduce nohz_full syscall overhead 40%

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] sched,time: reduce nohz_full syscall overhead 40%
@ 2016-01-29 22:22 riel
  2016-01-29 22:22 ` [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals riel
  2016-01-29 22:23 ` [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy riel
  0 siblings, 2 replies; 12+ messages in thread
From: riel @ 2016-01-29 22:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: tglx, mingo, peterz, luto, fweisbec

Running with nohz_full introduces a fair amount of overhead.
Specifically, various things that are usually done from the
timer interrupt are now done at syscall, irq, and guest
entry and exit times.

However, some of the code that is called every single time
has only ever worked at jiffy resolution. The code in
__acct_update_integrals was also doing some unnecessary
calculations.

Getting rid of the unnecessary calculations, without
changing any of the functionality in __acct_update_integrals
gets us about a 10% win.

Not calling __acct_update_integrals and related code unless
jiffies changed (__acct_update_integrals does not do anything
with smaller time intervals anyway) shaves off a further 30%.

Run times for the microbenchmark:

4.4					3.8 seconds
4.5-rc1					3.7 seconds
4.5-rc1 + first patch			3.3 seconds
4.5-rc1 + both patches			2.3 seconds

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals
  2016-01-29 22:22 [PATCH 0/2] sched,time: reduce nohz_full syscall overhead 40% riel
@ 2016-01-29 22:22 ` riel
  2016-01-29 23:10   ` Peter Zijlstra
  2016-01-29 22:23 ` [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy riel
  1 sibling, 1 reply; 12+ messages in thread
From: riel @ 2016-01-29 22:22 UTC (permalink / raw)
  To: linux-kernel; +Cc: tglx, mingo, peterz, luto, fweisbec

From: Rik van Riel <riel@redhat.com>

When running a microbenchmark calling an invalid syscall number
in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
time in __acct_update_integrals.

This function converts cputime_t to jiffies, to a timeval, only to
convert the timeval back to microseconds before discarding it.

This patch leaves __acct_update_integrals functionally equivalent,
but speeds things up by about 11%, with 10 million calls to an
invalid syscall number dropping from 3.7 to 3.3 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..afb5cf8ecc5f 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -125,22 +125,22 @@ static void __acct_update_integrals(struct task_struct *tsk,
 {
 	if (likely(tsk->mm)) {
 		cputime_t time, dtime;
-		struct timeval value;
 		unsigned long flags;
-		u64 delta;
+		u64 delta, usecs;
 
 		local_irq_save(flags);
 		time = stime + utime;
 		dtime = time - tsk->acct_timexpd;
-		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
-		delta = value.tv_sec;
-		delta = delta * USEC_PER_SEC + value.tv_usec;
+		delta = cputime_to_jiffies(dtime);
 
 		if (delta == 0)
 			goto out;
+
+		usecs = jiffies_to_usecs(delta);
+
 		tsk->acct_timexpd = time;
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
+		tsk->acct_rss_mem1 += usecs * get_mm_rss(tsk->mm);
+		tsk->acct_vm_mem1 += usecs * tsk->mm->total_vm;
 	out:
 		local_irq_restore(flags);
 	}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy
  2016-01-29 22:22 [PATCH 0/2] sched,time: reduce nohz_full syscall overhead 40% riel
  2016-01-29 22:22 ` [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals riel
@ 2016-01-29 22:23 ` riel
  2016-01-29 22:43   ` Rik van Riel
  1 sibling, 1 reply; 12+ messages in thread
From: riel @ 2016-01-29 22:23 UTC (permalink / raw)
  To: linux-kernel; +Cc: tglx, mingo, peterz, luto, fweisbec

From: Rik van Riel <riel@redhat.com>

Because __acct_update_integrals does nothing unless the time
interval in question exceeds a jiffy, there is no real reason
to call it more than once a jiffy from the syscall, irq, and
guest entry & exit paths.

If tasks get rescheduled frequently, the scheduler will still
update their time statistics normally. This patch only impacts
longer running tasks.

This speeds up

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h  |  1 +
 kernel/sched/cputime.c | 35 +++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..019c3af98503 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1532,6 +1532,7 @@ struct task_struct {
 	struct prev_cputime prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t vtime_seqcount;
+	unsigned long vtime_jiffies;
 	unsigned long long vtime_snap;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ffb1adc..923c110319b1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,6 +668,15 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
+{
+	if (tsk->vtime_jiffies == jiffies)
+		return false;
+
+	tsk->vtime_jiffies = jiffies;
+	return true;
+}
+
 static unsigned long long vtime_delta(struct task_struct *tsk)
 {
 	unsigned long long clock;
@@ -699,6 +708,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +719,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +731,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_jiffies_changed(tsk, jiffies)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +758,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -759,8 +776,12 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta(tsk);
+	cputime_t delta_cpu;
+
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
 
+	delta_cpu = get_vtime_delta(tsk);
 	account_idle_time(delta_cpu);
 }
 
@@ -773,6 +794,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
 	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_jiffies = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -784,6 +806,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
 	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_jiffies = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy
  2016-01-29 22:23 ` [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy riel
@ 2016-01-29 22:43   ` Rik van Riel
  2016-01-30 14:20     ` Frederic Weisbecker
  0 siblings, 1 reply; 12+ messages in thread
From: Rik van Riel @ 2016-01-29 22:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: tglx, mingo, peterz, luto, fweisbec, Clark Williams

On 01/29/2016 05:23 PM, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>

> This speeds up

... ok, that changelog got truncated :(

Here is the full version:

Because __acct_update_integrals does nothing unless the time
interval in question exceeds a jiffy, there is no real reason
to call it more than once a jiffy from the syscall, irq, and
guest entry & exit paths.

If tasks get rescheduled frequently, the scheduler will still
update their time statistics normally.

However, longer running tasks with frequent syscall, irq,
or guest entry & exit see a difference with this patch.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patch, for a total speedup of about 40%
over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

4.4                             3.8 seconds
4.5-rc1                         3.7 seconds
4.5-rc1 + first patch           3.3 seconds
4.5-rc1 + both patches          2.3 seconds

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals
  2016-01-29 22:22 ` [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals riel
@ 2016-01-29 23:10   ` Peter Zijlstra
  2016-01-30  3:36     ` Frederic Weisbecker
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Zijlstra @ 2016-01-29 23:10 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, tglx, mingo, luto, fweisbec

On Fri, Jan 29, 2016 at 05:22:59PM -0500, riel@redhat.com wrote:
> From: Rik van Riel <riel@redhat.com>
> 
> When running a microbenchmark calling an invalid syscall number
> in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
> time in __acct_update_integrals.
> 
> This function converts cputime_t to jiffies, to a timeval, only to
> convert the timeval back to microseconds before discarding it.
> 
> This patch leaves __acct_update_integrals functionally equivalent,
> but speeds things up by about 11%, with 10 million calls to an
> invalid syscall number dropping from 3.7 to 3.3 seconds.

WTH is this taskstat crap anyway? Who uses it and can't we kill it?

There seems to be an endless reserve of accounting crap.

> +++ b/kernel/tsacct.c
> @@ -125,22 +125,22 @@ static void __acct_update_integrals(struct task_struct *tsk,
>  {
>  	if (likely(tsk->mm)) {
>  		cputime_t time, dtime;
>  		unsigned long flags;
> +		u64 delta, usecs;
>  
>  		local_irq_save(flags);
>  		time = stime + utime;
>  		dtime = time - tsk->acct_timexpd;
> +		delta = cputime_to_jiffies(dtime);
>  
>  		if (delta == 0)
>  			goto out;

> +
> +		usecs = jiffies_to_usecs(delta);
> +
>  		tsk->acct_timexpd = time;
> +		tsk->acct_rss_mem1 += usecs * get_mm_rss(tsk->mm);
> +		tsk->acct_vm_mem1 += usecs * tsk->mm->total_vm;
>  	out:
>  		local_irq_restore(flags);
>  	}

Hurch, what horrible code.

Can't you at least drop the pointless indent level while you're there
anyway?

Also, it looks like the callpath through acct_account_cputime() should
already guarantee IRQs are disabled, so if you move the irq_save /
irq_restore business into acct_update_integrals() you can remove them
too from the fast path.

And I think you can kill that last divide too, if you do something like:


	nsecs = cputime_to_nsecs(dtime);
	if (nsecs < TICK_NSEC)
		goto out;

	usecs = nsecs >> 10;

	...

And if people really care, you can correct the 1024 != 1000 thing in
xacct_add_tsk() which seems to be the consumer side and thus should be
exceedingly rare.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals
  2016-01-29 23:10   ` Peter Zijlstra
@ 2016-01-30  3:36     ` Frederic Weisbecker
  2016-01-30  3:47       ` Rik van Riel
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2016-01-30  3:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: riel, linux-kernel, tglx, mingo, luto

On Sat, Jan 30, 2016 at 12:10:18AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 29, 2016 at 05:22:59PM -0500, riel@redhat.com wrote:
> > From: Rik van Riel <riel@redhat.com>
> > 
> > When running a microbenchmark calling an invalid syscall number
> > in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
> > time in __acct_update_integrals.
> > 
> > This function converts cputime_t to jiffies, to a timeval, only to
> > convert the timeval back to microseconds before discarding it.
> > 
> > This patch leaves __acct_update_integrals functionally equivalent,
> > but speeds things up by about 11%, with 10 million calls to an
> > invalid syscall number dropping from 3.7 to 3.3 seconds.
> 
> WTH is this taskstat crap anyway? Who uses it and can't we kill it?

I have no idea what it's used for, it seems to be related to taskstats
over netlink. I'm not even sure if it's actually used. There don't seem
to be a runtime offcase and I bet distros enable it. So that stuff does
some work every millisecond on millions of machines while it probably
has very few users.

SGI introduced it in 2006 and it seems that their last contribution there is
in 2008. The rest is kernel maintainance and fixes.

If there are still users of it, then at least we should disable it on runtime by
default.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals
  2016-01-30  3:36     ` Frederic Weisbecker
@ 2016-01-30  3:47       ` Rik van Riel
  0 siblings, 0 replies; 12+ messages in thread
From: Rik van Riel @ 2016-01-30  3:47 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra; +Cc: linux-kernel, tglx, mingo, luto

On 01/29/2016 10:36 PM, Frederic Weisbecker wrote:
> On Sat, Jan 30, 2016 at 12:10:18AM +0100, Peter Zijlstra wrote:
>> On Fri, Jan 29, 2016 at 05:22:59PM -0500, riel@redhat.com wrote:
>>> From: Rik van Riel <riel@redhat.com>
>>>
>>> When running a microbenchmark calling an invalid syscall number
>>> in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
>>> time in __acct_update_integrals.
>>>
>>> This function converts cputime_t to jiffies, to a timeval, only to
>>> convert the timeval back to microseconds before discarding it.
>>>
>>> This patch leaves __acct_update_integrals functionally equivalent,
>>> but speeds things up by about 11%, with 10 million calls to an
>>> invalid syscall number dropping from 3.7 to 3.3 seconds.
>>
>> WTH is this taskstat crap anyway? Who uses it and can't we kill it?
> 
> I have no idea what it's used for, it seems to be related to taskstats
> over netlink. I'm not even sure if it's actually used. There don't seem
> to be a runtime offcase and I bet distros enable it. So that stuff does
> some work every millisecond on millions of machines while it probably
> has very few users.
> 
> SGI introduced it in 2006 and it seems that their last contribution there is
> in 2008. The rest is kernel maintainance and fixes.
> 
> If there are still users of it, then at least we should disable it on runtime by
> default.

With all the non-power-of-2 divides removed, __acct_update_integrals
disappears from the profile, even running many times per millisecond.

At that point, native_sched_clock takes over the top of the profile,
and the only way I can think of getting rid of that one is making
sure it is not called twice for every syscall, irq, and kvm guest
entry/exit.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy
  2016-01-29 22:43   ` Rik van Riel
@ 2016-01-30 14:20     ` Frederic Weisbecker
  2016-01-30 17:53       ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2016-01-30 14:20 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel, tglx, mingo, peterz, luto, Clark Williams

On Fri, Jan 29, 2016 at 05:43:28PM -0500, Rik van Riel wrote:
> On 01/29/2016 05:23 PM, riel@redhat.com wrote:
> > From: Rik van Riel <riel@redhat.com>
> 
> > This speeds up
> 
> ... ok, that changelog got truncated :(
> 
> Here is the full version:
> 
> 
> Because __acct_update_integrals does nothing unless the time
> interval in question exceeds a jiffy, there is no real reason
> to call it more than once a jiffy from the syscall, irq, and
> guest entry & exit paths.
> 
> If tasks get rescheduled frequently, the scheduler will still
> update their time statistics normally.
> 
> However, longer running tasks with frequent syscall, irq,
> or guest entry & exit see a difference with this patch.
> 
> A microbenchmark calling an invalid syscall number 10 million
> times in a row speeds up an additional 30% over the numbers
> with just the previous patch, for a total speedup of about 40%
> over 4.4 and 4.5-rc1.
> 
> Run times for the microbenchmark:
> 
> 4.4                             3.8 seconds
> 4.5-rc1                         3.7 seconds
> 4.5-rc1 + first patch           3.3 seconds
> 4.5-rc1 + both patches          2.3 seconds

Very nice improvement!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy
  2016-01-30 14:20     ` Frederic Weisbecker
@ 2016-01-30 17:53       ` Mike Galbraith
  2016-01-30 20:36         ` Frederic Weisbecker
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2016-01-30 17:53 UTC (permalink / raw)
  To: Frederic Weisbecker, Rik van Riel
  Cc: linux-kernel, tglx, mingo, peterz, luto, Clark Williams

On Sat, 2016-01-30 at 15:20 +0100, Frederic Weisbecker wrote:
> On Fri, Jan 29, 2016 at 05:43:28PM -0500, Rik van Riel wrote:

> > Run times for the microbenchmark:
> > 
> > 4.4                             3.8 seconds
> > 4.5-rc1                         3.7 seconds
> > 4.5-rc1 + first patch           3.3 seconds
> > 4.5-rc1 + both patches          2.3 seconds
> 
> Very nice improvement!

Tasty indeed.

When nohz_full CPUs are not isolated, ie are being used as generic
CPUs, get_nohz_timer_target() is a problem with things like tbench.

tbench 8 with Rik's patches applied:
nohz_full=empty
Throughput 3204.69 MB/sec  1.000
nohz_full=1-3,5-7 
Throughput 1354.99 MB/sec   .422  1.000
nohz_full=1-3,5-7 + club below 
Throughput 2762.22 MB/sec   .861  2.038

With Rik's patches and a club, tbench becomes nearly acceptable.
---
 include/linux/tick.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -184,7 +184,7 @@ static inline const struct cpumask *hous
 static inline bool is_housekeeping_cpu(int cpu)
 {
 #ifdef CONFIG_NO_HZ_FULL
-	if (tick_nohz_full_enabled())
+	if (tick_nohz_full_enabled() && runqueue_is_isolated(cpu))
 		return cpumask_test_cpu(cpu, housekeeping_mask);
 #endif
 	return true;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy
  2016-01-30 17:53       ` Mike Galbraith
@ 2016-01-30 20:36         ` Frederic Weisbecker
  2016-01-31  2:52           ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Frederic Weisbecker @ 2016-01-30 20:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Rik van Riel, linux-kernel, tglx, mingo, peterz, luto, Clark Williams

On Sat, Jan 30, 2016 at 06:53:05PM +0100, Mike Galbraith wrote:
> On Sat, 2016-01-30 at 15:20 +0100, Frederic Weisbecker wrote:
> > On Fri, Jan 29, 2016 at 05:43:28PM -0500, Rik van Riel wrote:
> 
> > > Run times for the microbenchmark:
> > > 
> > > 4.4                             3.8 seconds
> > > 4.5-rc1                         3.7 seconds
> > > 4.5-rc1 + first patch           3.3 seconds
> > > 4.5-rc1 + both patches          2.3 seconds
> > 
> > Very nice improvement!
> 
> Tasty indeed.
> 
> When nohz_full CPUs are not isolated, ie are being used as generic
> CPUs, get_nohz_timer_target() is a problem with things like tbench.

So by isolated CPU you mean those part of isolcpus= boot option, right?

> 
> tbench 8 with Rik's patches applied:
> nohz_full=empty
> Throughput 3204.69 MB/sec  1.000
> nohz_full=1-3,5-7 
> Throughput 1354.99 MB/sec   .422  1.000
> nohz_full=1-3,5-7 + club below 
> Throughput 2762.22 MB/sec   .861  2.038
> 
> With Rik's patches and a club, tbench becomes nearly acceptable.
> ---
>  include/linux/tick.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -184,7 +184,7 @@ static inline const struct cpumask *hous
>  static inline bool is_housekeeping_cpu(int cpu)
>  {
>  #ifdef CONFIG_NO_HZ_FULL
> -	if (tick_nohz_full_enabled())
> +	if (tick_nohz_full_enabled() && runqueue_is_isolated(cpu))
>  		return cpumask_test_cpu(cpu, housekeeping_mask);

This makes me confused. How forcing timers to CPUs in isolcpus is making
better results?

>  #endif
>  	return true;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy
  2016-01-30 20:36         ` Frederic Weisbecker
@ 2016-01-31  2:52           ` Mike Galbraith
  2016-01-31  5:37             ` Mike Galbraith
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Galbraith @ 2016-01-31  2:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rik van Riel, linux-kernel, tglx, mingo, peterz, luto, Clark Williams

On Sat, 2016-01-30 at 21:36 +0100, Frederic Weisbecker wrote:
> On Sat, Jan 30, 2016 at 06:53:05PM +0100, Mike Galbraith wrote:
> > On Sat, 2016-01-30 at 15:20 +0100, Frederic Weisbecker wrote:
> > > On Fri, Jan 29, 2016 at 05:43:28PM -0500, Rik van Riel wrote:
> > 
> > > > Run times for the microbenchmark:
> > > > 
> > > > 4.4                             3.8 seconds
> > > > 4.5-rc1                         3.7 seconds
> > > > 4.5-rc1 + first patch           3.3 seconds
> > > > 4.5-rc1 + both patches          2.3 seconds
> > > 
> > > Very nice improvement!
> > 
> > Tasty indeed.
> > 
> > When nohz_full CPUs are not isolated, ie are being used as generic
> > CPUs, get_nohz_timer_target() is a problem with things like tbench.
> 
> So by isolated CPU you mean those part of isolcpus= boot option,
> right?

Yes, isolated in the scheduler sense, either via isolcpus= or cpusets. 
 If the CPU is part of a scheduler domain, it is by definition part of
the generic work crew.
 
> > tbench 8 with Rik's patches applied:
> > nohz_full=empty
> > Throughput 3204.69 MB/sec  1.000
> > nohz_full=1-3,5-7 
> > Throughput 1354.99 MB/sec   .422  1.000
> > nohz_full=1-3,5-7 + club below 
> > Throughput 2762.22 MB/sec   .861  2.038
> > 
> > With Rik's patches and a club, tbench becomes nearly acceptable.
> > ---
> >  include/linux/tick.h |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -184,7 +184,7 @@ static inline const struct cpumask *hous
> >  static inline bool is_housekeeping_cpu(int cpu)
> >  {
> >  #ifdef CONFIG_NO_HZ_FULL
> > -	if (tick_nohz_full_enabled())
> > +	if (tick_nohz_full_enabled() && runqueue_is_isolated(cpu))
> >  		return cpumask_test_cpu(cpu, housekeeping_mask);
> 
> This makes me confused. How forcing timers to CPUs in isolcpus is making
> better results?

It doesn't, it's shutting get_nohz_timer_target() down for those
nohz_full CPUs that are NOT currently isolated, are thus generic CPUs
with the capability to _become_ elite solo artists on demand.

	-Mike

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy
  2016-01-31  2:52           ` Mike Galbraith
@ 2016-01-31  5:37             ` Mike Galbraith
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Galbraith @ 2016-01-31  5:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Rik van Riel, linux-kernel, tglx, mingo, peterz, luto, Clark Williams

On Sun, 2016-01-31 at 03:52 +0100, Mike Galbraith wrote:
> On Sat, 2016-01-30 at 21:36 +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 30, 2016 at 06:53:05PM +0100, Mike Galbraith wrote:
> > > On Sat, 2016-01-30 at 15:20 +0100, Frederic Weisbecker wrote:
> > > > On Fri, Jan 29, 2016 at 05:43:28PM -0500, Rik van Riel wrote:
> > > 
> > > > > Run times for the microbenchmark:
> > > > > 
> > > > > 4.4                             3.8 seconds
> > > > > 4.5-rc1                         3.7 seconds
> > > > > 4.5-rc1 + first patch           3.3 seconds
> > > > > 4.5-rc1 + both patches          2.3 seconds
> > > > 
> > > > Very nice improvement!
> > > 
> > > Tasty indeed.
> > > 
> > > When nohz_full CPUs are not isolated, ie are being used as
> > > generic
> > > CPUs, get_nohz_timer_target() is a problem with things like
> > > tbench.
> > 
> > So by isolated CPU you mean those part of isolcpus= boot option,
> > right?
> 
> Yes, isolated in the scheduler sense, either via isolcpus= or
> cpusets. 
>  If the CPU is part of a scheduler domain, it is by definition part
> of
> the generic work crew.
>  
> > > tbench 8 with Rik's patches applied:
> > > nohz_full=empty
> > > Throughput 3204.69 MB/sec  1.000
> > > nohz_full=1-3,5-7 
> > > Throughput 1354.99 MB/sec   .422  1.000
> > > nohz_full=1-3,5-7 + club below 
> > > Throughput 2762.22 MB/sec   .861  2.038
> > > 
> > > With Rik's patches and a club, tbench becomes nearly acceptable.
> > > ---
> > >  include/linux/tick.h |    2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > --- a/include/linux/tick.h
> > > +++ b/include/linux/tick.h
> > > @@ -184,7 +184,7 @@ static inline const struct cpumask *hous
> > >  static inline bool is_housekeeping_cpu(int cpu)
> > >  {
> > >  #ifdef CONFIG_NO_HZ_FULL
> > > -	if (tick_nohz_full_enabled())
> > > +	if (tick_nohz_full_enabled() &&
> > > runqueue_is_isolated(cpu))
> > >  		return cpumask_test_cpu(cpu, housekeeping_mask);
> > 
> > This makes me confused. How forcing timers to CPUs in isolcpus is
> > making
> > better results?
> 
> It doesn't, it's shutting get_nohz_timer_target() down for those
> nohz_full CPUs that are NOT currently isolated, are thus generic CPUs
> with the capability to _become_ elite solo artists on demand.

Damn, I'm gonna have to ask...

If an isolated elite task sets the alarm, why would it want it to go
off somewhere else, maybe in the middle of a death metal concert? 
 Seems it must induce jitter.  Why is this a good thing?

	-Mike

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-01-31  5:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-29 22:22 [PATCH 0/2] sched,time: reduce nohz_full syscall overhead 40% riel
2016-01-29 22:22 ` [PATCH 1/2] sched,time: remove pointless divides from __acct_update_integrals riel
2016-01-29 23:10   ` Peter Zijlstra
2016-01-30  3:36     ` Frederic Weisbecker
2016-01-30  3:47       ` Rik van Riel
2016-01-29 22:23 ` [PATCH 2/2] sched,time: call __acct_update_integrals once a jiffy riel
2016-01-29 22:43   ` Rik van Riel
2016-01-30 14:20     ` Frederic Weisbecker
2016-01-30 17:53       ` Mike Galbraith
2016-01-30 20:36         ` Frederic Weisbecker
2016-01-31  2:52           ` Mike Galbraith
2016-01-31  5:37             ` Mike Galbraith

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).