[PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40%

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40%
@ 2016-02-01  2:12 riel
  2016-02-01  2:12 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: riel @ 2016-02-01  2:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark

(v3: address comments raised by Frederic)

Running with nohz_full introduces a fair amount of overhead.
Specifically, various things that are usually done from the
timer interrupt are now done at syscall, irq, and guest
entry and exit times.

However, some of the code that is called every single time
has only ever worked at jiffy resolution. The code in
__acct_update_integrals was also doing some unnecessary
calculations.

Getting rid of the unnecessary calculations, without
changing any of the functionality in __acct_update_integrals
gets us about an 11% win.

Not calling the time statistics updating code more than
once per jiffy, like is done on housekeeping CPUs and on
all the CPUs of a non-nohz_full system, shaves off a
further 30%.

I tested this series with a microbenchmark calling
an invalid syscall number ten million times in a row,
on a nohz_full cpu.

    Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
@ 2016-02-01  2:12 ` riel
  2016-02-01  4:46   ` kbuild test robot
  2016-02-01  8:37   ` Thomas Gleixner
  2016-02-01  2:12 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: riel @ 2016-02-01  2:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark

From: Rik van Riel <riel@redhat.com>

When running a microbenchmark calling an invalid syscall number
in a loop, on a nohz_full CPU, we spend a full 9% of our CPU
time in __acct_update_integrals.

This function converts cputime_t to jiffies, to a timeval, only to
convert the timeval back to microseconds before discarding it.

This patch leaves __acct_update_integrals functionally equivalent,
but speeds things up by about 12%, with 10 million calls to an
invalid syscall number dropping from 3.7 to 3.25 seconds.

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..1b121a2f1c55 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -93,9 +93,9 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 {
 	struct mm_struct *mm;
 
-	/* convert pages-usec to Mbyte-usec */
-	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
-	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
+	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
+	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / (1000 * KB);
+	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / (1000 * KB);
 	mm = get_task_mm(p);
 	if (mm) {
 		/* adjust to KB unit */
@@ -125,22 +125,26 @@ static void __acct_update_integrals(struct task_struct *tsk,
 {
 	if (likely(tsk->mm)) {
 		cputime_t time, dtime;
-		struct timeval value;
 		unsigned long flags;
 		u64 delta;
 
 		local_irq_save(flags);
 		time = stime + utime;
 		dtime = time - tsk->acct_timexpd;
-		jiffies_to_timeval(cputime_to_jiffies(dtime), &value);
-		delta = value.tv_sec;
-		delta = delta * USEC_PER_SEC + value.tv_usec;
+		/* Avoid division: cputime_t is often in nanoseconds already. */
+		delta = cputime_to_nsecs(dtime);
 
-		if (delta == 0)
+		if (delta < TICK_NSEC)
 			goto out;
+
 		tsk->acct_timexpd = time;
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
+		/*
+		 * Divide by 1024 to avoid overflow, and to avoid division.
+		 * The final unit reported to userspace is Mbyte-usecs,
+		 * the rest of the math is done in xacct_add_tsk.
+		 */
+		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
 	out:
 		local_irq_restore(flags);
 	}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] acct,time: change indentation in __acct_update_integrals
  2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
  2016-02-01  2:12 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
@ 2016-02-01  2:12 ` riel
  2016-02-01  2:12 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: riel @ 2016-02-01  2:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark

From: Rik van Riel <riel@redhat.com>

Change the indentation in __acct_update_integrals to make the function
a little easier to read.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@redhat.com>
---
 kernel/tsacct.c | 51 ++++++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 1b121a2f1c55..9c23584c76c4 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -123,31 +123,32 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
 static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
-	if (likely(tsk->mm)) {
-		cputime_t time, dtime;
-		unsigned long flags;
-		u64 delta;
-
-		local_irq_save(flags);
-		time = stime + utime;
-		dtime = time - tsk->acct_timexpd;
-		/* Avoid division: cputime_t is often in nanoseconds already. */
-		delta = cputime_to_nsecs(dtime);
-
-		if (delta < TICK_NSEC)
-			goto out;
-
-		tsk->acct_timexpd = time;
-		/*
-		 * Divide by 1024 to avoid overflow, and to avoid division.
-		 * The final unit reported to userspace is Mbyte-usecs,
-		 * the rest of the math is done in xacct_add_tsk.
-		 */
-		tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
-		tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-	out:
-		local_irq_restore(flags);
-	}
+	cputime_t time, dtime;
+	unsigned long flags;
+	u64 delta;
+
+	if (!likely(tsk->mm))
+		return;
+
+	local_irq_save(flags);
+	time = stime + utime;
+	dtime = time - tsk->acct_timexpd;
+	/* Avoid division: cputime_t is often in nanoseconds already. */
+	delta = cputime_to_nsecs(dtime);
+
+	if (delta < TICK_NSEC)
+		goto out;
+
+	tsk->acct_timexpd = time;
+	/*
+	 * Divide by 1024 to avoid overflow, and to avoid division.
+	 * The final unit reported to userspace is Mbyte-usecs,
+	 * the rest of the math is done in xacct_add_tsk.
+	 */
+	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
+	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
+out:
+	local_irq_restore(flags);
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals
  2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
  2016-02-01  2:12 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
  2016-02-01  2:12 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
@ 2016-02-01  2:12 ` riel
  2016-02-01  9:28   ` Peter Zijlstra
  2016-02-01  2:12 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
  2016-02-01  7:41 ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Ingo Molnar
  4 siblings, 1 reply; 23+ messages in thread
From: riel @ 2016-02-01  2:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark

From: Rik van Riel <riel@redhat.com>

It looks like all the call paths that lead to __acct_update_integrals
already have irqs disabled, and __acct_update_integrals does not need
to disable irqs itself.

This is very convenient since about half the CPU time left in this
function was spent in local_irq_save alone.

Performance of a microbenchmark that calls an invalid syscall
ten million times in a row on a nohz_full CPU improves 21% vs.
4.5-rc1 with both the removal of divisions from __acct_update_integrals
and this patch, with runtime dropping from 3.7 to 2.9 seconds.

With these patches applied, the highest remaining cpu user in
the trace is native_sched_clock, which is addressed in the next
patch.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
---
 kernel/tsacct.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 9c23584c76c4..31fb6c9746d4 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -124,20 +124,18 @@ static void __acct_update_integrals(struct task_struct *tsk,
 				    cputime_t utime, cputime_t stime)
 {
 	cputime_t time, dtime;
-	unsigned long flags;
 	u64 delta;
 
 	if (!likely(tsk->mm))
 		return;
 
-	local_irq_save(flags);
 	time = stime + utime;
 	dtime = time - tsk->acct_timexpd;
 	/* Avoid division: cputime_t is often in nanoseconds already. */
 	delta = cputime_to_nsecs(dtime);
 
 	if (delta < TICK_NSEC)
-		goto out;
+		return;
 
 	tsk->acct_timexpd = time;
 	/*
@@ -147,8 +145,6 @@ static void __acct_update_integrals(struct task_struct *tsk,
 	 */
 	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
 	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
-out:
-	local_irq_restore(flags);
 }
 
 /**
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
                   ` (2 preceding siblings ...)
  2016-02-01  2:12 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
@ 2016-02-01  2:12 ` riel
  2016-02-01  9:29   ` Peter Zijlstra
  2016-02-01  7:41 ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Ingo Molnar
  4 siblings, 1 reply; 23+ messages in thread
From: riel @ 2016-02-01  2:12 UTC (permalink / raw)
  To: linux-kernel; +Cc: fweisbec, tglx, mingo, luto, peterz, clark

From: Rik van Riel <riel@redhat.com>

After removing __acct_update_integrals from the profile,
native_sched_clock remains as the top CPU user. This can be
reduced by only calling account_{user,sys,guest,idle}_time
once per jiffy for long running tasks on nohz_full CPUs.

This will reduce timing accuracy on nohz_full CPUs to jiffy
based sampling, just like on normal CPUs. It results in
totally removing native_sched_clock from the profile, and
significantly speeding up the syscall entry and exit path,
as well as irq entry and exit, and kvm guest entry & exit.

This code relies on another CPU advancing jiffies when the
system is busy. On a nohz_full system, this is done by a
housekeeping CPU.

A microbenchmark calling an invalid syscall number 10 million
times in a row speeds up an additional 30% over the numbers
with just the previous patches, for a total speedup of about
40% over 4.4 and 4.5-rc1.

Run times for the microbenchmark:

4.4				3.8 seconds
4.5-rc1				3.7 seconds
4.5-rc1 + first patch		3.3 seconds
4.5-rc1 + first 3 patches	3.1 seconds
4.5-rc1 + all patches		2.3 seconds

Signed-off-by: Rik van Riel <riel@redhat.com>
---
 include/linux/sched.h  |  1 +
 kernel/sched/cputime.c | 35 +++++++++++++++++++++++++++++------
 2 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a94cc3..019c3af98503 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1532,6 +1532,7 @@ struct task_struct {
 	struct prev_cputime prev_cputime;
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 	seqcount_t vtime_seqcount;
+	unsigned long vtime_jiffies;
 	unsigned long long vtime_snap;
 	enum {
 		/* Task is sleeping or running in a CPU with VTIME inactive */
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b2ab2ffb1adc..923c110319b1 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -668,6 +668,15 @@ void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, cputime
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
+static bool vtime_jiffies_changed(struct task_struct *tsk, unsigned long now)
+{
+	if (tsk->vtime_jiffies == jiffies)
+		return false;
+
+	tsk->vtime_jiffies = jiffies;
+	return true;
+}
+
 static unsigned long long vtime_delta(struct task_struct *tsk)
 {
 	unsigned long long clock;
@@ -699,6 +708,9 @@ static void __vtime_account_system(struct task_struct *tsk)
 
 void vtime_account_system(struct task_struct *tsk)
 {
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
+
 	write_seqcount_begin(&tsk->vtime_seqcount);
 	__vtime_account_system(tsk);
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -707,7 +719,8 @@ void vtime_account_system(struct task_struct *tsk)
 void vtime_gen_account_irq_exit(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	if (context_tracking_in_user())
 		tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
@@ -718,16 +731,19 @@ void vtime_account_user(struct task_struct *tsk)
 	cputime_t delta_cpu;
 
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	delta_cpu = get_vtime_delta(tsk);
 	tsk->vtime_snap_whence = VTIME_SYS;
-	account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	if (vtime_jiffies_changed(tsk, jiffies)) {
+		delta_cpu = get_vtime_delta(tsk);
+		account_user_time(tsk, delta_cpu, cputime_to_scaled(delta_cpu));
+	}
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
 
 void vtime_user_enter(struct task_struct *tsk)
 {
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	tsk->vtime_snap_whence = VTIME_USER;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -742,7 +758,8 @@ void vtime_guest_enter(struct task_struct *tsk)
 	 * that can thus safely catch up with a tickless delta.
 	 */
 	write_seqcount_begin(&tsk->vtime_seqcount);
-	__vtime_account_system(tsk);
+	if (vtime_jiffies_changed(tsk, jiffies))
+		__vtime_account_system(tsk);
 	current->flags |= PF_VCPU;
 	write_seqcount_end(&tsk->vtime_seqcount);
 }
@@ -759,8 +776,12 @@ EXPORT_SYMBOL_GPL(vtime_guest_exit);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-	cputime_t delta_cpu = get_vtime_delta(tsk);
+	cputime_t delta_cpu;
+
+	if (!vtime_jiffies_changed(tsk, jiffies))
+		return;
 
+	delta_cpu = get_vtime_delta(tsk);
 	account_idle_time(delta_cpu);
 }
 
@@ -773,6 +794,7 @@ void arch_vtime_task_switch(struct task_struct *prev)
 	write_seqcount_begin(&current->vtime_seqcount);
 	current->vtime_snap_whence = VTIME_SYS;
 	current->vtime_snap = sched_clock_cpu(smp_processor_id());
+	current->vtime_jiffies = jiffies;
 	write_seqcount_end(&current->vtime_seqcount);
 }
 
@@ -784,6 +806,7 @@ void vtime_init_idle(struct task_struct *t, int cpu)
 	write_seqcount_begin(&t->vtime_seqcount);
 	t->vtime_snap_whence = VTIME_SYS;
 	t->vtime_snap = sched_clock_cpu(cpu);
+	t->vtime_jiffies = jiffies;
 	write_seqcount_end(&t->vtime_seqcount);
 	local_irq_restore(flags);
 }
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-01  2:12 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
@ 2016-02-01  4:46   ` kbuild test robot
  2016-02-01  8:37   ` Thomas Gleixner
  1 sibling, 0 replies; 23+ messages in thread
From: kbuild test robot @ 2016-02-01  4:46 UTC (permalink / raw)
  To: riel; +Cc: kbuild-all, linux-kernel, fweisbec, tglx, mingo, luto, peterz, clark

[-- Attachment #1: Type: text/plain, Size: 2952 bytes --]

Hi Rik,

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.5-rc2 next-20160129]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/riel-redhat-com/sched-time-reduce-nohz_full-syscall-overhead-40/20160201-101609
config: sh-sh7757lcr_defconfig (attached as .config)
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=sh 

All errors (new ones prefixed by >>):

   kernel/built-in.o: In function `get_mm_hiwater_vm':
>> include/linux/mm.h:1377: undefined reference to `__udivdi3'

vim +1377 include/linux/mm.h

172703b0 Matt Fleming      2011-05-24  1361  	atomic_long_dec(&mm->rss_stat.count[member]);
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1362  }
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1363  
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1364  static inline unsigned long get_mm_rss(struct mm_struct *mm)
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1365  {
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1366  	return get_mm_counter(mm, MM_FILEPAGES) +
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1367  		get_mm_counter(mm, MM_ANONPAGES);
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1368  }
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1369  
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1370  static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1371  {
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1372  	return max(mm->hiwater_rss, get_mm_rss(mm));
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1373  }
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1374  
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1375  static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1376  {
d559db08 KAMEZAWA Hiroyuki 2010-03-05 @1377  	return max(mm->hiwater_vm, mm->total_vm);
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1378  }
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1379  
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1380  static inline void update_hiwater_rss(struct mm_struct *mm)
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1381  {
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1382  	unsigned long _rss = get_mm_rss(mm);
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1383  
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1384  	if ((mm)->hiwater_rss < _rss)
d559db08 KAMEZAWA Hiroyuki 2010-03-05  1385  		(mm)->hiwater_rss = _rss;

:::::: The code at line 1377 was first introduced by commit
:::::: d559db086ff5be9bcc259e5aa50bf3d881eaf1d1 mm: clean up mm_counter

:::::: TO: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
:::::: CC: Linus Torvalds <torvalds@linux-foundation.org>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 11875 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] perf tooling: Add 'perf bench syscall' benchmark
  2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
                   ` (3 preceding siblings ...)
  2016-02-01  2:12 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
@ 2016-02-01  7:41 ` Ingo Molnar
  2016-02-01  7:48   ` [PATCH] perf tooling: Simplify 'perf bench syscall' Ingo Molnar
  2016-02-01 15:41   ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Andy Lutomirski
  4 siblings, 2 replies; 23+ messages in thread
From: Ingo Molnar @ 2016-02-01  7:41 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, fweisbec, tglx, luto, peterz, clark,
	Arnaldo Carvalho de Melo, Peter Zijlstra


* riel@redhat.com <riel@redhat.com> wrote:

> (v3: address comments raised by Frederic)
> 
> Running with nohz_full introduces a fair amount of overhead.
> Specifically, various things that are usually done from the
> timer interrupt are now done at syscall, irq, and guest
> entry and exit times.
> 
> However, some of the code that is called every single time
> has only ever worked at jiffy resolution. The code in
> __acct_update_integrals was also doing some unnecessary
> calculations.
> 
> Getting rid of the unnecessary calculations, without
> changing any of the functionality in __acct_update_integrals
> gets us about an 11% win.
> 
> Not calling the time statistics updating code more than
> once per jiffy, like is done on housekeeping CPUs and on
> all the CPUs of a non-nohz_full system, shaves off a
> further 30%.
> 
> I tested this series with a microbenchmark calling
> an invalid syscall number ten million times in a row,
> on a nohz_full cpu.
> 
>     Run times for the microbenchmark:
>     
> 4.4				3.8 seconds
> 4.5-rc1				3.7 seconds
> 4.5-rc1 + first patch		3.3 seconds
> 4.5-rc1 + first 3 patches	3.1 seconds
> 4.5-rc1 + all patches		2.3 seconds

Another suggestion (beyond fixing the 32-bit build ;-), could you please stick 
your syscall microbenchmark into 'perf bench', so that we have a standardized way 
of checking such numbers?

In fact I'd suggest we introduce an entirely new sub-tool for system call 
performance measurement - and this might be the first functionality of it.

I've attached a quick patch that is basically a copy of 'perf bench numa' and 
which measures getppid() performance (simple syscall where the result is not 
cached by glibc).

I kept the process, threading and memory allocation bits of numa.c, just in case 
we need them to measure more complex syscalls. Maybe we could keep the threading 
bits and remove the memory allocation parameters, to simplify the benchmark?

Anyway, this could be a good base to start off on.

Thanks,

	Ingo

=====================>
>From ea072c9053555e71dec5eae5bb91bd6f9cc33723 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Mon, 1 Feb 2016 08:36:31 +0100
Subject: [PATCH] perf bench: Add 'syscall' benchmark utility

Just a basic 'perf bench syscall null' variant, which measures
getppid() performance.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/perf/bench/Build     |    1 +
 tools/perf/bench/bench.h   |    1 +
 tools/perf/bench/syscall.c | 1598 ++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c |   16 +-
 4 files changed, 1612 insertions(+), 4 deletions(-)

diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 60bf11943047..f7ad43f4871a 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -1,3 +1,4 @@
+perf-y += syscall.o
 perf-y += sched-messaging.o
 perf-y += sched-pipe.o
 perf-y += mem-functions.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index a50df86f2b9b..f4e674042467 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -26,6 +26,7 @@
 #endif
 
 extern int bench_numa(int argc, const char **argv, const char *prefix);
+extern int bench_syscall(int argc, const char **argv, const char *prefix);
 extern int bench_sched_messaging(int argc, const char **argv, const char *prefix);
 extern int bench_sched_pipe(int argc, const char **argv, const char *prefix);
 extern int bench_mem_memcpy(int argc, const char **argv,
diff --git a/tools/perf/bench/syscall.c b/tools/perf/bench/syscall.c
new file mode 100644
index 000000000000..5a4ef02176d1
--- /dev/null
+++ b/tools/perf/bench/syscall.c
@@ -0,0 +1,1598 @@
+/*
+ * syscall.c
+ *
+ * syscall: Measure various aspects of Linux system call performance
+ */
+
+#include "../perf.h"
+#include "../builtin.h"
+#include "../util/util.h"
+#include <subcmd/parse-options.h>
+#include "../util/cloexec.h"
+
+#include "bench.h"
+
+#include <errno.h>
+#include <sched.h>
+#include <stdio.h>
+#include <assert.h>
+#include <malloc.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <sys/mman.h>
+#include <sys/time.h>
+#include <sys/resource.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+#include <sys/types.h>
+
+#include <numa.h>
+#include <numaif.h>
+
+/*
+ * Regular printout to the terminal, supressed if -q is specified:
+ */
+#define tprintf(x...) do { if (g && g->p.show_details >= 0) printf(x); } while (0)
+
+/*
+ * Debug printf:
+ */
+#define dprintf(x...) do { if (g && g->p.show_details >= 1) printf(x); } while (0)
+
+struct thread_data {
+	int			curr_cpu;
+	cpu_set_t		bind_cpumask;
+	int			bind_node;
+	u8			*process_data;
+	int			process_nr;
+	int			thread_nr;
+	int			task_nr;
+	unsigned int		loops_done;
+	u64			val;
+	u64			runtime_ns;
+	u64			system_time_ns;
+	u64			user_time_ns;
+	double			speed_gbs;
+	pthread_mutex_t		*process_lock;
+};
+
+/* Parameters set by options: */
+
+struct params {
+	/* Startup synchronization: */
+	bool			serialize_startup;
+
+	/* Task hierarchy: */
+	int			nr_proc;
+	int			nr_threads;
+
+	/* Working set sizes: */
+	const char		*mb_global_str;
+	const char		*mb_proc_str;
+	const char		*mb_proc_locked_str;
+	const char		*mb_thread_str;
+
+	double			mb_global;
+	double			mb_proc;
+	double			mb_proc_locked;
+	double			mb_thread;
+
+	/* Access patterns to the working set: */
+	bool			data_reads;
+	bool			data_writes;
+	bool			data_backwards;
+	bool			data_zero_memset;
+	bool			data_rand_walk;
+	u32			nr_loops;
+	u32			nr_secs;
+	u32			sleep_usecs;
+
+	/* Working set initialization: */
+	bool			init_zero;
+	bool			init_random;
+	bool			init_cpu0;
+
+	/* Misc options: */
+	int			show_details;
+	int			run_all;
+	int			thp;
+
+	long			bytes_global;
+	long			bytes_process;
+	long			bytes_process_locked;
+	long			bytes_thread;
+
+	int			nr_tasks;
+	bool			show_quiet;
+
+	bool			show_convergence;
+	bool			measure_convergence;
+
+	int			perturb_secs;
+	int			nr_cpus;
+	int			nr_nodes;
+
+	/* Affinity options -C and -N: */
+	char			*cpu_list_str;
+	char			*node_list_str;
+};
+
+
+/* Global, read-writable area, accessible to all processes and threads: */
+
+struct global_info {
+	u8			*data;
+
+	pthread_mutex_t		startup_mutex;
+	int			nr_tasks_started;
+
+	pthread_mutex_t		startup_done_mutex;
+
+	pthread_mutex_t		start_work_mutex;
+	int			nr_tasks_working;
+
+	pthread_mutex_t		stop_work_mutex;
+	u64			bytes_done;
+
+	struct thread_data	*threads;
+
+	/* Convergence latency measurement: */
+	bool			all_converged;
+	bool			stop_work;
+
+	int			print_once;
+
+	struct params		p;
+};
+
+static struct global_info	*g = NULL;
+
+static int parse_cpus_opt(const struct option *opt, const char *arg, int unset);
+static int parse_nodes_opt(const struct option *opt, const char *arg, int unset);
+
+struct params p0;
+
+static const struct option options[] = {
+	OPT_INTEGER('p', "nr_proc"	, &p0.nr_proc,		"number of processes"),
+	OPT_INTEGER('t', "nr_threads"	, &p0.nr_threads,	"number of threads per process"),
+
+	OPT_STRING('G', "mb_global"	, &p0.mb_global_str,	"MB", "global  memory (MBs)"),
+	OPT_STRING('P', "mb_proc"	, &p0.mb_proc_str,	"MB", "process memory (MBs)"),
+	OPT_STRING('L', "mb_proc_locked", &p0.mb_proc_locked_str,"MB", "process serialized/locked memory access (MBs), <= process_memory"),
+	OPT_STRING('T', "mb_thread"	, &p0.mb_thread_str,	"MB", "thread  memory (MBs)"),
+
+	OPT_UINTEGER('l', "nr_loops"	, &p0.nr_loops,		"max number of loops to run (default: 10,000,000)"),
+	OPT_UINTEGER('s', "nr_secs"	, &p0.nr_secs,		"max number of seconds to run (default: 5 secs)"),
+	OPT_UINTEGER('u', "usleep"	, &p0.sleep_usecs,	"usecs to sleep per loop iteration"),
+
+	OPT_BOOLEAN('R', "data_reads"	, &p0.data_reads,	"access the data via writes (can be mixed with -W)"),
+	OPT_BOOLEAN('W', "data_writes"	, &p0.data_writes,	"access the data via writes (can be mixed with -R)"),
+	OPT_BOOLEAN('B', "data_backwards", &p0.data_backwards,	"access the data backwards as well"),
+	OPT_BOOLEAN('Z', "data_zero_memset", &p0.data_zero_memset,"access the data via glibc bzero only"),
+	OPT_BOOLEAN('r', "data_rand_walk", &p0.data_rand_walk,	"access the data with random (32bit LFSR) walk"),
+
+
+	OPT_BOOLEAN('z', "init_zero"	, &p0.init_zero,	"bzero the initial allocations"),
+	OPT_BOOLEAN('I', "init_random"	, &p0.init_random,	"randomize the contents of the initial allocations"),
+	OPT_BOOLEAN('0', "init_cpu0"	, &p0.init_cpu0,	"do the initial allocations on CPU#0"),
+	OPT_INTEGER('x', "perturb_secs", &p0.perturb_secs,	"perturb thread 0/0 every X secs, to test convergence stability"),
+
+	OPT_INCR   ('d', "show_details"	, &p0.show_details,	"Show details"),
+	OPT_INCR   ('a', "all"		, &p0.run_all,		"Run all tests in the suite"),
+	OPT_INTEGER('H', "thp"		, &p0.thp,		"MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE"),
+	OPT_BOOLEAN('c', "show_convergence", &p0.show_convergence, "show convergence details"),
+	OPT_BOOLEAN('m', "measure_convergence",	&p0.measure_convergence, "measure convergence latency"),
+	OPT_BOOLEAN('q', "quiet"	, &p0.show_quiet,	"quiet mode"),
+	OPT_BOOLEAN('S', "serialize-startup", &p0.serialize_startup,"serialize thread startup"),
+
+	/* Special option string parsing callbacks: */
+        OPT_CALLBACK('C', "cpus", NULL, "cpu[,cpu2,...cpuN]",
+			"bind the first N tasks to these specific cpus (the rest is unbound)",
+			parse_cpus_opt),
+        OPT_CALLBACK('M', "memnodes", NULL, "node[,node2,...nodeN]",
+			"bind the first N tasks to these specific memory nodes (the rest is unbound)",
+			parse_nodes_opt),
+	OPT_END()
+};
+
+static const char * const bench_syscall_usage[] = {
+	"perf bench syscall <options>",
+	NULL
+};
+
+static const char * const syscall_usage[] = {
+	"perf bench syscall mem [<options>]",
+	NULL
+};
+
+static cpu_set_t bind_to_cpu(int target_cpu)
+{
+	cpu_set_t orig_mask, mask;
+	int ret;
+
+	ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
+	BUG_ON(ret);
+
+	CPU_ZERO(&mask);
+
+	if (target_cpu == -1) {
+		int cpu;
+
+		for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+			CPU_SET(cpu, &mask);
+	} else {
+		BUG_ON(target_cpu < 0 || target_cpu >= g->p.nr_cpus);
+		CPU_SET(target_cpu, &mask);
+	}
+
+	ret = sched_setaffinity(0, sizeof(mask), &mask);
+	BUG_ON(ret);
+
+	return orig_mask;
+}
+
+static cpu_set_t bind_to_node(int target_node)
+{
+	int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes;
+	cpu_set_t orig_mask, mask;
+	int cpu;
+	int ret;
+
+	BUG_ON(cpus_per_node*g->p.nr_nodes != g->p.nr_cpus);
+	BUG_ON(!cpus_per_node);
+
+	ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
+	BUG_ON(ret);
+
+	CPU_ZERO(&mask);
+
+	if (target_node == -1) {
+		for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+			CPU_SET(cpu, &mask);
+	} else {
+		int cpu_start = (target_node + 0) * cpus_per_node;
+		int cpu_stop  = (target_node + 1) * cpus_per_node;
+
+		BUG_ON(cpu_stop > g->p.nr_cpus);
+
+		for (cpu = cpu_start; cpu < cpu_stop; cpu++)
+			CPU_SET(cpu, &mask);
+	}
+
+	ret = sched_setaffinity(0, sizeof(mask), &mask);
+	BUG_ON(ret);
+
+	return orig_mask;
+}
+
+static void bind_to_cpumask(cpu_set_t mask)
+{
+	int ret;
+
+	ret = sched_setaffinity(0, sizeof(mask), &mask);
+	BUG_ON(ret);
+}
+
+static void mempol_restore(void)
+{
+	int ret;
+
+	ret = set_mempolicy(MPOL_DEFAULT, NULL, g->p.nr_nodes-1);
+
+	BUG_ON(ret);
+}
+
+static void bind_to_memnode(int node)
+{
+	unsigned long nodemask;
+	int ret;
+
+	if (node == -1)
+		return;
+
+	BUG_ON(g->p.nr_nodes > (int)sizeof(nodemask));
+	nodemask = 1L << node;
+
+	ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask)*8);
+	dprintf("binding to node %d, mask: %016lx => %d\n", node, nodemask, ret);
+
+	BUG_ON(ret);
+}
+
+#define HPSIZE (2*1024*1024)
+
+#define set_taskname(fmt...)				\
+do {							\
+	char name[20];					\
+							\
+	snprintf(name, 20, fmt);			\
+	prctl(PR_SET_NAME, name);			\
+} while (0)
+
+static u8 *alloc_data(ssize_t bytes0, int map_flags,
+		      int init_zero, int init_cpu0, int thp, int init_random)
+{
+	cpu_set_t orig_mask;
+	ssize_t bytes;
+	u8 *buf;
+	int ret;
+
+	if (!bytes0)
+		return NULL;
+
+	/* Allocate and initialize all memory on CPU#0: */
+	if (init_cpu0) {
+		orig_mask = bind_to_node(0);
+		bind_to_memnode(0);
+	}
+
+	bytes = bytes0 + HPSIZE;
+
+	buf = (void *)mmap(0, bytes, PROT_READ|PROT_WRITE, MAP_ANON|map_flags, -1, 0);
+	BUG_ON(buf == (void *)-1);
+
+	if (map_flags == MAP_PRIVATE) {
+		if (thp > 0) {
+			ret = madvise(buf, bytes, MADV_HUGEPAGE);
+			if (ret && !g->print_once) {
+				g->print_once = 1;
+				printf("WARNING: Could not enable THP - do: 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled'\n");
+			}
+		}
+		if (thp < 0) {
+			ret = madvise(buf, bytes, MADV_NOHUGEPAGE);
+			if (ret && !g->print_once) {
+				g->print_once = 1;
+				printf("WARNING: Could not disable THP: run a CONFIG_TRANSPARENT_HUGEPAGE kernel?\n");
+			}
+		}
+	}
+
+	if (init_zero) {
+		bzero(buf, bytes);
+	} else {
+		/* Initialize random contents, different in each word: */
+		if (init_random) {
+			u64 *wbuf = (void *)buf;
+			long off = rand();
+			long i;
+
+			for (i = 0; i < bytes/8; i++)
+				wbuf[i] = i + off;
+		}
+	}
+
+	/* Align to 2MB boundary: */
+	buf = (void *)(((unsigned long)buf + HPSIZE-1) & ~(HPSIZE-1));
+
+	/* Restore affinity: */
+	if (init_cpu0) {
+		bind_to_cpumask(orig_mask);
+		mempol_restore();
+	}
+
+	return buf;
+}
+
+static void free_data(void *data, ssize_t bytes)
+{
+	int ret;
+
+	if (!data)
+		return;
+
+	ret = munmap(data, bytes);
+	BUG_ON(ret);
+}
+
+/*
+ * Create a shared memory buffer that can be shared between processes, zeroed:
+ */
+static void * zalloc_shared_data(ssize_t bytes)
+{
+	return alloc_data(bytes, MAP_SHARED, 1, g->p.init_cpu0,  g->p.thp, g->p.init_random);
+}
+
+/*
+ * Create a shared memory buffer that can be shared between processes:
+ */
+static void * setup_shared_data(ssize_t bytes)
+{
+	return alloc_data(bytes, MAP_SHARED, 0, g->p.init_cpu0,  g->p.thp, g->p.init_random);
+}
+
+/*
+ * Allocate process-local memory - this will either be shared between
+ * threads of this process, or only be accessed by this thread:
+ */
+static void * setup_private_data(ssize_t bytes)
+{
+	return alloc_data(bytes, MAP_PRIVATE, 0, g->p.init_cpu0,  g->p.thp, g->p.init_random);
+}
+
+/*
+ * Return a process-shared (global) mutex:
+ */
+static void init_global_mutex(pthread_mutex_t *mutex)
+{
+	pthread_mutexattr_t attr;
+
+	pthread_mutexattr_init(&attr);
+	pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);
+	pthread_mutex_init(mutex, &attr);
+}
+
+static int parse_cpu_list(const char *arg)
+{
+	p0.cpu_list_str = strdup(arg);
+
+	dprintf("got CPU list: {%s}\n", p0.cpu_list_str);
+
+	return 0;
+}
+
+static int parse_setup_cpu_list(void)
+{
+	struct thread_data *td;
+	char *str0, *str;
+	int t;
+
+	if (!g->p.cpu_list_str)
+		return 0;
+
+	dprintf("g->p.nr_tasks: %d\n", g->p.nr_tasks);
+
+	str0 = str = strdup(g->p.cpu_list_str);
+	t = 0;
+
+	BUG_ON(!str);
+
+	tprintf("# binding tasks to CPUs:\n");
+	tprintf("#  ");
+
+	while (true) {
+		int bind_cpu, bind_cpu_0, bind_cpu_1;
+		char *tok, *tok_end, *tok_step, *tok_len, *tok_mul;
+		int bind_len;
+		int step;
+		int mul;
+
+		tok = strsep(&str, ",");
+		if (!tok)
+			break;
+
+		tok_end = strstr(tok, "-");
+
+		dprintf("\ntoken: {%s}, end: {%s}\n", tok, tok_end);
+		if (!tok_end) {
+			/* Single CPU specified: */
+			bind_cpu_0 = bind_cpu_1 = atol(tok);
+		} else {
+			/* CPU range specified (for example: "5-11"): */
+			bind_cpu_0 = atol(tok);
+			bind_cpu_1 = atol(tok_end + 1);
+		}
+
+		step = 1;
+		tok_step = strstr(tok, "#");
+		if (tok_step) {
+			step = atol(tok_step + 1);
+			BUG_ON(step <= 0 || step >= g->p.nr_cpus);
+		}
+
+		/*
+		 * Mask length.
+		 * Eg: "--cpus 8_4-16#4" means: '--cpus 8_4,12_4,16_4',
+		 * where the _4 means the next 4 CPUs are allowed.
+		 */
+		bind_len = 1;
+		tok_len = strstr(tok, "_");
+		if (tok_len) {
+			bind_len = atol(tok_len + 1);
+			BUG_ON(bind_len <= 0 || bind_len > g->p.nr_cpus);
+		}
+
+		/* Multiplicator shortcut, "0x8" is a shortcut for: "0,0,0,0,0,0,0,0" */
+		mul = 1;
+		tok_mul = strstr(tok, "x");
+		if (tok_mul) {
+			mul = atol(tok_mul + 1);
+			BUG_ON(mul <= 0);
+		}
+
+		dprintf("CPUs: %d_%d-%d#%dx%d\n", bind_cpu_0, bind_len, bind_cpu_1, step, mul);
+
+		if (bind_cpu_0 >= g->p.nr_cpus || bind_cpu_1 >= g->p.nr_cpus) {
+			printf("\nTest not applicable, system has only %d CPUs.\n", g->p.nr_cpus);
+			return -1;
+		}
+
+		BUG_ON(bind_cpu_0 < 0 || bind_cpu_1 < 0);
+		BUG_ON(bind_cpu_0 > bind_cpu_1);
+
+		for (bind_cpu = bind_cpu_0; bind_cpu <= bind_cpu_1; bind_cpu += step) {
+			int i;
+
+			for (i = 0; i < mul; i++) {
+				int cpu;
+
+				if (t >= g->p.nr_tasks) {
+					printf("\n# NOTE: ignoring bind CPUs starting at CPU#%d\n #", bind_cpu);
+					goto out;
+				}
+				td = g->threads + t;
+
+				if (t)
+					tprintf(",");
+				if (bind_len > 1) {
+					tprintf("%2d/%d", bind_cpu, bind_len);
+				} else {
+					tprintf("%2d", bind_cpu);
+				}
+
+				CPU_ZERO(&td->bind_cpumask);
+				for (cpu = bind_cpu; cpu < bind_cpu+bind_len; cpu++) {
+					BUG_ON(cpu < 0 || cpu >= g->p.nr_cpus);
+					CPU_SET(cpu, &td->bind_cpumask);
+				}
+				t++;
+			}
+		}
+	}
+out:
+
+	tprintf("\n");
+
+	if (t < g->p.nr_tasks)
+		printf("# NOTE: %d tasks bound, %d tasks unbound\n", t, g->p.nr_tasks - t);
+
+	free(str0);
+	return 0;
+}
+
+static int parse_cpus_opt(const struct option *opt __maybe_unused,
+			  const char *arg, int unset __maybe_unused)
+{
+	if (!arg)
+		return -1;
+
+	return parse_cpu_list(arg);
+}
+
+static int parse_node_list(const char *arg)
+{
+	p0.node_list_str = strdup(arg);
+
+	dprintf("got NODE list: {%s}\n", p0.node_list_str);
+
+	return 0;
+}
+
+static int parse_setup_node_list(void)
+{
+	struct thread_data *td;
+	char *str0, *str;
+	int t;
+
+	if (!g->p.node_list_str)
+		return 0;
+
+	dprintf("g->p.nr_tasks: %d\n", g->p.nr_tasks);
+
+	str0 = str = strdup(g->p.node_list_str);
+	t = 0;
+
+	BUG_ON(!str);
+
+	tprintf("# binding tasks to NODEs:\n");
+	tprintf("# ");
+
+	while (true) {
+		int bind_node, bind_node_0, bind_node_1;
+		char *tok, *tok_end, *tok_step, *tok_mul;
+		int step;
+		int mul;
+
+		tok = strsep(&str, ",");
+		if (!tok)
+			break;
+
+		tok_end = strstr(tok, "-");
+
+		dprintf("\ntoken: {%s}, end: {%s}\n", tok, tok_end);
+		if (!tok_end) {
+			/* Single NODE specified: */
+			bind_node_0 = bind_node_1 = atol(tok);
+		} else {
+			/* NODE range specified (for example: "5-11"): */
+			bind_node_0 = atol(tok);
+			bind_node_1 = atol(tok_end + 1);
+		}
+
+		step = 1;
+		tok_step = strstr(tok, "#");
+		if (tok_step) {
+			step = atol(tok_step + 1);
+			BUG_ON(step <= 0 || step >= g->p.nr_nodes);
+		}
+
+		/* Multiplicator shortcut, "0x8" is a shortcut for: "0,0,0,0,0,0,0,0" */
+		mul = 1;
+		tok_mul = strstr(tok, "x");
+		if (tok_mul) {
+			mul = atol(tok_mul + 1);
+			BUG_ON(mul <= 0);
+		}
+
+		dprintf("NODEs: %d-%d #%d\n", bind_node_0, bind_node_1, step);
+
+		if (bind_node_0 >= g->p.nr_nodes || bind_node_1 >= g->p.nr_nodes) {
+			printf("\nTest not applicable, system has only %d nodes.\n", g->p.nr_nodes);
+			return -1;
+		}
+
+		BUG_ON(bind_node_0 < 0 || bind_node_1 < 0);
+		BUG_ON(bind_node_0 > bind_node_1);
+
+		for (bind_node = bind_node_0; bind_node <= bind_node_1; bind_node += step) {
+			int i;
+
+			for (i = 0; i < mul; i++) {
+				if (t >= g->p.nr_tasks) {
+					printf("\n# NOTE: ignoring bind NODEs starting at NODE#%d\n", bind_node);
+					goto out;
+				}
+				td = g->threads + t;
+
+				if (!t)
+					tprintf(" %2d", bind_node);
+				else
+					tprintf(",%2d", bind_node);
+
+				td->bind_node = bind_node;
+				t++;
+			}
+		}
+	}
+out:
+
+	tprintf("\n");
+
+	if (t < g->p.nr_tasks)
+		printf("# NOTE: %d tasks mem-bound, %d tasks unbound\n", t, g->p.nr_tasks - t);
+
+	free(str0);
+	return 0;
+}
+
+static int parse_nodes_opt(const struct option *opt __maybe_unused,
+			  const char *arg, int unset __maybe_unused)
+{
+	if (!arg)
+		return -1;
+
+	return parse_node_list(arg);
+
+	return 0;
+}
+
+/*
+ * The worker thread. For system calls this is trivial, for the time being:
+ */
+static u64 do_work(u8 *__data __maybe_unused, long bytes __maybe_unused, int nr __maybe_unused, int nr_max __maybe_unused, int loop __maybe_unused, u64 val __maybe_unused)
+{
+	getppid();
+
+	return 0;
+}
+
+static void update_curr_cpu(int task_nr, unsigned long bytes_worked)
+{
+	unsigned int cpu;
+
+	cpu = sched_getcpu();
+
+	g->threads[task_nr].curr_cpu = cpu;
+	prctl(0, bytes_worked);
+}
+
+#define MAX_NR_NODES	64
+
+/*
+ * Count the number of nodes a process's threads
+ * are spread out on.
+ *
+ * A count of 1 means that the process is compressed
+ * to a single node. A count of g->p.nr_nodes means it's
+ * spread out on the whole system.
+ */
+static int count_process_nodes(int process_nr)
+{
+	char node_present[MAX_NR_NODES] = { 0, };
+	int nodes;
+	int n, t;
+
+	for (t = 0; t < g->p.nr_threads; t++) {
+		struct thread_data *td;
+		int task_nr;
+		int node;
+
+		task_nr = process_nr*g->p.nr_threads + t;
+		td = g->threads + task_nr;
+
+		node = numa_node_of_cpu(td->curr_cpu);
+		if (node < 0) /* curr_cpu was likely still -1 */
+			return 0;
+
+		node_present[node] = 1;
+	}
+
+	nodes = 0;
+
+	for (n = 0; n < MAX_NR_NODES; n++)
+		nodes += node_present[n];
+
+	return nodes;
+}
+
+/*
+ * Count the number of distinct process-threads a node contains.
+ *
+ * A count of 1 means that the node contains only a single
+ * process. If all nodes on the system contain at most one
+ * process then we are well-converged.
+ */
+static int count_node_processes(int node)
+{
+	int processes = 0;
+	int t, p;
+
+	for (p = 0; p < g->p.nr_proc; p++) {
+		for (t = 0; t < g->p.nr_threads; t++) {
+			struct thread_data *td;
+			int task_nr;
+			int n;
+
+			task_nr = p*g->p.nr_threads + t;
+			td = g->threads + task_nr;
+
+			n = numa_node_of_cpu(td->curr_cpu);
+			if (n == node) {
+				processes++;
+				break;
+			}
+		}
+	}
+
+	return processes;
+}
+
+static void calc_convergence_compression(int *strong)
+{
+	unsigned int nodes_min, nodes_max;
+	int p;
+
+	nodes_min = -1;
+	nodes_max =  0;
+
+	for (p = 0; p < g->p.nr_proc; p++) {
+		unsigned int nodes = count_process_nodes(p);
+
+		if (!nodes) {
+			*strong = 0;
+			return;
+		}
+
+		nodes_min = min(nodes, nodes_min);
+		nodes_max = max(nodes, nodes_max);
+	}
+
+	/* Strong convergence: all threads compress on a single node: */
+	if (nodes_min == 1 && nodes_max == 1) {
+		*strong = 1;
+	} else {
+		*strong = 0;
+		tprintf(" {%d-%d}", nodes_min, nodes_max);
+	}
+}
+
+static void calc_convergence(double runtime_ns_max, double *convergence)
+{
+	unsigned int loops_done_min, loops_done_max;
+	int process_groups;
+	int nodes[MAX_NR_NODES];
+	int distance;
+	int nr_min;
+	int nr_max;
+	int strong;
+	int sum;
+	int nr;
+	int node;
+	int cpu;
+	int t;
+
+	if (!g->p.show_convergence && !g->p.measure_convergence)
+		return;
+
+	for (node = 0; node < g->p.nr_nodes; node++)
+		nodes[node] = 0;
+
+	loops_done_min = -1;
+	loops_done_max = 0;
+
+	for (t = 0; t < g->p.nr_tasks; t++) {
+		struct thread_data *td = g->threads + t;
+		unsigned int loops_done;
+
+		cpu = td->curr_cpu;
+
+		/* Not all threads have written it yet: */
+		if (cpu < 0)
+			continue;
+
+		node = numa_node_of_cpu(cpu);
+
+		nodes[node]++;
+
+		loops_done = td->loops_done;
+		loops_done_min = min(loops_done, loops_done_min);
+		loops_done_max = max(loops_done, loops_done_max);
+	}
+
+	nr_max = 0;
+	nr_min = g->p.nr_tasks;
+	sum = 0;
+
+	for (node = 0; node < g->p.nr_nodes; node++) {
+		nr = nodes[node];
+		nr_min = min(nr, nr_min);
+		nr_max = max(nr, nr_max);
+		sum += nr;
+	}
+	BUG_ON(nr_min > nr_max);
+
+	BUG_ON(sum > g->p.nr_tasks);
+
+	if (0 && (sum < g->p.nr_tasks))
+		return;
+
+	/*
+	 * Count the number of distinct process groups present
+	 * on nodes - when we are converged this will decrease
+	 * to g->p.nr_proc:
+	 */
+	process_groups = 0;
+
+	for (node = 0; node < g->p.nr_nodes; node++) {
+		int processes = count_node_processes(node);
+
+		nr = nodes[node];
+		tprintf(" %2d/%-2d", nr, processes);
+
+		process_groups += processes;
+	}
+
+	distance = nr_max - nr_min;
+
+	tprintf(" [%2d/%-2d]", distance, process_groups);
+
+	tprintf(" l:%3d-%-3d (%3d)",
+		loops_done_min, loops_done_max, loops_done_max-loops_done_min);
+
+	if (loops_done_min && loops_done_max) {
+		double skew = 1.0 - (double)loops_done_min/loops_done_max;
+
+		tprintf(" [%4.1f%%]", skew * 100.0);
+	}
+
+	calc_convergence_compression(&strong);
+
+	if (strong && process_groups == g->p.nr_proc) {
+		if (!*convergence) {
+			*convergence = runtime_ns_max;
+			tprintf(" (%6.1fs converged)\n", *convergence/1e9);
+			if (g->p.measure_convergence) {
+				g->all_converged = true;
+				g->stop_work = true;
+			}
+		}
+	} else {
+		if (*convergence) {
+			tprintf(" (%6.1fs de-converged)", runtime_ns_max/1e9);
+			*convergence = 0;
+		}
+		tprintf("\n");
+	}
+}
+
+static void show_summary(double runtime_ns_max, int l, double *convergence)
+{
+	tprintf("\r #  %5.1f%%  [%.1f mins]",
+		(double)(l+1)/g->p.nr_loops*100.0, runtime_ns_max/1e9 / 60.0);
+
+	calc_convergence(runtime_ns_max, convergence);
+
+	if (g->p.show_details >= 0)
+		fflush(stdout);
+}
+
+static void *worker_thread(void *__tdata)
+{
+	struct thread_data *td = __tdata;
+	struct timeval start0, start, stop, diff;
+	int process_nr = td->process_nr;
+	int thread_nr = td->thread_nr;
+	unsigned long last_perturbance;
+	int task_nr = td->task_nr;
+	int details = g->p.show_details;
+	int first_task, last_task;
+	double convergence = 0;
+	u64 val = td->val;
+	double runtime_ns_max;
+	u8 *global_data;
+	u8 *process_data;
+	u8 *thread_data;
+	u64 bytes_done;
+	long work_done;
+	u32 l;
+	struct rusage rusage;
+
+	bind_to_cpumask(td->bind_cpumask);
+	bind_to_memnode(td->bind_node);
+
+	set_taskname("thread %d/%d", process_nr, thread_nr);
+
+	global_data = g->data;
+	process_data = td->process_data;
+	thread_data = setup_private_data(g->p.bytes_thread);
+
+	bytes_done = 0;
+
+	last_task = 0;
+	if (process_nr == g->p.nr_proc-1 && thread_nr == g->p.nr_threads-1)
+		last_task = 1;
+
+	first_task = 0;
+	if (process_nr == 0 && thread_nr == 0)
+		first_task = 1;
+
+	if (details >= 2) {
+		printf("#  thread %2d / %2d global mem: %p, process mem: %p, thread mem: %p\n",
+			process_nr, thread_nr, global_data, process_data, thread_data);
+	}
+
+	if (g->p.serialize_startup) {
+		pthread_mutex_lock(&g->startup_mutex);
+		g->nr_tasks_started++;
+		pthread_mutex_unlock(&g->startup_mutex);
+
+		/* Here we will wait for the main process to start us all at once: */
+		pthread_mutex_lock(&g->start_work_mutex);
+		g->nr_tasks_working++;
+
+		/* Last one wake the main process: */
+		if (g->nr_tasks_working == g->p.nr_tasks)
+			pthread_mutex_unlock(&g->startup_done_mutex);
+
+		pthread_mutex_unlock(&g->start_work_mutex);
+	}
+
+	gettimeofday(&start0, NULL);
+
+	start = stop = start0;
+	last_perturbance = start.tv_sec;
+
+	for (l = 0; l < g->p.nr_loops; l++) {
+		start = stop;
+
+		if (g->stop_work)
+			break;
+
+		val += do_work(global_data,  g->p.bytes_global,  process_nr, g->p.nr_proc,	l, val);
+		val += do_work(process_data, g->p.bytes_process, thread_nr,  g->p.nr_threads,	l, val);
+		val += do_work(thread_data,  g->p.bytes_thread,  0,          1,		l, val);
+
+		if (g->p.sleep_usecs) {
+			pthread_mutex_lock(td->process_lock);
+			usleep(g->p.sleep_usecs);
+			pthread_mutex_unlock(td->process_lock);
+		}
+		/*
+		 * Amount of work to be done under a process-global lock:
+		 */
+		if (g->p.bytes_process_locked) {
+			pthread_mutex_lock(td->process_lock);
+			val += do_work(process_data, g->p.bytes_process_locked, thread_nr,  g->p.nr_threads,	l, val);
+			pthread_mutex_unlock(td->process_lock);
+		}
+
+		work_done = g->p.bytes_global + g->p.bytes_process +
+			    g->p.bytes_process_locked + g->p.bytes_thread;
+
+		update_curr_cpu(task_nr, work_done);
+		bytes_done += work_done;
+
+		if (details < 0 && !g->p.perturb_secs && !g->p.measure_convergence && !g->p.nr_secs)
+			continue;
+
+		td->loops_done = l;
+
+		gettimeofday(&stop, NULL);
+
+		/* Check whether our max runtime timed out: */
+		if (g->p.nr_secs) {
+			timersub(&stop, &start0, &diff);
+			if ((u32)diff.tv_sec >= g->p.nr_secs) {
+				g->stop_work = true;
+				break;
+			}
+		}
+
+		/* Update the summary at most once per second: */
+		if (start.tv_sec == stop.tv_sec)
+			continue;
+
+		/*
+		 * Perturb the first task's equilibrium every g->p.perturb_secs seconds,
+		 * by migrating to CPU#0:
+		 */
+		if (first_task && g->p.perturb_secs && (int)(stop.tv_sec - last_perturbance) >= g->p.perturb_secs) {
+			cpu_set_t orig_mask;
+			int target_cpu;
+			int this_cpu;
+
+			last_perturbance = stop.tv_sec;
+
+			/*
+			 * Depending on where we are running, move into
+			 * the other half of the system, to create some
+			 * real disturbance:
+			 */
+			this_cpu = g->threads[task_nr].curr_cpu;
+			if (this_cpu < g->p.nr_cpus/2)
+				target_cpu = g->p.nr_cpus-1;
+			else
+				target_cpu = 0;
+
+			orig_mask = bind_to_cpu(target_cpu);
+
+			/* Here we are running on the target CPU already */
+			if (details >= 1)
+				printf(" (injecting perturbalance, moved to CPU#%d)\n", target_cpu);
+
+			bind_to_cpumask(orig_mask);
+		}
+
+		if (details >= 3) {
+			timersub(&stop, &start, &diff);
+			runtime_ns_max = diff.tv_sec * 1000000000;
+			runtime_ns_max += diff.tv_usec * 1000;
+
+			if (details >= 0) {
+				printf(" #%2d / %2d: %14.2lf nsecs/op [val: %016"PRIx64"]\n",
+					process_nr, thread_nr, runtime_ns_max / bytes_done, val);
+			}
+			fflush(stdout);
+		}
+		if (!last_task)
+			continue;
+
+		timersub(&stop, &start0, &diff);
+		runtime_ns_max = diff.tv_sec * 1000000000ULL;
+		runtime_ns_max += diff.tv_usec * 1000ULL;
+
+		show_summary(runtime_ns_max, l, &convergence);
+	}
+
+	gettimeofday(&stop, NULL);
+	timersub(&stop, &start0, &diff);
+	td->runtime_ns = diff.tv_sec * 1000000000ULL;
+	td->runtime_ns += diff.tv_usec * 1000ULL;
+	td->speed_gbs = bytes_done / (td->runtime_ns / 1e9) / 1e9;
+
+	getrusage(RUSAGE_THREAD, &rusage);
+	td->system_time_ns = rusage.ru_stime.tv_sec * 1000000000ULL;
+	td->system_time_ns += rusage.ru_stime.tv_usec * 1000ULL;
+	td->user_time_ns = rusage.ru_utime.tv_sec * 1000000000ULL;
+	td->user_time_ns += rusage.ru_utime.tv_usec * 1000ULL;
+
+	free_data(thread_data, g->p.bytes_thread);
+
+	pthread_mutex_lock(&g->stop_work_mutex);
+	g->bytes_done += bytes_done;
+	pthread_mutex_unlock(&g->stop_work_mutex);
+
+	return NULL;
+}
+
+/*
+ * A worker process starts a couple of threads:
+ */
+static void worker_process(int process_nr)
+{
+	pthread_mutex_t process_lock;
+	struct thread_data *td;
+	pthread_t *pthreads;
+	u8 *process_data;
+	int task_nr;
+	int ret;
+	int t;
+
+	pthread_mutex_init(&process_lock, NULL);
+	set_taskname("process %d", process_nr);
+
+	/*
+	 * Pick up the memory policy and the CPU binding of our first thread,
+	 * so that we initialize memory accordingly:
+	 */
+	task_nr = process_nr*g->p.nr_threads;
+	td = g->threads + task_nr;
+
+	bind_to_memnode(td->bind_node);
+	bind_to_cpumask(td->bind_cpumask);
+
+	pthreads = zalloc(g->p.nr_threads * sizeof(pthread_t));
+	process_data = setup_private_data(g->p.bytes_process);
+
+	if (g->p.show_details >= 3) {
+		printf(" # process %2d global mem: %p, process mem: %p\n",
+			process_nr, g->data, process_data);
+	}
+
+	for (t = 0; t < g->p.nr_threads; t++) {
+		task_nr = process_nr*g->p.nr_threads + t;
+		td = g->threads + task_nr;
+
+		td->process_data = process_data;
+		td->process_nr   = process_nr;
+		td->thread_nr    = t;
+		td->task_nr	 = task_nr;
+		td->val          = rand();
+		td->curr_cpu	 = -1;
+		td->process_lock = &process_lock;
+
+		ret = pthread_create(pthreads + t, NULL, worker_thread, td);
+		BUG_ON(ret);
+	}
+
+	for (t = 0; t < g->p.nr_threads; t++) {
+                ret = pthread_join(pthreads[t], NULL);
+		BUG_ON(ret);
+	}
+
+	free_data(process_data, g->p.bytes_process);
+	free(pthreads);
+}
+
+static void print_summary(void)
+{
+	if (g->p.show_details < 0)
+		return;
+
+	printf("\n ###\n");
+	printf(" # %d %s will execute (on %d nodes, %d CPUs):\n",
+		g->p.nr_tasks, g->p.nr_tasks == 1 ? "task" : "tasks", g->p.nr_nodes, g->p.nr_cpus);
+	printf(" #      %5dx %5ldMB global  shared mem operations\n",
+			g->p.nr_loops, g->p.bytes_global/1024/1024);
+	printf(" #      %5dx %5ldMB process shared mem operations\n",
+			g->p.nr_loops, g->p.bytes_process/1024/1024);
+	printf(" #      %5dx %5ldMB thread  local  mem operations\n",
+			g->p.nr_loops, g->p.bytes_thread/1024/1024);
+
+	printf(" ###\n");
+
+	printf("\n ###\n"); fflush(stdout);
+}
+
+static void init_thread_data(void)
+{
+	ssize_t size = sizeof(*g->threads)*g->p.nr_tasks;
+	int t;
+
+	g->threads = zalloc_shared_data(size);
+
+	for (t = 0; t < g->p.nr_tasks; t++) {
+		struct thread_data *td = g->threads + t;
+		int cpu;
+
+		/* Allow all nodes by default: */
+		td->bind_node = -1;
+
+		/* Allow all CPUs by default: */
+		CPU_ZERO(&td->bind_cpumask);
+		for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
+			CPU_SET(cpu, &td->bind_cpumask);
+	}
+}
+
+static void deinit_thread_data(void)
+{
+	ssize_t size = sizeof(*g->threads)*g->p.nr_tasks;
+
+	free_data(g->threads, size);
+}
+
+static int init(void)
+{
+	g = (void *)alloc_data(sizeof(*g), MAP_SHARED, 1, 0, 0 /* THP */, 0);
+
+	/* Copy over options: */
+	g->p = p0;
+
+	g->p.nr_cpus = numa_num_configured_cpus();
+
+	g->p.nr_nodes = numa_max_node() + 1;
+
+	/* char array in count_process_nodes(): */
+	BUG_ON(g->p.nr_nodes > MAX_NR_NODES || g->p.nr_nodes < 0);
+
+	if (g->p.show_quiet && !g->p.show_details)
+		g->p.show_details = -1;
+
+	/* Some memory should be specified: */
+	if (!g->p.mb_global_str && !g->p.mb_proc_str && !g->p.mb_thread_str)
+		return -1;
+
+	if (g->p.mb_global_str) {
+		g->p.mb_global = atof(g->p.mb_global_str);
+		BUG_ON(g->p.mb_global < 0);
+	}
+
+	if (g->p.mb_proc_str) {
+		g->p.mb_proc = atof(g->p.mb_proc_str);
+		BUG_ON(g->p.mb_proc < 0);
+	}
+
+	if (g->p.mb_proc_locked_str) {
+		g->p.mb_proc_locked = atof(g->p.mb_proc_locked_str);
+		BUG_ON(g->p.mb_proc_locked < 0);
+		BUG_ON(g->p.mb_proc_locked > g->p.mb_proc);
+	}
+
+	if (g->p.mb_thread_str) {
+		g->p.mb_thread = atof(g->p.mb_thread_str);
+		BUG_ON(g->p.mb_thread < 0);
+	}
+
+	BUG_ON(g->p.nr_threads <= 0);
+	BUG_ON(g->p.nr_proc <= 0);
+
+	g->p.nr_tasks = g->p.nr_proc*g->p.nr_threads;
+
+	g->p.bytes_global		= g->p.mb_global	*1024L*1024L;
+	g->p.bytes_process		= g->p.mb_proc		*1024L*1024L;
+	g->p.bytes_process_locked	= g->p.mb_proc_locked	*1024L*1024L;
+	g->p.bytes_thread		= g->p.mb_thread	*1024L*1024L;
+
+	g->data = setup_shared_data(g->p.bytes_global);
+
+	/* Startup serialization: */
+	init_global_mutex(&g->start_work_mutex);
+	init_global_mutex(&g->startup_mutex);
+	init_global_mutex(&g->startup_done_mutex);
+	init_global_mutex(&g->stop_work_mutex);
+
+	init_thread_data();
+
+	tprintf("#\n");
+	if (parse_setup_cpu_list() || parse_setup_node_list())
+		return -1;
+	tprintf("#\n");
+
+	print_summary();
+
+	return 0;
+}
+
+static void deinit(void)
+{
+	free_data(g->data, g->p.bytes_global);
+	g->data = NULL;
+
+	deinit_thread_data();
+
+	free_data(g, sizeof(*g));
+	g = NULL;
+}
+
+/*
+ * Print a short or long result, depending on the verbosity setting:
+ */
+static void print_res(const char *name, double val,
+		      const char *txt_unit, const char *txt_short, const char *txt_long)
+{
+	if (!name)
+		name = "main,";
+
+	if (!g->p.show_quiet)
+		printf(" %-30s %15.3f, %-15s %s\n", name, val, txt_unit, txt_short);
+	else
+		printf(" %14.3f %s\n", val, txt_long);
+}
+
+static int __bench_syscall(const char *name)
+{
+	struct timeval start, stop, diff;
+	u64 runtime_ns_min, runtime_ns_sum;
+	pid_t *pids, pid, wpid;
+	double delta_runtime;
+	double runtime_avg;
+	double runtime_sec_max;
+	double runtime_sec_min;
+	int wait_stat;
+	double bytes;
+	int i, t, p;
+
+	if (init())
+		return -1;
+
+	pids = zalloc(g->p.nr_proc * sizeof(*pids));
+	pid = -1;
+
+	/* All threads try to acquire it, this way we can wait for them to start up: */
+	pthread_mutex_lock(&g->start_work_mutex);
+
+	if (g->p.serialize_startup) {
+		tprintf(" #\n");
+		tprintf(" # Startup synchronization: ..."); fflush(stdout);
+	}
+
+	gettimeofday(&start, NULL);
+
+	for (i = 0; i < g->p.nr_proc; i++) {
+		pid = fork();
+		dprintf(" # process %2d: PID %d\n", i, pid);
+
+		BUG_ON(pid < 0);
+		if (!pid) {
+			/* Child process: */
+			worker_process(i);
+
+			exit(0);
+		}
+		pids[i] = pid;
+
+	}
+	/* Wait for all the threads to start up: */
+	while (g->nr_tasks_started != g->p.nr_tasks)
+		usleep(1000);
+
+	BUG_ON(g->nr_tasks_started != g->p.nr_tasks);
+
+	if (g->p.serialize_startup) {
+		double startup_sec;
+
+		pthread_mutex_lock(&g->startup_done_mutex);
+
+		/* This will start all threads: */
+		pthread_mutex_unlock(&g->start_work_mutex);
+
+		/* This mutex is locked - the last started thread will wake us: */
+		pthread_mutex_lock(&g->startup_done_mutex);
+
+		gettimeofday(&stop, NULL);
+
+		timersub(&stop, &start, &diff);
+
+		startup_sec = diff.tv_sec * 1000000000.0;
+		startup_sec += diff.tv_usec * 1000.0;
+		startup_sec /= 1e9;
+
+		tprintf(" threads initialized in %.6f seconds.\n", startup_sec);
+		tprintf(" #\n");
+
+		start = stop;
+		pthread_mutex_unlock(&g->startup_done_mutex);
+	} else {
+		gettimeofday(&start, NULL);
+	}
+
+	/* Parent process: */
+
+
+	for (i = 0; i < g->p.nr_proc; i++) {
+		wpid = waitpid(pids[i], &wait_stat, 0);
+		BUG_ON(wpid < 0);
+		BUG_ON(!WIFEXITED(wait_stat));
+
+	}
+
+	runtime_ns_sum = 0;
+	runtime_ns_min = -1LL;
+
+	for (t = 0; t < g->p.nr_tasks; t++) {
+		u64 thread_runtime_ns = g->threads[t].runtime_ns;
+
+		runtime_ns_sum += thread_runtime_ns;
+		runtime_ns_min = min(thread_runtime_ns, runtime_ns_min);
+	}
+
+	gettimeofday(&stop, NULL);
+	timersub(&stop, &start, &diff);
+
+	BUG_ON(bench_format != BENCH_FORMAT_DEFAULT);
+
+	tprintf("\n ###\n");
+	tprintf("\n");
+
+	runtime_sec_max = diff.tv_sec * 1000000000.0;
+	runtime_sec_max += diff.tv_usec * 1000.0;
+	runtime_sec_max /= 1e9;
+
+	runtime_sec_min = runtime_ns_min/1e9;
+
+	bytes = g->bytes_done;
+	runtime_avg = (double)runtime_ns_sum / g->p.nr_tasks / 1e9;
+
+	if (g->p.measure_convergence) {
+		print_res(name, runtime_sec_max,
+			"secs,", "NUMA-convergence-latency", "secs latency to NUMA-converge");
+	}
+
+	print_res(name, runtime_sec_max,
+		"secs,", "runtime-max/thread",	"secs slowest (max) thread-runtime");
+
+	print_res(name, runtime_sec_min,
+		"secs,", "runtime-min/thread",	"secs fastest (min) thread-runtime");
+
+	print_res(name, runtime_avg,
+		"secs,", "runtime-avg/thread",	"secs average thread-runtime");
+
+	delta_runtime = (runtime_sec_max - runtime_sec_min)/2.0;
+	print_res(name, delta_runtime / runtime_sec_max * 100.0,
+		"%,", "spread-runtime/thread",	"% difference between max/avg runtime");
+
+	print_res(name, bytes / g->p.nr_tasks / 1e9,
+		"GB,", "data/thread",		"GB data processed, per thread");
+
+	print_res(name, bytes / 1e9,
+		"GB,", "data-total",		"GB data processed, total");
+
+	print_res(name, runtime_sec_max * 1e9 / (bytes / g->p.nr_tasks),
+		"nsecs,", "runtime/byte/thread","nsecs/byte/thread runtime");
+
+	print_res(name, bytes / g->p.nr_tasks / 1e9 / runtime_sec_max,
+		"GB/sec,", "thread-speed",	"GB/sec/thread speed");
+
+	print_res(name, bytes / runtime_sec_max / 1e9,
+		"GB/sec,", "total-speed",	"GB/sec total speed");
+
+	if (g->p.show_details >= 2) {
+		char tname[32];
+		struct thread_data *td;
+		for (p = 0; p < g->p.nr_proc; p++) {
+			for (t = 0; t < g->p.nr_threads; t++) {
+				memset(tname, 0, 32);
+				td = g->threads + p*g->p.nr_threads + t;
+				snprintf(tname, 32, "process%d:thread%d", p, t);
+				print_res(tname, td->speed_gbs,
+					"GB/sec",	"thread-speed", "GB/sec/thread speed");
+				print_res(tname, td->system_time_ns / 1e9,
+					"secs",	"thread-system-time", "system CPU time/thread");
+				print_res(tname, td->user_time_ns / 1e9,
+					"secs",	"thread-user-time", "user CPU time/thread");
+			}
+		}
+	}
+
+	free(pids);
+
+	deinit();
+
+	return 0;
+}
+
+#define MAX_ARGS 50
+
+static int command_size(const char **argv)
+{
+	int size = 0;
+
+	while (*argv) {
+		size++;
+		argv++;
+	}
+
+	BUG_ON(size >= MAX_ARGS);
+
+	return size;
+}
+
+static void init_params(struct params *p, const char *name, int argc, const char **argv)
+{
+	int i;
+
+	printf("\n # Running %s \"perf bench syscall", name);
+
+	for (i = 0; i < argc; i++)
+		printf(" %s", argv[i]);
+
+	printf("\"\n");
+
+	memset(p, 0, sizeof(*p));
+
+	/* Initialize nonzero defaults: */
+
+	p->serialize_startup		= 1;
+	p->data_reads			= true;
+	p->data_writes			= true;
+	p->data_backwards		= true;
+	p->data_rand_walk		= true;
+	p->nr_loops			= 10000000;
+	p->init_random			= true;
+	p->mb_global_str		= "1";
+	p->nr_proc			= 1;
+	p->nr_threads			= 1;
+	p->nr_secs			= 5;
+	p->run_all			= argc == 1;
+}
+
+static int run_bench_syscall(const char *name, const char **argv)
+{
+	int argc = command_size(argv);
+
+	init_params(&p0, name, argc, argv);
+	argc = parse_options(argc, argv, options, bench_syscall_usage, 0);
+	if (argc)
+		goto err;
+
+	if (__bench_syscall(name))
+		goto err;
+
+	return 0;
+
+err:
+	return -1;
+}
+
+#define OPT_BW_NOTHP		OPT_BW,			"--thp", "-1"
+
+/*
+ * The built-in test-suite executed by "perf bench syscall -a".
+ */
+static const char *tests[][MAX_ARGS] = {
+   /* Basic single-stream NUMA bandwidth measurements: */
+   { "NULL-fail,",	  "null" },
+   { "NULL-fail,",	  "null", "-t", "2" }
+};
+
+static int bench_all(void)
+{
+	int nr = ARRAY_SIZE(tests);
+	int ret;
+	int i;
+
+	ret = system("echo ' #'; echo ' # Running test on: '$(uname -a); echo ' #'");
+	BUG_ON(ret < 0);
+
+	for (i = 0; i < nr; i++) {
+		run_bench_syscall(tests[i][0], tests[i] + 1);
+	}
+
+	printf("\n");
+
+	return 0;
+}
+
+int bench_syscall(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+	init_params(&p0, "main,", argc, argv);
+	argc = parse_options(argc, argv, options, bench_syscall_usage, 0);
+	if (argc)
+		goto err;
+
+	if (p0.run_all)
+		return bench_all();
+
+	if (__bench_syscall(NULL))
+		goto err;
+
+	return 0;
+
+err:
+	usage_with_options(syscall_usage, options);
+	return -1;
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index a1cddc6bbf0f..2600c031435c 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -9,10 +9,11 @@
 /*
  * Available benchmark collection list:
  *
- *  sched ... scheduler and IPC performance
- *  mem   ... memory access performance
- *  numa  ... NUMA scheduling and MM performance
- *  futex ... Futex performance
+ *  syscall ... syscall performance
+ *  sched   ... scheduler and IPC performance
+ *  mem     ... memory access performance
+ *  numa    ... NUMA scheduling and MM performance
+ *  futex   ... Futex performance
  */
 #include "perf.h"
 #include "util/util.h"
@@ -33,6 +34,12 @@ struct bench {
 	bench_fn_t	fn;
 };
 
+static struct bench syscall_benchmarks[] = {
+	{ "null",	"Benchmark for NULL syscall workloads",		bench_syscall		},
+	{ "all",	"Run all syscall benchmarks",			NULL			},
+	{ NULL,		NULL,						NULL			}
+};
+
 #ifdef HAVE_LIBNUMA_SUPPORT
 static struct bench numa_benchmarks[] = {
 	{ "mem",	"Benchmark for NUMA workloads",			bench_numa		},
@@ -73,6 +80,7 @@ struct collection {
 };
 
 static struct collection collections[] = {
+	{ "syscall",	"syscall benchmarks",				syscall_benchmarks	},
 	{ "sched",	"Scheduler and IPC benchmarks",			sched_benchmarks	},
 	{ "mem",	"Memory access benchmarks",			mem_benchmarks		},
 #ifdef HAVE_LIBNUMA_SUPPORT

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH] perf tooling: Simplify 'perf bench syscall'
  2016-02-01  7:41 ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Ingo Molnar
@ 2016-02-01  7:48   ` Ingo Molnar
  2016-02-01 15:41   ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Andy Lutomirski
  1 sibling, 0 replies; 23+ messages in thread
From: Ingo Molnar @ 2016-02-01  7:48 UTC (permalink / raw)
  To: riel
  Cc: linux-kernel, fweisbec, tglx, luto, peterz, clark,
	Arnaldo Carvalho de Melo, Peter Zijlstra


* Ingo Molnar <mingo@kernel.org> wrote:

> [...]
> 
> I kept the process, threading and memory allocation bits of numa.c, just in case 
> we need them to measure more complex syscalls. Maybe we could keep the threading 
> bits and remove the memory allocation parameters, to simplify the benchmark?

So the patch below removes NUMA details: convergence measurement and memory access 
pattern details. This reduces the linecount by about 30%. Should be combined with 
the previous patch I suspect.

Thanks,

	Ingo

==================>
>From a992aecebe12a195ffa74e09fcbe6b48db4430e3 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Mon, 1 Feb 2016 08:46:39 +0100
Subject: [PATCH] perf tooling: Simplify 'perf bench syscall'

Remove NUMA legacies.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/perf/bench/syscall.c | 316 +--------------------------------------------
 1 file changed, 5 insertions(+), 311 deletions(-)

diff --git a/tools/perf/bench/syscall.c b/tools/perf/bench/syscall.c
index 5a4ef02176d1..fabac462bde5 100644
--- a/tools/perf/bench/syscall.c
+++ b/tools/perf/bench/syscall.c
@@ -81,11 +81,6 @@ struct params {
 	double			mb_thread;
 
 	/* Access patterns to the working set: */
-	bool			data_reads;
-	bool			data_writes;
-	bool			data_backwards;
-	bool			data_zero_memset;
-	bool			data_rand_walk;
 	u32			nr_loops;
 	u32			nr_secs;
 	u32			sleep_usecs;
@@ -108,10 +103,6 @@ struct params {
 	int			nr_tasks;
 	bool			show_quiet;
 
-	bool			show_convergence;
-	bool			measure_convergence;
-
-	int			perturb_secs;
 	int			nr_cpus;
 	int			nr_nodes;
 
@@ -139,8 +130,6 @@ struct global_info {
 
 	struct thread_data	*threads;
 
-	/* Convergence latency measurement: */
-	bool			all_converged;
 	bool			stop_work;
 
 	int			print_once;
@@ -168,23 +157,13 @@ static const struct option options[] = {
 	OPT_UINTEGER('s', "nr_secs"	, &p0.nr_secs,		"max number of seconds to run (default: 5 secs)"),
 	OPT_UINTEGER('u', "usleep"	, &p0.sleep_usecs,	"usecs to sleep per loop iteration"),
 
-	OPT_BOOLEAN('R', "data_reads"	, &p0.data_reads,	"access the data via writes (can be mixed with -W)"),
-	OPT_BOOLEAN('W', "data_writes"	, &p0.data_writes,	"access the data via writes (can be mixed with -R)"),
-	OPT_BOOLEAN('B', "data_backwards", &p0.data_backwards,	"access the data backwards as well"),
-	OPT_BOOLEAN('Z', "data_zero_memset", &p0.data_zero_memset,"access the data via glibc bzero only"),
-	OPT_BOOLEAN('r', "data_rand_walk", &p0.data_rand_walk,	"access the data with random (32bit LFSR) walk"),
-
-
 	OPT_BOOLEAN('z', "init_zero"	, &p0.init_zero,	"bzero the initial allocations"),
 	OPT_BOOLEAN('I', "init_random"	, &p0.init_random,	"randomize the contents of the initial allocations"),
 	OPT_BOOLEAN('0', "init_cpu0"	, &p0.init_cpu0,	"do the initial allocations on CPU#0"),
-	OPT_INTEGER('x', "perturb_secs", &p0.perturb_secs,	"perturb thread 0/0 every X secs, to test convergence stability"),
 
 	OPT_INCR   ('d', "show_details"	, &p0.show_details,	"Show details"),
 	OPT_INCR   ('a', "all"		, &p0.run_all,		"Run all tests in the suite"),
 	OPT_INTEGER('H', "thp"		, &p0.thp,		"MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE"),
-	OPT_BOOLEAN('c', "show_convergence", &p0.show_convergence, "show convergence details"),
-	OPT_BOOLEAN('m', "measure_convergence",	&p0.measure_convergence, "measure convergence latency"),
 	OPT_BOOLEAN('q', "quiet"	, &p0.show_quiet,	"quiet mode"),
 	OPT_BOOLEAN('S', "serialize-startup", &p0.serialize_startup,"serialize thread startup"),
 
@@ -208,32 +187,6 @@ static const char * const syscall_usage[] = {
 	NULL
 };
 
-static cpu_set_t bind_to_cpu(int target_cpu)
-{
-	cpu_set_t orig_mask, mask;
-	int ret;
-
-	ret = sched_getaffinity(0, sizeof(orig_mask), &orig_mask);
-	BUG_ON(ret);
-
-	CPU_ZERO(&mask);
-
-	if (target_cpu == -1) {
-		int cpu;
-
-		for (cpu = 0; cpu < g->p.nr_cpus; cpu++)
-			CPU_SET(cpu, &mask);
-	} else {
-		BUG_ON(target_cpu < 0 || target_cpu >= g->p.nr_cpus);
-		CPU_SET(target_cpu, &mask);
-	}
-
-	ret = sched_setaffinity(0, sizeof(mask), &mask);
-	BUG_ON(ret);
-
-	return orig_mask;
-}
-
 static cpu_set_t bind_to_node(int target_node)
 {
 	int cpus_per_node = g->p.nr_cpus/g->p.nr_nodes;
@@ -699,222 +652,11 @@ static void update_curr_cpu(int task_nr, unsigned long bytes_worked)
 	prctl(0, bytes_worked);
 }
 
-#define MAX_NR_NODES	64
-
-/*
- * Count the number of nodes a process's threads
- * are spread out on.
- *
- * A count of 1 means that the process is compressed
- * to a single node. A count of g->p.nr_nodes means it's
- * spread out on the whole system.
- */
-static int count_process_nodes(int process_nr)
-{
-	char node_present[MAX_NR_NODES] = { 0, };
-	int nodes;
-	int n, t;
-
-	for (t = 0; t < g->p.nr_threads; t++) {
-		struct thread_data *td;
-		int task_nr;
-		int node;
-
-		task_nr = process_nr*g->p.nr_threads + t;
-		td = g->threads + task_nr;
-
-		node = numa_node_of_cpu(td->curr_cpu);
-		if (node < 0) /* curr_cpu was likely still -1 */
-			return 0;
-
-		node_present[node] = 1;
-	}
-
-	nodes = 0;
-
-	for (n = 0; n < MAX_NR_NODES; n++)
-		nodes += node_present[n];
-
-	return nodes;
-}
-
-/*
- * Count the number of distinct process-threads a node contains.
- *
- * A count of 1 means that the node contains only a single
- * process. If all nodes on the system contain at most one
- * process then we are well-converged.
- */
-static int count_node_processes(int node)
-{
-	int processes = 0;
-	int t, p;
-
-	for (p = 0; p < g->p.nr_proc; p++) {
-		for (t = 0; t < g->p.nr_threads; t++) {
-			struct thread_data *td;
-			int task_nr;
-			int n;
-
-			task_nr = p*g->p.nr_threads + t;
-			td = g->threads + task_nr;
-
-			n = numa_node_of_cpu(td->curr_cpu);
-			if (n == node) {
-				processes++;
-				break;
-			}
-		}
-	}
-
-	return processes;
-}
-
-static void calc_convergence_compression(int *strong)
-{
-	unsigned int nodes_min, nodes_max;
-	int p;
-
-	nodes_min = -1;
-	nodes_max =  0;
-
-	for (p = 0; p < g->p.nr_proc; p++) {
-		unsigned int nodes = count_process_nodes(p);
-
-		if (!nodes) {
-			*strong = 0;
-			return;
-		}
-
-		nodes_min = min(nodes, nodes_min);
-		nodes_max = max(nodes, nodes_max);
-	}
-
-	/* Strong convergence: all threads compress on a single node: */
-	if (nodes_min == 1 && nodes_max == 1) {
-		*strong = 1;
-	} else {
-		*strong = 0;
-		tprintf(" {%d-%d}", nodes_min, nodes_max);
-	}
-}
-
-static void calc_convergence(double runtime_ns_max, double *convergence)
-{
-	unsigned int loops_done_min, loops_done_max;
-	int process_groups;
-	int nodes[MAX_NR_NODES];
-	int distance;
-	int nr_min;
-	int nr_max;
-	int strong;
-	int sum;
-	int nr;
-	int node;
-	int cpu;
-	int t;
-
-	if (!g->p.show_convergence && !g->p.measure_convergence)
-		return;
-
-	for (node = 0; node < g->p.nr_nodes; node++)
-		nodes[node] = 0;
-
-	loops_done_min = -1;
-	loops_done_max = 0;
-
-	for (t = 0; t < g->p.nr_tasks; t++) {
-		struct thread_data *td = g->threads + t;
-		unsigned int loops_done;
-
-		cpu = td->curr_cpu;
-
-		/* Not all threads have written it yet: */
-		if (cpu < 0)
-			continue;
-
-		node = numa_node_of_cpu(cpu);
-
-		nodes[node]++;
-
-		loops_done = td->loops_done;
-		loops_done_min = min(loops_done, loops_done_min);
-		loops_done_max = max(loops_done, loops_done_max);
-	}
-
-	nr_max = 0;
-	nr_min = g->p.nr_tasks;
-	sum = 0;
-
-	for (node = 0; node < g->p.nr_nodes; node++) {
-		nr = nodes[node];
-		nr_min = min(nr, nr_min);
-		nr_max = max(nr, nr_max);
-		sum += nr;
-	}
-	BUG_ON(nr_min > nr_max);
-
-	BUG_ON(sum > g->p.nr_tasks);
-
-	if (0 && (sum < g->p.nr_tasks))
-		return;
-
-	/*
-	 * Count the number of distinct process groups present
-	 * on nodes - when we are converged this will decrease
-	 * to g->p.nr_proc:
-	 */
-	process_groups = 0;
-
-	for (node = 0; node < g->p.nr_nodes; node++) {
-		int processes = count_node_processes(node);
-
-		nr = nodes[node];
-		tprintf(" %2d/%-2d", nr, processes);
-
-		process_groups += processes;
-	}
-
-	distance = nr_max - nr_min;
-
-	tprintf(" [%2d/%-2d]", distance, process_groups);
-
-	tprintf(" l:%3d-%-3d (%3d)",
-		loops_done_min, loops_done_max, loops_done_max-loops_done_min);
-
-	if (loops_done_min && loops_done_max) {
-		double skew = 1.0 - (double)loops_done_min/loops_done_max;
-
-		tprintf(" [%4.1f%%]", skew * 100.0);
-	}
-
-	calc_convergence_compression(&strong);
-
-	if (strong && process_groups == g->p.nr_proc) {
-		if (!*convergence) {
-			*convergence = runtime_ns_max;
-			tprintf(" (%6.1fs converged)\n", *convergence/1e9);
-			if (g->p.measure_convergence) {
-				g->all_converged = true;
-				g->stop_work = true;
-			}
-		}
-	} else {
-		if (*convergence) {
-			tprintf(" (%6.1fs de-converged)", runtime_ns_max/1e9);
-			*convergence = 0;
-		}
-		tprintf("\n");
-	}
-}
-
-static void show_summary(double runtime_ns_max, int l, double *convergence)
+static void show_summary(double runtime_ns_max, int l)
 {
 	tprintf("\r #  %5.1f%%  [%.1f mins]",
 		(double)(l+1)/g->p.nr_loops*100.0, runtime_ns_max/1e9 / 60.0);
 
-	calc_convergence(runtime_ns_max, convergence);
-
 	if (g->p.show_details >= 0)
 		fflush(stdout);
 }
@@ -925,11 +667,9 @@ static void *worker_thread(void *__tdata)
 	struct timeval start0, start, stop, diff;
 	int process_nr = td->process_nr;
 	int thread_nr = td->thread_nr;
-	unsigned long last_perturbance;
 	int task_nr = td->task_nr;
 	int details = g->p.show_details;
-	int first_task, last_task;
-	double convergence = 0;
+	int last_task;
 	u64 val = td->val;
 	double runtime_ns_max;
 	u8 *global_data;
@@ -955,10 +695,6 @@ static void *worker_thread(void *__tdata)
 	if (process_nr == g->p.nr_proc-1 && thread_nr == g->p.nr_threads-1)
 		last_task = 1;
 
-	first_task = 0;
-	if (process_nr == 0 && thread_nr == 0)
-		first_task = 1;
-
 	if (details >= 2) {
 		printf("#  thread %2d / %2d global mem: %p, process mem: %p, thread mem: %p\n",
 			process_nr, thread_nr, global_data, process_data, thread_data);
@@ -983,7 +719,6 @@ static void *worker_thread(void *__tdata)
 	gettimeofday(&start0, NULL);
 
 	start = stop = start0;
-	last_perturbance = start.tv_sec;
 
 	for (l = 0; l < g->p.nr_loops; l++) {
 		start = stop;
@@ -1015,7 +750,7 @@ static void *worker_thread(void *__tdata)
 		update_curr_cpu(task_nr, work_done);
 		bytes_done += work_done;
 
-		if (details < 0 && !g->p.perturb_secs && !g->p.measure_convergence && !g->p.nr_secs)
+		if (details < 0 && !g->p.nr_secs)
 			continue;
 
 		td->loops_done = l;
@@ -1035,37 +770,6 @@ static void *worker_thread(void *__tdata)
 		if (start.tv_sec == stop.tv_sec)
 			continue;
 
-		/*
-		 * Perturb the first task's equilibrium every g->p.perturb_secs seconds,
-		 * by migrating to CPU#0:
-		 */
-		if (first_task && g->p.perturb_secs && (int)(stop.tv_sec - last_perturbance) >= g->p.perturb_secs) {
-			cpu_set_t orig_mask;
-			int target_cpu;
-			int this_cpu;
-
-			last_perturbance = stop.tv_sec;
-
-			/*
-			 * Depending on where we are running, move into
-			 * the other half of the system, to create some
-			 * real disturbance:
-			 */
-			this_cpu = g->threads[task_nr].curr_cpu;
-			if (this_cpu < g->p.nr_cpus/2)
-				target_cpu = g->p.nr_cpus-1;
-			else
-				target_cpu = 0;
-
-			orig_mask = bind_to_cpu(target_cpu);
-
-			/* Here we are running on the target CPU already */
-			if (details >= 1)
-				printf(" (injecting perturbalance, moved to CPU#%d)\n", target_cpu);
-
-			bind_to_cpumask(orig_mask);
-		}
-
 		if (details >= 3) {
 			timersub(&stop, &start, &diff);
 			runtime_ns_max = diff.tv_sec * 1000000000;
@@ -1084,7 +788,7 @@ static void *worker_thread(void *__tdata)
 		runtime_ns_max = diff.tv_sec * 1000000000ULL;
 		runtime_ns_max += diff.tv_usec * 1000ULL;
 
-		show_summary(runtime_ns_max, l, &convergence);
+		show_summary(runtime_ns_max, l);
 	}
 
 	gettimeofday(&stop, NULL);
@@ -1226,8 +930,7 @@ static int init(void)
 
 	g->p.nr_nodes = numa_max_node() + 1;
 
-	/* char array in count_process_nodes(): */
-	BUG_ON(g->p.nr_nodes > MAX_NR_NODES || g->p.nr_nodes < 0);
+	BUG_ON(g->p.nr_nodes < 0);
 
 	if (g->p.show_quiet && !g->p.show_details)
 		g->p.show_details = -1;
@@ -1427,11 +1130,6 @@ static int __bench_syscall(const char *name)
 	bytes = g->bytes_done;
 	runtime_avg = (double)runtime_ns_sum / g->p.nr_tasks / 1e9;
 
-	if (g->p.measure_convergence) {
-		print_res(name, runtime_sec_max,
-			"secs,", "NUMA-convergence-latency", "secs latency to NUMA-converge");
-	}
-
 	print_res(name, runtime_sec_max,
 		"secs,", "runtime-max/thread",	"secs slowest (max) thread-runtime");
 
@@ -1517,10 +1215,6 @@ static void init_params(struct params *p, const char *name, int argc, const char
 	/* Initialize nonzero defaults: */
 
 	p->serialize_startup		= 1;
-	p->data_reads			= true;
-	p->data_writes			= true;
-	p->data_backwards		= true;
-	p->data_rand_walk		= true;
 	p->nr_loops			= 10000000;
 	p->init_random			= true;
 	p->mb_global_str		= "1";

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-01  2:12 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
  2016-02-01  4:46   ` kbuild test robot
@ 2016-02-01  8:37   ` Thomas Gleixner
  2016-02-01  9:22     ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2016-02-01  8:37 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, fweisbec, mingo, luto, peterz, clark

On Sun, 31 Jan 2016, riel@redhat.com wrote:
> @@ -93,9 +93,9 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
>  {
>  	struct mm_struct *mm;
>  
> -	/* convert pages-usec to Mbyte-usec */
> -	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
> -	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> +	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
> +	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / (1000 * KB);
> +	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / (1000 * KB);

You replace "/ (1024 * 1024)" by "/ (1000 * 1024). So that's introducing a non
power of 2 division instead of removing one and wont compile on systems which
do not have a 64/32 division in hardware.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-01  8:37   ` Thomas Gleixner
@ 2016-02-01  9:22     ` Peter Zijlstra
  2016-02-01  9:31       ` Thomas Gleixner
  2016-02-01 13:44       ` Rik van Riel
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Zijlstra @ 2016-02-01  9:22 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: riel, linux-kernel, fweisbec, mingo, luto, clark

On Mon, Feb 01, 2016 at 09:37:00AM +0100, Thomas Gleixner wrote:
> On Sun, 31 Jan 2016, riel@redhat.com wrote:
> > @@ -93,9 +93,9 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
> >  {
> >  	struct mm_struct *mm;
> >  
> > -	/* convert pages-usec to Mbyte-usec */
> > -	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
> > -	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> > +	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
> > +	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / (1000 * KB);
> > +	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / (1000 * KB);
> 
> You replace "/ (1024 * 1024)" by "/ (1000 * 1024). So that's introducing a non
> power of 2 division instead of removing one and wont compile on systems which
> do not have a 64/32 division in hardware.

Yep, so that needs to be fixed to use do_div(). But the reason for this
is that this is the consumer side of these stats and therefore rarely
executed.

This patch effectively moves a div out of the fast path into the slow
path.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals
  2016-02-01  2:12 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
@ 2016-02-01  9:28   ` Peter Zijlstra
  2016-02-01 19:22     ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2016-02-01  9:28 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, fweisbec, tglx, mingo, luto, clark

On Sun, Jan 31, 2016 at 09:12:30PM -0500, riel@redhat.com wrote:
>  kernel/tsacct.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/kernel/tsacct.c b/kernel/tsacct.c
> index 9c23584c76c4..31fb6c9746d4 100644
> --- a/kernel/tsacct.c
> +++ b/kernel/tsacct.c
> @@ -124,20 +124,18 @@ static void __acct_update_integrals(struct task_struct *tsk,
>  				    cputime_t utime, cputime_t stime)
>  {
>  	cputime_t time, dtime;
> -	unsigned long flags;
>  	u64 delta;
>  
>  	if (!likely(tsk->mm))
>  		return;
>  
> -	local_irq_save(flags);
>  	time = stime + utime;
>  	dtime = time - tsk->acct_timexpd;
>  	/* Avoid division: cputime_t is often in nanoseconds already. */
>  	delta = cputime_to_nsecs(dtime);
>  
>  	if (delta < TICK_NSEC)
> -		goto out;
> +		return;
>  
>  	tsk->acct_timexpd = time;
>  	/*
> @@ -147,8 +145,6 @@ static void __acct_update_integrals(struct task_struct *tsk,
>  	 */
>  	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
>  	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
> -out:
> -	local_irq_restore(flags);
>  }

Like Frederic said on your last posting, at the very least
do_exit()->acct_update_integrals() will not have IRQs disabled.

Its just the acct_account_cputime() callers that I suspect will all have
IRQs disabled, but it would still be goot to verify that with a
WARN_ON(!irqs_disabled()) test in there for at least one test run, and
then include that you did this in the Changelog.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01  2:12 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
@ 2016-02-01  9:29   ` Peter Zijlstra
  2016-02-01 19:23     ` Rik van Riel
  0 siblings, 1 reply; 23+ messages in thread
From: Peter Zijlstra @ 2016-02-01  9:29 UTC (permalink / raw)
  To: riel; +Cc: linux-kernel, fweisbec, tglx, mingo, luto, clark

On Sun, Jan 31, 2016 at 09:12:31PM -0500, riel@redhat.com wrote:
> Run times for the microbenchmark:
> 
> 4.4				3.8 seconds
> 4.5-rc1				3.7 seconds
> 4.5-rc1 + first patch		3.3 seconds
> 4.5-rc1 + first 3 patches	3.1 seconds
> 4.5-rc1 + all patches		2.3 seconds

Would be good to compare that also to a !NOHZ_FULL bloated path.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-01  9:22     ` Peter Zijlstra
@ 2016-02-01  9:31       ` Thomas Gleixner
  2016-02-01 13:44       ` Rik van Riel
  1 sibling, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2016-02-01  9:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: riel, linux-kernel, fweisbec, mingo, luto, clark

On Mon, 1 Feb 2016, Peter Zijlstra wrote:

> On Mon, Feb 01, 2016 at 09:37:00AM +0100, Thomas Gleixner wrote:
> > On Sun, 31 Jan 2016, riel@redhat.com wrote:
> > > @@ -93,9 +93,9 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
> > >  {
> > >  	struct mm_struct *mm;
> > >  
> > > -	/* convert pages-usec to Mbyte-usec */
> > > -	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
> > > -	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
> > > +	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
> > > +	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / (1000 * KB);
> > > +	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / (1000 * KB);
> > 
> > You replace "/ (1024 * 1024)" by "/ (1000 * 1024). So that's introducing a non
> > power of 2 division instead of removing one and wont compile on systems which
> > do not have a 64/32 division in hardware.
> 
> Yep, so that needs to be fixed to use do_div(). But the reason for this
> is that this is the consumer side of these stats and therefore rarely
> executed.
> 
> This patch effectively moves a div out of the fast path into the slow
> path.

Yeah, noticed after hitting Send :)
 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-01  9:22     ` Peter Zijlstra
  2016-02-01  9:31       ` Thomas Gleixner
@ 2016-02-01 13:44       ` Rik van Riel
  2016-02-01 13:51         ` Peter Zijlstra
  1 sibling, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2016-02-01 13:44 UTC (permalink / raw)
  To: Peter Zijlstra, Thomas Gleixner
  Cc: linux-kernel, fweisbec, mingo, luto, clark

On 02/01/2016 04:22 AM, Peter Zijlstra wrote:
> On Mon, Feb 01, 2016 at 09:37:00AM +0100, Thomas Gleixner wrote:
>> On Sun, 31 Jan 2016, riel@redhat.com wrote:
>>> @@ -93,9 +93,9 @@ void xacct_add_tsk(struct taskstats *stats, struct task_struct *p)
>>>  {
>>>  	struct mm_struct *mm;
>>>  
>>> -	/* convert pages-usec to Mbyte-usec */
>>> -	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / MB;
>>> -	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / MB;
>>> +	/* convert pages-nsec/1024 to Mbyte-usec, see __acct_update_integrals */
>>> +	stats->coremem = p->acct_rss_mem1 * PAGE_SIZE / (1000 * KB);
>>> +	stats->virtmem = p->acct_vm_mem1 * PAGE_SIZE / (1000 * KB);
>>
>> You replace "/ (1024 * 1024)" by "/ (1000 * 1024). So that's introducing a non
>> power of 2 division instead of removing one and wont compile on systems which
>> do not have a 64/32 division in hardware.
> 
> Yep, so that needs to be fixed to use do_div(). But the reason for this
> is that this is the consumer side of these stats and therefore rarely
> executed.

Before I send in a v4 with do_div in that location, are there
any other changes you would like me to make to the code?

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals
  2016-02-01 13:44       ` Rik van Riel
@ 2016-02-01 13:51         ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2016-02-01 13:51 UTC (permalink / raw)
  To: Rik van Riel; +Cc: Thomas Gleixner, linux-kernel, fweisbec, mingo, luto, clark

On Mon, Feb 01, 2016 at 08:44:44AM -0500, Rik van Riel wrote:
> Before I send in a v4 with do_div in that location, are there
> any other changes you would like me to make to the code?

The WARN_ON(!irqs_disabled()) run and adding irq_save/restore to
acct_update_integrals() would be good :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] perf tooling: Add 'perf bench syscall' benchmark
  2016-02-01  7:41 ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Ingo Molnar
  2016-02-01  7:48   ` [PATCH] perf tooling: Simplify 'perf bench syscall' Ingo Molnar
@ 2016-02-01 15:41   ` Andy Lutomirski
  2016-02-03 10:22     ` Ingo Molnar
  1 sibling, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2016-02-01 15:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Arnaldo Carvalho de Melo, Frederic Weisbecker, linux-kernel,
	Thomas Gleixner, Rik van Riel, Peter Zijlstra, clark,
	Peter Zijlstra

On Jan 31, 2016 11:42 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
>
>
> * riel@redhat.com <riel@redhat.com> wrote:
>
> > (v3: address comments raised by Frederic)
> >
> > Running with nohz_full introduces a fair amount of overhead.
> > Specifically, various things that are usually done from the
> > timer interrupt are now done at syscall, irq, and guest
> > entry and exit times.
> >
> > However, some of the code that is called every single time
> > has only ever worked at jiffy resolution. The code in
> > __acct_update_integrals was also doing some unnecessary
> > calculations.
> >
> > Getting rid of the unnecessary calculations, without
> > changing any of the functionality in __acct_update_integrals
> > gets us about an 11% win.
> >
> > Not calling the time statistics updating code more than
> > once per jiffy, like is done on housekeeping CPUs and on
> > all the CPUs of a non-nohz_full system, shaves off a
> > further 30%.
> >
> > I tested this series with a microbenchmark calling
> > an invalid syscall number ten million times in a row,
> > on a nohz_full cpu.
> >
> >     Run times for the microbenchmark:
> >
> > 4.4                           3.8 seconds
> > 4.5-rc1                               3.7 seconds
> > 4.5-rc1 + first patch         3.3 seconds
> > 4.5-rc1 + first 3 patches     3.1 seconds
> > 4.5-rc1 + all patches         2.3 seconds
>
> Another suggestion (beyond fixing the 32-bit build ;-), could you please stick
> your syscall microbenchmark into 'perf bench', so that we have a standardized way
> of checking such numbers?
>
> In fact I'd suggest we introduce an entirely new sub-tool for system call
> performance measurement - and this might be the first functionality of it.
>
> I've attached a quick patch that is basically a copy of 'perf bench numa' and
> which measures getppid() performance (simple syscall where the result is not
> cached by glibc).
>
> I kept the process, threading and memory allocation bits of numa.c, just in case
> we need them to measure more complex syscalls. Maybe we could keep the threading
> bits and remove the memory allocation parameters, to simplify the benchmark?
>
> Anyway, this could be a good base to start off on.

So much code...

I'll try to take a look this week.  It shouldn't be so hard to port my
rdpmc-based widget over to this.


--Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals
  2016-02-01  9:28   ` Peter Zijlstra
@ 2016-02-01 19:22     ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2016-02-01 19:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, fweisbec, tglx, mingo, luto, clark

On 02/01/2016 04:28 AM, Peter Zijlstra wrote:

> Its just the acct_account_cputime() callers that I suspect will all have
> IRQs disabled, but it would still be goot to verify that with a
> WARN_ON(!irqs_disabled()) test in there for at least one test run, and
> then include that you did this in the Changelog.

I verified that the WARN_ON did not trigger, but I forgot to include
that in the changelog for v4

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy
  2016-02-01  9:29   ` Peter Zijlstra
@ 2016-02-01 19:23     ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2016-02-01 19:23 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, fweisbec, tglx, mingo, luto, clark

On 02/01/2016 04:29 AM, Peter Zijlstra wrote:
> On Sun, Jan 31, 2016 at 09:12:31PM -0500, riel@redhat.com wrote:
>> Run times for the microbenchmark:
>>
>> 4.4				3.8 seconds
>> 4.5-rc1				3.7 seconds
>> 4.5-rc1 + first patch		3.3 seconds
>> 4.5-rc1 + first 3 patches	3.1 seconds
>> 4.5-rc1 + all patches		2.3 seconds
> 
> Would be good to compare that also to a !NOHZ_FULL bloated path.

On a non-nohz_full, non-housekeeping CPU, I see 1.86
seconds run time, or about 20% faster than on a nohz_full
CPU with all patches applied.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] perf tooling: Add 'perf bench syscall' benchmark
  2016-02-01 15:41   ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Andy Lutomirski
@ 2016-02-03 10:22     ` Ingo Molnar
  2016-06-20 18:00       ` [PATCH] perf: add 'perf bench syscall' Josh Poimboeuf
  0 siblings, 1 reply; 23+ messages in thread
From: Ingo Molnar @ 2016-02-03 10:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Arnaldo Carvalho de Melo, Frederic Weisbecker, linux-kernel,
	Thomas Gleixner, Rik van Riel, Peter Zijlstra, clark,
	Peter Zijlstra


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Jan 31, 2016 11:42 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
> >
> >
> > * riel@redhat.com <riel@redhat.com> wrote:
> >
> > > (v3: address comments raised by Frederic)
> > >
> > > Running with nohz_full introduces a fair amount of overhead.
> > > Specifically, various things that are usually done from the
> > > timer interrupt are now done at syscall, irq, and guest
> > > entry and exit times.
> > >
> > > However, some of the code that is called every single time
> > > has only ever worked at jiffy resolution. The code in
> > > __acct_update_integrals was also doing some unnecessary
> > > calculations.
> > >
> > > Getting rid of the unnecessary calculations, without
> > > changing any of the functionality in __acct_update_integrals
> > > gets us about an 11% win.
> > >
> > > Not calling the time statistics updating code more than
> > > once per jiffy, like is done on housekeeping CPUs and on
> > > all the CPUs of a non-nohz_full system, shaves off a
> > > further 30%.
> > >
> > > I tested this series with a microbenchmark calling
> > > an invalid syscall number ten million times in a row,
> > > on a nohz_full cpu.
> > >
> > >     Run times for the microbenchmark:
> > >
> > > 4.4                           3.8 seconds
> > > 4.5-rc1                               3.7 seconds
> > > 4.5-rc1 + first patch         3.3 seconds
> > > 4.5-rc1 + first 3 patches     3.1 seconds
> > > 4.5-rc1 + all patches         2.3 seconds
> >
> > Another suggestion (beyond fixing the 32-bit build ;-), could you please stick
> > your syscall microbenchmark into 'perf bench', so that we have a standardized way
> > of checking such numbers?
> >
> > In fact I'd suggest we introduce an entirely new sub-tool for system call
> > performance measurement - and this might be the first functionality of it.
> >
> > I've attached a quick patch that is basically a copy of 'perf bench numa' and
> > which measures getppid() performance (simple syscall where the result is not
> > cached by glibc).
> >
> > I kept the process, threading and memory allocation bits of numa.c, just in case
> > we need them to measure more complex syscalls. Maybe we could keep the threading
> > bits and remove the memory allocation parameters, to simplify the benchmark?
> >
> > Anyway, this could be a good base to start off on.
> 
> So much code...

Arguably 90% of that should be factored out, as it's now a duplicate between 
bench/numa.c and bench/syscall.c.

Technically, for a minimum benchmark, something like this would already be 
functional for tools/perf/bench/syscall.c:

#include "../perf.h"
#include "../util/util.h"
#include "../builtin.h"
#include "bench.h"

static void run_syscall_benchmark(void)
{
	[ .... your benchmark loop as-is ... ]
}

int bench_syscall(int argc __maybe_unused, const char **argv __maybe_unused, const char *prefix __maybe_unused)
{
	run_syscall_benchmark();

        switch (bench_format) {
        case BENCH_FORMAT_DEFAULT:
                printf("print results in human-readable format\n");
                break;
        case BENCH_FORMAT_SIMPLE:
                printf("print results in machine-parseable format\n");
                break;
        default:
		BUG_ON(1);
        }

	return 0;
}

Plus the small amount of glue for bench_sycall() I sent in the first patch.

Completely untested.

If the loop is long enough then even without any timing measurement this would be 
usable via:

	perf stat --null --repeat 10 perf bench syscall

as the 'perf stat' will do the timing and statistics.

> I'll try to take a look this week.  It shouldn't be so hard to port my 
> rdpmc-based widget over to this.

Sounds great to me!

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] perf: add 'perf bench syscall'
  2016-02-03 10:22     ` Ingo Molnar
@ 2016-06-20 18:00       ` Josh Poimboeuf
  2016-06-20 19:16         ` Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Josh Poimboeuf @ 2016-06-20 18:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	linux-kernel, Thomas Gleixner, Rik van Riel, Peter Zijlstra,
	clark, Peter Zijlstra

On Wed, Feb 03, 2016 at 11:22:47AM +0100, Ingo Molnar wrote:
> 
> * Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > On Jan 31, 2016 11:42 PM, "Ingo Molnar" <mingo@kernel.org> wrote:
> > >
> > >
> > > * riel@redhat.com <riel@redhat.com> wrote:
> > >
> > > > (v3: address comments raised by Frederic)
> > > >
> > > > Running with nohz_full introduces a fair amount of overhead.
> > > > Specifically, various things that are usually done from the
> > > > timer interrupt are now done at syscall, irq, and guest
> > > > entry and exit times.
> > > >
> > > > However, some of the code that is called every single time
> > > > has only ever worked at jiffy resolution. The code in
> > > > __acct_update_integrals was also doing some unnecessary
> > > > calculations.
> > > >
> > > > Getting rid of the unnecessary calculations, without
> > > > changing any of the functionality in __acct_update_integrals
> > > > gets us about an 11% win.
> > > >
> > > > Not calling the time statistics updating code more than
> > > > once per jiffy, like is done on housekeeping CPUs and on
> > > > all the CPUs of a non-nohz_full system, shaves off a
> > > > further 30%.
> > > >
> > > > I tested this series with a microbenchmark calling
> > > > an invalid syscall number ten million times in a row,
> > > > on a nohz_full cpu.
> > > >
> > > >     Run times for the microbenchmark:
> > > >
> > > > 4.4                           3.8 seconds
> > > > 4.5-rc1                               3.7 seconds
> > > > 4.5-rc1 + first patch         3.3 seconds
> > > > 4.5-rc1 + first 3 patches     3.1 seconds
> > > > 4.5-rc1 + all patches         2.3 seconds
> > >
> > > Another suggestion (beyond fixing the 32-bit build ;-), could you please stick
> > > your syscall microbenchmark into 'perf bench', so that we have a standardized way
> > > of checking such numbers?
> > >
> > > In fact I'd suggest we introduce an entirely new sub-tool for system call
> > > performance measurement - and this might be the first functionality of it.
> > >
> > > I've attached a quick patch that is basically a copy of 'perf bench numa' and
> > > which measures getppid() performance (simple syscall where the result is not
> > > cached by glibc).
> > >
> > > I kept the process, threading and memory allocation bits of numa.c, just in case
> > > we need them to measure more complex syscalls. Maybe we could keep the threading
> > > bits and remove the memory allocation parameters, to simplify the benchmark?
> > >
> > > Anyway, this could be a good base to start off on.
> > 
> > So much code...
> 
> Arguably 90% of that should be factored out, as it's now a duplicate between 
> bench/numa.c and bench/syscall.c.
> 
> Technically, for a minimum benchmark, something like this would already be 
> functional for tools/perf/bench/syscall.c:
> 
> #include "../perf.h"
> #include "../util/util.h"
> #include "../builtin.h"
> #include "bench.h"
> 
> static void run_syscall_benchmark(void)
> {
> 	[ .... your benchmark loop as-is ... ]
> }
> 
> int bench_syscall(int argc __maybe_unused, const char **argv __maybe_unused, const char *prefix __maybe_unused)
> {
> 	run_syscall_benchmark();
> 
>         switch (bench_format) {
>         case BENCH_FORMAT_DEFAULT:
>                 printf("print results in human-readable format\n");
>                 break;
>         case BENCH_FORMAT_SIMPLE:
>                 printf("print results in machine-parseable format\n");
>                 break;
>         default:
> 		BUG_ON(1);
>         }
> 
> 	return 0;
> }
> 
> Plus the small amount of glue for bench_sycall() I sent in the first patch.
> 
> Completely untested.
> 
> If the loop is long enough then even without any timing measurement this would be 
> usable via:
> 
> 	perf stat --null --repeat 10 perf bench syscall
> 
> as the 'perf stat' will do the timing and statistics.

How about this:

--------

From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH] perf: add 'perf bench syscall'

Add a basic 'perf bench syscall' benchmark which does a getppid() system
call in a tight loop.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 tools/perf/bench/Build     |  1 +
 tools/perf/bench/bench.h   |  1 +
 tools/perf/bench/syscall.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c | 18 ++++++++---
 4 files changed, 95 insertions(+), 5 deletions(-)
 create mode 100644 tools/perf/bench/syscall.c

diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 60bf119..0b7395d 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -1,5 +1,6 @@
 perf-y += sched-messaging.o
 perf-y += sched-pipe.o
+perf-y += syscall.o
 perf-y += mem-functions.o
 perf-y += futex-hash.o
 perf-y += futex-wake.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index 579a592..bdd6cdc 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -28,6 +28,7 @@
 int bench_numa(int argc, const char **argv, const char *prefix);
 int bench_sched_messaging(int argc, const char **argv, const char *prefix);
 int bench_sched_pipe(int argc, const char **argv, const char *prefix);
+int bench_syscall_basic(int argc, const char **argv, const char *prefix);
 int bench_mem_memcpy(int argc, const char **argv, const char *prefix);
 int bench_mem_memset(int argc, const char **argv, const char *prefix);
 int bench_futex_hash(int argc, const char **argv, const char *prefix);
diff --git a/tools/perf/bench/syscall.c b/tools/perf/bench/syscall.c
new file mode 100644
index 0000000..0dc782a
--- /dev/null
+++ b/tools/perf/bench/syscall.c
@@ -0,0 +1,80 @@
+/*
+ *
+ * syscall.c
+ *
+ * syscall: Benchmark for system call performance
+ */
+#include "../perf.h"
+#include "../util/util.h"
+#include <subcmd/parse-options.h>
+#include "../builtin.h"
+#include "bench.h"
+
+#include <stdio.h>
+#include <locale.h>
+#include <sys/time.h>
+#include <sys/syscall.h>
+
+#define LOOPS_DEFAULT 10000000
+static	int loops = LOOPS_DEFAULT;
+
+static const struct option options[] = {
+	OPT_INTEGER('l', "loop",	&loops,		"Specify number of loops"),
+	OPT_END()
+};
+
+static const char * const bench_syscall_usage[] = {
+	"perf bench syscall <options>",
+	NULL
+};
+
+int bench_syscall_basic(int argc, const char **argv, const char *prefix __maybe_unused)
+{
+	struct timeval start, stop, diff;
+	unsigned long long result_usec = 0;
+	int i;
+
+	argc = parse_options(argc, argv, options, bench_syscall_usage, 0);
+
+	gettimeofday(&start, NULL);
+
+	for (i = 0; i < loops; i++)
+		getppid();
+
+	gettimeofday(&stop, NULL);
+	timersub(&stop, &start, &diff);
+
+	switch (bench_format) {
+	case BENCH_FORMAT_DEFAULT:
+		setlocale(LC_NUMERIC, "");
+		printf("# Executed %'d getppid() calls\n", loops);
+
+		result_usec = diff.tv_sec * 1000000;
+		result_usec += diff.tv_usec;
+
+		printf(" %14s: %lu.%03lu [sec]\n\n", "Total time",
+		       diff.tv_sec,
+		       (unsigned long) (diff.tv_usec/1000));
+
+		printf(" %14lf usecs/op\n",
+		       (double)result_usec / (double)loops);
+		printf(" %'14d ops/sec\n",
+		       (int)((double)loops /
+			     ((double)result_usec / (double)1000000)));
+		break;
+
+	case BENCH_FORMAT_SIMPLE:
+		printf("%lu.%03lu\n",
+		       diff.tv_sec,
+		       (unsigned long) (diff.tv_usec / 1000));
+		break;
+
+	default:
+		/* reaching here is something disaster */
+		fprintf(stderr, "Unknown format:%d\n", bench_format);
+		exit(1);
+		break;
+	}
+
+	return 0;
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index a1cddc6..f32a503 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -9,10 +9,11 @@
 /*
  * Available benchmark collection list:
  *
- *  sched ... scheduler and IPC performance
- *  mem   ... memory access performance
- *  numa  ... NUMA scheduling and MM performance
- *  futex ... Futex performance
+ *  sched   ... Scheduler and IPC performance
+ *  syscall ... System call performance
+ *  mem     ... Memory access performance
+ *  numa    ... NUMA scheduling and MM performance
+ *  futex   ... Futex performance
  */
 #include "perf.h"
 #include "util/util.h"
@@ -44,10 +45,16 @@ static struct bench numa_benchmarks[] = {
 static struct bench sched_benchmarks[] = {
 	{ "messaging",	"Benchmark for scheduling and IPC",		bench_sched_messaging	},
 	{ "pipe",	"Benchmark for pipe() between two processes",	bench_sched_pipe	},
-	{ "all",	"Run all scheduler benchmarks",		NULL			},
+	{ "all",	"Run all scheduler benchmarks",			NULL			},
 	{ NULL,		NULL,						NULL			}
 };
 
+static struct bench syscall_benchmarks[] = {
+	{ "basic",	"Benchmark for basic getppid() system calls",	bench_syscall_basic	},
+	{ "all",	"Run all syscall benchmarks",			NULL			},
+	{ NULL,		NULL,						NULL			},
+};
+
 static struct bench mem_benchmarks[] = {
 	{ "memcpy",	"Benchmark for memcpy() functions",		bench_mem_memcpy	},
 	{ "memset",	"Benchmark for memset() functions",		bench_mem_memset	},
@@ -74,6 +81,7 @@ struct collection {
 
 static struct collection collections[] = {
 	{ "sched",	"Scheduler and IPC benchmarks",			sched_benchmarks	},
+	{ "syscall",	"System call benchmarks",			syscall_benchmarks	},
 	{ "mem",	"Memory access benchmarks",			mem_benchmarks		},
 #ifdef HAVE_LIBNUMA_SUPPORT
 	{ "numa",	"NUMA scheduling and MM benchmarks",		numa_benchmarks		},
-- 
2.4.11

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] perf: add 'perf bench syscall'
  2016-06-20 18:00       ` [PATCH] perf: add 'perf bench syscall' Josh Poimboeuf
@ 2016-06-20 19:16         ` Andy Lutomirski
  2016-06-21 14:55           ` Josh Poimboeuf
  0 siblings, 1 reply; 23+ messages in thread
From: Andy Lutomirski @ 2016-06-20 19:16 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	linux-kernel, Thomas Gleixner, Rik van Riel, Peter Zijlstra,
	clark, Peter Zijlstra

On Mon, Jun 20, 2016 at 11:00 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:

>
> From: Josh Poimboeuf <jpoimboe@redhat.com>
> Subject: [PATCH] perf: add 'perf bench syscall'
>
> Add a basic 'perf bench syscall' benchmark which does a getppid() system
> call in a tight loop.
>

My one suggestion would be to use a different syscall than getppid(),
as getppid() can have unusual performance characteristics on kernels
with pid namespaces enabled.  I'm a fan of prctl(-1, 0, 0, 0, 0).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] perf: add 'perf bench syscall'
  2016-06-20 19:16         ` Andy Lutomirski
@ 2016-06-21 14:55           ` Josh Poimboeuf
  2016-06-21 16:31             ` Andy Lutomirski
  0 siblings, 1 reply; 23+ messages in thread
From: Josh Poimboeuf @ 2016-06-21 14:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Frederic Weisbecker,
	linux-kernel, Thomas Gleixner, Rik van Riel, Peter Zijlstra,
	clark, Peter Zijlstra

On Mon, Jun 20, 2016 at 12:16:22PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 20, 2016 at 11:00 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> >
> > From: Josh Poimboeuf <jpoimboe@redhat.com>
> > Subject: [PATCH] perf: add 'perf bench syscall'
> >
> > Add a basic 'perf bench syscall' benchmark which does a getppid() system
> > call in a tight loop.
> >
> 
> My one suggestion would be to use a different syscall than getppid(),
> as getppid() can have unusual performance characteristics on kernels
> with pid namespaces enabled.  I'm a fan of prctl(-1, 0, 0, 0, 0).

Hm, can you elaborate on the unusual performance characteristics of
getppid()?  The code seems pretty minimal.

prctl() actually seems much worse to me, because of all the security
module cruft.

-- 
Josh

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] perf: add 'perf bench syscall'
  2016-06-21 14:55           ` Josh Poimboeuf
@ 2016-06-21 16:31             ` Andy Lutomirski
  0 siblings, 0 replies; 23+ messages in thread
From: Andy Lutomirski @ 2016-06-21 16:31 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Frederic Weisbecker, Thomas Gleixner, Arnaldo Carvalho de Melo,
	clark, Peter Zijlstra, Ingo Molnar, linux-kernel, Peter Zijlstra,
	Rik van Riel

On Jun 21, 2016 7:55 AM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>
> On Mon, Jun 20, 2016 at 12:16:22PM -0700, Andy Lutomirski wrote:
> > On Mon, Jun 20, 2016 at 11:00 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >
> > >
> > > From: Josh Poimboeuf <jpoimboe@redhat.com>
> > > Subject: [PATCH] perf: add 'perf bench syscall'
> > >
> > > Add a basic 'perf bench syscall' benchmark which does a getppid() system
> > > call in a tight loop.
> > >
> >
> > My one suggestion would be to use a different syscall than getppid(),
> > as getppid() can have unusual performance characteristics on kernels
> > with pid namespaces enabled.  I'm a fan of prctl(-1, 0, 0, 0, 0).
>
> Hm, can you elaborate on the unusual performance characteristics of
> getppid()?  The code seems pretty minimal.

task_this_vnr -> pid_vnr -> pid_nr_ns

There's probably something better than prctl, though.

--Andy

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-06-21 16:31 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-01  2:12 [PATCH 0/4 v3] sched,time: reduce nohz_full syscall overhead 40% riel
2016-02-01  2:12 ` [PATCH 1/4] sched,time: remove non-power-of-two divides from __acct_update_integrals riel
2016-02-01  4:46   ` kbuild test robot
2016-02-01  8:37   ` Thomas Gleixner
2016-02-01  9:22     ` Peter Zijlstra
2016-02-01  9:31       ` Thomas Gleixner
2016-02-01 13:44       ` Rik van Riel
2016-02-01 13:51         ` Peter Zijlstra
2016-02-01  2:12 ` [PATCH 2/4] acct,time: change indentation in __acct_update_integrals riel
2016-02-01  2:12 ` [PATCH 3/4] time,acct: drop irq save & restore from __acct_update_integrals riel
2016-02-01  9:28   ` Peter Zijlstra
2016-02-01 19:22     ` Rik van Riel
2016-02-01  2:12 ` [PATCH 4/4] sched,time: only call account_{user,sys,guest,idle}_time once a jiffy riel
2016-02-01  9:29   ` Peter Zijlstra
2016-02-01 19:23     ` Rik van Riel
2016-02-01  7:41 ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Ingo Molnar
2016-02-01  7:48   ` [PATCH] perf tooling: Simplify 'perf bench syscall' Ingo Molnar
2016-02-01 15:41   ` [PATCH] perf tooling: Add 'perf bench syscall' benchmark Andy Lutomirski
2016-02-03 10:22     ` Ingo Molnar
2016-06-20 18:00       ` [PATCH] perf: add 'perf bench syscall' Josh Poimboeuf
2016-06-20 19:16         ` Andy Lutomirski
2016-06-21 14:55           ` Josh Poimboeuf
2016-06-21 16:31             ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.