Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
       [not found] <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
@ 2015-03-13 12:49 ` Alexey Dobriyan
  2015-03-13 12:56   ` Alexey Dobriyan
  2015-03-13 23:53   ` Rasmus Villemoes
  0 siblings, 2 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-13 12:49 UTC (permalink / raw)
  To: Linux Kernel, Andrew Morton
  Cc: linux, Peter Zijlstra, Tejun Heo, Denis Vlasenko, KAMEZAWA Hiroyuki

On Thu, Mar 12, 2015 at 12:54 AM,  <akpm@linux-foundation.org> wrote:
> Subject: lib/vsprintf.c: even faster binary to decimal conversion

I spent some time to microbenchmark changes in userspace (audience: fool!).
Results are below.

Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
Great care was taken to remove interrupt noise.

Number of measurements is 100 millions per line.
CPU is Intel Core 2 Duo E6550 in 64-bit mode.

3.19.1:

0                     98.015369 +- 0.512937   91-616
42                   116.000193 +- 3.523826  112-868
27182                137.009008 +- 3.515725  133-1043
65535                137.008262 +- 3.521761  133-840
4294967295           201.019966 +- 3.278608  196-1050
3141592653589793238  289.996882 +- 3.489376  287-1148
18446744073709551615 295.065274 +- 2.860187  287-1029
-----------------------------------------------------
3.19.1+patch
0                     94.444063 +- 3.518922   84-630
42                   116.428533 +- 18.539093 105-1036
42                   116.316904 +- 18.234484 105-833
27182                136.172398 +- 3.737113  133-980
65535                136.014742 +- 3.537882  133-714
4294967295           172.009618 +- 3.507473  168-826
3141592653589793238  207.001114 +- 3.492724  196-1120
18446744073709551615 208.018154 +- 3.220185  203-1246
-----------------------------------------------------

New code is somewhat faster for huge numbers.
But top and ps don't show huge numbers normally --
it is either PIDs (2^16) or moderately high numbers in a range of millions
(see /proc/stat)

* variance for new code is bigger
I even tried N=42 twice because I thought 18.5 variance is a glitch
but it is not.

New code uses lookup table which implies cache misses.
Current code is purely code.

So I don't think new printing code will change anything really.

> On a larger scale, perf shows that top, one of the big consumers of /proc
> data, uses 0.5-1.0% fewer cpu cycles.

perf(1) also shows variance next to average, what was it?
In my experience everything perf measures has single digit percent variance
(and this is just 1sigma!) so you can't say new code is faster.
Also average can vary between runs more than variance (yuck!)

First number printing improvement patch was measuring ~30% speedups:
commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2

Now it is 1%.

> Microbenchmarking shows improvements
> ranging from -50% (for numbers uniformly distributed in [0, 2^64-1]) to
> -25% (for numbers heavily biased toward the smaller end, a more realistic
> distribution).

-25%? Mmm, no, see table above.

I think any further improvements to number printing code should be rejected
on philosophical grounds:

Kernel should ship numbers to ps(1) and top(1) in BINARY,
so it would take exactly 1 MOV instruction which takes exactly 1 cycle
to execute.
Currently it is 1) kernel converts binary to text, 2) usespace
converts text to binary,
3) userspace converts binary to text and shows the user. 4) people optimizing #1

But only final conversion is needed to happen because it is communication
between human and program. Programs can very well talk in binary.

So, if you really want to speed up top(1), design binary interface for
shipping numbers
and enjoy 1000% speedups, leave text files in /proc for those
undemanding shell scripts.

    Alexey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-13 12:49 ` + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Alexey Dobriyan
@ 2015-03-13 12:56   ` Alexey Dobriyan
  2015-03-13 13:00     ` Alexey Dobriyan
  2015-03-13 23:53   ` Rasmus Villemoes
  1 sibling, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-13 12:56 UTC (permalink / raw)
  To: Linux Kernel, Andrew Morton
  Cc: linux, Peter Zijlstra, Tejun Heo, Denis Vlasenko, KAMEZAWA Hiroyuki

On Fri, Mar 13, 2015 at 3:49 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> Great care was taken to remove interrupt noise.

To reproduce numbers:
* apply the patch which adds sys_nr_irq() system call.
* recompile without cpufreq, reboot with isolcpus=,
* shift interrupts off free CPU

* use new system call
static inline uint64_t sys_nr_irq(void)
{
        uint64_t rv;

        asm volatile ("syscall" : "=a" (rv) : "0" (323) : "memory",
"cc", "rcx", "r11");
        return rv;
}

* use measurement code
static inline void rdtsc(uint32_t *edx, uint32_t *eax)
{
        asm volatile ("rdtsc" : "=d" (*edx), "=a" (*eax));
}
static inline void lfence(void)
{
        asm volatile ("lfence" ::: "memory");
}
static inline uint64_t time_init(uint64_t *x)
{
        uint32_t edx, eax;

        *x = sys_nr_irq();

        lfence();
        rdtsc(&edx, &eax);
        return ((uint64_t)edx << 32) | eax;
}

static inline uint64_t time_fini(uint64_t *x)
{
        uint32_t edx, eax;

        lfence();
        rdtsc(&edx, &eax);

        *x = sys_nr_irq();

        return ((uint64_t)edx << 32) | eax;
}
----------------------------------------
* drop measurement where interrupt interfered
        i = 0;
        while (i < N) {
                uint64_t t0, t1;
                uint64_t x0, x1;

                t0 = time_init(&x0);
                f();
                t1 = time_fini(&x1);

                if (x1 != x0)
                        continue;

                T[i] = t1 - t0;
                i++;
        }
------------------------------------
* average results how you were taught in school :-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-13 12:56   ` Alexey Dobriyan
@ 2015-03-13 13:00     ` Alexey Dobriyan
  0 siblings, 0 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-13 13:00 UTC (permalink / raw)
  To: Linux Kernel, Andrew Morton
  Cc: linux, Peter Zijlstra, Tejun Heo, Denis Vlasenko, KAMEZAWA Hiroyuki

[-- Attachment #1: Type: text/plain, Size: 373 bytes --]

On Fri, Mar 13, 2015 at 3:56 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Fri, Mar 13, 2015 at 3:49 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>> Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
>> Great care was taken to remove interrupt noise.
>
> To reproduce numbers:
> * apply the patch which adds sys_nr_irq() system call.

attached

[-- Attachment #2: sys-nr-irq-3.19.diff --]
[-- Type: text/plain, Size: 2915 bytes --]

commit 4a79e45e0a03843489de0987eb6be84222201c10
Author: Aliaksei Dabryian <aliaksei_dabryian@epam.com>
Date:   Fri Mar 13 12:56:37 2015 +0300

    sys_nr_irq()

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 547e344..cc4d656 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -35,6 +35,7 @@ struct thread_info {
 	void __user		*sysenter_return;
 	unsigned int		sig_on_uaccess_error:1;
 	unsigned int		uaccess_err:1;	/* uaccess failed */
+	__u64			nr_irq;
 };
 
 #define INIT_THREAD_INFO(tsk)			\
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 9f6b934..1b2158d 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -32,6 +32,7 @@ void common(void) {
 	OFFSET(TI_flags, thread_info, flags);
 	OFFSET(TI_status, thread_info, status);
 	OFFSET(TI_addr_limit, thread_info, addr_limit);
+	OFFSET(TI_nr_irq, thread_info, nr_irq);
 
 	BLANK();
 	OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 9ebaf63..c4ee124 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -808,6 +808,7 @@ ret_from_intr:
 
 exit_intr:
 	GET_THREAD_INFO(%rcx)
+	addq $1, TI_nr_irq(%rcx)
 	testl $3,CS-ARGOFFSET(%rsp)
 	je retint_kernel
 
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..f61a380 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320	common	kexec_file_load		sys_kexec_file_load
 321	common	bpf			sys_bpf
 322	64	execveat		stub_execveat
+323	64	nr_irq			sys_nr_irq
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..811b0dc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -205,6 +205,8 @@ extern struct trace_event_functions exit_syscall_print_funcs;
 	}								\
 	static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
 
+asmlinkage long sys_nr_irq(void);
+
 asmlinkage long sys32_quotactl(unsigned int cmd, const char __user *special,
 			       qid_t id, void __user *addr);
 asmlinkage long sys_time(time_t __user *tloc);
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..3d3d425 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -27,6 +27,8 @@ obj-y += printk/
 obj-y += irq/
 obj-y += rcu/
 
+obj-y += sys_nr_irq.o
+
 obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
 obj-$(CONFIG_FREEZER) += freezer.o
 obj-$(CONFIG_PROFILING) += profile.o
diff --git a/kernel/sys_nr_irq.c b/kernel/sys_nr_irq.c
new file mode 100644
index 0000000..b57f482
--- /dev/null
+++ b/kernel/sys_nr_irq.c
@@ -0,0 +1,7 @@
+#include <linux/syscalls.h>
+#include <asm/thread_info.h>
+
+SYSCALL_DEFINE0(nr_irq)
+{
+	return current_thread_info()->nr_irq;
+}

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-13 12:49 ` + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Alexey Dobriyan
  2015-03-13 12:56   ` Alexey Dobriyan
@ 2015-03-13 23:53   ` Rasmus Villemoes
  2015-03-14  9:21     ` Alexey Dobriyan
  1 sibling, 1 reply; 10+ messages in thread
From: Rasmus Villemoes @ 2015-03-13 23:53 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Linux Kernel, Andrew Morton, Peter Zijlstra, Tejun Heo,
	Denis Vlasenko, KAMEZAWA Hiroyuki

On Fri, Mar 13 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Thu, Mar 12, 2015 at 12:54 AM,  <akpm@linux-foundation.org> wrote:
>> Subject: lib/vsprintf.c: even faster binary to decimal conversion
>
> I spent some time to microbenchmark changes in userspace (audience: fool!).
> Results are below.
>
> Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> Great care was taken to remove interrupt noise.
>
> Number of measurements is 100 millions per line.
> CPU is Intel Core 2 Duo E6550 in 64-bit mode.
>
> 3.19.1:
>
> 0                     98.015369 +- 0.512937   91-616
> 42                   116.000193 +- 3.523826  112-868
> 27182                137.009008 +- 3.515725  133-1043
> 65535                137.008262 +- 3.521761  133-840
> 4294967295           201.019966 +- 3.278608  196-1050
> 3141592653589793238  289.996882 +- 3.489376  287-1148
> 18446744073709551615 295.065274 +- 2.860187  287-1029
> -----------------------------------------------------
> 3.19.1+patch
> 0                     94.444063 +- 3.518922   84-630
> 42                   116.428533 +- 18.539093 105-1036
> 42                   116.316904 +- 18.234484 105-833
> 27182                136.172398 +- 3.737113  133-980
> 65535                136.014742 +- 3.537882  133-714
> 4294967295           172.009618 +- 3.507473  168-826
> 3141592653589793238  207.001114 +- 3.492724  196-1120
> 18446744073709551615 208.018154 +- 3.220185  203-1246
> -----------------------------------------------------

This seems to measure lfence+rdtsc overhead more than anything else. On
my presumably rather similar Core2 Duo T5870, I get an average of 22
cycles for the old code and 11 cycles for the new when converting 42 two
million times in a loop, and I'm not even trying to take interrupts into
account.

Since you seem to be dirtying 800 MB of memory, I'm guessing you get
quite a few page faults/TLB misses, which might explain the rather huge
max numbers.

> New code is somewhat faster for huge numbers.
> But top and ps don't show huge numbers normally --
> it is either PIDs (2^16) or moderately high numbers in a range of millions
> (see /proc/stat)

I said much the same thing in the commit log, and accordingly I've done
(micro)benchmarks with distributions biased to various degrees towards
smaller numbers, all of which showed 25+% improvement.

> * variance for new code is bigger

Seems to depend on how you measure...

> I even tried N=42 twice because I thought 18.5 variance is a glitch
> but it is not.

That does seem odd. But I think your numbers are caused by the huge
memory use. In any case, I modified my test program to record the
cycle count for each individual call (using lfence+rdtsc), but I used a
frequency table instead of a gigantic array, ignoring excessively large
cycle counts (I used > 1023). On the Core 2, I then get

90 90 90 80 80 80 80 80 80 80 80 80 90 80 80
Distribution              Function         cycles/conv  std.dev. (ignored) 
uniform([10, 2^64-1])     linux_put_dec          224.83     9.80 (156)
uniform([10, 2^64-1])     rv_put_dec             147.87     7.44 (86)
3 + neg_binom(0.05)       linux_put_dec          138.97    39.48 (87)
3 + neg_binom(0.05)       rv_put_dec             123.76    27.33 (77)
3 + neg_binom(0.10)       linux_put_dec          115.49    27.27 (84)
3 + neg_binom(0.10)       rv_put_dec             108.22    20.14 (71)
3 + neg_binom(0.15)       linux_put_dec          105.21    20.52 (59)
3 + neg_binom(0.15)       rv_put_dec             101.75    17.21 (54)
3 + neg_binom(0.20)       linux_put_dec          100.79    17.25 (65)
3 + neg_binom(0.20)       rv_put_dec              98.34    16.22 (64)
3 + neg_binom(0.50)       linux_put_dec           87.84     7.75 (44)
3 + neg_binom(0.50)       rv_put_dec              85.37     8.26 (45)

[first line is just deltas between a few lfence+rdtsc reads in quick
succession, to get a sense of the overhead]. For each distribution I'm
generating 2048 random numbers and then iterate over that 1000 times. So
almost none of the ~2M observations are being ignored. Here, the new
code is always faster (but computing a percentage from numbers including
the rdtsc overhead is meaningless), and in all but the last case (where
the numbers are almost exclusively 2-digit) the std. deviation is also
smaller. An an Intel Xeon, I get

48 32 32 32 32 32 32 32 32 32 32 32 32 32 32
Distribution              Function         cycles/conv  std.dev. (ignored) 
uniform([10, 2^64-1])     linux_put_dec          152.58     8.54 (26)
uniform([10, 2^64-1])     rv_put_dec              89.33     3.02 (16)
3 + neg_binom(0.05)       linux_put_dec           91.88    34.46 (17)
3 + neg_binom(0.05)       rv_put_dec              71.59    21.19 (10)
3 + neg_binom(0.10)       linux_put_dec           72.50    25.30 (12)
3 + neg_binom(0.10)       rv_put_dec              60.10    17.51 (11)
3 + neg_binom(0.15)       linux_put_dec           63.81    20.68 (8)
3 + neg_binom(0.15)       rv_put_dec              55.57    15.74 (6)
3 + neg_binom(0.20)       linux_put_dec           57.18    16.50 (7)
3 + neg_binom(0.20)       rv_put_dec              51.15    13.58 (12)
3 + neg_binom(0.50)       linux_put_dec           45.06     6.39 (4)
3 + neg_binom(0.50)       rv_put_dec              41.16     6.51 (6)

> New code uses lookup table which implies cache misses.  Current code
> is purely code.

Code can miss the cache also, and then it needs to be decoded again. The
new _code_ is slightly smaller, although the total .text+.rodata does
increase by ~150 bytes. Yes, overall the new code will probably
touch one or two extra cache lines compared to the old - so there's a
tradeoff between one-shot and bulk decimal conversions.

>> On a larger scale, perf shows that top, one of the big consumers of /proc
>> data, uses 0.5-1.0% fewer cpu cycles.
>
> perf(1) also shows variance next to average, what was it?

Not in the output I got - that just showed lines such as 

     2.35%  top      [kernel.kallsyms]   [k] num_to_str                 

But I don't have much perf-fu, so maybe I should have invoked it differently.

> First number printing improvement patch was measuring ~30% speedups:
> commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
> vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2
>
> Now it is 1%.

That's comparing apples and oranges. The ~30% were obtained from a top
which was modified to do nothing but read /proc/pid/stat, the 1% is out
of all the cycles top actually spends.

> I think any further improvements to number printing code should be rejected
> on philosophical grounds:

Hm, "perfect is the enemy of good" and all that.

> Kernel should ship numbers to ps(1) and top(1) in BINARY,
> so it would take exactly 1 MOV instruction which takes exactly 1 cycle
> to execute.
> Currently it is 1) kernel converts binary to text, 2) usespace
> converts text to binary,
> 3) userspace converts binary to text and shows the user. 4) people optimizing #1

I agree this is somewhat silly, but that's what we have, and it is
unlikely to change anytime soon. Adding a parallel binary interface
would be a maintenance nightmare.

Rasmus

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-13 23:53   ` Rasmus Villemoes
@ 2015-03-14  9:21     ` Alexey Dobriyan
  2015-03-16 15:19       ` Alexey Dobriyan
  0 siblings, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-14  9:21 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Linux Kernel, Andrew Morton, Peter Zijlstra, Tejun Heo,
	Denis Vlasenko, KAMEZAWA Hiroyuki

On Sat, Mar 14, 2015 at 12:53:20AM +0100, Rasmus Villemoes wrote:
> On Fri, Mar 13 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
> > On Thu, Mar 12, 2015 at 12:54 AM,  <akpm@linux-foundation.org> wrote:
> >> Subject: lib/vsprintf.c: even faster binary to decimal conversion
> >
> > I spent some time to microbenchmark changes in userspace (audience: fool!).
> > Results are below.
> >
> > Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> > Great care was taken to remove interrupt noise.
> >
> > Number of measurements is 100 millions per line.
> > CPU is Intel Core 2 Duo E6550 in 64-bit mode.
> >
> > 3.19.1:
> >
> > 0                     98.015369 +- 0.512937   91-616
> > 42                   116.000193 +- 3.523826  112-868
> > 27182                137.009008 +- 3.515725  133-1043
> > 65535                137.008262 +- 3.521761  133-840
> > 4294967295           201.019966 +- 3.278608  196-1050
> > 3141592653589793238  289.996882 +- 3.489376  287-1148
> > 18446744073709551615 295.065274 +- 2.860187  287-1029
> > -----------------------------------------------------
> > 3.19.1+patch
> > 0                     94.444063 +- 3.518922   84-630
> > 42                   116.428533 +- 18.539093 105-1036
> > 42                   116.316904 +- 18.234484 105-833
> > 27182                136.172398 +- 3.737113  133-980
> > 65535                136.014742 +- 3.537882  133-714
> > 4294967295           172.009618 +- 3.507473  168-826
> > 3141592653589793238  207.001114 +- 3.492724  196-1120
> > 18446744073709551615 208.018154 +- 3.220185  203-1246
> > -----------------------------------------------------
> 
> This seems to measure lfence+rdtsc overhead more than anything else. On
> my presumably rather similar Core2 Duo T5870, I get an average of 22
> cycles for the old code and 11 cycles for the new when converting 42 two
> million times in a loop, and I'm not even trying to take interrupts into
> account.
> 
> Since you seem to be dirtying 800 MB of memory, I'm guessing you get
> quite a few page faults/TLB misses, which might explain the rather huge
> max numbers.

Huge max numbers come from RDTSC itself. If you measure tiniest possible
RDTSC;RDTSC loop, it will reliably show huge results once in a while,
on that box it is in 70-500 range IIRC.

> > New code is somewhat faster for huge numbers.
> > But top and ps don't show huge numbers normally --
> > it is either PIDs (2^16) or moderately high numbers in a range of millions
> > (see /proc/stat)
> 
> I said much the same thing in the commit log, and accordingly I've done
> (micro)benchmarks with distributions biased to various degrees towards
> smaller numbers, all of which showed 25+% improvement.
> 
> > * variance for new code is bigger
> 
> Seems to depend on how you measure...
> 
> > I even tried N=42 twice because I thought 18.5 variance is a glitch
> > but it is not.
> 
> That does seem odd. But I think your numbers are caused by the huge
> memory use. In any case, I modified my test program to record the
> cycle count for each individual call (using lfence+rdtsc), but I used a
> frequency table instead of a gigantic array, ignoring excessively large
> cycle counts (I used > 1023). On the Core 2, I then get
> 
> 90 90 90 80 80 80 80 80 80 80 80 80 90 80 80
> Distribution              Function         cycles/conv  std.dev. (ignored) 
> uniform([10, 2^64-1])     linux_put_dec          224.83     9.80 (156)
> uniform([10, 2^64-1])     rv_put_dec             147.87     7.44 (86)
> 3 + neg_binom(0.05)       linux_put_dec          138.97    39.48 (87)
> 3 + neg_binom(0.05)       rv_put_dec             123.76    27.33 (77)
> 3 + neg_binom(0.10)       linux_put_dec          115.49    27.27 (84)
> 3 + neg_binom(0.10)       rv_put_dec             108.22    20.14 (71)
> 3 + neg_binom(0.15)       linux_put_dec          105.21    20.52 (59)
> 3 + neg_binom(0.15)       rv_put_dec             101.75    17.21 (54)
> 3 + neg_binom(0.20)       linux_put_dec          100.79    17.25 (65)
> 3 + neg_binom(0.20)       rv_put_dec              98.34    16.22 (64)
> 3 + neg_binom(0.50)       linux_put_dec           87.84     7.75 (44)
> 3 + neg_binom(0.50)       rv_put_dec              85.37     8.26 (45)
> 
> [first line is just deltas between a few lfence+rdtsc reads in quick
> succession, to get a sense of the overhead]. For each distribution I'm
> generating 2048 random numbers and then iterate over that 1000 times. So
> almost none of the ~2M observations are being ignored. Here, the new
> code is always faster (but computing a percentage from numbers including
> the rdtsc overhead is meaningless), and in all but the last case (where
> the numbers are almost exclusively 2-digit) the std. deviation is also
> smaller. An an Intel Xeon, I get
> 
> 48 32 32 32 32 32 32 32 32 32 32 32 32 32 32
> Distribution              Function         cycles/conv  std.dev. (ignored) 
> uniform([10, 2^64-1])     linux_put_dec          152.58     8.54 (26)
> uniform([10, 2^64-1])     rv_put_dec              89.33     3.02 (16)
> 3 + neg_binom(0.05)       linux_put_dec           91.88    34.46 (17)
> 3 + neg_binom(0.05)       rv_put_dec              71.59    21.19 (10)
> 3 + neg_binom(0.10)       linux_put_dec           72.50    25.30 (12)
> 3 + neg_binom(0.10)       rv_put_dec              60.10    17.51 (11)
> 3 + neg_binom(0.15)       linux_put_dec           63.81    20.68 (8)
> 3 + neg_binom(0.15)       rv_put_dec              55.57    15.74 (6)
> 3 + neg_binom(0.20)       linux_put_dec           57.18    16.50 (7)
> 3 + neg_binom(0.20)       rv_put_dec              51.15    13.58 (12)
> 3 + neg_binom(0.50)       linux_put_dec           45.06     6.39 (4)
> 3 + neg_binom(0.50)       rv_put_dec              41.16     6.51 (6)
> 
> > New code uses lookup table which implies cache misses.  Current code
> > is purely code.
> 
> Code can miss the cache also, and then it needs to be decoded again. The
> new _code_ is slightly smaller, although the total .text+.rodata does
> increase by ~150 bytes. Yes, overall the new code will probably
> touch one or two extra cache lines compared to the old - so there's a
> tradeoff between one-shot and bulk decimal conversions.
> 
> >> On a larger scale, perf shows that top, one of the big consumers of /proc
> >> data, uses 0.5-1.0% fewer cpu cycles.
> >
> > perf(1) also shows variance next to average, what was it?
> 
> Not in the output I got - that just showed lines such as 
> 
>      2.35%  top      [kernel.kallsyms]   [k] num_to_str                 
> 
> But I don't have much perf-fu, so maybe I should have invoked it differently.

I thought you did macrobenchmark with "perf stat -r", then it shows variance.

> > First number printing improvement patch was measuring ~30% speedups:
> > commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
> > vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2
> >
> > Now it is 1%.
> 
> That's comparing apples and oranges. The ~30% were obtained from a top
> which was modified to do nothing but read /proc/pid/stat, the 1% is out
> of all the cycles top actually spends.

> > I think any further improvements to number printing code should be rejected
> > on philosophical grounds:
> 
> Hm, "perfect is the enemy of good" and all that.
> 
> > Kernel should ship numbers to ps(1) and top(1) in BINARY,
> > so it would take exactly 1 MOV instruction which takes exactly 1 cycle
> > to execute.
> > Currently it is 1) kernel converts binary to text, 2) usespace
> > converts text to binary,
> > 3) userspace converts binary to text and shows the user. 4) people optimizing #1
> 
> I agree this is somewhat silly, but that's what we have, and it is
> unlikely to change anytime soon. Adding a parallel binary interface
> would be a maintenance nightmare.

Hm, such interface already exists, it's called CONFIG_TASKSTATS.
procps and top of course don't use it.

	Alexey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-14  9:21     ` Alexey Dobriyan
@ 2015-03-16 15:19       ` Alexey Dobriyan
  2015-03-17 22:04         ` Andrew Morton
  0 siblings, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-16 15:19 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Linux Kernel, Andrew Morton, Peter Zijlstra, Tejun Heo,
	Denis Vlasenko, KAMEZAWA Hiroyuki

[-- Attachment #1: Type: text/plain, Size: 2534 bytes --]

On Sat, Mar 14, 2015 at 12:21 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:

Rasmus, I redid benchmarks:
* time is folded into average/variance immediately,
it didn't change anything of importance
* batched conversion in a loop, still removing results where interrupt
intervened

It became worse :-)

On this box some numbers are converted _slower_ with your changes.
Numbers are stable (average, min, max).

New code becomes noticeably faster around ~100000 and up.
But these speedups are diluted because num_to_str() takes ~2% of
top(1)'s CPU time with 5000 processes.

# 3.19.1
$ taskset -c 1 ./bb1 1000000 1000
0                     13114.776199 +- 114.942115   13097-15981
1                     13107.875655 +- 69.779029    13097-16275
10                    31091.129391 +- 4.638450     31087-32529
100                   34098.031121 +- 3.870317     34097-34615
255                   34098.028937 +- 3.714863     34090-34622
1000                  37101.041166 +- 4.003837     37100-38276
10000                 50760.840697 +- 30.528328    50750-52143
65535                 50760.825472 +- 30.126764    50750-52101
100000                59608.431078 +- 8.414339     59605-60144
1000000               77101.574543 +- 6.524058     77077-77637
10000000              88638.113739 +- 723.057589   88095-90125
100000000             86092.934347 +- 6265.582344  84105-106407
4294967295           118092.229892 +- 55.223289   110152-118699
18446744073709551615 210586.971532 +- 62.836890   209125-211120

# 3.19.1+diff
$ taskset -c 1 ./bb2 1000000 1000
0                     12093.654853 +- 9.896411     12089-12789
1                     12093.352964 +- 8.099249     12089-12838
10                    31093.928544 +- 62.655080    28091-34587
100                 * 40688.594709 +- 10.637530    39256-41195
255                 * 40688.571161 +- 10.667900    39991-41195
1000                * 42095.428977 +- 75.946830    39704-42623
10000               * 51724.895173 +- 6.051896     51135-52248
65535               * 51724.839467 +- 6.205925     51114-52241
100000                53103.076152 +- 20.832946    53081-54103
1000000               59602.515756 +- 10.729477    59598-60137
10000000              61104.239490 +- 23.073365    61096-62629
100000000             70857.128601 +- 31.596224    70854-72660
4294967295            79101.747725 +- 32.571835    79093-81102
18446744073709551615 132183.241400 +- 871.759411  122451-134869

I'm attaching full tarball so there is no ambiguity what is measured
and what is not.

    Alexey

[-- Attachment #2: num-to-str.tar.gz --]
[-- Type: application/x-gzip, Size: 6804 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-16 15:19       ` Alexey Dobriyan
@ 2015-03-17 22:04         ` Andrew Morton
  2015-03-19 12:17           ` Alexey Dobriyan
  0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-03-17 22:04 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Rasmus Villemoes, Linux Kernel, Peter Zijlstra, Tejun Heo,
	Denis Vlasenko, KAMEZAWA Hiroyuki

On Mon, 16 Mar 2015 18:19:41 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:

> Rasmus, I redid benchmarks:

tl;dr ;)  Is this an ack or a nack?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-17 22:04         ` Andrew Morton
@ 2015-03-19 12:17           ` Alexey Dobriyan
  2015-03-19 22:15             ` Rasmus Villemoes
  2015-03-22 17:29             ` Denys Vlasenko
  0 siblings, 2 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-19 12:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rasmus Villemoes, Linux Kernel, Peter Zijlstra, Tejun Heo,
	Denis Vlasenko, KAMEZAWA Hiroyuki

On Wed, Mar 18, 2015 at 1:04 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 16 Mar 2015 18:19:41 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
>> Rasmus, I redid benchmarks:
>
> tl;dr ;)  Is this an ack or a nack?

New code executes slower for some input on one CPU I've benchmarked,
both with -O2 and -Os (Core 2 Duo E6550). The slowdown is in 2-20% range
depending on exact number being converted and is reproducible.
With -O2 "bad" range is roughly 100-70000, with -Os it is 100-1000 and
around 100000.

On another CPU (more modern Core i5-something), new code is uniformly
faster giving advertised 10-20-35-40% speedups (yay!)

The ideal situation is still being
a) one system call to push PIDs into userspace in bulk
  (memcpy directly from pid_ns->pidmap[i]),
b) one system call to fetch data in binary to userspace given PID or
set of PIDs,

without all of this string and /proc overhead.

    Alexey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-19 12:17           ` Alexey Dobriyan
@ 2015-03-19 22:15             ` Rasmus Villemoes
  2015-03-22 17:29             ` Denys Vlasenko
  1 sibling, 0 replies; 10+ messages in thread
From: Rasmus Villemoes @ 2015-03-19 22:15 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Andrew Morton, Linux Kernel, Peter Zijlstra, Tejun Heo,
	Denis Vlasenko, KAMEZAWA Hiroyuki

On Thu, Mar 19 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:

> On Wed, Mar 18, 2015 at 1:04 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Mon, 16 Mar 2015 18:19:41 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
>>
>>> Rasmus, I redid benchmarks:
>>
>> tl;dr ;)  Is this an ack or a nack?
>
> New code executes slower for some input on one CPU I've benchmarked,
> both with -O2 and -Os (Core 2 Duo E6550). 

Running Alexey's code on my Core 2 Duo, I can confirm that. However, my
own benchmark did show the claimed 25-50% improvement, depending on
distribution. One difference between our benchmarks is that in Alexey's
case all branches are perfectly predictable - whether that matters I
can't tell [is there a "flush branch prediction" instruction?]. Also, I
found a somewhat subtle flaw in his benchmark [1] which gave the old
code a small (1-2 cycles) advantage. Fixing that and applying the small
tweak I just sent out [2], Alexey's benchmark no longer shows any
difference between the old and new code on the Core 2 Duo.

Rasmus

[1] put_dec was inlined into num_to_str in the old code - in the actual
kernel code, it is and was not, since it has another caller. I somehow
just cargo-culted the noinline_for_stack annotations all over, so it
also wasn't inlined in the benchmark of the new code.

[2] https://lkml.org/lkml/2015/3/19/802

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
  2015-03-19 12:17           ` Alexey Dobriyan
  2015-03-19 22:15             ` Rasmus Villemoes
@ 2015-03-22 17:29             ` Denys Vlasenko
  1 sibling, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2015-03-22 17:29 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Andrew Morton, Rasmus Villemoes, Linux Kernel, Peter Zijlstra,
	Tejun Heo, KAMEZAWA Hiroyuki

On Thu, Mar 19, 2015 at 1:17 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> The ideal situation is still being
> a) one system call to push PIDs into userspace in bulk
>   (memcpy directly from pid_ns->pidmap[i]),
> b) one system call to fetch data in binary to userspace given PID or
> set of PIDs,

This is a good point wrt top/ps.

But.

We will never get rid of vsprintf in kernel printing numbers.
And inevitably there are scenarios where for one reason or another
someone would need to print LOTS AND LOTS of them.
It does make perfect sense to optimize that, even after
top/ps would start using a more efficient channel to get their data.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-03-22 17:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
2015-03-13 12:49 ` + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Alexey Dobriyan
2015-03-13 12:56   ` Alexey Dobriyan
2015-03-13 13:00     ` Alexey Dobriyan
2015-03-13 23:53   ` Rasmus Villemoes
2015-03-14  9:21     ` Alexey Dobriyan
2015-03-16 15:19       ` Alexey Dobriyan
2015-03-17 22:04         ` Andrew Morton
2015-03-19 12:17           ` Alexey Dobriyan
2015-03-19 22:15             ` Rasmus Villemoes
2015-03-22 17:29             ` Denys Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).