* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
[not found] <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
@ 2015-03-13 12:49 ` Alexey Dobriyan
2015-03-13 12:56 ` Alexey Dobriyan
2015-03-13 23:53 ` Rasmus Villemoes
0 siblings, 2 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-13 12:49 UTC (permalink / raw)
To: Linux Kernel, Andrew Morton
Cc: linux, Peter Zijlstra, Tejun Heo, Denis Vlasenko, KAMEZAWA Hiroyuki
On Thu, Mar 12, 2015 at 12:54 AM, <akpm@linux-foundation.org> wrote:
> Subject: lib/vsprintf.c: even faster binary to decimal conversion
I spent some time to microbenchmark changes in userspace (audience: fool!).
Results are below.
Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
Great care was taken to remove interrupt noise.
Number of measurements is 100 millions per line.
CPU is Intel Core 2 Duo E6550 in 64-bit mode.
3.19.1:
0 98.015369 +- 0.512937 91-616
42 116.000193 +- 3.523826 112-868
27182 137.009008 +- 3.515725 133-1043
65535 137.008262 +- 3.521761 133-840
4294967295 201.019966 +- 3.278608 196-1050
3141592653589793238 289.996882 +- 3.489376 287-1148
18446744073709551615 295.065274 +- 2.860187 287-1029
-----------------------------------------------------
3.19.1+patch
0 94.444063 +- 3.518922 84-630
42 116.428533 +- 18.539093 105-1036
42 116.316904 +- 18.234484 105-833
27182 136.172398 +- 3.737113 133-980
65535 136.014742 +- 3.537882 133-714
4294967295 172.009618 +- 3.507473 168-826
3141592653589793238 207.001114 +- 3.492724 196-1120
18446744073709551615 208.018154 +- 3.220185 203-1246
-----------------------------------------------------
New code is somewhat faster for huge numbers.
But top and ps don't show huge numbers normally --
it is either PIDs (2^16) or moderately high numbers in a range of millions
(see /proc/stat)
* variance for new code is bigger
I even tried N=42 twice because I thought 18.5 variance is a glitch
but it is not.
New code uses lookup table which implies cache misses.
Current code is purely code.
So I don't think new printing code will change anything really.
> On a larger scale, perf shows that top, one of the big consumers of /proc
> data, uses 0.5-1.0% fewer cpu cycles.
perf(1) also shows variance next to average, what was it?
In my experience everything perf measures has single digit percent variance
(and this is just 1sigma!) so you can't say new code is faster.
Also average can vary between runs more than variance (yuck!)
First number printing improvement patch was measuring ~30% speedups:
commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2
Now it is 1%.
> Microbenchmarking shows improvements
> ranging from -50% (for numbers uniformly distributed in [0, 2^64-1]) to
> -25% (for numbers heavily biased toward the smaller end, a more realistic
> distribution).
-25%? Mmm, no, see table above.
I think any further improvements to number printing code should be rejected
on philosophical grounds:
Kernel should ship numbers to ps(1) and top(1) in BINARY,
so it would take exactly 1 MOV instruction which takes exactly 1 cycle
to execute.
Currently it is 1) kernel converts binary to text, 2) usespace
converts text to binary,
3) userspace converts binary to text and shows the user. 4) people optimizing #1
But only final conversion is needed to happen because it is communication
between human and program. Programs can very well talk in binary.
So, if you really want to speed up top(1), design binary interface for
shipping numbers
and enjoy 1000% speedups, leave text files in /proc for those
undemanding shell scripts.
Alexey
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-13 12:49 ` + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Alexey Dobriyan
@ 2015-03-13 12:56 ` Alexey Dobriyan
2015-03-13 13:00 ` Alexey Dobriyan
2015-03-13 23:53 ` Rasmus Villemoes
1 sibling, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-13 12:56 UTC (permalink / raw)
To: Linux Kernel, Andrew Morton
Cc: linux, Peter Zijlstra, Tejun Heo, Denis Vlasenko, KAMEZAWA Hiroyuki
On Fri, Mar 13, 2015 at 3:49 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> Great care was taken to remove interrupt noise.
To reproduce numbers:
* apply the patch which adds sys_nr_irq() system call.
* recompile without cpufreq, reboot with isolcpus=,
* shift interrupts off free CPU
* use new system call
static inline uint64_t sys_nr_irq(void)
{
uint64_t rv;
asm volatile ("syscall" : "=a" (rv) : "0" (323) : "memory",
"cc", "rcx", "r11");
return rv;
}
* use measurement code
static inline void rdtsc(uint32_t *edx, uint32_t *eax)
{
asm volatile ("rdtsc" : "=d" (*edx), "=a" (*eax));
}
static inline void lfence(void)
{
asm volatile ("lfence" ::: "memory");
}
static inline uint64_t time_init(uint64_t *x)
{
uint32_t edx, eax;
*x = sys_nr_irq();
lfence();
rdtsc(&edx, &eax);
return ((uint64_t)edx << 32) | eax;
}
static inline uint64_t time_fini(uint64_t *x)
{
uint32_t edx, eax;
lfence();
rdtsc(&edx, &eax);
*x = sys_nr_irq();
return ((uint64_t)edx << 32) | eax;
}
----------------------------------------
* drop measurement where interrupt interfered
i = 0;
while (i < N) {
uint64_t t0, t1;
uint64_t x0, x1;
t0 = time_init(&x0);
f();
t1 = time_fini(&x1);
if (x1 != x0)
continue;
T[i] = t1 - t0;
i++;
}
------------------------------------
* average results how you were taught in school :-)
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-13 12:56 ` Alexey Dobriyan
@ 2015-03-13 13:00 ` Alexey Dobriyan
0 siblings, 0 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-13 13:00 UTC (permalink / raw)
To: Linux Kernel, Andrew Morton
Cc: linux, Peter Zijlstra, Tejun Heo, Denis Vlasenko, KAMEZAWA Hiroyuki
[-- Attachment #1: Type: text/plain, Size: 373 bytes --]
On Fri, Mar 13, 2015 at 3:56 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Fri, Mar 13, 2015 at 3:49 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>> Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
>> Great care was taken to remove interrupt noise.
>
> To reproduce numbers:
> * apply the patch which adds sys_nr_irq() system call.
attached
[-- Attachment #2: sys-nr-irq-3.19.diff --]
[-- Type: text/plain, Size: 2915 bytes --]
commit 4a79e45e0a03843489de0987eb6be84222201c10
Author: Aliaksei Dabryian <aliaksei_dabryian@epam.com>
Date: Fri Mar 13 12:56:37 2015 +0300
sys_nr_irq()
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 547e344..cc4d656 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -35,6 +35,7 @@ struct thread_info {
void __user *sysenter_return;
unsigned int sig_on_uaccess_error:1;
unsigned int uaccess_err:1; /* uaccess failed */
+ __u64 nr_irq;
};
#define INIT_THREAD_INFO(tsk) \
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 9f6b934..1b2158d 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -32,6 +32,7 @@ void common(void) {
OFFSET(TI_flags, thread_info, flags);
OFFSET(TI_status, thread_info, status);
OFFSET(TI_addr_limit, thread_info, addr_limit);
+ OFFSET(TI_nr_irq, thread_info, nr_irq);
BLANK();
OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 9ebaf63..c4ee124 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -808,6 +808,7 @@ ret_from_intr:
exit_intr:
GET_THREAD_INFO(%rcx)
+ addq $1, TI_nr_irq(%rcx)
testl $3,CS-ARGOFFSET(%rsp)
je retint_kernel
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..f61a380 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 64 nr_irq sys_nr_irq
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 85893d7..811b0dc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -205,6 +205,8 @@ extern struct trace_event_functions exit_syscall_print_funcs;
} \
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
+asmlinkage long sys_nr_irq(void);
+
asmlinkage long sys32_quotactl(unsigned int cmd, const char __user *special,
qid_t id, void __user *addr);
asmlinkage long sys_time(time_t __user *tloc);
diff --git a/kernel/Makefile b/kernel/Makefile
index a59481a..3d3d425 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -27,6 +27,8 @@ obj-y += printk/
obj-y += irq/
obj-y += rcu/
+obj-y += sys_nr_irq.o
+
obj-$(CONFIG_CHECKPOINT_RESTORE) += kcmp.o
obj-$(CONFIG_FREEZER) += freezer.o
obj-$(CONFIG_PROFILING) += profile.o
diff --git a/kernel/sys_nr_irq.c b/kernel/sys_nr_irq.c
new file mode 100644
index 0000000..b57f482
--- /dev/null
+++ b/kernel/sys_nr_irq.c
@@ -0,0 +1,7 @@
+#include <linux/syscalls.h>
+#include <asm/thread_info.h>
+
+SYSCALL_DEFINE0(nr_irq)
+{
+ return current_thread_info()->nr_irq;
+}
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-13 12:49 ` + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Alexey Dobriyan
2015-03-13 12:56 ` Alexey Dobriyan
@ 2015-03-13 23:53 ` Rasmus Villemoes
2015-03-14 9:21 ` Alexey Dobriyan
1 sibling, 1 reply; 10+ messages in thread
From: Rasmus Villemoes @ 2015-03-13 23:53 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Linux Kernel, Andrew Morton, Peter Zijlstra, Tejun Heo,
Denis Vlasenko, KAMEZAWA Hiroyuki
On Fri, Mar 13 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Thu, Mar 12, 2015 at 12:54 AM, <akpm@linux-foundation.org> wrote:
>> Subject: lib/vsprintf.c: even faster binary to decimal conversion
>
> I spent some time to microbenchmark changes in userspace (audience: fool!).
> Results are below.
>
> Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> Great care was taken to remove interrupt noise.
>
> Number of measurements is 100 millions per line.
> CPU is Intel Core 2 Duo E6550 in 64-bit mode.
>
> 3.19.1:
>
> 0 98.015369 +- 0.512937 91-616
> 42 116.000193 +- 3.523826 112-868
> 27182 137.009008 +- 3.515725 133-1043
> 65535 137.008262 +- 3.521761 133-840
> 4294967295 201.019966 +- 3.278608 196-1050
> 3141592653589793238 289.996882 +- 3.489376 287-1148
> 18446744073709551615 295.065274 +- 2.860187 287-1029
> -----------------------------------------------------
> 3.19.1+patch
> 0 94.444063 +- 3.518922 84-630
> 42 116.428533 +- 18.539093 105-1036
> 42 116.316904 +- 18.234484 105-833
> 27182 136.172398 +- 3.737113 133-980
> 65535 136.014742 +- 3.537882 133-714
> 4294967295 172.009618 +- 3.507473 168-826
> 3141592653589793238 207.001114 +- 3.492724 196-1120
> 18446744073709551615 208.018154 +- 3.220185 203-1246
> -----------------------------------------------------
This seems to measure lfence+rdtsc overhead more than anything else. On
my presumably rather similar Core2 Duo T5870, I get an average of 22
cycles for the old code and 11 cycles for the new when converting 42 two
million times in a loop, and I'm not even trying to take interrupts into
account.
Since you seem to be dirtying 800 MB of memory, I'm guessing you get
quite a few page faults/TLB misses, which might explain the rather huge
max numbers.
> New code is somewhat faster for huge numbers.
> But top and ps don't show huge numbers normally --
> it is either PIDs (2^16) or moderately high numbers in a range of millions
> (see /proc/stat)
I said much the same thing in the commit log, and accordingly I've done
(micro)benchmarks with distributions biased to various degrees towards
smaller numbers, all of which showed 25+% improvement.
> * variance for new code is bigger
Seems to depend on how you measure...
> I even tried N=42 twice because I thought 18.5 variance is a glitch
> but it is not.
That does seem odd. But I think your numbers are caused by the huge
memory use. In any case, I modified my test program to record the
cycle count for each individual call (using lfence+rdtsc), but I used a
frequency table instead of a gigantic array, ignoring excessively large
cycle counts (I used > 1023). On the Core 2, I then get
90 90 90 80 80 80 80 80 80 80 80 80 90 80 80
Distribution Function cycles/conv std.dev. (ignored)
uniform([10, 2^64-1]) linux_put_dec 224.83 9.80 (156)
uniform([10, 2^64-1]) rv_put_dec 147.87 7.44 (86)
3 + neg_binom(0.05) linux_put_dec 138.97 39.48 (87)
3 + neg_binom(0.05) rv_put_dec 123.76 27.33 (77)
3 + neg_binom(0.10) linux_put_dec 115.49 27.27 (84)
3 + neg_binom(0.10) rv_put_dec 108.22 20.14 (71)
3 + neg_binom(0.15) linux_put_dec 105.21 20.52 (59)
3 + neg_binom(0.15) rv_put_dec 101.75 17.21 (54)
3 + neg_binom(0.20) linux_put_dec 100.79 17.25 (65)
3 + neg_binom(0.20) rv_put_dec 98.34 16.22 (64)
3 + neg_binom(0.50) linux_put_dec 87.84 7.75 (44)
3 + neg_binom(0.50) rv_put_dec 85.37 8.26 (45)
[first line is just deltas between a few lfence+rdtsc reads in quick
succession, to get a sense of the overhead]. For each distribution I'm
generating 2048 random numbers and then iterate over that 1000 times. So
almost none of the ~2M observations are being ignored. Here, the new
code is always faster (but computing a percentage from numbers including
the rdtsc overhead is meaningless), and in all but the last case (where
the numbers are almost exclusively 2-digit) the std. deviation is also
smaller. An an Intel Xeon, I get
48 32 32 32 32 32 32 32 32 32 32 32 32 32 32
Distribution Function cycles/conv std.dev. (ignored)
uniform([10, 2^64-1]) linux_put_dec 152.58 8.54 (26)
uniform([10, 2^64-1]) rv_put_dec 89.33 3.02 (16)
3 + neg_binom(0.05) linux_put_dec 91.88 34.46 (17)
3 + neg_binom(0.05) rv_put_dec 71.59 21.19 (10)
3 + neg_binom(0.10) linux_put_dec 72.50 25.30 (12)
3 + neg_binom(0.10) rv_put_dec 60.10 17.51 (11)
3 + neg_binom(0.15) linux_put_dec 63.81 20.68 (8)
3 + neg_binom(0.15) rv_put_dec 55.57 15.74 (6)
3 + neg_binom(0.20) linux_put_dec 57.18 16.50 (7)
3 + neg_binom(0.20) rv_put_dec 51.15 13.58 (12)
3 + neg_binom(0.50) linux_put_dec 45.06 6.39 (4)
3 + neg_binom(0.50) rv_put_dec 41.16 6.51 (6)
> New code uses lookup table which implies cache misses. Current code
> is purely code.
Code can miss the cache also, and then it needs to be decoded again. The
new _code_ is slightly smaller, although the total .text+.rodata does
increase by ~150 bytes. Yes, overall the new code will probably
touch one or two extra cache lines compared to the old - so there's a
tradeoff between one-shot and bulk decimal conversions.
>> On a larger scale, perf shows that top, one of the big consumers of /proc
>> data, uses 0.5-1.0% fewer cpu cycles.
>
> perf(1) also shows variance next to average, what was it?
Not in the output I got - that just showed lines such as
2.35% top [kernel.kallsyms] [k] num_to_str
But I don't have much perf-fu, so maybe I should have invoked it differently.
> First number printing improvement patch was measuring ~30% speedups:
> commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
> vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2
>
> Now it is 1%.
That's comparing apples and oranges. The ~30% were obtained from a top
which was modified to do nothing but read /proc/pid/stat, the 1% is out
of all the cycles top actually spends.
> I think any further improvements to number printing code should be rejected
> on philosophical grounds:
Hm, "perfect is the enemy of good" and all that.
> Kernel should ship numbers to ps(1) and top(1) in BINARY,
> so it would take exactly 1 MOV instruction which takes exactly 1 cycle
> to execute.
> Currently it is 1) kernel converts binary to text, 2) usespace
> converts text to binary,
> 3) userspace converts binary to text and shows the user. 4) people optimizing #1
I agree this is somewhat silly, but that's what we have, and it is
unlikely to change anytime soon. Adding a parallel binary interface
would be a maintenance nightmare.
Rasmus
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-13 23:53 ` Rasmus Villemoes
@ 2015-03-14 9:21 ` Alexey Dobriyan
2015-03-16 15:19 ` Alexey Dobriyan
0 siblings, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-14 9:21 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Linux Kernel, Andrew Morton, Peter Zijlstra, Tejun Heo,
Denis Vlasenko, KAMEZAWA Hiroyuki
On Sat, Mar 14, 2015 at 12:53:20AM +0100, Rasmus Villemoes wrote:
> On Fri, Mar 13 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> > On Thu, Mar 12, 2015 at 12:54 AM, <akpm@linux-foundation.org> wrote:
> >> Subject: lib/vsprintf.c: even faster binary to decimal conversion
> >
> > I spent some time to microbenchmark changes in userspace (audience: fool!).
> > Results are below.
> >
> > Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> > Great care was taken to remove interrupt noise.
> >
> > Number of measurements is 100 millions per line.
> > CPU is Intel Core 2 Duo E6550 in 64-bit mode.
> >
> > 3.19.1:
> >
> > 0 98.015369 +- 0.512937 91-616
> > 42 116.000193 +- 3.523826 112-868
> > 27182 137.009008 +- 3.515725 133-1043
> > 65535 137.008262 +- 3.521761 133-840
> > 4294967295 201.019966 +- 3.278608 196-1050
> > 3141592653589793238 289.996882 +- 3.489376 287-1148
> > 18446744073709551615 295.065274 +- 2.860187 287-1029
> > -----------------------------------------------------
> > 3.19.1+patch
> > 0 94.444063 +- 3.518922 84-630
> > 42 116.428533 +- 18.539093 105-1036
> > 42 116.316904 +- 18.234484 105-833
> > 27182 136.172398 +- 3.737113 133-980
> > 65535 136.014742 +- 3.537882 133-714
> > 4294967295 172.009618 +- 3.507473 168-826
> > 3141592653589793238 207.001114 +- 3.492724 196-1120
> > 18446744073709551615 208.018154 +- 3.220185 203-1246
> > -----------------------------------------------------
>
> This seems to measure lfence+rdtsc overhead more than anything else. On
> my presumably rather similar Core2 Duo T5870, I get an average of 22
> cycles for the old code and 11 cycles for the new when converting 42 two
> million times in a loop, and I'm not even trying to take interrupts into
> account.
>
> Since you seem to be dirtying 800 MB of memory, I'm guessing you get
> quite a few page faults/TLB misses, which might explain the rather huge
> max numbers.
Huge max numbers come from RDTSC itself. If you measure tiniest possible
RDTSC;RDTSC loop, it will reliably show huge results once in a while,
on that box it is in 70-500 range IIRC.
> > New code is somewhat faster for huge numbers.
> > But top and ps don't show huge numbers normally --
> > it is either PIDs (2^16) or moderately high numbers in a range of millions
> > (see /proc/stat)
>
> I said much the same thing in the commit log, and accordingly I've done
> (micro)benchmarks with distributions biased to various degrees towards
> smaller numbers, all of which showed 25+% improvement.
>
> > * variance for new code is bigger
>
> Seems to depend on how you measure...
>
> > I even tried N=42 twice because I thought 18.5 variance is a glitch
> > but it is not.
>
> That does seem odd. But I think your numbers are caused by the huge
> memory use. In any case, I modified my test program to record the
> cycle count for each individual call (using lfence+rdtsc), but I used a
> frequency table instead of a gigantic array, ignoring excessively large
> cycle counts (I used > 1023). On the Core 2, I then get
>
> 90 90 90 80 80 80 80 80 80 80 80 80 90 80 80
> Distribution Function cycles/conv std.dev. (ignored)
> uniform([10, 2^64-1]) linux_put_dec 224.83 9.80 (156)
> uniform([10, 2^64-1]) rv_put_dec 147.87 7.44 (86)
> 3 + neg_binom(0.05) linux_put_dec 138.97 39.48 (87)
> 3 + neg_binom(0.05) rv_put_dec 123.76 27.33 (77)
> 3 + neg_binom(0.10) linux_put_dec 115.49 27.27 (84)
> 3 + neg_binom(0.10) rv_put_dec 108.22 20.14 (71)
> 3 + neg_binom(0.15) linux_put_dec 105.21 20.52 (59)
> 3 + neg_binom(0.15) rv_put_dec 101.75 17.21 (54)
> 3 + neg_binom(0.20) linux_put_dec 100.79 17.25 (65)
> 3 + neg_binom(0.20) rv_put_dec 98.34 16.22 (64)
> 3 + neg_binom(0.50) linux_put_dec 87.84 7.75 (44)
> 3 + neg_binom(0.50) rv_put_dec 85.37 8.26 (45)
>
> [first line is just deltas between a few lfence+rdtsc reads in quick
> succession, to get a sense of the overhead]. For each distribution I'm
> generating 2048 random numbers and then iterate over that 1000 times. So
> almost none of the ~2M observations are being ignored. Here, the new
> code is always faster (but computing a percentage from numbers including
> the rdtsc overhead is meaningless), and in all but the last case (where
> the numbers are almost exclusively 2-digit) the std. deviation is also
> smaller. An an Intel Xeon, I get
>
> 48 32 32 32 32 32 32 32 32 32 32 32 32 32 32
> Distribution Function cycles/conv std.dev. (ignored)
> uniform([10, 2^64-1]) linux_put_dec 152.58 8.54 (26)
> uniform([10, 2^64-1]) rv_put_dec 89.33 3.02 (16)
> 3 + neg_binom(0.05) linux_put_dec 91.88 34.46 (17)
> 3 + neg_binom(0.05) rv_put_dec 71.59 21.19 (10)
> 3 + neg_binom(0.10) linux_put_dec 72.50 25.30 (12)
> 3 + neg_binom(0.10) rv_put_dec 60.10 17.51 (11)
> 3 + neg_binom(0.15) linux_put_dec 63.81 20.68 (8)
> 3 + neg_binom(0.15) rv_put_dec 55.57 15.74 (6)
> 3 + neg_binom(0.20) linux_put_dec 57.18 16.50 (7)
> 3 + neg_binom(0.20) rv_put_dec 51.15 13.58 (12)
> 3 + neg_binom(0.50) linux_put_dec 45.06 6.39 (4)
> 3 + neg_binom(0.50) rv_put_dec 41.16 6.51 (6)
>
> > New code uses lookup table which implies cache misses. Current code
> > is purely code.
>
> Code can miss the cache also, and then it needs to be decoded again. The
> new _code_ is slightly smaller, although the total .text+.rodata does
> increase by ~150 bytes. Yes, overall the new code will probably
> touch one or two extra cache lines compared to the old - so there's a
> tradeoff between one-shot and bulk decimal conversions.
>
> >> On a larger scale, perf shows that top, one of the big consumers of /proc
> >> data, uses 0.5-1.0% fewer cpu cycles.
> >
> > perf(1) also shows variance next to average, what was it?
>
> Not in the output I got - that just showed lines such as
>
> 2.35% top [kernel.kallsyms] [k] num_to_str
>
> But I don't have much perf-fu, so maybe I should have invoked it differently.
I thought you did macrobenchmark with "perf stat -r", then it shows variance.
> > First number printing improvement patch was measuring ~30% speedups:
> > commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
> > vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2
> >
> > Now it is 1%.
>
> That's comparing apples and oranges. The ~30% were obtained from a top
> which was modified to do nothing but read /proc/pid/stat, the 1% is out
> of all the cycles top actually spends.
> > I think any further improvements to number printing code should be rejected
> > on philosophical grounds:
>
> Hm, "perfect is the enemy of good" and all that.
>
> > Kernel should ship numbers to ps(1) and top(1) in BINARY,
> > so it would take exactly 1 MOV instruction which takes exactly 1 cycle
> > to execute.
> > Currently it is 1) kernel converts binary to text, 2) usespace
> > converts text to binary,
> > 3) userspace converts binary to text and shows the user. 4) people optimizing #1
>
> I agree this is somewhat silly, but that's what we have, and it is
> unlikely to change anytime soon. Adding a parallel binary interface
> would be a maintenance nightmare.
Hm, such interface already exists, it's called CONFIG_TASKSTATS.
procps and top of course don't use it.
Alexey
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-14 9:21 ` Alexey Dobriyan
@ 2015-03-16 15:19 ` Alexey Dobriyan
2015-03-17 22:04 ` Andrew Morton
0 siblings, 1 reply; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-16 15:19 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Linux Kernel, Andrew Morton, Peter Zijlstra, Tejun Heo,
Denis Vlasenko, KAMEZAWA Hiroyuki
[-- Attachment #1: Type: text/plain, Size: 2534 bytes --]
On Sat, Mar 14, 2015 at 12:21 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
Rasmus, I redid benchmarks:
* time is folded into average/variance immediately,
it didn't change anything of importance
* batched conversion in a loop, still removing results where interrupt
intervened
It became worse :-)
On this box some numbers are converted _slower_ with your changes.
Numbers are stable (average, min, max).
New code becomes noticeably faster around ~100000 and up.
But these speedups are diluted because num_to_str() takes ~2% of
top(1)'s CPU time with 5000 processes.
# 3.19.1
$ taskset -c 1 ./bb1 1000000 1000
0 13114.776199 +- 114.942115 13097-15981
1 13107.875655 +- 69.779029 13097-16275
10 31091.129391 +- 4.638450 31087-32529
100 34098.031121 +- 3.870317 34097-34615
255 34098.028937 +- 3.714863 34090-34622
1000 37101.041166 +- 4.003837 37100-38276
10000 50760.840697 +- 30.528328 50750-52143
65535 50760.825472 +- 30.126764 50750-52101
100000 59608.431078 +- 8.414339 59605-60144
1000000 77101.574543 +- 6.524058 77077-77637
10000000 88638.113739 +- 723.057589 88095-90125
100000000 86092.934347 +- 6265.582344 84105-106407
4294967295 118092.229892 +- 55.223289 110152-118699
18446744073709551615 210586.971532 +- 62.836890 209125-211120
# 3.19.1+diff
$ taskset -c 1 ./bb2 1000000 1000
0 12093.654853 +- 9.896411 12089-12789
1 12093.352964 +- 8.099249 12089-12838
10 31093.928544 +- 62.655080 28091-34587
100 * 40688.594709 +- 10.637530 39256-41195
255 * 40688.571161 +- 10.667900 39991-41195
1000 * 42095.428977 +- 75.946830 39704-42623
10000 * 51724.895173 +- 6.051896 51135-52248
65535 * 51724.839467 +- 6.205925 51114-52241
100000 53103.076152 +- 20.832946 53081-54103
1000000 59602.515756 +- 10.729477 59598-60137
10000000 61104.239490 +- 23.073365 61096-62629
100000000 70857.128601 +- 31.596224 70854-72660
4294967295 79101.747725 +- 32.571835 79093-81102
18446744073709551615 132183.241400 +- 871.759411 122451-134869
I'm attaching full tarball so there is no ambiguity what is measured
and what is not.
Alexey
[-- Attachment #2: num-to-str.tar.gz --]
[-- Type: application/x-gzip, Size: 6804 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-16 15:19 ` Alexey Dobriyan
@ 2015-03-17 22:04 ` Andrew Morton
2015-03-19 12:17 ` Alexey Dobriyan
0 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2015-03-17 22:04 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Rasmus Villemoes, Linux Kernel, Peter Zijlstra, Tejun Heo,
Denis Vlasenko, KAMEZAWA Hiroyuki
On Mon, 16 Mar 2015 18:19:41 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
> Rasmus, I redid benchmarks:
tl;dr ;) Is this an ack or a nack?
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-17 22:04 ` Andrew Morton
@ 2015-03-19 12:17 ` Alexey Dobriyan
2015-03-19 22:15 ` Rasmus Villemoes
2015-03-22 17:29 ` Denys Vlasenko
0 siblings, 2 replies; 10+ messages in thread
From: Alexey Dobriyan @ 2015-03-19 12:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Rasmus Villemoes, Linux Kernel, Peter Zijlstra, Tejun Heo,
Denis Vlasenko, KAMEZAWA Hiroyuki
On Wed, Mar 18, 2015 at 1:04 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 16 Mar 2015 18:19:41 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
>> Rasmus, I redid benchmarks:
>
> tl;dr ;) Is this an ack or a nack?
New code executes slower for some input on one CPU I've benchmarked,
both with -O2 and -Os (Core 2 Duo E6550). The slowdown is in 2-20% range
depending on exact number being converted and is reproducible.
With -O2 "bad" range is roughly 100-70000, with -Os it is 100-1000 and
around 100000.
On another CPU (more modern Core i5-something), new code is uniformly
faster giving advertised 10-20-35-40% speedups (yay!)
The ideal situation is still being
a) one system call to push PIDs into userspace in bulk
(memcpy directly from pid_ns->pidmap[i]),
b) one system call to fetch data in binary to userspace given PID or
set of PIDs,
without all of this string and /proc overhead.
Alexey
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-19 12:17 ` Alexey Dobriyan
@ 2015-03-19 22:15 ` Rasmus Villemoes
2015-03-22 17:29 ` Denys Vlasenko
1 sibling, 0 replies; 10+ messages in thread
From: Rasmus Villemoes @ 2015-03-19 22:15 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Andrew Morton, Linux Kernel, Peter Zijlstra, Tejun Heo,
Denis Vlasenko, KAMEZAWA Hiroyuki
On Thu, Mar 19 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Wed, Mar 18, 2015 at 1:04 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>> On Mon, 16 Mar 2015 18:19:41 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
>>
>>> Rasmus, I redid benchmarks:
>>
>> tl;dr ;) Is this an ack or a nack?
>
> New code executes slower for some input on one CPU I've benchmarked,
> both with -O2 and -Os (Core 2 Duo E6550).
Running Alexey's code on my Core 2 Duo, I can confirm that. However, my
own benchmark did show the claimed 25-50% improvement, depending on
distribution. One difference between our benchmarks is that in Alexey's
case all branches are perfectly predictable - whether that matters I
can't tell [is there a "flush branch prediction" instruction?]. Also, I
found a somewhat subtle flaw in his benchmark [1] which gave the old
code a small (1-2 cycles) advantage. Fixing that and applying the small
tweak I just sent out [2], Alexey's benchmark no longer shows any
difference between the old and new code on the Core 2 Duo.
Rasmus
[1] put_dec was inlined into num_to_str in the old code - in the actual
kernel code, it is and was not, since it has another caller. I somehow
just cargo-culted the noinline_for_stack annotations all over, so it
also wasn't inlined in the benchmark of the new code.
[2] https://lkml.org/lkml/2015/3/19/802
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
2015-03-19 12:17 ` Alexey Dobriyan
2015-03-19 22:15 ` Rasmus Villemoes
@ 2015-03-22 17:29 ` Denys Vlasenko
1 sibling, 0 replies; 10+ messages in thread
From: Denys Vlasenko @ 2015-03-22 17:29 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Andrew Morton, Rasmus Villemoes, Linux Kernel, Peter Zijlstra,
Tejun Heo, KAMEZAWA Hiroyuki
On Thu, Mar 19, 2015 at 1:17 PM, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> The ideal situation is still being
> a) one system call to push PIDs into userspace in bulk
> (memcpy directly from pid_ns->pidmap[i]),
> b) one system call to fetch data in binary to userspace given PID or
> set of PIDs,
This is a good point wrt top/ps.
But.
We will never get rid of vsprintf in kernel printing numbers.
And inevitably there are scenarios where for one reason or another
someone would need to print LOTS AND LOTS of them.
It does make perfect sense to optimize that, even after
top/ps would start using a more efficient channel to get their data.
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-03-22 17:30 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
2015-03-13 12:49 ` + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Alexey Dobriyan
2015-03-13 12:56 ` Alexey Dobriyan
2015-03-13 13:00 ` Alexey Dobriyan
2015-03-13 23:53 ` Rasmus Villemoes
2015-03-14 9:21 ` Alexey Dobriyan
2015-03-16 15:19 ` Alexey Dobriyan
2015-03-17 22:04 ` Andrew Morton
2015-03-19 12:17 ` Alexey Dobriyan
2015-03-19 22:15 ` Rasmus Villemoes
2015-03-22 17:29 ` Denys Vlasenko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).