From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
To: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
Denis Vlasenko <vda.linux@googlemail.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
Date: Sat, 14 Mar 2015 00:53:20 +0100 [thread overview]
Message-ID: <87y4n0xxn3.fsf@rasmusvillemoes.dk> (raw)
In-Reply-To: <CACVxJT_enpRNoBf3TizQ2U1Z77viP6sCy3CGTMnhpzpkjDpqcw@mail.gmail.com> (Alexey Dobriyan's message of "Fri, 13 Mar 2015 15:49:45 +0300")
On Fri, Mar 13 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> On Thu, Mar 12, 2015 at 12:54 AM, <akpm@linux-foundation.org> wrote:
>> Subject: lib/vsprintf.c: even faster binary to decimal conversion
>
> I spent some time to microbenchmark changes in userspace (audience: fool!).
> Results are below.
>
> Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> Great care was taken to remove interrupt noise.
>
> Number of measurements is 100 millions per line.
> CPU is Intel Core 2 Duo E6550 in 64-bit mode.
>
> 3.19.1:
>
> 0 98.015369 +- 0.512937 91-616
> 42 116.000193 +- 3.523826 112-868
> 27182 137.009008 +- 3.515725 133-1043
> 65535 137.008262 +- 3.521761 133-840
> 4294967295 201.019966 +- 3.278608 196-1050
> 3141592653589793238 289.996882 +- 3.489376 287-1148
> 18446744073709551615 295.065274 +- 2.860187 287-1029
> -----------------------------------------------------
> 3.19.1+patch
> 0 94.444063 +- 3.518922 84-630
> 42 116.428533 +- 18.539093 105-1036
> 42 116.316904 +- 18.234484 105-833
> 27182 136.172398 +- 3.737113 133-980
> 65535 136.014742 +- 3.537882 133-714
> 4294967295 172.009618 +- 3.507473 168-826
> 3141592653589793238 207.001114 +- 3.492724 196-1120
> 18446744073709551615 208.018154 +- 3.220185 203-1246
> -----------------------------------------------------
This seems to measure lfence+rdtsc overhead more than anything else. On
my presumably rather similar Core2 Duo T5870, I get an average of 22
cycles for the old code and 11 cycles for the new when converting 42 two
million times in a loop, and I'm not even trying to take interrupts into
account.
Since you seem to be dirtying 800 MB of memory, I'm guessing you get
quite a few page faults/TLB misses, which might explain the rather huge
max numbers.
> New code is somewhat faster for huge numbers.
> But top and ps don't show huge numbers normally --
> it is either PIDs (2^16) or moderately high numbers in a range of millions
> (see /proc/stat)
I said much the same thing in the commit log, and accordingly I've done
(micro)benchmarks with distributions biased to various degrees towards
smaller numbers, all of which showed 25+% improvement.
> * variance for new code is bigger
Seems to depend on how you measure...
> I even tried N=42 twice because I thought 18.5 variance is a glitch
> but it is not.
That does seem odd. But I think your numbers are caused by the huge
memory use. In any case, I modified my test program to record the
cycle count for each individual call (using lfence+rdtsc), but I used a
frequency table instead of a gigantic array, ignoring excessively large
cycle counts (I used > 1023). On the Core 2, I then get
90 90 90 80 80 80 80 80 80 80 80 80 90 80 80
Distribution Function cycles/conv std.dev. (ignored)
uniform([10, 2^64-1]) linux_put_dec 224.83 9.80 (156)
uniform([10, 2^64-1]) rv_put_dec 147.87 7.44 (86)
3 + neg_binom(0.05) linux_put_dec 138.97 39.48 (87)
3 + neg_binom(0.05) rv_put_dec 123.76 27.33 (77)
3 + neg_binom(0.10) linux_put_dec 115.49 27.27 (84)
3 + neg_binom(0.10) rv_put_dec 108.22 20.14 (71)
3 + neg_binom(0.15) linux_put_dec 105.21 20.52 (59)
3 + neg_binom(0.15) rv_put_dec 101.75 17.21 (54)
3 + neg_binom(0.20) linux_put_dec 100.79 17.25 (65)
3 + neg_binom(0.20) rv_put_dec 98.34 16.22 (64)
3 + neg_binom(0.50) linux_put_dec 87.84 7.75 (44)
3 + neg_binom(0.50) rv_put_dec 85.37 8.26 (45)
[first line is just deltas between a few lfence+rdtsc reads in quick
succession, to get a sense of the overhead]. For each distribution I'm
generating 2048 random numbers and then iterate over that 1000 times. So
almost none of the ~2M observations are being ignored. Here, the new
code is always faster (but computing a percentage from numbers including
the rdtsc overhead is meaningless), and in all but the last case (where
the numbers are almost exclusively 2-digit) the std. deviation is also
smaller. An an Intel Xeon, I get
48 32 32 32 32 32 32 32 32 32 32 32 32 32 32
Distribution Function cycles/conv std.dev. (ignored)
uniform([10, 2^64-1]) linux_put_dec 152.58 8.54 (26)
uniform([10, 2^64-1]) rv_put_dec 89.33 3.02 (16)
3 + neg_binom(0.05) linux_put_dec 91.88 34.46 (17)
3 + neg_binom(0.05) rv_put_dec 71.59 21.19 (10)
3 + neg_binom(0.10) linux_put_dec 72.50 25.30 (12)
3 + neg_binom(0.10) rv_put_dec 60.10 17.51 (11)
3 + neg_binom(0.15) linux_put_dec 63.81 20.68 (8)
3 + neg_binom(0.15) rv_put_dec 55.57 15.74 (6)
3 + neg_binom(0.20) linux_put_dec 57.18 16.50 (7)
3 + neg_binom(0.20) rv_put_dec 51.15 13.58 (12)
3 + neg_binom(0.50) linux_put_dec 45.06 6.39 (4)
3 + neg_binom(0.50) rv_put_dec 41.16 6.51 (6)
> New code uses lookup table which implies cache misses. Current code
> is purely code.
Code can miss the cache also, and then it needs to be decoded again. The
new _code_ is slightly smaller, although the total .text+.rodata does
increase by ~150 bytes. Yes, overall the new code will probably
touch one or two extra cache lines compared to the old - so there's a
tradeoff between one-shot and bulk decimal conversions.
>> On a larger scale, perf shows that top, one of the big consumers of /proc
>> data, uses 0.5-1.0% fewer cpu cycles.
>
> perf(1) also shows variance next to average, what was it?
Not in the output I got - that just showed lines such as
2.35% top [kernel.kallsyms] [k] num_to_str
But I don't have much perf-fu, so maybe I should have invoked it differently.
> First number printing improvement patch was measuring ~30% speedups:
> commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
> vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2
>
> Now it is 1%.
That's comparing apples and oranges. The ~30% were obtained from a top
which was modified to do nothing but read /proc/pid/stat, the 1% is out
of all the cycles top actually spends.
> I think any further improvements to number printing code should be rejected
> on philosophical grounds:
Hm, "perfect is the enemy of good" and all that.
> Kernel should ship numbers to ps(1) and top(1) in BINARY,
> so it would take exactly 1 MOV instruction which takes exactly 1 cycle
> to execute.
> Currently it is 1) kernel converts binary to text, 2) usespace
> converts text to binary,
> 3) userspace converts binary to text and shows the user. 4) people optimizing #1
I agree this is somewhat silly, but that's what we have, and it is
unlikely to change anytime soon. Adding a parallel binary interface
would be a maintenance nightmare.
Rasmus
next prev parent reply other threads:[~2015-03-13 23:53 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
2015-03-13 12:49 ` + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Alexey Dobriyan
2015-03-13 12:56 ` Alexey Dobriyan
2015-03-13 13:00 ` Alexey Dobriyan
2015-03-13 23:53 ` Rasmus Villemoes [this message]
2015-03-14 9:21 ` Alexey Dobriyan
2015-03-16 15:19 ` Alexey Dobriyan
2015-03-17 22:04 ` Andrew Morton
2015-03-19 12:17 ` Alexey Dobriyan
2015-03-19 22:15 ` Rasmus Villemoes
2015-03-22 17:29 ` Denys Vlasenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87y4n0xxn3.fsf@rasmusvillemoes.dk \
--to=linux@rasmusvillemoes.dk \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=tj@kernel.org \
--cc=vda.linux@googlemail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).