From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755247AbbCMXx0 (ORCPT ); Fri, 13 Mar 2015 19:53:26 -0400 Received: from mail-la0-f44.google.com ([209.85.215.44]:34556 "EHLO mail-la0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750818AbbCMXxY (ORCPT ); Fri, 13 Mar 2015 19:53:24 -0400 From: Rasmus Villemoes To: Alexey Dobriyan Cc: Linux Kernel , Andrew Morton , Peter Zijlstra , Tejun Heo , Denis Vlasenko , KAMEZAWA Hiroyuki Subject: Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree Organization: D03 References: <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org> X-Hashcash: 1:20:150313:adobriyan@gmail.com::YLq1I+XNpTw9VHQ6:0000000000000000000000000000000000000000000I+z X-Hashcash: 1:20:150313:tj@kernel.org::950sFuVyhY+FNVoz:00002tzB X-Hashcash: 1:20:150313:akpm@linux-foundation.org::+BLZeisQeZ9Jwj6E:0000000000000000000000000000000000002sm+ X-Hashcash: 1:20:150313:peterz@infradead.org::HNevPxQ5VWrZK42p:000000000000000000000000000000000000000002u3f X-Hashcash: 1:20:150313:kamezawa.hiroyu@jp.fujitsu.com::chqh2DaZHE9TrmV4:00000000000000000000000000000008jqE X-Hashcash: 1:20:150313:vda.linux@googlemail.com::4RyG/wJ1EdAYT86Z:0000000000000000000000000000000000000Ft4K X-Hashcash: 1:20:150313:linux-kernel@vger.kernel.org::1D0pwaXuEHEz6H76:000000000000000000000000000000000GAAp Date: Sat, 14 Mar 2015 00:53:20 +0100 In-Reply-To: (Alexey Dobriyan's message of "Fri, 13 Mar 2015 15:49:45 +0300") Message-ID: <87y4n0xxn3.fsf@rasmusvillemoes.dk> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 13 2015, Alexey Dobriyan wrote: > On Thu, Mar 12, 2015 at 12:54 AM, wrote: >> Subject: lib/vsprintf.c: even faster binary to decimal conversion > > I spent some time to microbenchmark changes in userspace (audience: fool!). > Results are below. > > Legend is "number avg+-1sigma min-max". Every number is CPU cycles. > Great care was taken to remove interrupt noise. > > Number of measurements is 100 millions per line. > CPU is Intel Core 2 Duo E6550 in 64-bit mode. > > 3.19.1: > > 0 98.015369 +- 0.512937 91-616 > 42 116.000193 +- 3.523826 112-868 > 27182 137.009008 +- 3.515725 133-1043 > 65535 137.008262 +- 3.521761 133-840 > 4294967295 201.019966 +- 3.278608 196-1050 > 3141592653589793238 289.996882 +- 3.489376 287-1148 > 18446744073709551615 295.065274 +- 2.860187 287-1029 > ----------------------------------------------------- > 3.19.1+patch > 0 94.444063 +- 3.518922 84-630 > 42 116.428533 +- 18.539093 105-1036 > 42 116.316904 +- 18.234484 105-833 > 27182 136.172398 +- 3.737113 133-980 > 65535 136.014742 +- 3.537882 133-714 > 4294967295 172.009618 +- 3.507473 168-826 > 3141592653589793238 207.001114 +- 3.492724 196-1120 > 18446744073709551615 208.018154 +- 3.220185 203-1246 > ----------------------------------------------------- This seems to measure lfence+rdtsc overhead more than anything else. On my presumably rather similar Core2 Duo T5870, I get an average of 22 cycles for the old code and 11 cycles for the new when converting 42 two million times in a loop, and I'm not even trying to take interrupts into account. Since you seem to be dirtying 800 MB of memory, I'm guessing you get quite a few page faults/TLB misses, which might explain the rather huge max numbers. > New code is somewhat faster for huge numbers. > But top and ps don't show huge numbers normally -- > it is either PIDs (2^16) or moderately high numbers in a range of millions > (see /proc/stat) I said much the same thing in the commit log, and accordingly I've done (micro)benchmarks with distributions biased to various degrees towards smaller numbers, all of which showed 25+% improvement. > * variance for new code is bigger Seems to depend on how you measure... > I even tried N=42 twice because I thought 18.5 variance is a glitch > but it is not. That does seem odd. But I think your numbers are caused by the huge memory use. In any case, I modified my test program to record the cycle count for each individual call (using lfence+rdtsc), but I used a frequency table instead of a gigantic array, ignoring excessively large cycle counts (I used > 1023). On the Core 2, I then get 90 90 90 80 80 80 80 80 80 80 80 80 90 80 80 Distribution Function cycles/conv std.dev. (ignored) uniform([10, 2^64-1]) linux_put_dec 224.83 9.80 (156) uniform([10, 2^64-1]) rv_put_dec 147.87 7.44 (86) 3 + neg_binom(0.05) linux_put_dec 138.97 39.48 (87) 3 + neg_binom(0.05) rv_put_dec 123.76 27.33 (77) 3 + neg_binom(0.10) linux_put_dec 115.49 27.27 (84) 3 + neg_binom(0.10) rv_put_dec 108.22 20.14 (71) 3 + neg_binom(0.15) linux_put_dec 105.21 20.52 (59) 3 + neg_binom(0.15) rv_put_dec 101.75 17.21 (54) 3 + neg_binom(0.20) linux_put_dec 100.79 17.25 (65) 3 + neg_binom(0.20) rv_put_dec 98.34 16.22 (64) 3 + neg_binom(0.50) linux_put_dec 87.84 7.75 (44) 3 + neg_binom(0.50) rv_put_dec 85.37 8.26 (45) [first line is just deltas between a few lfence+rdtsc reads in quick succession, to get a sense of the overhead]. For each distribution I'm generating 2048 random numbers and then iterate over that 1000 times. So almost none of the ~2M observations are being ignored. Here, the new code is always faster (but computing a percentage from numbers including the rdtsc overhead is meaningless), and in all but the last case (where the numbers are almost exclusively 2-digit) the std. deviation is also smaller. An an Intel Xeon, I get 48 32 32 32 32 32 32 32 32 32 32 32 32 32 32 Distribution Function cycles/conv std.dev. (ignored) uniform([10, 2^64-1]) linux_put_dec 152.58 8.54 (26) uniform([10, 2^64-1]) rv_put_dec 89.33 3.02 (16) 3 + neg_binom(0.05) linux_put_dec 91.88 34.46 (17) 3 + neg_binom(0.05) rv_put_dec 71.59 21.19 (10) 3 + neg_binom(0.10) linux_put_dec 72.50 25.30 (12) 3 + neg_binom(0.10) rv_put_dec 60.10 17.51 (11) 3 + neg_binom(0.15) linux_put_dec 63.81 20.68 (8) 3 + neg_binom(0.15) rv_put_dec 55.57 15.74 (6) 3 + neg_binom(0.20) linux_put_dec 57.18 16.50 (7) 3 + neg_binom(0.20) rv_put_dec 51.15 13.58 (12) 3 + neg_binom(0.50) linux_put_dec 45.06 6.39 (4) 3 + neg_binom(0.50) rv_put_dec 41.16 6.51 (6) > New code uses lookup table which implies cache misses. Current code > is purely code. Code can miss the cache also, and then it needs to be decoded again. The new _code_ is slightly smaller, although the total .text+.rodata does increase by ~150 bytes. Yes, overall the new code will probably touch one or two extra cache lines compared to the old - so there's a tradeoff between one-shot and bulk decimal conversions. >> On a larger scale, perf shows that top, one of the big consumers of /proc >> data, uses 0.5-1.0% fewer cpu cycles. > > perf(1) also shows variance next to average, what was it? Not in the output I got - that just showed lines such as 2.35% top [kernel.kallsyms] [k] num_to_str But I don't have much perf-fu, so maybe I should have invoked it differently. > First number printing improvement patch was measuring ~30% speedups: > commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2 > vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2 > > Now it is 1%. That's comparing apples and oranges. The ~30% were obtained from a top which was modified to do nothing but read /proc/pid/stat, the 1% is out of all the cycles top actually spends. > I think any further improvements to number printing code should be rejected > on philosophical grounds: Hm, "perfect is the enemy of good" and all that. > Kernel should ship numbers to ps(1) and top(1) in BINARY, > so it would take exactly 1 MOV instruction which takes exactly 1 cycle > to execute. > Currently it is 1) kernel converts binary to text, 2) usespace > converts text to binary, > 3) userspace converts binary to text and shows the user. 4) people optimizing #1 I agree this is somewhat silly, but that's what we have, and it is unlikely to change anytime soon. Adding a parallel binary interface would be a maintenance nightmare. Rasmus