From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752785AbbCNJVO (ORCPT <rfc822;w@1wt.eu>);
	Sat, 14 Mar 2015 05:21:14 -0400
Received: from mail-we0-f178.google.com ([74.125.82.178]:32784 "EHLO
	mail-we0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751358AbbCNJVG (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 14 Mar 2015 05:21:06 -0400
Date: Sat, 14 Mar 2015 12:21:05 +0300
From: Alexey Dobriyan <adobriyan@gmail.com>
To: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Linux Kernel <linux-kernel@vger.kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Peter Zijlstra <peterz@infradead.org>, Tejun Heo <tj@kernel.org>,
        Denis Vlasenko <vda.linux@googlemail.com>,
        KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Subject: Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to
 -mm tree
Message-ID: <20150314092104.GA1674@p183.telecom.by>
References: <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
 <CACVxJT_enpRNoBf3TizQ2U1Z77viP6sCy3CGTMnhpzpkjDpqcw@mail.gmail.com>
 <87y4n0xxn3.fsf@rasmusvillemoes.dk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87y4n0xxn3.fsf@rasmusvillemoes.dk>
User-Agent: Mutt/1.5.22 (2013-10-16)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, Mar 14, 2015 at 12:53:20AM +0100, Rasmus Villemoes wrote:
> On Fri, Mar 13 2015, Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
> > On Thu, Mar 12, 2015 at 12:54 AM,  <akpm@linux-foundation.org> wrote:
> >> Subject: lib/vsprintf.c: even faster binary to decimal conversion
> >
> > I spent some time to microbenchmark changes in userspace (audience: fool!).
> > Results are below.
> >
> > Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
> > Great care was taken to remove interrupt noise.
> >
> > Number of measurements is 100 millions per line.
> > CPU is Intel Core 2 Duo E6550 in 64-bit mode.
> >
> > 3.19.1:
> >
> > 0                     98.015369 +- 0.512937   91-616
> > 42                   116.000193 +- 3.523826  112-868
> > 27182                137.009008 +- 3.515725  133-1043
> > 65535                137.008262 +- 3.521761  133-840
> > 4294967295           201.019966 +- 3.278608  196-1050
> > 3141592653589793238  289.996882 +- 3.489376  287-1148
> > 18446744073709551615 295.065274 +- 2.860187  287-1029
> > -----------------------------------------------------
> > 3.19.1+patch
> > 0                     94.444063 +- 3.518922   84-630
> > 42                   116.428533 +- 18.539093 105-1036
> > 42                   116.316904 +- 18.234484 105-833
> > 27182                136.172398 +- 3.737113  133-980
> > 65535                136.014742 +- 3.537882  133-714
> > 4294967295           172.009618 +- 3.507473  168-826
> > 3141592653589793238  207.001114 +- 3.492724  196-1120
> > 18446744073709551615 208.018154 +- 3.220185  203-1246
> > -----------------------------------------------------
> 
> This seems to measure lfence+rdtsc overhead more than anything else. On
> my presumably rather similar Core2 Duo T5870, I get an average of 22
> cycles for the old code and 11 cycles for the new when converting 42 two
> million times in a loop, and I'm not even trying to take interrupts into
> account.
> 
> Since you seem to be dirtying 800 MB of memory, I'm guessing you get
> quite a few page faults/TLB misses, which might explain the rather huge
> max numbers.

Huge max numbers come from RDTSC itself. If you measure tiniest possible
RDTSC;RDTSC loop, it will reliably show huge results once in a while,
on that box it is in 70-500 range IIRC.

> > New code is somewhat faster for huge numbers.
> > But top and ps don't show huge numbers normally --
> > it is either PIDs (2^16) or moderately high numbers in a range of millions
> > (see /proc/stat)
> 
> I said much the same thing in the commit log, and accordingly I've done
> (micro)benchmarks with distributions biased to various degrees towards
> smaller numbers, all of which showed 25+% improvement.
> 
> > * variance for new code is bigger
> 
> Seems to depend on how you measure...
> 
> > I even tried N=42 twice because I thought 18.5 variance is a glitch
> > but it is not.
> 
> That does seem odd. But I think your numbers are caused by the huge
> memory use. In any case, I modified my test program to record the
> cycle count for each individual call (using lfence+rdtsc), but I used a
> frequency table instead of a gigantic array, ignoring excessively large
> cycle counts (I used > 1023). On the Core 2, I then get
> 
> 90 90 90 80 80 80 80 80 80 80 80 80 90 80 80
> Distribution              Function         cycles/conv  std.dev. (ignored) 
> uniform([10, 2^64-1])     linux_put_dec          224.83     9.80 (156)
> uniform([10, 2^64-1])     rv_put_dec             147.87     7.44 (86)
> 3 + neg_binom(0.05)       linux_put_dec          138.97    39.48 (87)
> 3 + neg_binom(0.05)       rv_put_dec             123.76    27.33 (77)
> 3 + neg_binom(0.10)       linux_put_dec          115.49    27.27 (84)
> 3 + neg_binom(0.10)       rv_put_dec             108.22    20.14 (71)
> 3 + neg_binom(0.15)       linux_put_dec          105.21    20.52 (59)
> 3 + neg_binom(0.15)       rv_put_dec             101.75    17.21 (54)
> 3 + neg_binom(0.20)       linux_put_dec          100.79    17.25 (65)
> 3 + neg_binom(0.20)       rv_put_dec              98.34    16.22 (64)
> 3 + neg_binom(0.50)       linux_put_dec           87.84     7.75 (44)
> 3 + neg_binom(0.50)       rv_put_dec              85.37     8.26 (45)
> 
> [first line is just deltas between a few lfence+rdtsc reads in quick
> succession, to get a sense of the overhead]. For each distribution I'm
> generating 2048 random numbers and then iterate over that 1000 times. So
> almost none of the ~2M observations are being ignored. Here, the new
> code is always faster (but computing a percentage from numbers including
> the rdtsc overhead is meaningless), and in all but the last case (where
> the numbers are almost exclusively 2-digit) the std. deviation is also
> smaller. An an Intel Xeon, I get
> 
> 48 32 32 32 32 32 32 32 32 32 32 32 32 32 32
> Distribution              Function         cycles/conv  std.dev. (ignored) 
> uniform([10, 2^64-1])     linux_put_dec          152.58     8.54 (26)
> uniform([10, 2^64-1])     rv_put_dec              89.33     3.02 (16)
> 3 + neg_binom(0.05)       linux_put_dec           91.88    34.46 (17)
> 3 + neg_binom(0.05)       rv_put_dec              71.59    21.19 (10)
> 3 + neg_binom(0.10)       linux_put_dec           72.50    25.30 (12)
> 3 + neg_binom(0.10)       rv_put_dec              60.10    17.51 (11)
> 3 + neg_binom(0.15)       linux_put_dec           63.81    20.68 (8)
> 3 + neg_binom(0.15)       rv_put_dec              55.57    15.74 (6)
> 3 + neg_binom(0.20)       linux_put_dec           57.18    16.50 (7)
> 3 + neg_binom(0.20)       rv_put_dec              51.15    13.58 (12)
> 3 + neg_binom(0.50)       linux_put_dec           45.06     6.39 (4)
> 3 + neg_binom(0.50)       rv_put_dec              41.16     6.51 (6)
> 
> > New code uses lookup table which implies cache misses.  Current code
> > is purely code.
> 
> Code can miss the cache also, and then it needs to be decoded again. The
> new _code_ is slightly smaller, although the total .text+.rodata does
> increase by ~150 bytes. Yes, overall the new code will probably
> touch one or two extra cache lines compared to the old - so there's a
> tradeoff between one-shot and bulk decimal conversions.
> 
> >> On a larger scale, perf shows that top, one of the big consumers of /proc
> >> data, uses 0.5-1.0% fewer cpu cycles.
> >
> > perf(1) also shows variance next to average, what was it?
> 
> Not in the output I got - that just showed lines such as 
> 
>      2.35%  top      [kernel.kallsyms]   [k] num_to_str                 
> 
> But I don't have much perf-fu, so maybe I should have invoked it differently.

I thought you did macrobenchmark with "perf stat -r", then it shows variance.

> > First number printing improvement patch was measuring ~30% speedups:
> > commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
> > vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2
> >
> > Now it is 1%.
> 
> That's comparing apples and oranges. The ~30% were obtained from a top
> which was modified to do nothing but read /proc/pid/stat, the 1% is out
> of all the cycles top actually spends.

> > I think any further improvements to number printing code should be rejected
> > on philosophical grounds:
> 
> Hm, "perfect is the enemy of good" and all that.
> 
> > Kernel should ship numbers to ps(1) and top(1) in BINARY,
> > so it would take exactly 1 MOV instruction which takes exactly 1 cycle
> > to execute.
> > Currently it is 1) kernel converts binary to text, 2) usespace
> > converts text to binary,
> > 3) userspace converts binary to text and shows the user. 4) people optimizing #1
> 
> I agree this is somewhat silly, but that's what we have, and it is
> unlikely to change anytime soon. Adding a parallel binary interface
> would be a maintenance nightmare.

Hm, such interface already exists, it's called CONFIG_TASKSTATS.
procps and top of course don't use it.

	Alexey