From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754973AbbCMMtv (ORCPT <rfc822;w@1wt.eu>);
	Fri, 13 Mar 2015 08:49:51 -0400
Received: from mail-la0-f50.google.com ([209.85.215.50]:33680 "EHLO
	mail-la0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751391AbbCMMtr (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 13 Mar 2015 08:49:47 -0400
MIME-Version: 1.0
In-Reply-To: <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
References: <5500b987.kerYYCYfIffruy3Z%akpm@linux-foundation.org>
Date: Fri, 13 Mar 2015 15:49:45 +0300
Message-ID: <CACVxJT_enpRNoBf3TizQ2U1Z77viP6sCy3CGTMnhpzpkjDpqcw@mail.gmail.com>
Subject: Re: + lib-vsprintfc-even-faster-decimal-conversion.patch added to -mm tree
From: Alexey Dobriyan <adobriyan@gmail.com>
To: Linux Kernel <linux-kernel@vger.kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>
Cc: linux@rasmusvillemoes.dk, Peter Zijlstra <peterz@infradead.org>,
        Tejun Heo <tj@kernel.org>, Denis Vlasenko <vda.linux@googlemail.com>,
        KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Mar 12, 2015 at 12:54 AM,  <akpm@linux-foundation.org> wrote:
> Subject: lib/vsprintf.c: even faster binary to decimal conversion

I spent some time to microbenchmark changes in userspace (audience: fool!).
Results are below.

Legend is "number avg+-1sigma min-max". Every number is CPU cycles.
Great care was taken to remove interrupt noise.

Number of measurements is 100 millions per line.
CPU is Intel Core 2 Duo E6550 in 64-bit mode.

3.19.1:

0                     98.015369 +- 0.512937   91-616
42                   116.000193 +- 3.523826  112-868
27182                137.009008 +- 3.515725  133-1043
65535                137.008262 +- 3.521761  133-840
4294967295           201.019966 +- 3.278608  196-1050
3141592653589793238  289.996882 +- 3.489376  287-1148
18446744073709551615 295.065274 +- 2.860187  287-1029
-----------------------------------------------------
3.19.1+patch
0                     94.444063 +- 3.518922   84-630
42                   116.428533 +- 18.539093 105-1036
42                   116.316904 +- 18.234484 105-833
27182                136.172398 +- 3.737113  133-980
65535                136.014742 +- 3.537882  133-714
4294967295           172.009618 +- 3.507473  168-826
3141592653589793238  207.001114 +- 3.492724  196-1120
18446744073709551615 208.018154 +- 3.220185  203-1246
-----------------------------------------------------

New code is somewhat faster for huge numbers.
But top and ps don't show huge numbers normally --
it is either PIDs (2^16) or moderately high numbers in a range of millions
(see /proc/stat)

* variance for new code is bigger
I even tried N=42 twice because I thought 18.5 variance is a glitch
but it is not.

New code uses lookup table which implies cache misses.
Current code is purely code.

So I don't think new printing code will change anything really.

> On a larger scale, perf shows that top, one of the big consumers of /proc
> data, uses 0.5-1.0% fewer cpu cycles.

perf(1) also shows variance next to average, what was it?
In my experience everything perf measures has single digit percent variance
(and this is just 1sigma!) so you can't say new code is faster.
Also average can vary between runs more than variance (yuck!)

First number printing improvement patch was measuring ~30% speedups:
commit 4277eedd7908a0ca8b66fad46ee76b0ad96e6ef2
vsprintf.c: optimizing, part 2: base 10 conversion speedup, v2

Now it is 1%.

> Microbenchmarking shows improvements
> ranging from -50% (for numbers uniformly distributed in [0, 2^64-1]) to
> -25% (for numbers heavily biased toward the smaller end, a more realistic
> distribution).

-25%? Mmm, no, see table above.

I think any further improvements to number printing code should be rejected
on philosophical grounds:

Kernel should ship numbers to ps(1) and top(1) in BINARY,
so it would take exactly 1 MOV instruction which takes exactly 1 cycle
to execute.
Currently it is 1) kernel converts binary to text, 2) usespace
converts text to binary,
3) userspace converts binary to text and shows the user. 4) people optimizing #1

But only final conversion is needed to happen because it is communication
between human and program. Programs can very well talk in binary.

So, if you really want to speed up top(1), design binary interface for
shipping numbers
and enjoy 1000% speedups, leave text files in /proc for those
undemanding shell scripts.

    Alexey