All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()
@ 2010-08-05 22:38 Michal Nazarewicz
  2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
  2010-08-06  3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
  0 siblings, 2 replies; 14+ messages in thread
From: Michal Nazarewicz @ 2010-08-05 22:38 UTC (permalink / raw)
  To: linux-kernel, linux-kernel
  Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones,
	Denis Vlasenko, Andrew Morton

The put_dec_trunc() and put_dec_full() functions were based on
a code optimised for processors with 8-bit ALU but even then
they failed to satisfy the same constraints and in fact
required at least 16-bit ALU (because at least one number they
operate in can take 9 bits).

This version of those functions proposed by this patch goes
further and uses the full capacity of a 32-bit ALU and instead
of splitting the number into nibbles and operating on them it
performs the obvious algorithm for base conversion expect it
uses optimised code for dividing by ten (ie. no division is
actually performed).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 lib/vsprintf.c |  150 +++++++++++++++++++++++++++----------------------------
 1 files changed, 74 insertions(+), 76 deletions(-)


I did some benchmark on the following three processors:

Phenom: AMD Phenom(tm) II X3 710 Processor   (64-bit)
Atom:   Intel(R) Atom(TM) CPU N270 @ 1.60GHz (32-bit)
ARM:    ARMv7 Processor rev 2 (v7l)          (32-bit)

Here are the results (normalised to the fastest/smallest):

                    :        ARM     Phonem      Intel
-- Speed -------------------------------------------------------
orig_put_dec_full   :   1.078600   1.777800   1.356917  Original
mod1_put_dec_full   :   1.000000   1.117665   1.017742
mod3_put_dec_full   :   1.032507   1.000000   1.000000  Proposed

orig_put_dec_trunc  :   1.092177   1.657014   1.215658  Original
mod1_put_dec_trunc  :   1.006836   1.395088   1.078385
mod3_put_dec_trunc  :   1.000000   1.000000   1.000000  Proposed
-- Size --------------------------------------------------------
orig_put_dec_full   :   1.212766   1.355372   1.310345  Original
mod1_put_dec_full   :   1.021277   1.000000   1.000000
mod3_put_dec_full   :   1.000000   1.049587   1.172414  Proposed

orig_put_dec_trunc  :   1.363636   1.784000   1.317365  Original
mod1_put_dec_trunc  :   1.181818   1.400000   1.275449
mod3_put_dec_trunc  :   1.000000   1.000000   1.000000  Proposed

Source of the benchmark as well as code of all the modified version of
functions is included with the third patch of the benchmark.


As it can be observed from the table, the "mod3" version (proposed by
this patch) is the fastest version with the only exception of
"mod3_put_dec_full" on ARM which is slightly slower then
"mod1_put_dec_full" version.

It is also smaller, in terms of code size, then the original version
even though "mod1" is even smaller.

In the end, I'm proposing "mod3" because the size is not that
important (those are mere bytes) and as of speed, for ARM I have
proposed another solution in the next patch of this patchset.

The function is also shorter in terms of lines of code. ;)

I'm currently running 2.6.35 with this patch applied.  It applies just
fine on -next as well but I haven't tested this kernel and I've run it
with -next on ARM.


diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index b8a2f54..d63fb15 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -278,96 +278,94 @@ int skip_atoi(const char **s)
 	return i;
 }
 
-/* Decimal conversion is by far the most typical, and is used
- * for /proc and /sys data. This directly impacts e.g. top performance
- * with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
-
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
+ * correct results for num < 81920.  Because of this, we check at the
+ * beginning if we are dealing with a number that may cause trouble
+ * and if so, we make it smaller.
+ *
+ * (As a minor note, all operands are always 16 bit so this function
+ * should work well on hardware that cannot multiply 32 bit numbers).
+ *
+ * (Previous a code based on
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
+ * with permission from the author, Douglas W. Jones.)
+ *
+ * Other, possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10:  1100111
+ * (x * 0x34) >> 9:    110100 - same
+ * (x * 0x1a) >> 8:     11010 - same
+ * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+ */
 static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full(char *buf, unsigned q)
 {
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
-
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0'; /* least significant digit */
-	d1 = q + 9*d3 + 5*d2 + d1;
-	if (d1 != 0) {
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0'; /* next digit */
-
-		d2 = q + 2*d2;
-		if ((d2 != 0) || (d3 != 0)) {
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0'; /* next digit */
-
-			d3 = q + 4*d3;
-			if (d3 != 0) {
-				q = (d3 * 0xcd) >> 11;
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';  /* next digit */
-				if (q != 0)
-					*buf++ = q + '0'; /* most sign. digit */
-			}
-		}
+	unsigned r;
+	char a = '0';
+
+	if (q > 0xffff) {
+		a = '6';
+		q -= 60000;
 	}
 
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	q   = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r   = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	q   = (r * 0xd) >> 7;
+	*buf++ = (r - 10 * q) + '0';
+
+	*buf++ = q + a;
+
 	return buf;
 }
-/* Same with if's removed. Always emits five digits */
+
+/* Same as above but do not pad with zeros. */
 static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_trunc(char *buf, unsigned q)
 {
-	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
-	/* but anyway, gcc produces better code with full-sized ints */
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
+	unsigned r;
 
 	/*
-	 * Possible ways to approx. divide by 10
-	 * gcc -O2 replaces multiply with shifts and adds
-	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
-	 * (x * 0x67) >> 10:  1100111
-	 * (x * 0x34) >> 9:    110100 - same
-	 * (x * 0x1a) >> 8:     11010 - same
-	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+	 * We need to check if num is < 81920 so we might as well
+	 * check if we can just call the _full version of this
+	 * function.
 	 */
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0';
-	d1 = q + 9*d3 + 5*d2 + d1;
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0';
-
-		d2 = q + 2*d2;
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0';
-
-			d3 = q + 4*d3;
-				q = (d3 * 0xcd) >> 11; /* - shorter code */
-				/* q = (d3 * 0x67) >> 10; - would also work */
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';
-					*buf++ = q + '0';
+	if (q > 9999)
+		return put_dec_full(buf, q);
+
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q   = (r * 0x199a) >> 16;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q) {
+			r   = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			if (r)
+				*buf++ = r + '0';
+		}
+	}
 
 	return buf;
 }
+
 /* No inlining helps gcc to use registers better */
 static noinline_for_stack
 char *put_dec(char *buf, unsigned long long num)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-05 22:38 [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
@ 2010-08-05 22:38 ` Michal Nazarewicz
  2010-08-05 22:38   ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
  2010-08-06  5:18   ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
  2010-08-06  3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
  1 sibling, 2 replies; 14+ messages in thread
From: Michal Nazarewicz @ 2010-08-05 22:38 UTC (permalink / raw)
  To: linux-kernel, linux-kernel
  Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones,
	Denis Vlasenko, Andrew Morton

Existing put_dec() function uses a do_div() function for
dividing the 64-bit argument.  On 32-bit machines this may be
a costly operation.  This patch, replaces the put_dec()
function on 32-bit processors to one that performs no 64-bit
divisions.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 lib/vsprintf.c |  103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 102 insertions(+), 1 deletions(-)


I did some benchmark on the following two processors:

Atom:   Intel(R) Atom(TM) CPU N270 @ 1.60GHz
ARM:    ARMv7 Processor rev 2 (v7l)

(I'm ommitting Phenom since this code is ment for 32-bit processors).

Here are the results (normalised to the fastest/smallest):

                    :        ARM      Intel
-- Speed ---------------------------------------------------------------
orig_put_dec        :  10.194361   2.190063  Original
mod1_put_dec        :  10.259952   2.025620
mod2_put_dec        :  10.142540   1.970004
mod3_put_dec        :  10.284313   1.961153  Version from previous patch
mod4_put_dec        :  10.164480   1.986127
mod5_put_dec        :  14.529587   2.531521
mod6_put_dec        :   1.000000   1.000000  Proposed
mod7_put_dec        :   1.006486   1.011573
mod8_put_dec        :   8.347325   2.153098
-- Size ----------------------------------------------------------------
orig_put_dec        :   1.000000   1.000000  Original
mod1_put_dec        :   1.000000   1.000000
mod2_put_dec        :   1.361111   1.403226
mod3_put_dec        :   1.000000   1.000000  Version from previous patch
mod4_put_dec        :   1.361111   1.403226
mod5_put_dec        :   1.000000   1.000000
mod6_put_dec        :   2.555556   3.508065  Proposed
mod7_put_dec        :   2.833333   3.911290
mod8_put_dec        :   2.027778   2.258065

Source of the benchmark as well as code of all the modified version of
functions is included with the third patch of the benchmark.


As it can be obsevred, proposed version of the put_dec function is 10
times faster on ARM.  I imagine that it may be similar on other
"embedded" processors.

This may be skewed by the fact that the benchmark is using GCC's
64-bit division operator instead of kernel's do_div but it would
appear that by avoiding 64-bit division something can be gained.

The disadvantage is that the proposed function is 2.5-3.5 bigger.
Those are not big functions though -- we are talking here about
proposed function being below 512.


The drawback of this function is also that the patch adds a bit of
code.  It could be questionable whether it's worth optimising that
much.  Anyway, posting in case someone decides that it is or will be
simply interested. :)


I'm currently running 2.6.35 with this patch applied.  It applies just
fine on -next as well but I haven't tested this kernel (I don't really
have a machine that I wouldn't be afraid to run -next on ;) ).


PS. From Mr. Jones site: "Nonetheless, before relying on the material
here, it would be prudent to check the arithmetic!" hence I checked
all the calculations myself and everything seemed fine.  I've also run
test applitacion several times so it tested a few 64-bit numbers.
Here's a "bc" script which calculates all the numbers:


# You can feed "bc" with this file to check the numbers

# n  =                  1 * n0 +    0 <= n0 <= 65535
#                  6 5536 * n1 +    0 <= n1 <= 65535
#            42 9496 7296 * n2 +    0 <= n2 <= 65535
#      281 4749 7671 0656 * n3      0 <= n3 <= 65535

n0 = 65535
n1 = 65535
n2 = 65535
n3 = 65535

# n  =              10^ 0 * d0 +
#                   10^ 4 * d1 +
#                   10^ 8 * d2 +
#                   10^12 * d3 +
#                   10^16 * d4

a0 =  656 * n3 + 7296 * n2 + 5536 * n1 +   1 * n0
print "0 <= a0 <= ", a0, "\n"
# 0 <= a0 <=   884 001 615

a1 = 7671 * n3 + 9496 * n2 +    6 * n1
print "0 <= a1 <= ", a1, "\n"
# 0 <= a1 <= 1 125 432 555

a2 = 4749 * n3 +   42 * n2
print "0 <= a2 <= ", a2, "\n"
# 0 <= a2 <=   313 978 185

a3 =  281 * n3
print "0 <= a3 <= ", a3, "\n"
# 0 <= a3 <=    18 415 335


b0 = a0
print "0 <= b0 <= ", b0, "\n0 <= c1 <= ", b0 / 10000, "\n"
# 0 <= d0 <=   884 001 615
# 0 <= c1 <=        88 400

b1 = a1 + b0 / 10000
print "0 <= b1 <= ", b1, "\n0 <= c2 <= ", b1 / 10000, "\n"
# 0 <= d1 <= 1 125 520 955
# 0 <= c2 <=       112 552

b2 = a2 + b1 / 10000
print "0 <= b2 <= ", b2, "\n0 <= c3 <= ", b2 / 10000, "\n"
# 0 <= d2 <=   314 090 737
# 0 <= c3 <=        31 409

b3 = a3 + b2 / 10000
print "0 <= b3 <= ", b3, "\n0 <= c4 <= ", b3 / 10000, "\n"
# 0 <= d3 <=    18 446 744
# 0 <= c4 <=         1 844

b4 = a4 + b3 / 10000
print "0 <= b4 <= ", b4, "\n"
# 0 <= b4 <=         1 844


diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index d63fb15..bfbe662 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -278,6 +278,8 @@ int skip_atoi(const char **s)
 	return i;
 }
 
+#if BITS_PER_LONG == 64
+
 /*
  * Decimal conversion is by far the most typical, and is used for
  * /proc and /sys data. This directly impacts e.g. top performance
@@ -366,6 +368,9 @@ char *put_dec_trunc(char *buf, unsigned q)
 	return buf;
 }
 
+/* This is used by ip4_string(). */
+#define put_dec_8bit put_dec_trunc
+
 /* No inlining helps gcc to use registers better */
 static noinline_for_stack
 char *put_dec(char *buf, unsigned long long num)
@@ -379,6 +384,102 @@ char *put_dec(char *buf, unsigned long long num)
 	}
 }
 
+#else
+
+/*
+ * This is similar to the put_dec_full() above expect it handles
+ * numbers from 0 to 9999 (ie. at most four digits).  It is used by
+ * the put_dec() below which is optimised for 32-bit processors.
+ */
+static noinline_for_stack
+char *put_dec_full(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	*buf++ = r + '0';
+
+	return buf;
+}
+
+/*
+ * Similar to above but handles only 8-bit operands and does not pad
+ * with zeros.
+ */
+static noinline_for_stack
+char *put_dec_8bit(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q   = (r * 0xd) >> 7;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q)
+			*buf++ = q + '0';
+	}
+
+	return buf;
+}
+
+/*
+ * Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>.  This
+ * performs no 64-bit division and hence should be faster on 32-bit
+ * machines then the version of the function above.
+ */
+static noinline_for_stack
+char *put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q;
+
+	if (!n) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	d1  = (n >> 16) & 0xFFFF;
+	d2  = (n >> 32) & 0xFFFF;
+	d3  = (n >> 48) & 0xFFFF;
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF);
+
+	buf = put_dec_full(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+	buf = put_dec_full(buf, d1 % 10000);
+	q   = d1 / 10000;
+
+	d2  = q + 4749 * d3 + 42 * d2;
+	buf = put_dec_full(buf, d2 % 10000);
+	q   = d2 / 10000;
+
+	d3  = q + 281 * d3;
+	buf = put_dec_full(buf, d3 % 10000);
+	q   = d3 / 10000;
+
+	buf = put_dec_full(buf, q);
+
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
+}
+
+#endif
+
 #define ZEROPAD	1		/* pad with zero */
 #define SIGN	2		/* unsigned/signed long */
 #define PLUS	4		/* show plus */
@@ -752,7 +853,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
 	}
 	for (i = 0; i < 4; i++) {
 		char temp[3];	/* hold each IP quad in reverse order */
-		int digits = put_dec_trunc(temp, addr[index]) - temp;
+		int digits = put_dec_8bit(temp, addr[index]) - temp;
 		if (leading_zeros) {
 			if (digits < 3)
 				*p++ = '0';
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
  2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
@ 2010-08-05 22:38   ` Michal Nazarewicz
  2010-08-06  5:10     ` Denys Vlasenko
  2010-08-06  5:18   ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Nazarewicz @ 2010-08-05 22:38 UTC (permalink / raw)
  To: linux-kernel, linux-kernel
  Cc: m.nazarewicz, Michal Nazarewicz, Douglas W. Jones,
	Denis Vlasenko, Andrew Morton

This commit adds a test application for the put_dec() and
family of functions that are used by the previous two commits.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
---
 tools/put-dec/Makefile       |    8 +
 tools/put-dec/put-dec-test.c |  725 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 733 insertions(+), 0 deletions(-)
 create mode 100644 tools/put-dec/Makefile
 create mode 100644 tools/put-dec/put-dec-test.c

This is just a benchmark I've used to test various implementations of
the put_dec() and friends.  This is not intended as a patch included
in the kernel -- it's merely a tool for anyone who'd like to check
various versions.

diff --git a/tools/put-dec/Makefile b/tools/put-dec/Makefile
new file mode 100644
index 0000000..a56e3ef
--- /dev/null
+++ b/tools/put-dec/Makefile
@@ -0,0 +1,7 @@
+CFLAGS := -O2
+
+put-dec-test: put-dec-test.c
+	exec $(CC) $(CFLAGS) -o $@ $<
+
+clean:
+	rm -f -- put-dec-test
diff --git a/tools/put-dec/put-dec-test.c b/tools/put-dec/put-dec-test.c
new file mode 100644
index 0000000..b43cf08
--- /dev/null
+++ b/tools/put-dec/put-dec-test.c
@@ -0,0 +1,725 @@
+#define _BSD_SOURCE
+
+#include <stdarg.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/time.h>
+#include <time.h>
+
+
+#  define do_div(n, base) ({			\
+		uint32_t __base = (base);	\
+		uint32_t __rem = (n) % __base;	\
+		(n) /= __base;			\
+		__rem;				\
+	})
+
+
+/****************************** Original versian ******************************/
+
+static char *orig_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned d3, d2, d1, d0;
+	d1 = (q>>4) & 0xf;
+	d2 = (q>>8) & 0xf;
+	d3 = (q>>12);
+
+	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
+	q = (d0 * 0xcd) >> 11;
+	d0 = d0 - 10*q;
+	*buf++ = d0 + '0'; /* least significant digit */
+	d1 = q + 9*d3 + 5*d2 + d1;
+	if (d1 != 0) {
+		q = (d1 * 0xcd) >> 11;
+		d1 = d1 - 10*q;
+		*buf++ = d1 + '0'; /* next digit */
+
+		d2 = q + 2*d2;
+		if ((d2 != 0) || (d3 != 0)) {
+			q = (d2 * 0xd) >> 7;
+			d2 = d2 - 10*q;
+			*buf++ = d2 + '0'; /* next digit */
+
+			d3 = q + 4*d3;
+			if (d3 != 0) {
+				q = (d3 * 0xcd) >> 11;
+				d3 = d3 - 10*q;
+				*buf++ = d3 + '0';  /* next digit */
+				if (q != 0)
+					*buf++ = q + '0'; /* most sign. digit */
+			}
+		}
+	}
+
+	return buf;
+}
+/* Same with if's removed. Always emits five digits */
+static char *orig_put_dec_full(char *buf, unsigned q)
+{
+	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
+	/* but anyway, gcc produces better code with full-sized ints */
+	unsigned d3, d2, d1, d0;
+	d1 = (q>>4) & 0xf;
+	d2 = (q>>8) & 0xf;
+	d3 = (q>>12);
+
+	/*
+	 * Possible ways to approx. divide by 10
+	 * gcc -O2 replaces multiply with shifts and adds
+	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+	 * (x * 0x67) >> 10:  1100111
+	 * (x * 0x34) >> 9:    110100 - same
+	 * (x * 0x1a) >> 8:     11010 - same
+	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+	 */
+	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
+	q = (d0 * 0xcd) >> 11;
+	d0 = d0 - 10*q;
+	*buf++ = d0 + '0';
+	d1 = q + 9*d3 + 5*d2 + d1;
+		q = (d1 * 0xcd) >> 11;
+		d1 = d1 - 10*q;
+		*buf++ = d1 + '0';
+
+		d2 = q + 2*d2;
+			q = (d2 * 0xd) >> 7;
+			d2 = d2 - 10*q;
+			*buf++ = d2 + '0';
+
+			d3 = q + 4*d3;
+				q = (d3 * 0xcd) >> 11; /* - shorter code */
+				/* q = (d3 * 0x67) >> 10; - would also work */
+				d3 = d3 - 10*q;
+				*buf++ = d3 + '0';
+					*buf++ = q + '0';
+
+	return buf;
+}
+
+static __attribute__((noinline))
+char *orig_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 100000)
+			return orig_put_dec_trunc(buf, num);
+		rem = do_div(num, 100000);
+		buf = orig_put_dec_full(buf, rem);
+	}
+}
+
+
+
+/****************************** Modified versions ******************************/
+
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using code based on idea described at:
+ * http://www.cs.uiowa.edu/~jones/bcd/decimal.html (with permission
+ * from the author, Douglas W. Jones).
+ *
+ * The original code was designed for 8-bit ALus but since we can
+ * assume more capable hardware the code has been rewritten to use the
+ * following properties:
+ *
+ * n  =    1 * n0 +                   ( 0 <= n0 <= 1023 )
+ *      1024 * n1                     ( 0 <= n1 <=   97 )
+ * a0 = 0         + 4 * n1 + 1 * n0   ( 0 <= a0 <= 1412 )
+ * a1 = (a0 / 10) + 2 * n1            ( 0 <= a1 <=  335 )
+ * a2 = (a1 / 10) + 0 * n1            ( 0 <= a2 <=   33 )
+ * a3 = (a2 / 10) + 1 * n1            ( 0 <= a3 <=  100 )
+ * d0 = a0 % 10
+ * d1 = a1 % 10
+ * d2 = a2 % 10
+ * d3 = a3 % 10
+ * d4 = a3 / 10
+ *
+ * So instead of dividing the number into four nibles we divide it
+ * into two numbers: first one 10-bit and the other one 7-bit
+ * (argument is 17-bit number from 0 to 99999).
+ *
+ * Moreover, 1024, which is the value second part of the number needs
+ * to be multiplied by, has nice property that each digit is a power
+ * of two or zero -- this helps with multiplications.
+ *
+ * Possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10:  1100111
+ * (x * 0x34) >> 9:    110100 - same
+ * (x * 0x1a) >> 8:     11010 - same
+ * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+ */
+static char *mod1_put_dec_full(char *buf, unsigned q)
+{
+	unsigned p, r;
+	p   = q >> 10;
+
+	q  &= 0x3ff;
+	q  += 4 * p;
+	r   = (q * 0x199A) >> 16;
+	*buf++ = (q - 10 * r) + '0';
+
+	r  += 2 * p;
+	q   = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q)  + '0';
+
+	/* q += 0; */
+	r   = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	r  += p;
+	q   = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q) + '0';
+
+	*buf++ = q + '0';
+
+	return buf;
+}
+
+static char *mod1_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned p, r;
+	p   = q >> 10;
+
+	q  &= 0x3ff;
+	q  += 4 * p;
+	r   = (q * 0x199a) >> 16;
+	*buf++ = (q - 10 * r) + '0';
+
+	r  += 2 * p;
+	if (r) {
+		q   = (r * 0xcd) >> 11;
+		*buf++ = (r - 10 * q)  + '0';
+
+		/* q += 0; */
+		if (q || p) {
+			r   = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			r  += p;
+			if (r) {
+				q   = (r * 0xcd) >> 11;
+				*buf++ = (r - 10 * q) + '0';
+
+				if (q)
+					*buf++ = q + '0';
+			}
+		}
+	}
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod1_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 100000)
+			return mod1_put_dec_trunc(buf, num);
+		rem = do_div(num, 100000);
+		buf = mod1_put_dec_full(buf, rem);
+	}
+}
+
+
+static __attribute__((noinline))
+char *mod2_put_dec(char *buf, unsigned long long num)
+{
+	if (!num) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	while (num >= 100000) {
+		unsigned rem;
+		rem = do_div(num, 100000);
+		buf = mod1_put_dec_full(buf, rem);
+	}
+
+	buf = mod1_put_dec_full(buf, num);
+	while (buf[-1] == '0')
+		--buf;
+	return buf;
+}
+
+
+
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
+ * correct results for num < 81920.  Because of this, we check at the
+ * beginning if we are dealing with a number that may cause trouble
+ * and if so, we make it smaller.
+ *
+ * (As a minor note, all operands are always 16 bit so this function
+ * should work well on hardware that cannot multiply 32 bit numbers).
+ *
+ * (Previous a code based on
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
+ * with permission from the author, Douglas W. Jones.)
+ *
+ * Other, possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10:  1100111
+ * (x * 0x34) >> 9:    110100 - same
+ * (x * 0x1a) >> 8:     11010 - same
+ * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+ */
+static char *mod3_put_dec_full(char *buf, unsigned q)
+{
+	unsigned r;
+	char a = '0';
+
+	if (q > 0xffff) {
+		a = '6';
+		q -= 60000;
+	}
+
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	q   = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r   = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	q   = (r * 0xd) >> 7;
+	*buf++ = (r - 10 * q) + '0';
+
+	*buf++ = q + a;
+
+	return buf;
+}
+
+static char *mod3_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned r;
+
+	/*
+	 * We need to check if num is < 81920 so we might as well
+	 * check if we can just call the _full version of this
+	 * function.
+	 */
+	if (q > 9999)
+		return mod3_put_dec_full(buf, q);
+
+	r   = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q   = (r * 0x199a) >> 16;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q) {
+			r   = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			if (r)
+				*buf++ = r + '0';
+		}
+	}
+
+	return buf;
+}
+
+
+static char *mod3_put_dec_8bit(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q   = (r * 0xd) >> 7;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q)
+			*buf++ = q + '0';
+	}
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod3_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 100000)
+			return mod3_put_dec_trunc(buf, num);
+		rem = do_div(num, 100000);
+		buf = mod3_put_dec_full(buf, rem);
+	}
+}
+
+
+static __attribute__((noinline))
+char *mod4_put_dec(char *buf, unsigned long long num)
+{
+	if (!num) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	while (num >= 100000) {
+		unsigned rem;
+		rem = do_div(num, 100000);
+		buf = mod3_put_dec_full(buf, rem);
+	}
+
+	buf = mod3_put_dec_full(buf, num);
+	while (buf[-1] == '0')
+		--buf;
+	return buf;
+}
+
+
+
+/*
+ * Decimal conversion is by far the most typical, and is used for
+ * /proc and /sys data. This directly impacts e.g. top performance
+ * with many processes running.
+ *
+ * We optimize it for speed using ideas described at
+ * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
+ *
+ * (Previous a code based on
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
+ * with permission from the author, Douglas W. Jones.)
+ *
+ * Other, possible ways to approx. divide by 10
+ * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
+ * (x * 0x67) >> 10:  1100111
+ * (x * 0x34) >> 9:    110100 - same
+ * (x * 0x1a) >> 8:     11010 - same
+ * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
+ */
+static char *mod5_put_dec_full(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+
+	*buf++ = r + '0';
+
+	return buf;
+}
+
+static char *mod5_put_dec_trunc(char *buf, unsigned q)
+{
+	unsigned r;
+
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+
+	if (r) {
+		q      = (r * 0x199a) >> 16;
+		*buf++ = (r - 10 * q)  + '0';
+
+		if (q) {
+			r      = (q * 0xcd) >> 11;
+			*buf++ = (q - 10 * r)  + '0';
+
+			if (r)
+				*buf++ = r + '0';
+		}
+	}
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod5_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem;
+		if (num < 10000)
+			return mod5_put_dec_trunc(buf, num);
+		rem = do_div(num, 10000);
+		buf = mod5_put_dec_full(buf, rem);
+	}
+}
+
+/*
+ * Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html>.
+ */
+static __attribute__((noinline))
+char *mod6_put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q;
+
+	if (!n) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	d1  = (n >> 16) & 0xFFFF;
+	d2  = (n >> 32) & 0xFFFF;
+	d3  = (n >> 48) & 0xFFFF;
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF);
+
+	buf = mod5_put_dec_full(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+		buf = mod5_put_dec_full(buf, d1 % 10000);
+		q   = d1 / 10000;
+
+		d2  = q + 4749 * d3 + 42 * d2;
+			buf = mod5_put_dec_full(buf, d2 % 10000);
+			q   = d2 / 10000;
+
+			d3  = q + 281 * d3;
+				buf = mod5_put_dec_full(buf, d3 % 10000);
+				q   = d3 / 10000;
+
+					buf = mod5_put_dec_trunc(buf, q);
+
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
+}
+
+
+/*
+ * Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html>.
+ */
+static __attribute__((noinline))
+char *mod7_put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q;
+
+	if (!n) {
+		*buf++ = '0';
+		return buf;
+	}
+
+	d1  = (n >> 16) & 0xFFFF;
+	d2  = (n >> 32) & 0xFFFF;
+	d3  = (n >> 48) & 0xFFFF;
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + (n & 0xFFFF);
+
+	buf = mod5_put_dec_full(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+	if (d1) {
+		buf = mod5_put_dec_full(buf, d1 % 10000);
+		q   = d1 / 10000;
+
+		d2  = q + 4749 * d3 + 42 * d2;
+		if (d2) {
+			buf = mod5_put_dec_full(buf, d2 % 10000);
+			q   = d2 / 10000;
+
+			d3  = q + 281 * d3;
+			if (d3) {
+				buf = mod5_put_dec_full(buf, d3 % 10000);
+				q   = d3 / 10000;
+
+				if (q)
+					buf = mod5_put_dec_trunc(buf, q);
+			}
+		}
+	}
+
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
+}
+
+
+static __attribute__((noinline))
+char *mod8_put_dec(char *buf, unsigned long long num)
+{
+	while (1) {
+		unsigned rem = do_div(num, 100000000);
+
+		if (!num && rem < 10000)
+			return mod5_put_dec_trunc(buf, rem);
+		buf = mod5_put_dec_full(buf, rem % 10000);
+
+		if (!num)
+			return mod5_put_dec_trunc(buf, rem / 10000);
+		buf = mod5_put_dec_full(buf, rem / 10000);
+	}
+}
+
+
+
+/****************************** Main ******************************/
+
+static uint64_t rand_64(void)
+{
+	uint64_t v = 0, m = 1;
+	for (;;) {
+		uint64_t n;
+		v = m * rand();
+		n = m + m * RAND_MAX;
+		if (n < m)
+			break;
+		m = n;
+	}
+	return v;
+}
+
+
+static char buf1[24];
+
+static int test(const char *name, char *b, char *fmt, ...)
+{
+	static char buf2[24];
+	char *a = buf1;
+	va_list ap;
+
+	*b-- = '\0';
+	while (a < b) {
+		char tmp = *a;
+		*a = *b;
+		*b = tmp;
+		++a;
+		--b;
+	}
+
+	va_start(ap, fmt);
+	vsprintf(buf2, fmt, ap);
+	va_end(ap);
+
+	if (strcmp(buf1, buf2)) {
+		fprintf(stderr, "%-20s: expecting %20s got %20s\n",
+			name, buf2, buf1);
+		return 1;
+	}
+	return 0;
+}
+
+static void stop(const char *name, struct timeval *start, struct timeval *correction)
+{
+	struct timeval stop, *a;
+	gettimeofday(&stop, NULL);
+
+	a = start;
+	do {
+		stop.tv_sec -= a->tv_sec;
+		if (stop.tv_usec < a->tv_usec) {
+			--stop.tv_sec;
+			stop.tv_usec += 1000000;
+		}
+		stop.tv_usec -= a->tv_usec;
+
+		a = correction;
+		correction = NULL;
+	} while (a);
+
+	if (name) {
+		fflush(NULL);
+		printf("%-20s: %3d.%06ds\n", name, stop.tv_sec, stop.tv_usec);
+	} else {
+		*start = stop;
+	}
+}
+
+#define FUNC(func, outer, inner, correction, format, value) do {	\
+		struct timeval start;					\
+		unsigned i, o;						\
+		for (i = (inner); i--; ) {				\
+			typeof(value) v = (value);			\
+			ret |= test(#func, func(buf1, v), format, v);	\
+		}							\
+									\
+		gettimeofday(&start, NULL);				\
+		for (o = (outer); o; --o)				\
+			for (i = (inner); i--; )			\
+				func(buf1, (value));			\
+		stop(#func, &start, correction);			\
+	} while (0)							\
+
+
+int main(int argc, char **argv) {
+	unsigned long iterations = 1000, i;
+	struct timeval correction;
+	int ret = 0;
+
+	srand(time(NULL));
+
+	if (argc > 1)
+		iterations = atoi(argv[1]);
+
+	gettimeofday(&correction, NULL);
+	for (i = 25000 * iterations; i; --i)
+		rand_64();
+	stop(NULL, &correction, NULL);
+
+
+	puts(">> Benchmarks:\n\tput_dec_full()");
+	fflush(NULL);
+
+	FUNC(orig_put_dec_full, iterations, 100000, NULL, "%05u", i);
+	FUNC(mod1_put_dec_full, iterations, 100000, NULL, "%05u", i);
+	FUNC(mod3_put_dec_full, iterations, 100000, NULL, "%05u", i);
+	FUNC(mod5_put_dec_full, iterations * 10, 10000, NULL, "%04u", i);
+
+	puts("\tput_dec_trunc()");
+	fflush(NULL);
+
+	FUNC(orig_put_dec_trunc, iterations, 100000, NULL, "%u", i);
+	FUNC(mod1_put_dec_trunc, iterations, 100000, NULL, "%u", i);
+	FUNC(mod3_put_dec_trunc, iterations, 100000, NULL, "%u", i);
+	FUNC(mod5_put_dec_trunc, iterations * 10, 10000, NULL, "%u", i);
+	FUNC(mod3_put_dec_8bit, iterations * 500, 256, NULL, "%u", i);
+
+	puts("\n\tput_dec()");
+	fflush(NULL);
+
+	FUNC(orig_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod1_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod2_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod3_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod4_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod5_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod6_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod7_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+	FUNC(mod8_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
+
+
+	fflush(NULL);
+	puts("\n>> Size of the functions:");
+	fflush(NULL);
+
+	setenv("APP", *argv, 1);
+	puts("\tobjdump -t \"$APP\" | grep -F _put_dec | cut -f 2-");
+	system("objdump -t \"$APP\" | grep -F _put_dec | cut -f 2-");
+
+
+	return ret;
+}
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()
  2010-08-05 22:38 [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
  2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
@ 2010-08-06  3:58 ` Denys Vlasenko
  2010-08-06  6:56   ` Michal Nazarewicz
  1 sibling, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2010-08-06  3:58 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
> The put_dec_trunc() and put_dec_full() functions were based on
> a code optimised for processors with 8-bit ALU but even then
> they failed to satisfy the same constraints

"Failed"? Interesting wording. Yes, the code won't map easily
onto 8-bit ALU, for the simple reason Linux kernel
does not support any 8-bit CPUs, and by going to wider register
I was able to process 5 decimal digits at once, not 4.
It was done deliberately. It is not a "failure".

Your code isn't 8-bit ALU optimized either.

Do you think a bit of smear of previous code
would help your to be accepted?

> and in fact 
> required at least 16-bit ALU (because at least one number they
> operate in can take 9 bits).

Yes, as explained above.

> This version of those functions proposed by this patch goes
> further and uses the full capacity of a 32-bit ALU and instead
> of splitting the number into nibbles and operating on them it
> performs the obvious algorithm for base conversion expect it
> uses optimised code for dividing by ten (ie. no division is
> actually performed).

(1) "expect" is a typo
(2) No, _this_ patch does not eliminate division. Next one does.
Move this part of changelong to the next patch, where it belongs.


> + * Decimal conversion is by far the most typical, and is used for
> + * /proc and /sys data. This directly impacts e.g. top performance
> + * with many processes running.
> + *
> + * We optimize it for speed using ideas described at
> + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.

Do you have author's permission to do it?
Document it in the comment please.


> + * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
> + * correct results for num < 81920.  Because of this, we check at the
> + * beginning if we are dealing with a number that may cause trouble
> + * and if so, we make it smaller.

This comment needs to be moved to the code line where the opration
is performed.


> + * (As a minor note, all operands are always 16 bit so this function
> + * should work well on hardware that cannot multiply 32 bit numbers).
> + *
> + * (Previous a code based on

English is a bit broken in the line above.


-- 
vda

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
  2010-08-05 22:38   ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
@ 2010-08-06  5:10     ` Denys Vlasenko
  2010-08-06  8:34       ` Michał Nazarewicz
  0 siblings, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2010-08-06  5:10 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
> This commit adds a test application for the put_dec() and
> family of functions that are used by the previous two commits.
> 
> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>

> +put-dec-test: put-dec-test.c
> +	exec $(CC) $(CFLAGS) -o $@ $<

(1) Why exec?
(2) Add -Wall, you'd be surprised


> +static uint64_t rand_64(void)
> +{
> +	uint64_t v = 0, m = 1;
> +	for (;;) {
> +		uint64_t n;
> +		v = m * rand();
> +		n = m + m * RAND_MAX;
> +		if (n < m)
> +			break;
> +		m = n;
> +	}
> +	return v;
> +}

What this function do? Looks cryptic. In my testing, it picks 0
quite often.


> +static char buf1[24];

Can you size the array safely, without assuming that long long
is no wider than 64 bits?


> +#define FUNC(func, outer, inner, correction, format, value) do {	\
> +		struct timeval start;					\
> +		unsigned i, o;						\
> +		for (i = (inner); i--; ) {				\
> +			typeof(value) v = (value);			\
> +			ret |= test(#func, func(buf1, v), format, v);	\

I'd add memset(buf1, 77, sizeof(buf1)) before test


> +int main(int argc, char **argv) {
> +	unsigned long iterations = 1000, i;
> +	struct timeval correction;
> +	int ret = 0;
> +
> +	srand(time(NULL));
> +
> +	if (argc > 1)
> +		iterations = atoi(argv[1]);
> +
> +	gettimeofday(&correction, NULL);
> +	for (i = 25000 * iterations; i; --i)
> +		rand_64();
> +	stop(NULL, &correction, NULL);

Why is this "correction" thing needed? I looked at the entire machinery
and I see no reason to have it.


> +	puts(">> Benchmarks:\n\tput_dec_full()");
> +	fflush(NULL);
> +
> +	FUNC(orig_put_dec_full, iterations, 100000, NULL, "%05u", i);

You have variable named i, you pass it as macro parameter,
but macro has local variable named i too.
Is it an International Obfuscated C Code Contest entry?


> +	FUNC(mod1_put_dec_full, iterations, 100000, NULL, "%05u", i);
> +	FUNC(mod3_put_dec_full, iterations, 100000, NULL, "%05u", i);
> +	FUNC(mod5_put_dec_full, iterations * 10, 10000, NULL, "%04u", i);

> +	puts("\tput_dec_trunc()");
> +	fflush(NULL);
> +
> +	FUNC(orig_put_dec_trunc, iterations, 100000, NULL, "%u", i);
> +	FUNC(mod1_put_dec_trunc, iterations, 100000, NULL, "%u", i);
> +	FUNC(mod3_put_dec_trunc, iterations, 100000, NULL, "%u", i);
> +	FUNC(mod5_put_dec_trunc, iterations * 10, 10000, NULL, "%u", i);
> +	FUNC(mod3_put_dec_8bit, iterations * 500, 256, NULL, "%u", i);

> +	puts("\n\tput_dec()");
> +	fflush(NULL);
> +
> +	FUNC(orig_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());

"%llu" fmt is for unsigned long long, not uint64_t.


> +	FUNC(mod1_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
> +	FUNC(mod2_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
> +	FUNC(mod3_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
> +	FUNC(mod4_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
> +	FUNC(mod5_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
> +	FUNC(mod6_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
> +	FUNC(mod7_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());
> +	FUNC(mod8_put_dec, iterations / 4, 100000, &correction, "%llu", rand_64());

Here a lot of CPU time is taken by rand() calls. Also, you use different
values for different functions, which is wrong.


-- 
vda

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
  2010-08-05 22:38   ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
@ 2010-08-06  5:18   ` Denys Vlasenko
  2010-08-06  7:08     ` Michal Nazarewicz
  1 sibling, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2010-08-06  5:18 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
> The disadvantage is that the proposed function is 2.5-3.5 bigger.
> Those are not big functions though -- we are talking here about
> proposed function being below 512.

It's a slippery slope. Here's where it ends: glibc
has memcpy() function which is "only" 8k of code or so.
I'm not joking.



> +#if BITS_PER_LONG == 64
> +
...  
> +#else
...
> +/*
> + * Based on code by Douglas W. Jones found at
> + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>.  This
> + * performs no 64-bit division and hence should be faster on 32-bit
> + * machines then the version of the function above.
> + */
> +static noinline_for_stack
> +char *put_dec(char *buf, unsigned long long n)
> +{
> +	uint32_t d3, d2, d1, q;
> +
> +	if (!n) {
> +		*buf++ = '0';
> +		return buf;
> +	}
> +
> +	d1  = (n >> 16) & 0xFFFF;
> +	d2  = (n >> 32) & 0xFFFF;
> +	d3  = (n >> 48) & 0xFFFF;

Are you assuming that sizeof(long long) == 8, always?

-- 
vda

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full()
  2010-08-06  3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
@ 2010-08-06  6:56   ` Michal Nazarewicz
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Nazarewicz @ 2010-08-06  6:56 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

Denys Vlasenko <vda.linux@googlemail.com> writes:

> On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
>> The put_dec_trunc() and put_dec_full() functions were based on
>> a code optimised for processors with 8-bit ALU but even then
>> they failed to satisfy the same constraints
>
> "Failed"? Interesting wording. Yes, the code won't map easily
> onto 8-bit ALU, for the simple reason Linux kernel
> does not support any 8-bit CPUs, and by going to wider register
> I was able to process 5 decimal digits at once, not 4.
> It was done deliberately. It is not a "failure".

You're probably right. ;) I'll change the wording in the next patch.

> Your code isn't 8-bit ALU optimized either.

Yep, that was not my intention.

> Do you think a bit of smear of previous code
> would help your to be accepted?

I'm sorry if I sounded offencive, I just wanted to point out that we
don't have to and we in fact do not limit ourselves to 8-bit ALUs.

>> and in fact 
>> required at least 16-bit ALU (because at least one number they
>> operate in can take 9 bits).
>
> Yes, as explained above.
>
>> This version of those functions proposed by this patch goes
>> further and uses the full capacity of a 32-bit ALU and instead
>> of splitting the number into nibbles and operating on them it
>> performs the obvious algorithm for base conversion expect it
>> uses optimised code for dividing by ten (ie. no division is
>> actually performed).
>
> (1) "expect" is a typo

Fixed.

> (2) No, _this_ patch does not eliminate division. Next one does.
> Move this part of changelong to the next patch, where it belongs.

Probably right, the part in parens can be misleading.  Will fix.

>> + * Decimal conversion is by far the most typical, and is used for
>> + * /proc and /sys data. This directly impacts e.g. top performance
>> + * with many processes running.
>> + *
>> + * We optimize it for speed using ideas described at
>> + * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
>
> Do you have author's permission to do it?
> Document it in the comment please.

OK, I'll send an email to Mr. Jones directly (unless he's actually
reading it, in which case, could you please give permission to use the
code?) In case of this patch I assumed the right to use because: (i) we
had permission for the previous code and (ii) I've used only one line of
code that is from the "Binary to Decimal Conversion in Limited
Precision" website (namely division).  In case of the next patch, only
(i) applies.

>> + * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
>> + * correct results for num < 81920.  Because of this, we check at the
>> + * beginning if we are dealing with a number that may cause trouble
>> + * and if so, we make it smaller.
>
> This comment needs to be moved to the code line where the opration
> is performed.

Fixed.

>> + * (As a minor note, all operands are always 16 bit so this function
>> + * should work well on hardware that cannot multiply 32 bit numbers).
>> + *
>> + * (Previous a code based on
>
> English is a bit broken in the line above.

Fixed.

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-06  5:18   ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
@ 2010-08-06  7:08     ` Michal Nazarewicz
  2010-08-06  7:35       ` Denys Vlasenko
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Nazarewicz @ 2010-08-06  7:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

Denys Vlasenko <vda.linux@googlemail.com> writes:

> On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
>> The disadvantage is that the proposed function is 2.5-3.5 bigger.
>> Those are not big functions though -- we are talking here about
>> proposed function being below 512.

> It's a slippery slope. Here's where it ends: glibc
> has memcpy() function which is "only" 8k of code or so.
> I'm not joking.

I'm aware of that.  I assume that someone more clever then me will
decide whether to accept this patch or not.  (Also we win a few bytes on
put_dec_full() and put_dec_8bit()). :P

>> +#if BITS_PER_LONG == 64
>> +
> ...  
>> +#else
> ...
>> +/*
>> + * Based on code by Douglas W. Jones found at
>> + * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>.  This
>> + * performs no 64-bit division and hence should be faster on 32-bit
>> + * machines then the version of the function above.
>> + */
>> +static noinline_for_stack
>> +char *put_dec(char *buf, unsigned long long n)
>> +{
>> +	uint32_t d3, d2, d1, q;
>> +
>> +	if (!n) {
>> +		*buf++ = '0';
>> +		return buf;
>> +	}
>> +
>> +	d1  = (n >> 16) & 0xFFFF;
>> +	d2  = (n >> 32) & 0xFFFF;
>> +	d3  = (n >> 48) & 0xFFFF;
>
> Are you assuming that sizeof(long long) == 8, always?

Well...  yes.  C requires long long to be at least 64-bit and I don't
see it being larger in any foreseeable feature.  Wouldn't it be enough
to put a static assert here?

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-06  7:08     ` Michal Nazarewicz
@ 2010-08-06  7:35       ` Denys Vlasenko
  2010-08-06  8:54         ` Michał Nazarewicz
  0 siblings, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2010-08-06  7:35 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: linux-kernel, m.nazarewicz, Douglas W. Jones, Andrew Morton

On Friday 06 August 2010 09:08, Michal Nazarewicz wrote:
> >> +static noinline_for_stack
> >> +char *put_dec(char *buf, unsigned long long n)
> >> +{
> >> +	uint32_t d3, d2, d1, q;
> >> +
> >> +	if (!n) {
> >> +		*buf++ = '0';
> >> +		return buf;
> >> +	}
> >> +
> >> +	d1  = (n >> 16) & 0xFFFF;
> >> +	d2  = (n >> 32) & 0xFFFF;
> >> +	d3  = (n >> 48) & 0xFFFF;
> >
> > Are you assuming that sizeof(long long) == 8, always?
> 
> Well...  yes.  C requires long long to be at least 64-bit and I don't
> see it being larger in any foreseeable feature.

"640k will be enough for everybody"?

> Wouldn't it be enough to put a static assert here?

I'd prefer the code which works with arbitrarily wide long long.
If needed, use

if (sizeof(long long) == 8)
	64-bit code
else
	generic code

-- 
vda
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
  2010-08-06  5:10     ` Denys Vlasenko
@ 2010-08-06  8:34       ` Michał Nazarewicz
  2010-08-06 15:57         ` Raja R Harinath
  2010-08-06 19:26         ` Denys Vlasenko
  0 siblings, 2 replies; 14+ messages in thread
From: Michał Nazarewicz @ 2010-08-06  8:34 UTC (permalink / raw)
  To: Michal Nazarewicz, Denys Vlasenko
  Cc: linux-kernel, Douglas W. Jones, Andrew Morton

On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:

> On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
>> This commit adds a test application for the put_dec() and
>> family of functions that are used by the previous two commits.
>>
>> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
>
>> +put-dec-test: put-dec-test.c
>> +	exec $(CC) $(CFLAGS) -o $@ $<
>
> (1) Why exec?

Micro Makefile optimisation -- saves us a fork().

I'll try to fix the benchmark over the weekend and will post updated
version.  Thanks for the comments.

-- 
Best regards,                                        _     _
| Humble Liege of Serenely Enlightened Majesty of  o' \,=./ `o
| Computer Science,  Michał "mina86" Nazarewicz       (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines
  2010-08-06  7:35       ` Denys Vlasenko
@ 2010-08-06  8:54         ` Michał Nazarewicz
  0 siblings, 0 replies; 14+ messages in thread
From: Michał Nazarewicz @ 2010-08-06  8:54 UTC (permalink / raw)
  To: Michal Nazarewicz, Denys Vlasenko
  Cc: linux-kernel, Douglas W. Jones, Andrew Morton

On Fri, 06 Aug 2010 09:35:26 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:

> On Friday 06 August 2010 09:08, Michal Nazarewicz wrote:
>> >> +static noinline_for_stack
>> >> +char *put_dec(char *buf, unsigned long long n)
>> >> +{
>> >> +	uint32_t d3, d2, d1, q;
>> >> +
>> >> +	if (!n) {
>> >> +		*buf++ = '0';
>> >> +		return buf;
>> >> +	}
>> >> +
>> >> +	d1  = (n >> 16) & 0xFFFF;
>> >> +	d2  = (n >> 32) & 0xFFFF;
>> >> +	d3  = (n >> 48) & 0xFFFF;
>> >
>> > Are you assuming that sizeof(long long) == 8, always?
>>
>> Well...  yes.  C requires long long to be at least 64-bit and I don't
>> see it being larger in any foreseeable feature.
>
> "640k will be enough for everybody"?
>
>> Wouldn't it be enough to put a static assert here?
>
> I'd prefer the code which works with arbitrarily wide long long.
> If needed, use
>
> if (sizeof(long long) == 8)
> 	64-bit code
> else
> 	generic code

Thanks.  That seems like a perfect solution.  I rearrange the code
and try to post updated version after the weekend.

-- 
Best regards,                                        _     _
| Humble Liege of Serenely Enlightened Majesty of  o' \,=./ `o
| Computer Science,  Michał "mina86" Nazarewicz       (o o)
+----[mina86*mina86.com]---[mina86*jabber.org]----ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
  2010-08-06  8:34       ` Michał Nazarewicz
@ 2010-08-06 15:57         ` Raja R Harinath
  2010-08-06 19:26         ` Denys Vlasenko
  1 sibling, 0 replies; 14+ messages in thread
From: Raja R Harinath @ 2010-08-06 15:57 UTC (permalink / raw)
  To: linux-kernel

Hi,

Michał Nazarewicz <m.nazarewicz@samsung.com> writes:

> On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>> On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
>>> This commit adds a test application for the put_dec() and
>>> family of functions that are used by the previous two commits.
>>>
>>> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
>>
>>> +put-dec-test: put-dec-test.c
>>> +	exec $(CC) $(CFLAGS) -o $@ $<
>>
>> (1) Why exec?
>
> Micro Makefile optimisation -- saves us a fork().

I don't think it's worth it.  The use of 'exec' will force make to
invoke the shell, while without the exec, make has an opportunity to
optimize out the shell invocation completely [1].  Forcing the use of a
shell to avoid a fork() sounds like a terrible trade-off.

- Hari

[1] See, for instance, construct_command_argv_internal() in

  http://cvs.savannah.gnu.org/viewvc/make/job.c?root=make&view=markup


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
  2010-08-06  8:34       ` Michał Nazarewicz
  2010-08-06 15:57         ` Raja R Harinath
@ 2010-08-06 19:26         ` Denys Vlasenko
  2010-08-06 20:58           ` Michal Nazarewicz
  1 sibling, 1 reply; 14+ messages in thread
From: Denys Vlasenko @ 2010-08-06 19:26 UTC (permalink / raw)
  To: Michał Nazarewicz
  Cc: Michal Nazarewicz, linux-kernel, Douglas W. Jones, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 950 bytes --]

On Friday 06 August 2010 10:34, Michał Nazarewicz wrote:
> On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:
> 
> > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
> >> This commit adds a test application for the put_dec() and
> >> family of functions that are used by the previous two commits.
> >>
> >> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
> >
> >> +put-dec-test: put-dec-test.c
> >> +	exec $(CC) $(CFLAGS) -o $@ $<
> >
> > (1) Why exec?
> 
> Micro Makefile optimisation -- saves us a fork().
> 
> I'll try to fix the benchmark over the weekend and will post updated
> version.  Thanks for the comments.

You might find some ideas in the attached file:
* removed "correction" code
* added verification of correctness for put_dec()
* different rand64
  (old one was giving same "random" number surprisingly often)
* more robust coding in a few places

-- 
vda

[-- Attachment #2: put-dec-test.c --]
[-- Type: text/x-csrc, Size: 8205 bytes --]

#define _BSD_SOURCE

#include <stdarg.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>


#  define do_div(n, base) ({			\
		uint32_t __base = (base);	\
		uint32_t __rem = (n) % __base;	\
		(n) /= __base;			\
		__rem;				\
	})


/****************************** Original versian ******************************/

static char *orig_put_dec_trunc(char *buf, unsigned q)
{
	unsigned d3, d2, d1, d0;
	d1 = (q>>4) & 0xf;
	d2 = (q>>8) & 0xf;
	d3 = (q>>12);

	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
	q = (d0 * 0xcd) >> 11;
	d0 = d0 - 10*q;
	*buf++ = d0 + '0'; /* least significant digit */
	d1 = q + 9*d3 + 5*d2 + d1;
	if (d1 != 0) {
		q = (d1 * 0xcd) >> 11;
		d1 = d1 - 10*q;
		*buf++ = d1 + '0'; /* next digit */

		d2 = q + 2*d2;
		if ((d2 != 0) || (d3 != 0)) {
			q = (d2 * 0xd) >> 7;
			d2 = d2 - 10*q;
			*buf++ = d2 + '0'; /* next digit */

			d3 = q + 4*d3;
			if (d3 != 0) {
				q = (d3 * 0xcd) >> 11;
				d3 = d3 - 10*q;
				*buf++ = d3 + '0';  /* next digit */
				if (q != 0)
					*buf++ = q + '0'; /* most sign. digit */
			}
		}
	}

	return buf;
}
/* Same with if's removed. Always emits five digits */
static char *orig_put_dec_full(char *buf, unsigned q)
{
	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
	/* but anyway, gcc produces better code with full-sized ints */
	unsigned d3, d2, d1, d0;
	d1 = (q>>4) & 0xf;
	d2 = (q>>8) & 0xf;
	d3 = (q>>12);

	/*
	 * Possible ways to approx. divide by 10
	 * gcc -O2 replaces multiply with shifts and adds
	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
	 * (x * 0x67) >> 10:  1100111
	 * (x * 0x34) >> 9:    110100 - same
	 * (x * 0x1a) >> 8:     11010 - same
	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
	 */
	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
	q = (d0 * 0xcd) >> 11;
	d0 = d0 - 10*q;
	*buf++ = d0 + '0';
	d1 = q + 9*d3 + 5*d2 + d1;
		q = (d1 * 0xcd) >> 11;
		d1 = d1 - 10*q;
		*buf++ = d1 + '0';

		d2 = q + 2*d2;
			q = (d2 * 0xd) >> 7;
			d2 = d2 - 10*q;
			*buf++ = d2 + '0';

			d3 = q + 4*d3;
				q = (d3 * 0xcd) >> 11; /* - shorter code */
				/* q = (d3 * 0x67) >> 10; - would also work */
				d3 = d3 - 10*q;
				*buf++ = d3 + '0';
					*buf++ = q + '0';

	return buf;
}

static __attribute__((noinline))
char *orig_put_dec(char *buf, unsigned long long num)
{
	while (1) {
		unsigned rem;
		if (num < 100000)
			return orig_put_dec_trunc(buf, num);
		rem = do_div(num, 100000);
		buf = orig_put_dec_full(buf, rem);
	}
}



/****************************** Modified versions ******************************/

/*
 * Decimal conversion is by far the most typical, and is used for
 * /proc and /sys data. This directly impacts e.g. top performance
 * with many processes running.
 *
 * We optimize it for speed using ideas described at
 * <http://www.cs.uiowa.edu/~jones/bcd/divide.html>.
 *
 * '(num * 0xcccd) >> 19' is an approximation of 'num / 10' that gives
 * correct results for num < 81920.  Because of this, we check at the
 * beginning if we are dealing with a number that may cause trouble
 * and if so, we make it smaller.
 *
 * (As a minor note, all operands are always 16 bit so this function
 * should work well on hardware that cannot multiply 32 bit numbers).
 *
 * (Previous a code based on
 * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html> was used here,
 * with permission from the author, Douglas W. Jones.)
 *
 * Other, possible ways to approx. divide by 10
 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
 * (x * 0x67) >> 10:  1100111
 * (x * 0x34) >> 9:    110100 - same
 * (x * 0x1a) >> 8:     11010 - same
 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
 */
static char *mod3_put_dec_full(char *buf, unsigned q)
{
	unsigned r;
	char a = '0';

	if (q > 0xffff) {
		a = '6';
		q -= 60000;
	}

	r   = (q * 0xcccd) >> 19;
	*buf++ = (q - 10 * r) + '0';

	q   = (r * 0x199a) >> 16;
	*buf++ = (r - 10 * q)  + '0';

	r   = (q * 0xcd) >> 11;
	*buf++ = (q - 10 * r)  + '0';

	q   = (r * 0xd) >> 7;
	*buf++ = (r - 10 * q) + '0';

	*buf++ = q + a;

	return buf;
}

static char *mod3_put_dec_trunc(char *buf, unsigned q)
{
	unsigned r;

	/*
	 * We need to check if num is < 81920 so we might as well
	 * check if we can just call the _full version of this
	 * function.
	 */
	if (q > 9999)
		return mod3_put_dec_full(buf, q);

	r   = (q * 0xcccd) >> 19;
	*buf++ = (q - 10 * r) + '0';

	if (r) {
		q   = (r * 0x199a) >> 16;
		*buf++ = (r - 10 * q)  + '0';

		if (q) {
			r   = (q * 0xcd) >> 11;
			*buf++ = (q - 10 * r)  + '0';

			if (r)
				*buf++ = r + '0';
		}
	}

	return buf;
}

static __attribute__((noinline))
char *mod3_put_dec(char *buf, unsigned long long num)
{
	while (1) {
		unsigned rem;
		if (num < 100000)
			return mod3_put_dec_trunc(buf, num);
		rem = do_div(num, 100000);
		buf = mod3_put_dec_full(buf, rem);
	}
}

/****************************** Main ******************************/

static unsigned long long rand_64(void)
{
	unsigned long long v = 0;
	int i;

	for (i = 0; i < sizeof(v); i++) {
		v += rand();
		v <<= 9;
	}
	return v;
}

static char buf1[sizeof(long long)*3];

static void reverse_buf1(char *b)
{
	char *a = buf1;
	*b-- = '\0';
	while (a < b) {
		char tmp = *a;
		*a = *b;
		*b = tmp;
		++a;
		--b;
	}
}

static int test(const char *name, char *b, char *fmt, ...)
{
	static char buf2[sizeof(long long)*3];
	va_list ap;

	reverse_buf1(b);

	va_start(ap, fmt);
	vsprintf(buf2, fmt, ap);
	va_end(ap);

	if (strcmp(buf1, buf2)) {
		fprintf(stderr, "%-20s: expecting '%s' got '%s'\n",
			name, buf2, buf1);
		return 1;
	}
	return 0;
}

static void stop(const char *name, struct timeval *start)
{
	struct timeval stop;

	gettimeofday(&stop, NULL);

	stop.tv_sec -= start->tv_sec;
	if (stop.tv_usec < start->tv_usec) {
		--stop.tv_sec;
		stop.tv_usec += 1000000;
	}
	stop.tv_usec -= start->tv_usec;

	printf("%-20s: %3d.%06ds\n", name, (int)stop.tv_sec, (int)stop.tv_usec);
	fflush(NULL);
}

#define FUNC(func, outer, inner, format, value) do {	\
		struct timeval start;					\
		unsigned i, o;						\
		for (i = (inner); i--; ) {				\
			typeof(value) v = (value);			\
			memset(buf1, 77, sizeof(buf1));			\
			ret |= test(#func, func(buf1, v), format, v);	\
		}							\
									\
		gettimeofday(&start, NULL);				\
		for (o = (outer); o; --o)				\
			for (i = (inner); i--; )			\
				func(buf1, (value));			\
		stop(#func, &start);					\
	} while (0)							\


int main(int argc, char **argv)
{
	int ret = 0;
	unsigned long iterations;
	unsigned long long ll;
	unsigned long long range;

	srand(time(NULL));

	iterations = 1000;
	if (argc > 1)
		iterations = atoi(argv[1]);

	printf("Benchmark with %ld iterations:\n", iterations*100000);
	printf("\tput_dec_full(99999..0)\n");
	fflush(NULL);
	/* func, bench_mult x test_iter, correction, fmt, val */
	FUNC(orig_put_dec_full, iterations, 100000, "%05u", i);
	FUNC(mod3_put_dec_full, iterations, 100000, "%05u", i);

	printf("\tput_dec_trunc(99999..0)\n");
	fflush(NULL);
	FUNC(orig_put_dec_trunc, iterations, 100000, "%u", i);
	FUNC(mod3_put_dec_trunc, iterations, 100000, "%u", i);

	ll = rand_64();
	printf("\tput_dec(%llu)\n", ll);
	fflush(NULL);
	FUNC(orig_put_dec, iterations, 100000, "%llu", ll);
	FUNC(mod3_put_dec, iterations, 100000, "%llu", ll);

	/* Test wide consecutive ranges at both ends of [0,max] interval */
	range = 10*1000*1000;
	if (argc > 2)
		range = atoi(argv[2]);
	ll = -1LL - range;
	while (ll != range) {
		char buf[sizeof(long long)*3];

		sprintf(buf, "%llu", ll);
		if (!(ll & 0xffff)) {
			printf("Testing correctness: %s%*s\r", buf, (int)(sizeof(long long) * 4), "");
			fflush(NULL);
		}
		memset(buf1, 77, sizeof(buf1));
		reverse_buf1(orig_put_dec(buf1, ll));
		if (strcmp(buf, buf1) == 0) {
			memset(buf1, 77, sizeof(buf1));
			reverse_buf1(mod3_put_dec(buf1, ll));
			if (strcmp(buf, buf1) == 0) {
				ll++;
				continue;
			}
		}
		printf("error at %llu: bad:'%s' expected:'%s'\n", ll, buf1, buf);
		return 1;
	}
	printf("\n");

	printf("\n>> Size of the functions:\n");
	setenv("APP", *argv, 1);
	printf("\tobjdump -t \"$APP\" | grep -F _put_dec | cut -f 2-\n");
	fflush(NULL);
	system("objdump -t \"$APP\" | grep -F _put_dec | cut -f 2-");

	return ret;
}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool
  2010-08-06 19:26         ` Denys Vlasenko
@ 2010-08-06 20:58           ` Michal Nazarewicz
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Nazarewicz @ 2010-08-06 20:58 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Michał Nazarewicz, linux-kernel, Douglas W. Jones, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 1858 bytes --]

Denys Vlasenko <vda.linux@googlemail.com> writes:

> On Friday 06 August 2010 10:34, Michał Nazarewicz wrote:
>> On Fri, 06 Aug 2010 07:10:06 +0200, Denys Vlasenko <vda.linux@googlemail.com> wrote:
>> 
>> > On Friday 06 August 2010 00:38, Michal Nazarewicz wrote:
>> >> This commit adds a test application for the put_dec() and
>> >> family of functions that are used by the previous two commits.
>> >>
>> >> Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
>> >
>> >> +put-dec-test: put-dec-test.c
>> >> +	exec $(CC) $(CFLAGS) -o $@ $<
>> >
>> > (1) Why exec?
>> 
>> Micro Makefile optimisation -- saves us a fork().
>> 
>> I'll try to fix the benchmark over the weekend and will post updated
>> version.  Thanks for the comments.
>
> You might find some ideas in the attached file:
> * removed "correction" code
> * added verification of correctness for put_dec()
> * different rand64
>   (old one was giving same "random" number surprisingly often)
> * more robust coding in a few places

Thanks!  I actually changed the benchmark earlier today and redid all
benchmarks but I'll sure incorporate you're test code.

Also, I've removed rand_64() from my code in favour of reading random
data from /dev/urandom.  In consequence, all functions ale benchmarked
using the same values and it's still random (ie. no the same value all
the time).  This also made "correction" no longer needed.

It's 11pm here, so I'll try to send the new patches tomorrow morning
after getting some sleep.

Once again, thank you for all the comments and suggestiveness!

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-08-06 20:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-05 22:38 [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Michal Nazarewicz
2010-08-05 22:38 ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Michal Nazarewicz
2010-08-05 22:38   ` [PATCH 3/3] lib: vsprintf: added a put_dec() test and benchmark tool Michal Nazarewicz
2010-08-06  5:10     ` Denys Vlasenko
2010-08-06  8:34       ` Michał Nazarewicz
2010-08-06 15:57         ` Raja R Harinath
2010-08-06 19:26         ` Denys Vlasenko
2010-08-06 20:58           ` Michal Nazarewicz
2010-08-06  5:18   ` [PATCH 2/3] lib: vsprintf: optimised put_dec() for 32-bit machines Denys Vlasenko
2010-08-06  7:08     ` Michal Nazarewicz
2010-08-06  7:35       ` Denys Vlasenko
2010-08-06  8:54         ` Michał Nazarewicz
2010-08-06  3:58 ` [PATCH 1/3] lib: vsprintf: optimised put_dec_trunc() and put_dec_full() Denys Vlasenko
2010-08-06  6:56   ` Michal Nazarewicz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.