linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] vsprintf: optimize decimal conversion (again)
@ 2012-03-26 18:47 Denys Vlasenko
  2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 18:47 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Douglas W Jones, Michal Nazarewicz

[-- Attachment #1: Type: text/plain, Size: 1842 bytes --]

Hi Andrew,

Can you take this patch into -mm?

Michal, Jones - can you review the code?

Sometime ago, Michal Nazarewicz <mina86@mina86.com> optimized our
(already fast) decimal-to-string conversion even further.
Somehow this effort did not reach the kernel.

Here is a new iteration of his code.

Optimizations and patch follow in next email.

Please find test programs attached.

32-bit test programs were built using gcc 4.6.2
64-bit test programs were built using gcc 4.2.1
Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c

Sizes:
org32.o: 2850 bytes
new32.o: 2858 bytes
org64.o: 2155 bytes
new64.o: 2283 bytes

Correctness: I tested first and last 40 billion values from [0, 2^64-1] range,
they all produced correct result.

Speed:
I measured how many thousands of conversions per second are done, for several values
(it takes different amount of time to convert, say, 123 and 2^64-1 to their
string representations).
Format of data below: VALUE:THOUSANDS_OF_CONVS_PER_SEC.

Intel Core i7 2.7GHz:
org32: 8:46852 123:39252 123456:23992 12345678:21992 123456789:21048 2^32-1:20424 2^64-1:10216
new32: 8:55300 123:43208 123456:34456 12345678:31272 123456789:23584 2^32-1:23568 2^64-1:16720

AMD Phenom II X4 2.4GHz:
org32: 8:29244 123:23988 123456:13792 12345678:12056 123456789:11368 2^32-1:10804 2^64-1:5224
new32: 8:38040 123:30356 123456:22832 12345678:20676 123456789:13556 2^32-1:13472 2^64-1:9228

org64: 8:38664 123:29256 123456:19188 12345678:16320 123456789:15380 2^32-1:14896 2^64-1:7864
new64: 8:42664 123:31660 123456:21632 12345678:19220 123456789:20880 2^32-1:17580 2^64-1:9596

Summary: in all cases new code is faster than old one, in many cases by 30%,
in few cases by more than 50% (for example, on x86-32, conversion of num=12345678).
Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.

-- 
vda

[-- Attachment #2: test_dec_conversion.tar.gz --]
[-- Type: application/x-tgz, Size: 5949 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
@ 2012-03-26 18:51 ` Denys Vlasenko
  2012-03-26 19:51   ` Andrew Morton
  2012-03-27 12:08 ` [PATCH 0/1] " roma1390
  2012-03-27 13:49 ` roma1390
  2 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 18:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andrew Morton, Douglas W Jones, Michal Nazarewicz

commit 01a2904d31d2373886f489429ec662c9be64a6ab
Author: Denys Vlasenko <vda.linux@googlemail.com>
Date:   Mon Mar 26 20:40:53 2012 +0200

    vsprintf: optimize decimal conversion (again)
    
    Previous code was using optimizations which were developed
    to work well even on narrow-word CPUs (by today's standards).
    But Linux runs only on 32-bit and wider CPUs. We can use that.
    
    First: using 32x32->64 multiply and trivial 32-bit shift,
    we can correctly divide by 10 much larger numbers, and thus
    we can print groups of 9 digits instead of groups of 5 digits.
    
    Next: there are two algorithms to print larger numbers.
    One is generic: divide by 1000000000 and repeatedly print
    groups of (up to) 9 digits. It's conceptually simple,
    but requires an (unsigned long long) / 1000000000 division.
    
    Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
    manipulates them cleverly and generates groups of 4 decimal digits.
    It so happens that it does NOT require long long division.
    
    If long is > 32 bits, division of 64-bit values is relatively easy,
    and we will use the first algorithm.
    If long long is > 64 bits (strange architecture with VERY large long long),
    second algorithm can't be used, and we again use the first one.
    
    Else (if long is 32 bits and long long is 64 bits) we use second one.
    
    And third: there is a simple optimization which takes fast path
    not only for zero as was done before, but for all one-digit numbers.
    
    In all tested cases new code is faster than old one, in many cases by 30%,
    in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
    Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
    
    Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index abbabec..0a2044c 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -112,106 +112,190 @@ int skip_atoi(const char **s)
 /* Decimal conversion is by far the most typical, and is used
  * for /proc and /sys data. This directly impacts e.g. top performance
  * with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
+ * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+ * (with permission from the author, Douglas W. Jones).
+ */
 
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
+/* Formats correctly any integer in [0, 999999999] */
 static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full9(char *buf, unsigned q)
 {
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
-
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0'; /* least significant digit */
-	d1 = q + 9*d3 + 5*d2 + d1;
-	if (d1 != 0) {
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0'; /* next digit */
-
-		d2 = q + 2*d2;
-		if ((d2 != 0) || (d3 != 0)) {
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0'; /* next digit */
-
-			d3 = q + 4*d3;
-			if (d3 != 0) {
-				q = (d3 * 0xcd) >> 11;
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';  /* next digit */
-				if (q != 0)
-					*buf++ = q + '0'; /* most sign. digit */
-			}
-		}
-	}
-
+	unsigned r;
+
+	/* Possible ways to approx. divide by 10
+	 * (x * 0x1999999a) >> 32 x < 1073741829 (multiply must be 64-bit)
+	 * (x * 0xcccd) >> 19     x <      81920 (x < 262149 when 64-bit mul)
+	 * (x * 0x6667) >> 18     x <      43699
+	 * (x * 0x3334) >> 17     x <      16389
+	 * (x * 0x199a) >> 16     x <      16389
+	 * (x * 0x0ccd) >> 15     x <      16389
+	 * (x * 0x0667) >> 14     x <       2739
+	 * (x * 0x0334) >> 13     x <       1029
+	 * (x * 0x019a) >> 12     x <       1029
+	 * (x * 0x00cd) >> 11     x <       1029 shorter code than * 0x67 (on i386)
+	 * (x * 0x0067) >> 10     x <        179
+	 * (x * 0x0034) >>  9     x <         69 same
+	 * (x * 0x001a) >>  8     x <         69 same
+	 * (x * 0x000d) >>  7     x <         69 same, shortest code (on i386)
+	 * (x * 0x0007) >>  6     x <         19
+	 * See <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+	 */
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 1 */
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 2 */
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 3 */
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 4 */
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 5 */
+	/* Now value is under 10000, can avoid 64-bit multiply */
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0'; /* 6 */
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0'; /* 7 */
+	q      = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q) + '0'; /* 8 */
+	*buf++ = q + '0'; /* 9 */
 	return buf;
 }
-/* Same with if's removed. Always emits five digits */
+#endif
+
+/* Similar to above but do not pad with zeros.
+ * Code can be easily arranged to print 9 digits too, but our callers
+ * always call put_dec_full9() instead when the number has 9 decimal digits.
+ */
 static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_trunc8(char *buf, unsigned r)
 {
-	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
-	/* but anyway, gcc produces better code with full-sized ints */
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
+	unsigned q;
+
+	/* Copy of previous function's body with added early returns */
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 2 */
+	if (q == 0) return buf;
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 3 */
+	if (r == 0) return buf;
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 4 */
+	if (q == 0) return buf;
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 5 */
+	if (r == 0) return buf;
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0'; /* 6 */
+	if (q == 0) return buf;
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0'; /* 7 */
+	if (r == 0) return buf;
+	q      = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q) + '0'; /* 8 */
+	if (q == 0) return buf;
+	*buf++ = q + '0'; /* 9 */
+	return buf;
+}
 
-	/*
-	 * Possible ways to approx. divide by 10
-	 * gcc -O2 replaces multiply with shifts and adds
-	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
-	 * (x * 0x67) >> 10:  1100111
-	 * (x * 0x34) >> 9:    110100 - same
-	 * (x * 0x1a) >> 8:     11010 - same
-	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
-	 */
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0';
-	d1 = q + 9*d3 + 5*d2 + d1;
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0';
-
-		d2 = q + 2*d2;
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0';
-
-			d3 = q + 4*d3;
-				q = (d3 * 0xcd) >> 11; /* - shorter code */
-				/* q = (d3 * 0x67) >> 10; - would also work */
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';
-					*buf++ = q + '0';
+/* There are two algorithms to print larger numbers.
+ * One is generic: divide by 1000000000 and repeatedly print
+ * groups of (up to) 9 digits. It's conceptually simple,
+ * but requires a (unsigned long long) / 1000000000 division.
+ *
+ * Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
+ * manipulates them cleverly and generates groups of 4 decimal digits.
+ * It so happens that it does NOT require long long division.
+ *
+ * If long is > 32 bits, division of 64-bit values is relatively easy,
+ * and we will use the first algorithm.
+ * If long long is > 64 bits (strange architecture with VERY large long long),
+ * second algorithm can't be used, and we again use the first one.
+ *
+ * Else (if long is 32 bits and long long is 64 bits) we use second one.
+ */
 
-	return buf;
+#if BITS_PER_LONG != 32 || ((~0ULL)>>1) != ((1ULL<<63)-1)
+
+/* First algorithm: generic */
+
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+	if (n >= 100*1000*1000) {
+		while (n >= 1000*1000*1000)
+			buf = put_dec_full9(buf, do_div(n, 1000*1000*1000));
+		if (n >= 100*1000*1000)
+			return put_dec_full9(buf, n);
+	}
+	return put_dec_trunc8(buf, n);
 }
-/* No inlining helps gcc to use registers better */
+
+#else
+
+/* Second algorithm: valid only for 32-bit longs, 64-bit long longs */
+
 static noinline_for_stack
-char *put_dec(char *buf, unsigned long long num)
+char *put_dec_full4(char *buf, unsigned q)
 {
-	while (1) {
-		unsigned rem;
-		if (num < 100000)
-			return put_dec_trunc(buf, num);
-		rem = do_div(num, 100000);
-		buf = put_dec_full(buf, rem);
-	}
+	unsigned r;
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+	*buf++ = r + '0';
+	return buf;
+}
+
+/* Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>
+ * (with permission from the author).
+ * Performs no 64-bit division and hence should be fast on 32-bit machines.
+ */
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q, h;
+
+	if (n < 100*1000*1000)
+		return put_dec_trunc8(buf, n);
+
+	d1  = ((uint32_t)n >> 16); /* implicit "& 0xffff" */
+	h   = (n >> 32);
+	d2  = (h      ) & 0xffff;
+	d3  = (h >> 16); /* implicit "& 0xffff" */
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
+
+	buf = put_dec_full4(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+	buf = put_dec_full4(buf, d1 % 10000);
+	q   = d1 / 10000;
+
+	d2  = q + 4749 * d3 + 42 * d2;
+	buf = put_dec_full4(buf, d2 % 10000);
+	q   = d2 / 10000;
+
+	d3  = q + 281 * d3;
+	if (!d3)
+		goto done;
+	buf = put_dec_full4(buf, d3 % 10000);
+	q   = d3 / 10000;
+	if (!q)
+		goto done;
+	buf = put_dec_full4(buf, q);
+ done:
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
 }
 
+#endif
 /*
  * Convert passed number to decimal string.
  * Returns the length of string.  On buffer overflow, returns 0.
@@ -220,7 +304,7 @@ char *put_dec(char *buf, unsigned long long num)
  */
 int num_to_str(char *buf, int size, unsigned long long num)
 {
-	char tmp[21];		/* Enough for 2^64 in decimal */
+	char tmp[sizeof(num) * 3];
 	int idx, len;
 
 	len = put_dec(tmp, num) - tmp;
@@ -229,7 +313,7 @@ int num_to_str(char *buf, int size, unsigned long long num)
 		return 0;
 	for (idx = 0; idx < len; ++idx)
 		buf[idx] = tmp[len - idx - 1];
-	return  len;
+	return len;
 }
 
 #define ZEROPAD	1		/* pad with zero */
@@ -312,8 +396,8 @@ char *number(char *buf, char *end, unsigned long long num,
 
 	/* generate full string in tmp[], in reverse order */
 	i = 0;
-	if (num == 0)
-		tmp[i++] = '0';
+	if (num < spec.base)
+		tmp[i++] = digits[num] | locase;
 	/* Generic code, for any base:
 	else do {
 		tmp[i++] = (digits[do_div(num,base)] | locase);
@@ -607,7 +691,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
 	}
 	for (i = 0; i < 4; i++) {
 		char temp[3];	/* hold each IP quad in reverse order */
-		int digits = put_dec_trunc(temp, addr[index]) - temp;
+		int digits = put_dec_trunc8(temp, addr[index]) - temp;
 		if (leading_zeros) {
 			if (digits < 3)
 				*p++ = '0';

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
@ 2012-03-26 19:51   ` Andrew Morton
  2012-03-26 19:56     ` Denys Vlasenko
  0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2012-03-26 19:51 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz

On Mon, 26 Mar 2012 20:51:24 +0200
Denys Vlasenko <vda.linux@googlemail.com> wrote:

> commit 01a2904d31d2373886f489429ec662c9be64a6ab
> Author: Denys Vlasenko <vda.linux@googlemail.com>
> Date:   Mon Mar 26 20:40:53 2012 +0200
> 
>     vsprintf: optimize decimal conversion (again)
>     
>     Previous code was using optimizations which were developed
>     to work well even on narrow-word CPUs (by today's standards).
>     But Linux runs only on 32-bit and wider CPUs. We can use that.
>     
>     First: using 32x32->64 multiply and trivial 32-bit shift,
>     we can correctly divide by 10 much larger numbers, and thus
>     we can print groups of 9 digits instead of groups of 5 digits.
>     
>     Next: there are two algorithms to print larger numbers.
>     One is generic: divide by 1000000000 and repeatedly print
>     groups of (up to) 9 digits. It's conceptually simple,
>     but requires an (unsigned long long) / 1000000000 division.
>     
>     Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
>     manipulates them cleverly and generates groups of 4 decimal digits.
>     It so happens that it does NOT require long long division.
>     
>     If long is > 32 bits, division of 64-bit values is relatively easy,
>     and we will use the first algorithm.
>     If long long is > 64 bits (strange architecture with VERY large long long),
>     second algorithm can't be used, and we again use the first one.
>     
>     Else (if long is 32 bits and long long is 64 bits) we use second one.
>     
>     And third: there is a simple optimization which takes fast path
>     not only for zero as was done before, but for all one-digit numbers.
>     
>     In all tested cases new code is faster than old one, in many cases by 30%,
>     in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
>     Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
>     

This patch is so nutty that I like it.

> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)

What's this for?


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 19:51   ` Andrew Morton
@ 2012-03-26 19:56     ` Denys Vlasenko
  2012-03-26 20:13       ` Andrew Morton
  0 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 19:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz

On Mon, Mar 26, 2012 at 9:51 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 26 Mar 2012 20:51:24 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>> commit 01a2904d31d2373886f489429ec662c9be64a6ab
>> Author: Denys Vlasenko <vda.linux@googlemail.com>
>> Date:   Mon Mar 26 20:40:53 2012 +0200
>>
>>     vsprintf: optimize decimal conversion (again)
>>
>>     Previous code was using optimizations which were developed
>>     to work well even on narrow-word CPUs (by today's standards).
>>     But Linux runs only on 32-bit and wider CPUs. We can use that.
>>
>>     First: using 32x32->64 multiply and trivial 32-bit shift,
>>     we can correctly divide by 10 much larger numbers, and thus
>>     we can print groups of 9 digits instead of groups of 5 digits.
>>
>>     Next: there are two algorithms to print larger numbers.
>>     One is generic: divide by 1000000000 and repeatedly print
>>     groups of (up to) 9 digits. It's conceptually simple,
>>     but requires an (unsigned long long) / 1000000000 division.
>>
>>     Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
>>     manipulates them cleverly and generates groups of 4 decimal digits.
>>     It so happens that it does NOT require long long division.
>>
>>     If long is > 32 bits, division of 64-bit values is relatively easy,
>>     and we will use the first algorithm.
>>     If long long is > 64 bits (strange architecture with VERY large long long),
>>     second algorithm can't be used, and we again use the first one.
>>
>>     Else (if long is 32 bits and long long is 64 bits) we use second one.
>>
>>     And third: there is a simple optimization which takes fast path
>>     not only for zero as was done before, but for all one-digit numbers.
>>
>>     In all tested cases new code is faster than old one, in many cases by 30%,
>>     in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
>>     Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
>>
>
> This patch is so nutty that I like it.
>
>> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>
> What's this for?

The second check should be just BITS_PER_LONG_LONG != 64,
but we don't have BITS_PER_LONG_LONG.

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 19:56     ` Denys Vlasenko
@ 2012-03-26 20:13       ` Andrew Morton
  2012-03-26 20:18         ` Geert Uytterhoeven
                           ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Andrew Morton @ 2012-03-26 20:13 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz

On Mon, 26 Mar 2012 21:56:38 +0200
Denys Vlasenko <vda.linux@googlemail.com> wrote:

> >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
> >
> > What's this for?
> 
> The second check should be just BITS_PER_LONG_LONG != 64,
> but we don't have BITS_PER_LONG_LONG.

So let's add BITS_PER_LONG_LONG rather than hacking around its absence!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 20:13       ` Andrew Morton
@ 2012-03-26 20:18         ` Geert Uytterhoeven
  2012-03-26 23:18           ` Denys Vlasenko
  2012-03-26 20:20         ` H. Peter Anvin
  2012-03-27  0:26         ` Denys Vlasenko
  2 siblings, 1 reply; 31+ messages in thread
From: Geert Uytterhoeven @ 2012-03-26 20:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Denys Vlasenko, linux-kernel, Douglas W Jones, Michal Nazarewicz

On Mon, Mar 26, 2012 at 22:13, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Mon, 26 Mar 2012 21:56:38 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>> >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>> >
>> > What's this for?
>>
>> The second check should be just BITS_PER_LONG_LONG != 64,
>> but we don't have BITS_PER_LONG_LONG.
>
> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!

I don't think Linux runs on anything with BITS_PER_LONG_LONG != 64...

BTW, what about CPUs with slow 32x32 multiplication and/or slow 64-bit
division?

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 20:13       ` Andrew Morton
  2012-03-26 20:18         ` Geert Uytterhoeven
@ 2012-03-26 20:20         ` H. Peter Anvin
  2012-03-27 17:12           ` Michal Nazarewicz
  2012-03-27  0:26         ` Denys Vlasenko
  2 siblings, 1 reply; 31+ messages in thread
From: H. Peter Anvin @ 2012-03-26 20:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Denys Vlasenko, linux-kernel, Douglas W Jones, Michal Nazarewicz

On 03/26/2012 01:13 PM, Andrew Morton wrote:
> On Mon, 26 Mar 2012 21:56:38 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
> 
>>>> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>>>
>>> What's this for?
>>
>> The second check should be just BITS_PER_LONG_LONG != 64,
>> but we don't have BITS_PER_LONG_LONG.
> 
> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!

First of all, the #if is wrong: the preprocessor doesn't support data
types and does all arithmetic at (u)intmax_t precision.

As far as BITS_PER_LONG_LONG, there are tons of places in the kernel
which already require that long long is exactly 64 bits.  That may or
may not be a good thing, but for right now one could simply:

#define BITS_PER_LONG_LONG 64

	-hpa


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 20:18         ` Geert Uytterhoeven
@ 2012-03-26 23:18           ` Denys Vlasenko
  2012-03-27  0:30             ` Denys Vlasenko
  2012-03-27  3:49             ` H. Peter Anvin
  0 siblings, 2 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 23:18 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Andrew Morton, linux-kernel, Douglas W Jones, Michal Nazarewicz

On Monday 26 March 2012 22:18, Geert Uytterhoeven wrote:
> On Mon, Mar 26, 2012 at 22:13, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Mon, 26 Mar 2012 21:56:38 +0200
> > Denys Vlasenko <vda.linux@googlemail.com> wrote:
> >
> >> >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
> >> >
> >> > What's this for?
> >>
> >> The second check should be just BITS_PER_LONG_LONG != 64,
> >> but we don't have BITS_PER_LONG_LONG.
> >
> > So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
> 
> I don't think Linux runs on anything with BITS_PER_LONG_LONG != 64...
> 
> BTW, what about CPUs with slow 32x32 multiplication and/or slow 64-bit
> division?

Without 32x32->64 multiply, the best we can generate is 4 decimal digits:
we produce next digit by approximating x/10 with (x * 0xcccd) >> 19,
and the first x where it gives wrong result is 81920 if multiply result
is truncated to 32 bits.
With it, we can generate 9 digits using (x * 0x1999999a) >> 32.

Regrading "slow 64-bit division" - after this patch, 32-bit machines
wouldn't use it at all. Only 64-bit machines will perform 64-bit
division, one per 9 decimal digits (thus, at most three divisions
per one long_long->string conversion).

In fact, with small change to #ifdefs, all machines with long long <= 64
bits can use division-less routine. It might be a good thing to try...

Any people with ARM hardware in hand interesting in running the test
program I sent in first email?

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 20:13       ` Andrew Morton
  2012-03-26 20:18         ` Geert Uytterhoeven
  2012-03-26 20:20         ` H. Peter Anvin
@ 2012-03-27  0:26         ` Denys Vlasenko
  2 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27  0:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz

On Monday 26 March 2012 22:13, Andrew Morton wrote:
> On Mon, 26 Mar 2012 21:56:38 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
> 
> > >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
> > >
> > > What's this for?
> > 
> > The second check should be just BITS_PER_LONG_LONG != 64,
> > but we don't have BITS_PER_LONG_LONG.
> 
> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!

Sure!

Here is a new version. I also plugged a hole in num_to_str() -
it was assuming it's safe to call put_dec() for num=0.
(We never tripped over it before because the single caller
of num_to_str() takes care of that case).

Signed-off-by and CC are also updated.
-- 
vda


commit 469bfce8d60d1069424aacc83fef50f9da5701bf
Author: Denys Vlasenko <vda.linux@googlemail.com>
Date:   Tue Mar 27 02:18:46 2012 +0200

    vsprintf: optimize decimal conversion (again)
    
    This patch is based upon an original from Michal Nazarewicz.
    
    Previous code was using optimizations which were developed
    to work well even on narrow-word CPUs (by today's standards).
    But Linux runs only on 32-bit and wider CPUs. We can use that.
    
    First: using 32x32->64 multiply and trivial 32-bit shift,
    we can correctly divide by 10 much larger numbers, and thus
    we can print groups of 9 digits instead of groups of 5 digits.
    
    Next: there are two algorithms to print larger numbers.
    One is generic: divide by 1000000000 and repeatedly print
    groups of (up to) 9 digits. It's conceptually simple,
    but requires an (unsigned long long) / 1000000000 division.
    
    Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
    manipulates them cleverly and generates groups of 4 decimal digits.
    It so happens that it does NOT require long long division.
    
    If long is > 32 bits, division of 64-bit values is relatively easy,
    and we will use the first algorithm.
    If long long is > 64 bits (strange architecture with VERY large long long),
    second algorithm can't be used, and we again use the first one.
    
    Else (if long is 32 bits and long long is 64 bits) we use second one.
    
    And third: there is a simple optimization which takes fast path
    not only for zero as was done before, but for all one-digit numbers.
    
    In all tested cases new code is faster than old one, in many cases by 30%,
    in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
    Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
    
    Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
    Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
    Cc: Douglas W Jones <jones@cs.uiowa.edu>

diff --git a/include/asm-generic/bitsperlong.h b/include/asm-generic/bitsperlong.h
index 4ae54e0..a7b0914 100644
--- a/include/asm-generic/bitsperlong.h
+++ b/include/asm-generic/bitsperlong.h
@@ -28,5 +28,9 @@
 #error Inconsistent word size. Check asm/bitsperlong.h
 #endif
 
+#ifndef BITS_PER_LONG_LONG
+#define BITS_PER_LONG_LONG 64
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_GENERIC_BITS_PER_LONG */
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index abbabec..877a503 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -112,106 +112,191 @@ int skip_atoi(const char **s)
 /* Decimal conversion is by far the most typical, and is used
  * for /proc and /sys data. This directly impacts e.g. top performance
  * with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
+ * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+ * (with permission from the author, Douglas W. Jones).
+ */
 
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
+/* Formats correctly any integer in [0, 999999999] */
 static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full9(char *buf, unsigned q)
 {
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
-
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0'; /* least significant digit */
-	d1 = q + 9*d3 + 5*d2 + d1;
-	if (d1 != 0) {
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0'; /* next digit */
-
-		d2 = q + 2*d2;
-		if ((d2 != 0) || (d3 != 0)) {
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0'; /* next digit */
-
-			d3 = q + 4*d3;
-			if (d3 != 0) {
-				q = (d3 * 0xcd) >> 11;
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';  /* next digit */
-				if (q != 0)
-					*buf++ = q + '0'; /* most sign. digit */
-			}
-		}
-	}
-
+	unsigned r;
+
+	/* Possible ways to approx. divide by 10
+	 * (x * 0x1999999a) >> 32 x < 1073741829 (multiply must be 64-bit)
+	 * (x * 0xcccd) >> 19     x <      81920 (x < 262149 when 64-bit mul)
+	 * (x * 0x6667) >> 18     x <      43699
+	 * (x * 0x3334) >> 17     x <      16389
+	 * (x * 0x199a) >> 16     x <      16389
+	 * (x * 0x0ccd) >> 15     x <      16389
+	 * (x * 0x0667) >> 14     x <       2739
+	 * (x * 0x0334) >> 13     x <       1029
+	 * (x * 0x019a) >> 12     x <       1029
+	 * (x * 0x00cd) >> 11     x <       1029 shorter code than * 0x67 (on i386)
+	 * (x * 0x0067) >> 10     x <        179
+	 * (x * 0x0034) >>  9     x <         69 same
+	 * (x * 0x001a) >>  8     x <         69 same
+	 * (x * 0x000d) >>  7     x <         69 same, shortest code (on i386)
+	 * (x * 0x0007) >>  6     x <         19
+	 * See <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+	 */
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 1 */
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 2 */
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 3 */
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 4 */
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 5 */
+	/* Now value is under 10000, can avoid 64-bit multiply */
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0'; /* 6 */
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0'; /* 7 */
+	q      = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q) + '0'; /* 8 */
+	*buf++ = q + '0'; /* 9 */
 	return buf;
 }
-/* Same with if's removed. Always emits five digits */
+#endif
+
+/* Similar to above but do not pad with zeros.
+ * Code can be easily arranged to print 9 digits too, but our callers
+ * always call put_dec_full9() instead when the number has 9 decimal digits.
+ */
 static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_trunc8(char *buf, unsigned r)
 {
-	/* BTW, if q is in [0,9999], 8-bit ints will be enough, */
-	/* but anyway, gcc produces better code with full-sized ints */
-	unsigned d3, d2, d1, d0;
-	d1 = (q>>4) & 0xf;
-	d2 = (q>>8) & 0xf;
-	d3 = (q>>12);
+	unsigned q;
+
+	/* Copy of previous function's body with added early returns */
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 2 */
+	if (q == 0) return buf;
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 3 */
+	if (r == 0) return buf;
+	q      = (r * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (r - 10 * q) + '0'; /* 4 */
+	if (q == 0) return buf;
+	r      = (q * (uint64_t)0x1999999a) >> 32;
+	*buf++ = (q - 10 * r) + '0'; /* 5 */
+	if (r == 0) return buf;
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0'; /* 6 */
+	if (q == 0) return buf;
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0'; /* 7 */
+	if (r == 0) return buf;
+	q      = (r * 0xcd) >> 11;
+	*buf++ = (r - 10 * q) + '0'; /* 8 */
+	if (q == 0) return buf;
+	*buf++ = q + '0'; /* 9 */
+	return buf;
+}
 
-	/*
-	 * Possible ways to approx. divide by 10
-	 * gcc -O2 replaces multiply with shifts and adds
-	 * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
-	 * (x * 0x67) >> 10:  1100111
-	 * (x * 0x34) >> 9:    110100 - same
-	 * (x * 0x1a) >> 8:     11010 - same
-	 * (x * 0x0d) >> 7:      1101 - same, shortest code (on i386)
-	 */
-	d0 = 6*(d3 + d2 + d1) + (q & 0xf);
-	q = (d0 * 0xcd) >> 11;
-	d0 = d0 - 10*q;
-	*buf++ = d0 + '0';
-	d1 = q + 9*d3 + 5*d2 + d1;
-		q = (d1 * 0xcd) >> 11;
-		d1 = d1 - 10*q;
-		*buf++ = d1 + '0';
-
-		d2 = q + 2*d2;
-			q = (d2 * 0xd) >> 7;
-			d2 = d2 - 10*q;
-			*buf++ = d2 + '0';
-
-			d3 = q + 4*d3;
-				q = (d3 * 0xcd) >> 11; /* - shorter code */
-				/* q = (d3 * 0x67) >> 10; - would also work */
-				d3 = d3 - 10*q;
-				*buf++ = d3 + '0';
-					*buf++ = q + '0';
+/* There are two algorithms to print larger numbers.
+ * One is generic: divide by 1000000000 and repeatedly print
+ * groups of (up to) 9 digits. It's conceptually simple,
+ * but requires a (unsigned long long) / 1000000000 division.
+ *
+ * Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
+ * manipulates them cleverly and generates groups of 4 decimal digits.
+ * It so happens that it does NOT require long long division.
+ *
+ * If long is > 32 bits, division of 64-bit values is relatively easy,
+ * and we will use the first algorithm.
+ * If long long is > 64 bits (strange architecture with VERY large long long),
+ * second algorithm can't be used, and we again use the first one.
+ *
+ * Else (if long is 32 bits and long long is 64 bits) we use second one.
+ */
 
-	return buf;
+#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
+
+/* First algorithm: generic */
+
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+	if (n >= 100*1000*1000) {
+		while (n >= 1000*1000*1000)
+			buf = put_dec_full9(buf, do_div(n, 1000*1000*1000));
+		if (n >= 100*1000*1000)
+			return put_dec_full9(buf, n);
+	}
+	return put_dec_trunc8(buf, n);
 }
-/* No inlining helps gcc to use registers better */
+
+#else
+
+/* Second algorithm: valid only for 64-bit long longs */
+
 static noinline_for_stack
-char *put_dec(char *buf, unsigned long long num)
+char *put_dec_full4(char *buf, unsigned q)
 {
-	while (1) {
-		unsigned rem;
-		if (num < 100000)
-			return put_dec_trunc(buf, num);
-		rem = do_div(num, 100000);
-		buf = put_dec_full(buf, rem);
-	}
+	unsigned r;
+	r      = (q * 0xcccd) >> 19;
+	*buf++ = (q - 10 * r) + '0';
+	q      = (r * 0x199a) >> 16;
+	*buf++ = (r - 10 * q)  + '0';
+	r      = (q * 0xcd) >> 11;
+	*buf++ = (q - 10 * r)  + '0';
+	*buf++ = r + '0';
+	return buf;
 }
 
+/* Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>
+ * (with permission from the author).
+ * Performs no 64-bit division and hence should be fast on 32-bit machines.
+ */
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+	uint32_t d3, d2, d1, q, h;
+
+	if (n < 100*1000*1000)
+		return put_dec_trunc8(buf, n);
+
+	d1  = ((uint32_t)n >> 16); /* implicit "& 0xffff" */
+	h   = (n >> 32);
+	d2  = (h      ) & 0xffff;
+	d3  = (h >> 16); /* implicit "& 0xffff" */
+
+	q   = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
+
+	buf = put_dec_full4(buf, q % 10000);
+	q   = q / 10000;
+
+	d1  = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+	buf = put_dec_full4(buf, d1 % 10000);
+	q   = d1 / 10000;
+
+	d2  = q + 4749 * d3 + 42 * d2;
+	buf = put_dec_full4(buf, d2 % 10000);
+	q   = d2 / 10000;
+
+	d3  = q + 281 * d3;
+	if (!d3)
+		goto done;
+	buf = put_dec_full4(buf, d3 % 10000);
+	q   = d3 / 10000;
+	if (!q)
+		goto done;
+	buf = put_dec_full4(buf, q);
+ done:
+	while (buf[-1] == '0')
+		--buf;
+
+	return buf;
+}
+
+#endif
+
 /*
  * Convert passed number to decimal string.
  * Returns the length of string.  On buffer overflow, returns 0.
@@ -220,16 +305,22 @@ char *put_dec(char *buf, unsigned long long num)
  */
 int num_to_str(char *buf, int size, unsigned long long num)
 {
-	char tmp[21];		/* Enough for 2^64 in decimal */
+	char tmp[sizeof(num) * 3];
 	int idx, len;
 
-	len = put_dec(tmp, num) - tmp;
+	/* put_dec() may work incorrectly for num = 0 (generate "", not "0") */
+	if (num <= 9) {
+		tmp[0] = '0' + num;
+		len = 1;
+	} else {
+		len = put_dec(tmp, num) - tmp;
+	}
 
 	if (len > size)
 		return 0;
 	for (idx = 0; idx < len; ++idx)
 		buf[idx] = tmp[len - idx - 1];
-	return  len;
+	return len;
 }
 
 #define ZEROPAD	1		/* pad with zero */
@@ -312,8 +403,8 @@ char *number(char *buf, char *end, unsigned long long num,
 
 	/* generate full string in tmp[], in reverse order */
 	i = 0;
-	if (num == 0)
-		tmp[i++] = '0';
+	if (num < spec.base)
+		tmp[i++] = digits[num] | locase;
 	/* Generic code, for any base:
 	else do {
 		tmp[i++] = (digits[do_div(num,base)] | locase);
@@ -607,7 +698,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
 	}
 	for (i = 0; i < 4; i++) {
 		char temp[3];	/* hold each IP quad in reverse order */
-		int digits = put_dec_trunc(temp, addr[index]) - temp;
+		int digits = put_dec_trunc8(temp, addr[index]) - temp;
 		if (leading_zeros) {
 			if (digits < 3)
 				*p++ = '0';
 

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 23:18           ` Denys Vlasenko
@ 2012-03-27  0:30             ` Denys Vlasenko
  2012-03-27  3:49             ` H. Peter Anvin
  1 sibling, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27  0:30 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Andrew Morton, linux-kernel, Douglas W Jones, Michal Nazarewicz

On Tuesday 27 March 2012 01:18, Denys Vlasenko wrote:
> > I don't think Linux runs on anything with BITS_PER_LONG_LONG != 64...
> > 
> > BTW, what about CPUs with slow 32x32 multiplication and/or slow 64-bit
> > division?
> 
> Without 32x32->64 multiply, the best we can generate is 4 decimal digits:
> we produce next digit by approximating x/10 with (x * 0xcccd) >> 19,
> and the first x where it gives wrong result is 81920 if multiply result
> is truncated to 32 bits.
> With it, we can generate 9 digits using (x * 0x1999999a) >> 32.
> 
> Regrading "slow 64-bit division" - after this patch, 32-bit machines
> wouldn't use it at all. Only 64-bit machines will perform 64-bit
> division, one per 9 decimal digits (thus, at most three divisions
> per one long_long->string conversion).
> 
> In fact, with small change to #ifdefs, all machines with long long <= 64
> bits can use division-less routine. It might be a good thing to try...

Well, apparently it's not a good idea for my Phenom II in 64-bit mode:
It's bigger:

   text    data     bss     dec     hex filename
   2395     448       0    2843     b1b test_div64.o
   2507     448       0    2955     b8b test_nodiv.o

And slower:
Conversions per second:
div64: 8:42660000 123:31472000 123456:21748000 12345678:19140000 123456789:20980000 2^32:16948000 2^64:9480000
nodiv: 8:40532000 123:30276000 123456:21172000 12345678:18672000 123456789:13440000 2^32:13440000 2^64:8992000

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 23:18           ` Denys Vlasenko
  2012-03-27  0:30             ` Denys Vlasenko
@ 2012-03-27  3:49             ` H. Peter Anvin
  1 sibling, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2012-03-27  3:49 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Geert Uytterhoeven, Andrew Morton, linux-kernel, Douglas W Jones,
	Michal Nazarewicz

On 03/26/2012 04:18 PM, Denys Vlasenko wrote:
> Only 64-bit machines will perform 64-bit
> division, one per 9 decimal digits (thus, at most three divisions
> per one long_long->string conversion).

FWIW, gcc supports 128-bit integers on 64-bit platforms using the
__int128 keyword.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
  2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
@ 2012-03-27 12:08 ` roma1390
  2012-03-27 15:32   ` Denys Vlasenko
  2012-03-27 15:42   ` Denys Vlasenko
  2012-03-27 13:49 ` roma1390
  2 siblings, 2 replies; 31+ messages in thread
From: roma1390 @ 2012-03-27 12:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On 2012.03.26 21:47, Denys Vlasenko wrote:
> Please find test programs attached.
>
> 32-bit test programs were built using gcc 4.6.2
> 64-bit test programs were built using gcc 4.2.1
> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c

Can't compile reference for arm:
$ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
test_org.c: In function ‘put_dec’:
test_org.c:101: error: impossible constraint in ‘asm’
test_org.c:101: error: impossible constraint in ‘asm’
test_org.c:101: error: impossible constraint in ‘asm’


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
  2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
  2012-03-27 12:08 ` [PATCH 0/1] " roma1390
@ 2012-03-27 13:49 ` roma1390
  2012-03-27 15:33   ` Denys Vlasenko
  2 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-27 13:49 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

Hi Denys,

Can't compare speed to base, but I tested this test_new on
2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
  ./test_new
Conversions per second: 8:5528000 123:4568000 123456:3568000 12345678:3392000 
123456789:1168000 2^32:976000 2^64:532000
Conversions per second: 8:5524000 123:4568000 123456:3680000 12345678:3408000 
123456789:1132000 2^32:972000 2^64:532000
Conversions per second: 8:5028000 123:4416000 123456:3688000 12345678:3396000 
123456789:1168000 2^32:976000 2^64:512000
Conversions per second: 8:5524000 123:4572000 123456:3684000 12345678:3288000 
123456789:1168000 2^32:972000 2^64:532000
Tested 900988928                               ^Z

Tested-by: roma1390 <roma1390@gmail.com>

roma1390

On 2012.03.26 21:47, Denys Vlasenko wrote:
> Hi Andrew,
>
> Can you take this patch into -mm?
>
> Michal, Jones - can you review the code?
>
> Sometime ago, Michal Nazarewicz<mina86@mina86.com>  optimized our
> (already fast) decimal-to-string conversion even further.
> Somehow this effort did not reach the kernel.
>
> Here is a new iteration of his code.
>
> Optimizations and patch follow in next email.
>
> Please find test programs attached.
>
> 32-bit test programs were built using gcc 4.6.2
> 64-bit test programs were built using gcc 4.2.1
> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>
> Sizes:
> org32.o: 2850 bytes
> new32.o: 2858 bytes
> org64.o: 2155 bytes
> new64.o: 2283 bytes
>
> Correctness: I tested first and last 40 billion values from [0, 2^64-1] range,
> they all produced correct result.
>
> Speed:
> I measured how many thousands of conversions per second are done, for several values
> (it takes different amount of time to convert, say, 123 and 2^64-1 to their
> string representations).
> Format of data below: VALUE:THOUSANDS_OF_CONVS_PER_SEC.
>
> Intel Core i7 2.7GHz:
> org32: 8:46852 123:39252 123456:23992 12345678:21992 123456789:21048 2^32-1:20424 2^64-1:10216
> new32: 8:55300 123:43208 123456:34456 12345678:31272 123456789:23584 2^32-1:23568 2^64-1:16720
>
> AMD Phenom II X4 2.4GHz:
> org32: 8:29244 123:23988 123456:13792 12345678:12056 123456789:11368 2^32-1:10804 2^64-1:5224
> new32: 8:38040 123:30356 123456:22832 12345678:20676 123456789:13556 2^32-1:13472 2^64-1:9228
>
> org64: 8:38664 123:29256 123456:19188 12345678:16320 123456789:15380 2^32-1:14896 2^64-1:7864
> new64: 8:42664 123:31660 123456:21632 12345678:19220 123456789:20880 2^32-1:17580 2^64-1:9596
>
> Summary: in all cases new code is faster than old one, in many cases by 30%,
> in few cases by more than 50% (for example, on x86-32, conversion of num=12345678).
> Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-27 12:08 ` [PATCH 0/1] " roma1390
@ 2012-03-27 15:32   ` Denys Vlasenko
  2012-03-27 15:42   ` Denys Vlasenko
  1 sibling, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 15:32 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Tue, Mar 27, 2012 at 2:08 PM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.26 21:47, Denys Vlasenko wrote:
>>
>> Please find test programs attached.
>>
>> 32-bit test programs were built using gcc 4.6.2
>> 64-bit test programs were built using gcc 4.2.1
>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>
> Can't compile reference for arm:
> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
> test_org.c: In function ‘put_dec’:
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’

Problem is that test_header.c contains do_div() macro
copied from x86 arch.

Can you replace it with one taken from arm arch?
You can find it in arch/arm/include/asm/div64.h,
and __asmeq macro can be found in
arch/arm/include/asm/system.h.

-- 
vda


arch/arm/include/asm/div64.h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-27 13:49 ` roma1390
@ 2012-03-27 15:33   ` Denys Vlasenko
  2012-03-29  5:16     ` roma1390
  0 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 15:33 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Tue, Mar 27, 2012 at 3:49 PM, roma1390 <roma1390@gmail.com> wrote:
> Hi Denys,
>
> Can't compare speed to base, but I tested this test_new on
> 2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
>  ./test_new
> Conversions per second: 8:5528000 123:4568000 123456:3568000
> 12345678:3392000 123456789:1168000 2^32:976000 2^64:532000
> Conversions per second: 8:5524000 123:4568000 123456:3680000
> 12345678:3408000 123456789:1132000 2^32:972000 2^64:532000
> Conversions per second: 8:5028000 123:4416000 123456:3688000
> 12345678:3396000 123456789:1168000 2^32:976000 2^64:512000
> Conversions per second: 8:5524000 123:4572000 123456:3684000
> 12345678:3288000 123456789:1168000 2^32:972000 2^64:532000
> Tested 900988928                               ^Z

Thanks, but without baseline, it's impossible to see whether
we have an improvement...

Can you also let it run longer, to test at least up to
first 5 billion numbers (which is > 2^32)?

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-27 12:08 ` [PATCH 0/1] " roma1390
  2012-03-27 15:32   ` Denys Vlasenko
@ 2012-03-27 15:42   ` Denys Vlasenko
  2012-03-28  5:56     ` roma1390
  1 sibling, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 15:42 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

[-- Attachment #1: Type: text/plain, Size: 748 bytes --]

On Tue, Mar 27, 2012 at 2:08 PM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.26 21:47, Denys Vlasenko wrote:
>>
>> Please find test programs attached.
>>
>> 32-bit test programs were built using gcc 4.6.2
>> 64-bit test programs were built using gcc 4.2.1
>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>
> Can't compile reference for arm:
> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
> test_org.c: In function ‘put_dec’:
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’

Please find a modified test_header.c attached.
I tested and it builds in my arm emulator.

-- 
vda

[-- Attachment #2: test_header.c --]
[-- Type: text/x-csrc, Size: 8139 bytes --]

#include <features.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdbool.h>
#include <sys/types.h> /* size_t */
#include <limits.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdarg.h>
#include <stddef.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <stdio.h> /* for FILE */
#include <sys/time.h>
#include <time.h>

#define noinline_for_stack __attribute__((noinline))
typedef uint8_t u8;
typedef int16_t s16;

/*
 * The semantics of do_div() are:
 *
 * uint32_t do_div(uint64_t *n, uint32_t base)
 * {
 * 	uint32_t remainder = *n % base;
 * 	*n = *n / base;
 * 	return remainder;
 * }
 *
 * In other words, a 64-bit dividend with a 32-bit divisor producing
 * a 64-bit result and a 32-bit remainder.  To accomplish this optimally
 * we call a special __do_div64 helper with completely non standard
 * calling convention for arguments and results (beware).
 */

#define __asmeq(x, y)  ".ifnc " x "," y " ; .err ; .endif\n\t"

#define __LINUX_ARM_ARCH__ 4

#ifdef __ARMEB__
#define __xh "r0"
#define __xl "r1"
#else
#define __xl "r0"
#define __xh "r1"
#endif

#define __do_div_asm(n, base)					\
({								\
	register unsigned int __base      asm("r4") = base;	\
	register unsigned long long __n   asm("r0") = n;	\
	register unsigned long long __res asm("r2");		\
	register unsigned int __rem       asm(__xh);		\
	asm(	__asmeq("%0", __xh)				\
		__asmeq("%1", "r2")				\
		__asmeq("%2", "r0")				\
		__asmeq("%3", "r4")				\
		"bl	__do_div64"				\
		: "=r" (__rem), "=r" (__res)			\
		: "r" (__n), "r" (__base)			\
		: "ip", "lr", "cc");				\
	n = __res;						\
	__rem;							\
})

#if __GNUC__ < 4

/*
 * gcc versions earlier than 4.0 are simply too problematic for the
 * optimized implementation below. First there is gcc PR 15089 that
 * tend to trig on more complex constructs, spurious .global __udivsi3
 * are inserted even if none of those symbols are referenced in the
 * generated code, and those gcc versions are not able to do constant
 * propagation on long long values anyway.
 */
#define do_div(n, base) __do_div_asm(n, base)

#elif __GNUC__ >= 4

/*
 * If the divisor happens to be constant, we determine the appropriate
 * inverse at compile time to turn the division into a few inline
 * multiplications instead which is much faster. And yet only if compiling
 * for ARMv4 or higher (we need umull/umlal) and if the gcc version is
 * sufficiently recent to perform proper long long constant propagation.
 * (It is unfortunate that gcc doesn't perform all this internally.)
 */
#define do_div(n, base)							\
({									\
	unsigned int __r, __b = (base);					\
	if (!__builtin_constant_p(__b) || __b == 0 ||			\
	    (__LINUX_ARM_ARCH__ < 4 && (__b & (__b - 1)) != 0)) {	\
		/* non-constant divisor (or zero): slow path */		\
		__r = __do_div_asm(n, __b);				\
	} else if ((__b & (__b - 1)) == 0) {				\
		/* Trivial: __b is constant and a power of 2 */		\
		/* gcc does the right thing with this code.  */		\
		__r = n;						\
		__r &= (__b - 1);					\
		n /= __b;						\
	} else {							\
		/* Multiply by inverse of __b: n/b = n*(p/b)/p       */	\
		/* We rely on the fact that most of this code gets   */	\
		/* optimized away at compile time due to constant    */	\
		/* propagation and only a couple inline assembly     */	\
		/* instructions should remain. Better avoid any      */	\
		/* code construct that might prevent that.           */	\
		unsigned long long __res, __x, __t, __m, __n = n;	\
		unsigned int __c, __p, __z = 0;				\
		/* preserve low part of n for reminder computation */	\
		__r = __n;						\
		/* determine number of bits to represent __b */		\
		__p = 1 << __div64_fls(__b);				\
		/* compute __m = ((__p << 64) + __b - 1) / __b */	\
		__m = (~0ULL / __b) * __p;				\
		__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b;	\
		/* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */	\
		__x = ~0ULL / __b * __b - 1;				\
		__res = (__m & 0xffffffff) * (__x & 0xffffffff);	\
		__res >>= 32;						\
		__res += (__m & 0xffffffff) * (__x >> 32);		\
		__t = __res;						\
		__res += (__x & 0xffffffff) * (__m >> 32);		\
		__t = (__res < __t) ? (1ULL << 32) : 0;			\
		__res = (__res >> 32) + __t;				\
		__res += (__m >> 32) * (__x >> 32);			\
		__res /= __p;						\
		/* Now sanitize and optimize what we've got. */		\
		if (~0ULL % (__b / (__b & -__b)) == 0) {		\
			/* those cases can be simplified with: */	\
			__n /= (__b & -__b);				\
			__m = ~0ULL / (__b / (__b & -__b));		\
			__p = 1;					\
			__c = 1;					\
		} else if (__res != __x / __b) {			\
			/* We can't get away without a correction    */	\
			/* to compensate for bit truncation errors.  */	\
			/* To avoid it we'd need an additional bit   */	\
			/* to represent __m which would overflow it. */	\
			/* Instead we do m=p/b and n/b=(n*m+m)/p.    */	\
			__c = 1;					\
			/* Compute __m = (__p << 64) / __b */		\
			__m = (~0ULL / __b) * __p;			\
			__m += ((~0ULL % __b + 1) * __p) / __b;		\
		} else {						\
			/* Reduce __m/__p, and try to clear bit 31   */	\
			/* of __m when possible otherwise that'll    */	\
			/* need extra overflow handling later.       */	\
			unsigned int __bits = -(__m & -__m);		\
			__bits |= __m >> 32;				\
			__bits = (~__bits) << 1;			\
			/* If __bits == 0 then setting bit 31 is     */	\
			/* unavoidable.  Simply apply the maximum    */	\
			/* possible reduction in that case.          */	\
			/* Otherwise the MSB of __bits indicates the */	\
			/* best reduction we should apply.           */	\
			if (!__bits) {					\
				__p /= (__m & -__m);			\
				__m /= (__m & -__m);			\
			} else {					\
				__p >>= __div64_fls(__bits);		\
				__m >>= __div64_fls(__bits);		\
			}						\
			/* No correction needed. */			\
			__c = 0;					\
		}							\
		/* Now we have a combination of 2 conditions:        */	\
		/* 1) whether or not we need a correction (__c), and */	\
		/* 2) whether or not there might be an overflow in   */	\
		/*    the cross product (__m & ((1<<63) | (1<<31)))  */	\
		/* Select the best insn combination to perform the   */	\
		/* actual __m * __n / (__p << 64) operation.         */	\
		if (!__c) {						\
			asm (	"umull	%Q0, %R0, %1, %Q2\n\t"		\
				"mov	%Q0, #0"			\
				: "=&r" (__res)				\
				: "r" (__m), "r" (__n)			\
				: "cc" );				\
		} else if (!(__m & ((1ULL << 63) | (1ULL << 31)))) {	\
			__res = __m;					\
			asm (	"umlal	%Q0, %R0, %Q1, %Q2\n\t"		\
				"mov	%Q0, #0"			\
				: "+&r" (__res)				\
				: "r" (__m), "r" (__n)			\
				: "cc" );				\
		} else {						\
			asm (	"umull	%Q0, %R0, %Q1, %Q2\n\t"		\
				"cmn	%Q0, %Q1\n\t"			\
				"adcs	%R0, %R0, %R1\n\t"		\
				"adc	%Q0, %3, #0"			\
				: "=&r" (__res)				\
				: "r" (__m), "r" (__n), "r" (__z)	\
				: "cc" );				\
		}							\
		if (!(__m & ((1ULL << 63) | (1ULL << 31)))) {		\
			asm (	"umlal	%R0, %Q0, %R1, %Q2\n\t"		\
				"umlal	%R0, %Q0, %Q1, %R2\n\t"		\
				"mov	%R0, #0\n\t"			\
				"umlal	%Q0, %R0, %R1, %R2"		\
				: "+&r" (__res)				\
				: "r" (__m), "r" (__n)			\
				: "cc" );				\
		} else {						\
			asm (	"umlal	%R0, %Q0, %R2, %Q3\n\t"		\
				"umlal	%R0, %1, %Q2, %R3\n\t"		\
				"mov	%R0, #0\n\t"			\
				"adds	%Q0, %1, %Q0\n\t"		\
				"adc	%R0, %R0, #0\n\t"		\
				"umlal	%Q0, %R0, %R2, %R3"		\
				: "+&r" (__res), "+&r" (__z)		\
				: "r" (__m), "r" (__n)			\
				: "cc" );				\
		}							\
		__res /= __p;						\
		/* The reminder can be computed with 32-bit regs     */	\
		/* only, and gcc is good at that.                    */	\
		{							\
			unsigned int __res0 = __res;			\
			unsigned int __b0 = __b;			\
			__r -= __res0 * __b0;				\
		}							\
		/* BUG_ON(__r >= __b || __res * __b + __r != n); */	\
		n = __res;						\
	}								\
	__r;								\
})

/* our own fls implementation to make sure constant propagation is fine */
#define __div64_fls(bits)						\
({									\
	unsigned int __left = (bits), __nr = 0;				\
	if (__left & 0xffff0000) __nr += 16, __left >>= 16;		\
	if (__left & 0x0000ff00) __nr +=  8, __left >>=  8;		\
	if (__left & 0x000000f0) __nr +=  4, __left >>=  4;		\
	if (__left & 0x0000000c) __nr +=  2, __left >>=  2;		\
	if (__left & 0x00000002) __nr +=  1;				\
	__nr;								\
})

#endif

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-26 20:20         ` H. Peter Anvin
@ 2012-03-27 17:12           ` Michal Nazarewicz
  2012-03-27 17:17             ` H. Peter Anvin
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Nazarewicz @ 2012-03-27 17:12 UTC (permalink / raw)
  To: Andrew Morton, H. Peter Anvin
  Cc: Denys Vlasenko, linux-kernel, Douglas W Jones, Michal Nazarewicz

On Mon, 26 Mar 2012 22:20:52 +0200, H. Peter Anvin <hpa@zytor.com> wrote:

> On 03/26/2012 01:13 PM, Andrew Morton wrote:
>> On Mon, 26 Mar 2012 21:56:38 +0200
>> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>>
>>>>> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>>>>
>>>> What's this for?
>>>
>>> The second check should be just BITS_PER_LONG_LONG != 64,
>>> but we don't have BITS_PER_LONG_LONG.
>>
>> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
>
> First of all, the #if is wrong: the preprocessor doesn't support data
> types and does all arithmetic at (u)intmax_t precision.

Maybe a regular “if” instead of macro would suffice here?  Compiler should
be smart enough to optimise out the dead path.

> As far as BITS_PER_LONG_LONG, there are tons of places in the kernel
> which already require that long long is exactly 64 bits.  That may or
> may not be a good thing, but for right now one could simply:
>
> #define BITS_PER_LONG_LONG 64

-- 
Best regards,                                         _     _
.o. | Liege of Serenely Enlightened Majesty of      o' \,=./ `o
..o | Computer Science,  Michał “mina86” Nazarewicz    (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
  2012-03-27 17:12           ` Michal Nazarewicz
@ 2012-03-27 17:17             ` H. Peter Anvin
  0 siblings, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2012-03-27 17:17 UTC (permalink / raw)
  To: Michal Nazarewicz
  Cc: Andrew Morton, Denys Vlasenko, linux-kernel, Douglas W Jones,
	Michal Nazarewicz

On 03/27/2012 10:12 AM, Michal Nazarewicz wrote:
> 
> Maybe a regular “if” instead of macro would suffice here?  Compiler should
> be smart enough to optimise out the dead path.
> 

Yes, a regular if would work, but it's also pointless.

	-hpa

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-27 15:42   ` Denys Vlasenko
@ 2012-03-28  5:56     ` roma1390
  2012-03-28 10:13       ` Denys Vlasenko
  0 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28  5:56 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On 2012.03.27 18:42, Denys Vlasenko wrote:
> On Tue, Mar 27, 2012 at 2:08 PM, roma1390<roma1390@gmail.com>  wrote:
>> On 2012.03.26 21:47, Denys Vlasenko wrote:
>>>
>>> Please find test programs attached.
>>>
>>> 32-bit test programs were built using gcc 4.6.2
>>> 64-bit test programs were built using gcc 4.2.1
>>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>>
>> Can't compile reference for arm:
>> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
>> test_org.c: In function ‘put_dec’:
>> test_org.c:101: error: impossible constraint in ‘asm’
>> test_org.c:101: error: impossible constraint in ‘asm’
>> test_org.c:101: error: impossible constraint in ‘asm’
>
> Please find a modified test_header.c attached.
> I tested and it builds in my arm emulator.


Run on same:
  2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
GCC version:
  gcc version 4.4.5 (Debian 4.4.5-8), Target: arm-linux-gnueabi
Compiled with:
  arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -o test_{org,new}


run default priority on almost idle machine:

  ./test_org
Conversions per second: 8:4716000 123:3896000 123456:2756000 12345678:2412000 
123456789:2380000 2^32:2348000 2^64:1400000
Conversions per second: 8:4556000 123:3892000 123456:2760000 12345678:2496000 
123456789:2384000 2^32:2264000 2^64:1400000
Conversions per second: 8:4716000 123:3896000 123456:2664000 12345678:2496000 
123456789:2388000 2^32:2348000 2^64:1400000
Conversions per second: 8:4716000 123:3892000 123456:2760000 12345678:2496000 
123456789:2292000 2^32:2348000 2^64:1400000
  ./test_org
Conversions per second: 8:4716000 123:3748000 123456:2760000 12345678:2496000 
123456789:2388000 2^32:2348000 2^64:1348000
Conversions per second: 8:4716000 123:3896000 123456:2756000 12345678:2408000 
123456789:2384000 2^32:2348000 2^64:1400000
Conversions per second: 8:4540000 123:3896000 123456:2648000 12345678:2496000 
123456789:2388000 2^32:2268000 2^64:1400000
Conversions per second: 8:4716000 123:3888000 123456:2664000 12345678:2496000 
123456789:2388000 2^32:2344000 2^64:1400000

./test_new
Conversions per second: 8:5496000 123:4564000 123456:3536000 12345678:3408000 
123456789:1168000 2^32:976000 2^64:532000
Conversions per second: 8:5524000 123:4568000 123456:3692000 12345678:3404000 
123456789:1132000 2^32:972000 2^64:532000
Conversions per second: 8:5528000 123:4404000 123456:3684000 12345678:3408000 
123456789:1168000 2^32:976000 2^64:516000
Conversions per second: 8:5528000 123:4568000 123456:3692000 12345678:3288000 
123456789:1168000 2^32:976000 2^64:532000
./test_new
Conversions per second: 8:5524000 123:4416000 123456:3688000 12345678:3400000 
123456789:1172000 2^32:972000 2^64:516000
Conversions per second: 8:5520000 123:4568000 123456:3692000 12345678:3276000 
123456789:1168000 2^32:976000 2^64:532000
Conversions per second: 8:5344000 123:4552000 123456:3692000 12345678:3404000 
123456789:1168000 2^32:944000 2^64:532000
Conversions per second: 8:5528000 123:4568000 123456:3556000 12345678:3404000 
123456789:1168000 2^32:976000 2^64:532000

I will run the long running correctness test, and post results later. But this 
can take a while...


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28  5:56     ` roma1390
@ 2012-03-28 10:13       ` Denys Vlasenko
  2012-03-28 10:24         ` roma1390
  2012-03-28 10:31         ` roma1390
  0 siblings, 2 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 10:13 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Wednesday 28 March 2012 07:56, roma1390 wrote:
> On 2012.03.27 18:42, Denys Vlasenko wrote:
> > On Tue, Mar 27, 2012 at 2:08 PM, roma1390<roma1390@gmail.com>  wrote:
> >> On 2012.03.26 21:47, Denys Vlasenko wrote:
> >>>
> >>> Please find test programs attached.
> >>>
> >>> 32-bit test programs were built using gcc 4.6.2
> >>> 64-bit test programs were built using gcc 4.2.1
> >>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
> >>
> >> Can't compile reference for arm:
> >> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
> >> test_org.c: In function ‘put_dec’:
> >> test_org.c:101: error: impossible constraint in ‘asm’
> >> test_org.c:101: error: impossible constraint in ‘asm’
> >> test_org.c:101: error: impossible constraint in ‘asm’
> >
> > Please find a modified test_header.c attached.
> > I tested and it builds in my arm emulator.
> 
> 
> Run on same:
>   2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
> GCC version:
>   gcc version 4.4.5 (Debian 4.4.5-8), Target: arm-linux-gnueabi
> Compiled with:
>   arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -o test_{org,new}
> 
> 
> run default priority on almost idle machine:

Hmm, results are not good. Up to 8-digit conversions we were winning,
but when we start converting larger numbers, we lose big time:

123456789:2388000 2^32:2268000 2^64:1400000
123456789:1168000 2^32:976000 2^64:532000

Since ARM is 32-bit, we are using algorithm #2.

It's not like algorithm #2 is intrinsically bad: on i386, it is a win
on both AMD and Intel I tested it on.

Either it's just the difference between ARM and x86, or gcc is
generating suboptimal code for it.

I would like to ask you to do a few things.

First: email me both test_org and test_new binaries.
(Privately, to not spam the list).

Second: run
arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
and email me resulting test_{org,new}.s files.

Third: switch to algorithm #1 and test whether it fares better.
To do that, go to test_new.c
and replace
  #if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
with 
  #if 1   ////LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
(there are two instances of this line there),
then recompile and rerun the test, and post the result.


When I disassemble ARM code produced by _my_ compiler,
I don't see any obviously bad things.

put_dec_trunc8 is the function which works well for you.

put_dec_full4 and put_dec are used for printing 9+ digit numbers,
and your testing says they are slow. I don't see why -
see disassembly below.

I need to look at _your_ gcc's output...

00000000 <put_dec_trunc8>:
   0:   e92d40f0        stmdb   sp!, {r4, r5, r6, r7, lr}
   4:   e59f6188        ldr     r6, [pc, #392]  ; 194 <.text+0x194>
   8:   e0843691        umull   r3, r4, r1, r6
   c:   e1a03004        mov     r3, r4
  10:   e1a0c003        mov     ip, r3
  14:   e1a02083        mov     r2, r3, lsl #1
  18:   e1a03183        mov     r3, r3, lsl #3
  1c:   e3a04000        mov     r4, #0  ; 0x0
  20:   e0822003        add     r2, r2, r3
  24:   e2811030        add     r1, r1, #48     ; 0x30
  28:   e0621001        rsb     r1, r2, r1
  2c:   e15c0004        cmp     ip, r4
  30:   e1a05000        mov     r5, r0
  34:   e3a07000        mov     r7, #0  ; 0x0
  38:   e4c01001        strb    r1, [r0], #1
  3c:   0a000052        beq     18c <put_dec_trunc8+0x18c>
  40:   e084369c        umull   r3, r4, ip, r6
  44:   e1a03004        mov     r3, r4
  48:   e1a02083        mov     r2, r3, lsl #1
  4c:   e1a01183        mov     r1, r3, lsl #3
  50:   e1a0e003        mov     lr, r3
  54:   e3a04000        mov     r4, #0  ; 0x0
  58:   e0822001        add     r2, r2, r1
  5c:   e28c3030        add     r3, ip, #48     ; 0x30
  60:   e0623003        rsb     r3, r2, r3
  64:   e15e0004        cmp     lr, r4
  68:   e5c53001        strb    r3, [r5, #1]
  6c:   e2800001        add     r0, r0, #1      ; 0x1
  70:   0a000045        beq     18c <put_dec_trunc8+0x18c>
  74:   e084369e        umull   r3, r4, lr, r6
  78:   e1a03004        mov     r3, r4
  7c:   e1a02083        mov     r2, r3, lsl #1
  80:   e1a01183        mov     r1, r3, lsl #3
  84:   e1a05003        mov     r5, r3
  88:   e3a04000        mov     r4, #0  ; 0x0
  8c:   e0822001        add     r2, r2, r1
  90:   e28e3030        add     r3, lr, #48     ; 0x30
  94:   e0623003        rsb     r3, r2, r3
  98:   e1550004        cmp     r5, r4
  9c:   e4c03001        strb    r3, [r0], #1
  a0:   0a000039        beq     18c <put_dec_trunc8+0x18c>
  a4:   e0843695        umull   r3, r4, r5, r6
  a8:   e1a03004        mov     r3, r4
  ac:   e1a02083        mov     r2, r3, lsl #1
  b0:   e1a01183        mov     r1, r3, lsl #3
  b4:   e1a0c003        mov     ip, r3
  b8:   e3a04000        mov     r4, #0  ; 0x0
  bc:   e0822001        add     r2, r2, r1
  c0:   e2853030        add     r3, r5, #48     ; 0x30
  c4:   e0623003        rsb     r3, r2, r3
  c8:   e15c0004        cmp     ip, r4
  cc:   e4c03001        strb    r3, [r0], #1
  d0:   0a00002d        beq     18c <put_dec_trunc8+0x18c>
  d4:   e1a0220c        mov     r2, ip, lsl #4
  d8:   e042210c        sub     r2, r2, ip, lsl #2
  dc:   e082200c        add     r2, r2, ip
  e0:   e1a03302        mov     r3, r2, lsl #6
  e4:   e0623003        rsb     r3, r2, r3
  e8:   e1a03103        mov     r3, r3, lsl #2
  ec:   e083300c        add     r3, r3, ip
  f0:   e1a03083        mov     r3, r3, lsl #1
  f4:   e1a0e823        mov     lr, r3, lsr #16
  f8:   e1a0208e        mov     r2, lr, lsl #1
  fc:   e1a0118e        mov     r1, lr, lsl #3
 100:   e0822001        add     r2, r2, r1
 104:   e28c3030        add     r3, ip, #48     ; 0x30
 108:   e0623003        rsb     r3, r2, r3
 10c:   e15e0004        cmp     lr, r4
 110:   e4c03001        strb    r3, [r0], #1
 114:   0a00001c        beq     18c <put_dec_trunc8+0x18c>
 118:   e1a0320e        mov     r3, lr, lsl #4
 11c:   e043310e        sub     r3, r3, lr, lsl #2
 120:   e1a02203        mov     r2, r3, lsl #4
 124:   e0833002        add     r3, r3, r2
 128:   e083300e        add     r3, r3, lr
 12c:   e1a0c5a3        mov     ip, r3, lsr #11
 130:   e1a0208c        mov     r2, ip, lsl #1
 134:   e1a0118c        mov     r1, ip, lsl #3
 138:   e0822001        add     r2, r2, r1
 13c:   e28e3030        add     r3, lr, #48     ; 0x30
 140:   e0623003        rsb     r3, r2, r3
 144:   e15c0004        cmp     ip, r4
 148:   e4c03001        strb    r3, [r0], #1
 14c:   0a00000e        beq     18c <put_dec_trunc8+0x18c>
 150:   e1a0320c        mov     r3, ip, lsl #4
 154:   e043310c        sub     r3, r3, ip, lsl #2
 158:   e1a02203        mov     r2, r3, lsl #4
 15c:   e0833002        add     r3, r3, r2
 160:   e083300c        add     r3, r3, ip
 164:   e1a0e5a3        mov     lr, r3, lsr #11
 168:   e1a0208e        mov     r2, lr, lsl #1
 16c:   e1a0118e        mov     r1, lr, lsl #3
 170:   e0822001        add     r2, r2, r1
 174:   e28c3030        add     r3, ip, #48     ; 0x30
 178:   e0623003        rsb     r3, r2, r3
 17c:   e15e0004        cmp     lr, r4
 180:   e4c03001        strb    r3, [r0], #1
 184:   128e3030        addne   r3, lr, #48     ; 0x30
 188:   14c03001        strneb  r3, [r0], #1
 18c:   e8bd40f0        ldmia   sp!, {r4, r5, r6, r7, lr}
 190:   e12fff1e        bx      lr
 194:   1999999a        ldmneib r9, {r1, r3, r4, r7, r8, fp, ip, pc}

00000198 <put_dec_full4>:
 198:   e1a0c201        mov     ip, r1, lsl #4
 19c:   e04cc101        sub     ip, ip, r1, lsl #2
 1a0:   e1a0320c        mov     r3, ip, lsl #4
 1a4:   e08cc003        add     ip, ip, r3
 1a8:   e1a0240c        mov     r2, ip, lsl #8
 1ac:   e08cc002        add     ip, ip, r2
 1b0:   e08cc001        add     ip, ip, r1
 1b4:   e1a0c9ac        mov     ip, ip, lsr #19
 1b8:   e1a0320c        mov     r3, ip, lsl #4
 1bc:   e043310c        sub     r3, r3, ip, lsl #2
 1c0:   e083300c        add     r3, r3, ip
 1c4:   e1a02303        mov     r2, r3, lsl #6
 1c8:   e0632002        rsb     r2, r3, r2
 1cc:   e1a02102        mov     r2, r2, lsl #2
 1d0:   e082200c        add     r2, r2, ip
 1d4:   e1a02082        mov     r2, r2, lsl #1
 1d8:   e1a02822        mov     r2, r2, lsr #16
 1dc:   e92d4070        stmdb   sp!, {r4, r5, r6, lr}
 1e0:   e1a0e202        mov     lr, r2, lsl #4
 1e4:   e04ee102        sub     lr, lr, r2, lsl #2
 1e8:   e1a0320e        mov     r3, lr, lsl #4
 1ec:   e08ee003        add     lr, lr, r3
 1f0:   e1a0408c        mov     r4, ip, lsl #1
 1f4:   e1a0318c        mov     r3, ip, lsl #3
 1f8:   e0844003        add     r4, r4, r3
 1fc:   e08ee002        add     lr, lr, r2
 200:   e2811030        add     r1, r1, #48     ; 0x30
 204:   e0641001        rsb     r1, r4, r1
 208:   e1a0e5ae        mov     lr, lr, lsr #11
 20c:   e1a04000        mov     r4, r0
 210:   e4c41001        strb    r1, [r4], #1
 214:   e1a06000        mov     r6, r0
 218:   e1a05182        mov     r5, r2, lsl #3
 21c:   e1a0018e        mov     r0, lr, lsl #3
 220:   e1a01082        mov     r1, r2, lsl #1
 224:   e1a0308e        mov     r3, lr, lsl #1
 228:   e0833000        add     r3, r3, r0
 22c:   e0811005        add     r1, r1, r5
 230:   e28cc030        add     ip, ip, #48     ; 0x30
 234:   e2822030        add     r2, r2, #48     ; 0x30
 238:   e2840001        add     r0, r4, #1      ; 0x1
 23c:   e061c00c        rsb     ip, r1, ip
 240:   e0632002        rsb     r2, r3, r2
 244:   e28ee030        add     lr, lr, #48     ; 0x30
 248:   e5c6c001        strb    ip, [r6, #1]
 24c:   e5c42001        strb    r2, [r4, #1]
 250:   e5c0e001        strb    lr, [r0, #1]
 254:   e2800002        add     r0, r0, #2      ; 0x2
 258:   e8bd4070        ldmia   sp!, {r4, r5, r6, lr}
 25c:   e12fff1e        bx      lr

00000260 <put_dec>:
 260:   e92d4ff0        stmdb   sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
 264:   e3530000        cmp     r3, #0  ; 0x0
 268:   e24dd00c        sub     sp, sp, #12     ; 0xc
 26c:   e1a08002        mov     r8, r2
 270:   e1a09003        mov     r9, r3
 274:   e1a0e000        mov     lr, r0
 278:   8a000009        bhi     2a4 <put_dec+0x44>
 27c:   0a000003        beq     290 <put_dec+0x30>
 280:   e1a01008        mov     r1, r8
 284:   e28dd00c        add     sp, sp, #12     ; 0xc
 288:   e8bd4ff0        ldmia   sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
 28c:   eaffff5b        b       0 <put_dec_trunc8>
 290:   e3e034fa        mvn     r3, #-100663296 ; 0xfa000000
 294:   e2433aa1        sub     r3, r3, #659456 ; 0xa1000
 298:   e2433c0f        sub     r3, r3, #3840   ; 0xf00
 29c:   e1520003        cmp     r2, r3
 2a0:   9afffff6        bls     280 <put_dec+0x20>
 2a4:   e1a07828        mov     r7, r8, lsr #16
 2a8:   e1a0c187        mov     ip, r7, lsl #3
 2ac:   e1a02287        mov     r2, r7, lsl #5
 2b0:   e08c2002        add     r2, ip, r2
 2b4:   e0822007        add     r2, r2, r7
 2b8:   e1a03182        mov     r3, r2, lsl #3
 2bc:   e1a0a829        mov     sl, r9, lsr #16
 2c0:   e0822003        add     r2, r2, r3
 2c4:   e1a04809        mov     r4, r9, lsl #16
 2c8:   e1a04824        mov     r4, r4, lsr #16
 2cc:   e1a00202        mov     r0, r2, lsl #4
 2d0:   e1a0118a        mov     r1, sl, lsl #3
 2d4:   e1a0328a        mov     r3, sl, lsl #5
 2d8:   e0815003        add     r5, r1, r3
 2dc:   e1a09184        mov     r9, r4, lsl #3
 2e0:   e0620000        rsb     r0, r2, r0
 2e4:   e1a01808        mov     r1, r8, lsl #16
 2e8:   e1a03304        mov     r3, r4, lsl #6
 2ec:   e0800007        add     r0, r0, r7
 2f0:   e1a01821        mov     r1, r1, lsr #16
 2f4:   e0693003        rsb     r3, r9, r3
 2f8:   e085200a        add     r2, r5, sl
 2fc:   e1a02202        mov     r2, r2, lsl #4
 300:   e0800001        add     r0, r0, r1
 304:   e0833004        add     r3, r3, r4
 308:   e0800002        add     r0, r0, r2
 30c:   e59fb164        ldr     fp, [pc, #356]  ; 478 <.text+0x478>
 310:   e1a03383        mov     r3, r3, lsl #7
 314:   e0800003        add     r0, r0, r3
 318:   e088209b        umull   r2, r8, fp, r0
 31c:   e1a086a8        mov     r8, r8, lsr #13
 320:   e1a01388        mov     r1, r8, lsl #7
 324:   e0411108        sub     r1, r1, r8, lsl #2
 328:   e0811008        add     r1, r1, r8
 32c:   e1a03101        mov     r3, r1, lsl #2
 330:   e0811003        add     r1, r1, r3
 334:   e0401201        sub     r1, r0, r1, lsl #4
 338:   e1a0000e        mov     r0, lr
 33c:   e58dc004        str     ip, [sp, #4]
 340:   ebffff94        bl      198 <put_dec_full4>
 344:   e1a03284        mov     r3, r4, lsl #5
 348:   e1a02104        mov     r2, r4, lsl #2
 34c:   e0822003        add     r2, r2, r3
 350:   e1a0630a        mov     r6, sl, lsl #6
 354:   e1a0350a        mov     r3, sl, lsl #10
 358:   e0663003        rsb     r3, r6, r3
 35c:   e1a01282        mov     r1, r2, lsl #5
 360:   e0822001        add     r2, r2, r1
 364:   e06a3003        rsb     r3, sl, r3
 368:   e0642002        rsb     r2, r4, r2
 36c:   e59dc004        ldr     ip, [sp, #4]
 370:   e1a03183        mov     r3, r3, lsl #3
 374:   e1a02182        mov     r2, r2, lsl #3
 378:   e06a3003        rsb     r3, sl, r3
 37c:   e04cc087        sub     ip, ip, r7, lsl #1
 380:   e0833002        add     r3, r3, r2
 384:   e083300c        add     r3, r3, ip
 388:   e0833008        add     r3, r3, r8
 38c:   e087239b        umull   r2, r7, fp, r3
 390:   e1a076a7        mov     r7, r7, lsr #13
 394:   e1a01387        mov     r1, r7, lsl #7
 398:   e0411107        sub     r1, r1, r7, lsl #2
 39c:   e0811007        add     r1, r1, r7
 3a0:   e1a02101        mov     r2, r1, lsl #2
 3a4:   e0811002        add     r1, r1, r2
 3a8:   e0431201        sub     r1, r3, r1, lsl #4
 3ac:   e046620a        sub     r6, r6, sl, lsl #4
 3b0:   ebffff78        bl      198 <put_dec_full4>
 3b4:   e1a03286        mov     r3, r6, lsl #5
 3b8:   e0866003        add     r6, r6, r3
 3bc:   e0499084        sub     r9, r9, r4, lsl #1
 3c0:   e06a6006        rsb     r6, sl, r6
 3c4:   e1a02189        mov     r2, r9, lsl #3
 3c8:   e1a03106        mov     r3, r6, lsl #2
 3cc:   e0663003        rsb     r3, r6, r3
 3d0:   e0692002        rsb     r2, r9, r2
 3d4:   e0822003        add     r2, r2, r3
 3d8:   e0822007        add     r2, r2, r7
 3dc:   e084329b        umull   r3, r4, fp, r2
 3e0:   e1a046a4        mov     r4, r4, lsr #13
 3e4:   e1a01384        mov     r1, r4, lsl #7
 3e8:   e0411104        sub     r1, r1, r4, lsl #2
 3ec:   e0811004        add     r1, r1, r4
 3f0:   e1a03101        mov     r3, r1, lsl #2
 3f4:   e0811003        add     r1, r1, r3
 3f8:   e0421201        sub     r1, r2, r1, lsl #4
 3fc:   ebffff65        bl      198 <put_dec_full4>
 400:   e1a03185        mov     r3, r5, lsl #3
 404:   e0653003        rsb     r3, r5, r3
 408:   e083300a        add     r3, r3, sl
 40c:   e0944003        adds    r4, r4, r3
 410:   e1a01000        mov     r1, r0
 414:   0a000010        beq     45c <put_dec+0x1fc>
 418:   e083249b        umull   r2, r3, fp, r4
 41c:   e1a066a3        mov     r6, r3, lsr #13
 420:   e1a01386        mov     r1, r6, lsl #7
 424:   e0411106        sub     r1, r1, r6, lsl #2
 428:   e0811006        add     r1, r1, r6
 42c:   e1a03101        mov     r3, r1, lsl #2
 430:   e0811003        add     r1, r1, r3
 434:   e0441201        sub     r1, r4, r1, lsl #4
 438:   ebffff56        bl      198 <put_dec_full4>
 43c:   e3560000        cmp     r6, #0  ; 0x0
 440:   e1a01000        mov     r1, r0
 444:   0a000004        beq     45c <put_dec+0x1fc>
 448:   e1a01006        mov     r1, r6
 44c:   ebffff51        bl      198 <put_dec_full4>
 450:   e1a01000        mov     r1, r0
 454:   ea000000        b       45c <put_dec+0x1fc>
 458:   e2411001        sub     r1, r1, #1      ; 0x1
 45c:   e5513001        ldrb    r3, [r1, #-1]
 460:   e3530030        cmp     r3, #48 ; 0x30
 464:   0afffffb        beq     458 <put_dec+0x1f8>
 468:   e1a00001        mov     r0, r1
 46c:   e28dd00c        add     sp, sp, #12     ; 0xc
 470:   e8bd4ff0        ldmia   sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
 474:   e12fff1e        bx      lr
 478:   d1b71759        movles  r1, r9, asr r7

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 10:13       ` Denys Vlasenko
@ 2012-03-28 10:24         ` roma1390
  2012-03-28 10:33           ` Denys Vlasenko
  2012-03-28 10:31         ` roma1390
  1 sibling, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28 10:24 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

[-- Attachment #1: Type: text/plain, Size: 189 bytes --]

On 2012.03.28 13:13, Denys Vlasenko wrote:
> Second: run
> arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
> and email me resulting test_{org,new}.s files.

test_{org,new}.s attached.


[-- Attachment #2: test_new.s --]
[-- Type: text/plain, Size: 13521 bytes --]

	.cpu arm9tdmi
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 2
	.eabi_attribute 18, 4
	.file	"test_new.c"
	.text
	.align	2
	.type	put_dec_trunc8, %function
put_dec_trunc8:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L5
	stmfd	sp!, {r4, r5, fp}
	umull	fp, ip, r1, r3
	mov	r2, ip
	add	r1, r1, #48
	mov	ip, r0
	add	r0, r2, r2, asl #2
	sub	r1, r1, r0, asl #1
	cmp	r2, #0
	mov	r0, ip
	strb	r1, [r0], #1
	beq	.L2
	umull	r4, r5, r2, r3
	add	r2, r2, #48
	add	fp, r5, r5, asl #2
	sub	r2, r2, fp, asl #1
	cmp	r5, #0
	mov	r1, r5
	strb	r2, [ip, #1]
	add	r0, r0, #1
	beq	.L2
	umull	r4, r5, r1, r3
	add	r1, r1, #48
	add	ip, r5, r5, asl #2
	sub	r1, r1, ip, asl #1
	cmp	r5, #0
	mov	r2, r5
	strb	r1, [r0], #1
	beq	.L2
	umull	r4, r5, r2, r3
	add	r2, r2, #48
	add	r1, r5, r5, asl #2
	sub	r2, r2, r1, asl #1
	cmp	r5, #0
	strb	r2, [r0], #1
	beq	.L2
	add	r2, r5, r5, asl #1
	add	r2, r5, r2, asl #2
	rsb	r2, r2, r2, asl #6
	add	r2, r5, r2, asl #2
	mov	r2, r2, asl #1
	mov	r2, r2, lsr #16
	add	r3, r5, #48
	add	r1, r2, r2, asl #2
	sub	r3, r3, r1, asl #1
	cmp	r2, #0
	strb	r3, [r0], #1
	beq	.L2
	add	r1, r2, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r3, r1, lsr #11
	add	r2, r2, #48
	add	r1, r3, r3, asl #2
	sub	r2, r2, r1, asl #1
	cmp	r3, #0
	strb	r2, [r0], #1
	beq	.L2
	add	r1, r3, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r2, r1, lsr #11
	add	r1, r2, r2, asl #2
	add	r3, r3, #48
	cmp	r2, #0
	sub	r3, r3, r1, asl #1
	strb	r3, [r0], #1
	addne	r2, r2, #48
	strneb	r2, [r0], #1
.L2:
	ldmfd	sp!, {r4, r5, fp}
	bx	lr
.L6:
	.align	2
.L5:
	.word	429496730
	.size	put_dec_trunc8, .-put_dec_trunc8
	.align	2
	.type	put_dec_full4, %function
put_dec_full4:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	add	r3, r1, r1, asl #1
	add	r3, r3, r3, asl #4
	add	r3, r3, r3, asl #8
	add	r3, r1, r3, asl #2
	mov	r3, r3, lsr #19
	add	r2, r3, r3, asl #1
	add	r2, r3, r2, asl #2
	rsb	r2, r2, r2, asl #6
	add	r2, r3, r2, asl #2
	mov	r2, r2, asl #1
	mov	r2, r2, lsr #16
	add	ip, r2, r2, asl #2
	stmfd	sp!, {r4, r5}
	add	r4, r2, ip, asl #3
	add	r5, r3, r3, asl #2
	add	r1, r1, #48
	add	r4, r4, r4, asl #2
	sub	r1, r1, r5, asl #1
	mov	r4, r4, lsr #11
	mov	r5, r0
	strb	r1, [r5], #1
	add	r3, r3, #48
	add	r1, r4, r4, asl #2
	add	r2, r2, #48
	sub	r2, r2, r1, asl #1
	sub	ip, r3, ip, asl #1
	add	r1, r5, #1
	add	r4, r4, #48
	strb	ip, [r0, #1]
	strb	r2, [r5, #1]
	add	r0, r1, #2
	strb	r4, [r1, #1]
	ldmfd	sp!, {r4, r5}
	bx	lr
	.size	put_dec_full4, .-put_dec_full4
	.global	__aeabi_uidivmod
	.global	__aeabi_uidiv
	.align	2
	.type	number, %function
number:
	@ Function supports interworking.
	@ args = 8, pretend = 0, frame = 120
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	sub	sp, sp, #124
	ldrb	ip, [sp, #161]	@ zero_extendqisi2
	mov	r4, r1
	ldrh	r5, [sp, #166]
	ldrb	r1, [sp, #162]	@ zero_extendqisi2
	str	r0, [sp, #8]
	ands	r0, ip, #64
	str	r1, [sp, #16]
	str	r5, [sp, #32]
	ldrh	r1, [sp, #164]
	str	ip, [sp, #12]
	beq	.L69
	ldr	r0, [sp, #16]
	subs	r0, r0, #10
	movne	r0, #1
.L69:
	ands	r5, ip, #16
	str	r0, [sp, #24]
	and	r0, ip, #32
	andne	ip, ip, #254
	strne	ip, [sp, #12]
	str	r5, [sp, #36]
	ldr	r5, [sp, #12]
	str	r0, [sp, #4]
	tst	r5, #2
	beq	.L14
	cmp	r3, #0
	blt	.L58
	tst	r5, #4
	bne	.L71
	ldr	r5, [sp, #12]
	tst	r5, #8
	beq	.L14
	sub	r1, r1, #1
	mov	r1, r1, asl #16
	mov	r1, r1, lsr #16
	mov	r0, #32
	str	r1, [sp, #20]
	str	r0, [sp, #28]
	b	.L17
.L14:
	str	r1, [sp, #20]
	mov	r1, #0
	str	r1, [sp, #28]
.L17:
	ldr	r5, [sp, #24]
	cmp	r5, #0
	beq	.L19
	ldr	r0, [sp, #20]
	ldr	r5, [sp, #16]
	sub	r1, r0, #1
	mov	r1, r1, asl #16
	mov	r1, r1, lsr #16
	cmp	r5, #16
	str	r1, [sp, #20]
	subeq	r1, r1, #1
	moveq	r1, r1, asl #16
	moveq	r1, r1, lsr #16
	streq	r1, [sp, #20]
.L19:
	cmp	r3, #0
	bne	.L20
	cmp	r2, #7
	bls	.L72
.L20:
	ldr	r0, [sp, #16]
	cmp	r0, #10
	beq	.L23
	cmp	r0, #16
	movne	r6, #3
	moveq	r6, #4
	add	r1, sp, #52
	ldr	r9, .L76
	sub	sl, r0, #1
	mov	r5, #0
	str	r1, [sp, #40]
	sub	r7, r6, #32
	rsb	fp, r6, #32
	str	r4, [sp, #44]
	mov	r8, r1
.L26:
	mov	ip, r2, lsr r6
	cmp	r7, #0
	orr	ip, ip, r3, asl fp
	movge	ip, r3, lsr r7
	mov	r4, r3, lsr r6
	and	r2, r2, #255
	mov	r0, ip
	and	r2, r2, sl
	ldrb	ip, [r9, r2]	@ zero_extendqisi2
	mov	r2, r0
	ldr	r0, [sp, #4]
	orrs	r1, r2, r4
	orr	ip, r0, ip
	strb	ip, [r8, r5]
	mov	r3, r4
	add	r5, r5, #1
	bne	.L26
	ldr	r4, [sp, #44]
	sub	ip, r5, #1
.L22:
	ldr	r2, [sp, #32]
	ldr	r3, [sp, #20]
	mov	r0, r2, asl #16
	cmp	r5, r0, asr #16
	movgt	r0, r5, asl #16
	mov	r0, r0, lsr #16
	ldr	r1, [sp, #12]
	rsb	r7, r0, r3
	mov	r7, r7, asl #16
	mov	r7, r7, lsr #16
	tst	r1, #17
	mov	r1, r7
	bne	.L34
	sub	r1, r7, #1
	mov	r1, r1, asl #16
	cmp	r1, #0
	mov	r1, r1, lsr #16
	blt	.L34
	ldr	r3, [sp, #8]
	mov	r8, r1
	add	r2, r3, #1
	add	r2, r2, r1
	mov	r6, #32
.L36:
	cmp	r4, r3
	strhib	r6, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L36
	rsb	r7, r7, #1
	ldr	r2, [sp, #8]
	add	r1, r1, r7
	mov	r1, r1, asl #16
	add	r8, r8, #1
	sub	r1, r1, #65536
	add	r2, r2, r8
	str	r2, [sp, #8]
	mov	r1, r1, lsr #16
.L34:
	ldr	r3, [sp, #28]
	cmp	r3, #0
	beq	.L37
	ldr	r2, [sp, #8]
	cmp	r2, r4
	strccb	r3, [r2, #0]
	ldr	r3, [sp, #8]
	add	r3, r3, #1
	str	r3, [sp, #8]
.L37:
	ldr	r2, [sp, #24]
	cmp	r2, #0
	beq	.L39
	ldr	r3, [sp, #8]
	cmp	r3, r4
	ldrcc	r2, [sp, #8]
	movcc	r3, #48
	strccb	r3, [r2, #0]
	ldr	r2, [sp, #8]
	ldr	r3, [sp, #16]
	add	r2, r2, #1
	cmp	r3, #16
	str	r2, [sp, #8]
	beq	.L73
.L39:
	ldr	r2, [sp, #36]
	cmp	r2, #0
	movne	r6, r1
	movne	r7, r6, asl #16
	bne	.L43
	sub	r6, r1, #1
	ldr	r3, [sp, #12]
	mov	r6, r6, asl #16
	tst	r3, #1
	mov	r6, r6, lsr #16
	movne	r8, #48
	moveq	r8, #32
	movs	r7, r6, asl #16
	bmi	.L43
	sub	r2, r1, #1
	ldr	r3, [sp, #8]
	mov	r2, r2, asl #16
	add	r2, r3, r2, lsr #16
	add	r2, r2, #1
.L47:
	cmp	r4, r3
	strhib	r8, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L47
	rsb	r6, r1, r6
	mov	r6, r6, asl #16
	mov	r6, r6, lsr #16
	str	r3, [sp, #8]
	mov	r7, r6, asl #16
.L43:
	sub	r3, r0, #1
	mov	r3, r3, asl #16
	cmp	r5, r3, asr #16
	bgt	.L48
	sub	r1, r0, #2
	ldr	r3, [sp, #8]
	mov	r1, r1, asl #16
	add	r1, r3, r1, asr #16
	add	r1, r1, #1
	mov	r0, #48
.L50:
	cmp	r4, r3
	strhib	r0, [r3, #0]
	add	r3, r3, #1
	rsb	r2, r3, r1
	cmp	r5, r2
	ble	.L50
	str	r3, [sp, #8]
.L48:
	cmp	ip, #0
	blt	.L51
	add	r2, sp, #52
	ldr	r3, [sp, #8]
	sub	r1, r2, #1
	add	r2, r2, ip
.L53:
	cmp	r4, r3
	ldrhib	r0, [r2, #0]	@ zero_extendqisi2
	sub	r2, r2, #1
	strhib	r0, [r3, #0]
	cmp	r2, r1
	add	r3, r3, #1
	bne	.L53
	ldr	r5, [sp, #8]
	add	ip, ip, #1
	add	r5, r5, ip
	str	r5, [sp, #8]
.L51:
	cmp	r7, #0
	ble	.L54
	ldr	r0, [sp, #8]
	sub	r2, r6, #1
	mov	r2, r2, asl #16
	add	r2, r0, r2, lsr #16
	add	r2, r2, #1
	mov	r3, r0
	mov	r1, #32
.L56:
	cmp	r4, r3
	strhib	r1, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L56
	str	r3, [sp, #8]
.L54:
	ldr	r0, [sp, #8]
	add	sp, sp, #124
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L71:
	sub	r1, r1, #1
	mov	r1, r1, asl #16
	mov	r1, r1, lsr #16
	str	r1, [sp, #20]
	mov	r1, #43
	str	r1, [sp, #28]
	b	.L17
.L58:
	sub	r1, r1, #1
	mov	r1, r1, asl #16
	mov	r1, r1, lsr #16
	mov	r0, #45
	rsbs	r2, r2, #0
	rsc	r3, r3, #0
	str	r1, [sp, #20]
	str	r0, [sp, #28]
	b	.L17
.L72:
	add	r1, r2, #48
	strb	r1, [sp, #52]
	mov	ip, r3
	mov	r5, #1
	b	.L22
.L23:
	cmp	r3, #0
	beq	.L74
.L27:
	mov	r1, r2, asl #16
	ldr	r0, .L76+4
	mov	r8, r2, lsr #16
	mov	r1, r1, lsr #16
	mla	ip, r0, r8, r1
	mov	r7, r3, lsr #16
	mov	r6, r3
	mov	r3, #656
	mla	r2, r3, r7, ip
	mov	r6, r6, asl #16
	mov	r6, r6, lsr #16
	mov	r3, #7296
	mla	r5, r3, r6, r2
	add	r0, sp, #52
	str	r0, [sp, #40]
	ldr	r1, .L76+8
	mov	r0, r5
	bl	__aeabi_uidivmod
	ldr	r0, [sp, #40]
	bl	put_dec_full4
	ldr	r3, .L76+12
	mov	sl, r0
	mul	r2, r3, r6
	sub	r3, r3, #1824
	sub	r3, r3, #1
	mla	ip, r3, r7, r2
	mov	r0, r5
	mov	r3, #6
	ldr	r1, .L76+8
	mla	r5, r3, r8, ip
	bl	__aeabi_uidiv
	add	r5, r5, r0
	mov	r0, r5
	ldr	r1, .L76+8
	bl	__aeabi_uidivmod
	mov	r0, sl
	bl	put_dec_full4
	ldr	r3, .L76+16
	mov	r8, r0
	mul	r2, r3, r7
	mov	r0, r5
	mov	r3, #42
	ldr	r1, .L76+8
	mla	r5, r3, r6, r2
	bl	__aeabi_uidiv
	add	r5, r5, r0
	mov	r0, r5
	ldr	r1, .L76+8
	bl	__aeabi_uidivmod
	mov	r0, r8
	bl	put_dec_full4
	ldr	r3, .L76+20
	mov	r6, r0
	ldr	r1, .L76+8
	mov	r0, r5
	mul	r5, r3, r7
	bl	__aeabi_uidiv
	adds	r5, r0, r5
	bne	.L75
.L30:
	mov	r3, r6
.L31:
	mov	r0, r3
	ldrb	r2, [r3, #-1]!	@ zero_extendqisi2
	cmp	r2, #48
	beq	.L31
.L29:
	ldr	r1, [sp, #40]
	rsb	r5, r1, r0
	sub	ip, r5, #1
	b	.L22
.L73:
	cmp	r4, r2
	ldrhi	r2, [sp, #4]
	orrhi	r3, r2, #88
	ldrhi	r2, [sp, #8]
	strhib	r3, [r2, #0]
	ldr	r3, [sp, #8]
	add	r3, r3, #1
	str	r3, [sp, #8]
	b	.L39
.L74:
	ldr	r1, .L76+24
	cmp	r2, r1
	bhi	.L27
	add	r5, sp, #52
	mov	r1, r2
	mov	r0, r5
	str	r5, [sp, #40]
	bl	put_dec_trunc8
	b	.L29
.L75:
	mov	r0, r5
	ldr	r1, .L76+8
	bl	__aeabi_uidivmod
	mov	r0, r6
	bl	put_dec_full4
	ldr	r1, .L76+8
	mov	r6, r0
	mov	r0, r5
	bl	__aeabi_uidiv
	subs	r1, r0, #0
	beq	.L30
	mov	r0, r6
	bl	put_dec_full4
	mov	r6, r0
	b	.L30
.L77:
	.align	2
.L76:
	.word	.LANCHOR0
	.word	5536
	.word	10000
	.word	9496
	.word	4749
	.word	281
	.word	99999999
	.size	number, .-number
	.align	2
	.type	measure_number, %function
measure_number:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 72
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	mov	fp, r3
	sub	sp, sp, #84
	ldr	r3, [r0, #0]
	str	r0, [sp, #12]
	mov	r0, #0
	str	r3, [sp, #8]
	mov	sl, r2
	bl	time
	ldr	r3, [sp, #8]
	mov	r9, #0
	add	r6, sp, #16
	cmp	r3, r0
	mov	r7, r9
	ldr	r5, .L85
	add	r8, r6, #63
	bne	.L84
.L81:
	ldr	r4, .L85+4
.L80:
	mov	ip, sp
	ldmia	r5, {r0, r1}
	mov	r2, sl
	stmia	ip, {r0, r1}
	mov	r3, fp
	mov	r0, r6
	mov	r1, r8
	bl	number
	sub	r4, r4, #1
	cmn	r4, #1
	strb	r7, [r0, #0]
	bne	.L80
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #8]
	add	r9, r9, #4000
	cmp	r3, r0
	beq	.L81
.L84:
	ldr	ip, [sp, #12]
	str	r0, [ip, #0]
	mov	r0, r9
	add	sp, sp, #84
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L86:
	.align	2
.L85:
	.word	.LANCHOR0+16
	.word	3999
	.size	measure_number, .-measure_number
	.align	2
	.type	measure, %function
measure:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 8
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	mov	r0, #0
	sub	sp, sp, #24
	bl	time
	str	r0, [sp, #20]
.L88:
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #20]
	cmp	r0, r3
	beq	.L88
	add	r8, sp, #24
	str	r0, [r8, #-4]!
	mov	r2, #8
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r2, #123
	mov	sl, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L91
	mov	r7, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L91+4
	mov	r6, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L91+8
	mov	r5, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r4, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r9, r0
	mvn	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r1, sl
	str	r0, [sp, #12]
	mov	r2, r7
	mov	r3, r6
	ldr	r0, .L91+12
	str	r5, [sp, #0]
	stmib	sp, {r4, r9}	@ phole stm
	bl	printf
	add	sp, sp, #24
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	bx	lr
.L92:
	.align	2
.L91:
	.word	123456
	.word	12345678
	.word	123456789
	.word	.LC0
	.size	measure, .-measure
	.align	2
	.type	check, %function
check:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 128
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, lr}
	ldr	r3, .L97
	sub	sp, sp, #140
	mov	r5, r0
	mov	r6, r1
	add	r4, sp, #72
	ldmia	r3, {r0, r1}
	mov	r3, sp
	stmia	r3, {r0, r1}
	mov	r2, r5
	mov	r3, r6
	add	r1, r4, #63
	mov	r0, r4
	bl	number
	add	r7, sp, #8
	mov	r3, #0
	strb	r3, [r0, #0]
	mov	r2, r5
	mov	r3, r6
	ldr	r1, .L97+4
	mov	r0, r7
	bl	sprintf
	mov	r0, r4
	mov	r1, r7
	bl	strcmp
	cmp	r0, #0
	bne	.L96
	add	sp, sp, #140
	ldmfd	sp!, {r4, r5, r6, r7, lr}
	bx	lr
.L96:
	mov	r2, r5
	mov	r3, r6
	ldr	r0, .L97+8
	str	r4, [sp, #0]
	bl	printf
	mov	r0, #1
	bl	exit
.L98:
	.align	2
.L97:
	.word	.LANCHOR0+16
	.word	.LC1
	.word	.LC2
	.size	check, .-check
	.align	2
	.global	main
	.type	main, %function
main:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bl	measure
	mov	r4, #0
	bl	measure
	mov	r5, #0
	bl	measure
	ldr	r6, .L103
	bl	measure
	mov	r7, #0
	mov	sl, #1
	mov	fp, #0
	b	.L101
.L100:
	adds	r4, r4, sl
	adc	r5, r5, fp
.L101:
	mov	r0, r4
	mov	r1, r5
	bl	check
	and	r8, r4, r6
	rsbs	r0, r4, #0
	rsc	r1, r5, #0
	and	r9, r5, r7
	bl	check
	orrs	r8, r8, r9
	bne	.L100
	mov	r2, r4
	mov	r3, r5
	ldr	r0, .L103+4
	bl	printf
	mov	r0, r8
	bl	fflush
	b	.L100
.L104:
	.align	2
.L103:
	.word	262143
	.word	.LC3
	.size	main, .-main
	.section	.rodata
	.align	2
.LANCHOR0 = . + 0
	.type	digits.3938, %object
	.size	digits.3938, 16
digits.3938:
	.ascii	"0123456789ABCDEF"
	.type	dummy_spec, %object
	.size	dummy_spec, 8
dummy_spec:
	.byte	8
	.byte	0
	.byte	10
	.byte	0
	.short	0
	.short	0
	.section	.rodata.str1.4,"aMS",%progbits,1
	.align	2
.LC0:
	.ascii	"Conversions per second: 8:%d 123:%d 123456:%d 12345"
	.ascii	"678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
	.ascii	"%llu\000"
	.space	3
.LC2:
	.ascii	"Error in formatting %llu:'%s'\012\000"
	.space	1
.LC3:
	.ascii	"\015Tested %llu                               \000"
	.ident	"GCC: (Debian 4.4.5-8) 4.4.5"
	.section	.note.GNU-stack,"",%progbits

[-- Attachment #3: test_org.s --]
[-- Type: text/plain, Size: 13375 bytes --]

	.cpu arm9tdmi
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 2
	.eabi_attribute 18, 4
	.file	"test_org.c"
	.text
	.align	2
	.type	put_dec_trunc, %function
put_dec_trunc:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	stmfd	sp!, {r4, r5, r6}
	mov	ip, r1, lsr #8
	mov	r4, r1, lsr #4
	and	r4, r4, #15
	and	ip, ip, #15
	mov	r3, r1, lsr #12
	add	r2, r4, ip
	add	r2, r2, r3
	add	r2, r2, r2, asl #1
	and	r1, r1, #15
	add	r1, r1, r2, asl #1
	add	r2, r1, r1, asl #2
	add	r2, r1, r2, asl #3
	add	r2, r2, r2, asl #2
	mov	r2, r2, lsr #11
	add	r5, r3, r3, asl #3
	add	r4, r5, r4
	add	r6, r2, r2, asl #2
	add	r5, ip, ip, asl #2
	add	r4, r4, r5
	sub	r1, r1, r6, asl #1
	add	r1, r1, #48
	adds	r2, r4, r2
	mov	r5, r0
	strb	r1, [r0], #1
	beq	.L2
	add	r1, r2, r2, asl #2
	add	r1, r2, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r1, r1, lsr #11
	add	r4, r1, r1, asl #2
	sub	r2, r2, r4, asl #1
	add	ip, r1, ip, asl #1
	add	r2, r2, #48
	orrs	r1, ip, r3
	strb	r2, [r5, #1]
	add	r0, r0, #1
	beq	.L2
	add	r2, ip, ip, asl #1
	add	r2, ip, r2, asl #2
	mov	r2, r2, lsr #7
	add	r1, r2, r2, asl #2
	sub	ip, ip, r1, asl #1
	add	ip, ip, #48
	adds	r3, r2, r3, asl #2
	strb	ip, [r0], #1
	beq	.L2
	add	r1, r3, r3, asl #2
	add	r1, r3, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r1, r1, lsr #11
	add	r2, r1, r1, asl #2
	sub	r3, r3, r2, asl #1
	cmp	r1, #0
	add	r3, r3, #48
	strb	r3, [r0], #1
	addne	r1, r1, #48
	strneb	r1, [r0], #1
.L2:
	ldmfd	sp!, {r4, r5, r6}
	bx	lr
	.size	put_dec_trunc, .-put_dec_trunc
	.align	2
	.type	put_dec_full, %function
put_dec_full:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	mov	ip, r1, lsr #4
	mov	r2, r1, lsr #8
	and	r2, r2, #15
	and	ip, ip, #15
	stmfd	sp!, {r4, r5, r6, r7, r8}
	mov	r3, r1, lsr #12
	add	r4, ip, r2
	add	r4, r4, r3
	add	r4, r4, r4, asl #1
	and	r1, r1, #15
	add	r1, r1, r4, asl #1
	add	r6, r1, r1, asl #2
	add	r4, r3, r3, asl #3
	add	r6, r1, r6, asl #3
	add	ip, r4, ip
	add	r6, r6, r6, asl #2
	add	r4, r2, r2, asl #2
	add	ip, ip, r4
	mov	r6, r6, lsr #11
	add	ip, ip, r6
	add	r5, ip, ip, asl #2
	add	r5, ip, r5, asl #3
	add	r5, r5, r5, asl #2
	mov	r5, r5, lsr #11
	add	r2, r5, r2, asl #1
	add	r4, r2, r2, asl #1
	add	r4, r2, r4, asl #2
	mov	r4, r4, lsr #7
	add	r3, r4, r3, asl #2
	add	r7, r3, r3, asl #2
	add	r7, r3, r7, asl #3
	add	r6, r6, r6, asl #2
	sub	r1, r1, r6, asl #1
	add	r7, r7, r7, asl #2
	mov	r6, r0
	mov	r7, r7, lsr #11
	add	r1, r1, #48
	strb	r1, [r6], #1
	add	r5, r5, r5, asl #2
	add	r1, r7, r7, asl #2
	add	r4, r4, r4, asl #2
	add	r8, r6, #1
	sub	ip, ip, r5, asl #1
	sub	r2, r2, r4, asl #1
	sub	r3, r3, r1, asl #1
	add	ip, ip, #48
	add	r1, r8, #1
	add	r2, r2, #48
	add	r3, r3, #48
	add	r7, r7, #48
	strb	ip, [r0, #1]
	strb	r2, [r6, #1]
	add	r0, r1, #2
	strb	r3, [r8, #1]
	strb	r7, [r1, #1]
	ldmfd	sp!, {r4, r5, r6, r7, r8}
	bx	lr
	.size	put_dec_full, .-put_dec_full
	.align	2
	.type	put_dec, %function
put_dec:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	cmp	r3, #0
	stmfd	sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
	beq	.L18
.L15:
	ldr	sl, .L19
	ldr	r5, .L19+4
	ldr	r6, .L19+8
	mov	r8, #0
.L17:
#APP
@ 101 "test_org.c" 1
	umull	fp, ip, r5, r2
	cmn	fp, r5
	adcs	ip, ip, r6
	adc	fp, r8, #0
@ 0 "" 2
	mov	r1, r8
#APP
@ 101 "test_org.c" 1
	umlal	ip, fp, r6, r2
	umlal	ip, r1, r5, r3
	mov	ip, #0
	adds	fp, r1, fp
	adc	ip, ip, #0
	umlal	fp, ip, r6, r3
@ 0 "" 2
	mov	r4, fp, lsr #16
	orr	r4, r4, ip, asl #16
	add	r1, r4, r4, asl #1
	add	r1, r1, r1, asl #6
	add	r1, r4, r1, asl #2
	add	r1, r4, r1, asl #2
	mov	r7, ip, lsr #16
	sub	r1, r2, r1, asl #5
	bl	put_dec_full
	cmp	r7, #0
	mov	r2, r4
	mov	r3, r7
	bne	.L17
	cmp	r4, sl
	bhi	.L17
.L10:
	mov	r1, r4
	ldmfd	sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
	b	put_dec_trunc
.L18:
	ldr	r1, .L19
	cmp	r2, r1
	bhi	.L15
	mov	r4, r2
	b	.L10
.L20:
	.align	2
.L19:
	.word	99999
	.word	457671715
	.word	-1480217529
	.size	put_dec, .-put_dec
	.align	2
	.type	number, %function
number:
	@ Function supports interworking.
	@ args = 8, pretend = 0, frame = 120
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	sub	sp, sp, #124
	ldrb	ip, [sp, #161]	@ zero_extendqisi2
	ldrb	r4, [sp, #162]	@ zero_extendqisi2
	ldrh	r5, [sp, #166]
	str	r0, [sp, #12]
	ands	r0, ip, #64
	str	r4, [sp, #20]
	str	r5, [sp, #36]
	ldrh	r4, [sp, #164]
	str	ip, [sp, #16]
	beq	.L72
	ldr	r0, [sp, #20]
	subs	r0, r0, #10
	movne	r0, #1
.L72:
	ands	r5, ip, #16
	str	r0, [sp, #28]
	and	r0, ip, #32
	andne	ip, ip, #254
	strne	ip, [sp, #16]
	str	r5, [sp, #40]
	ldr	r5, [sp, #16]
	str	r0, [sp, #8]
	tst	r5, #2
	beq	.L26
	cmp	r3, #0
	blt	.L64
	tst	r5, #4
	bne	.L74
	ldr	r5, [sp, #16]
	tst	r5, #8
	beq	.L26
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	mov	ip, #32
	str	r0, [sp, #24]
	str	ip, [sp, #32]
	b	.L29
.L26:
	mov	r0, #0
	str	r4, [sp, #24]
	str	r0, [sp, #32]
.L29:
	ldr	r4, [sp, #28]
	cmp	r4, #0
	beq	.L31
	ldr	r5, [sp, #24]
	ldr	ip, [sp, #20]
	sub	r0, r5, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	cmp	ip, #16
	str	r0, [sp, #24]
	subeq	r0, r0, #1
	moveq	r0, r0, asl #16
	moveq	r0, r0, lsr #16
	streq	r0, [sp, #24]
.L31:
	orrs	ip, r2, r3
	moveq	r3, #48
	streqb	r3, [sp, #52]
	moveq	r4, #1
	beq	.L33
	ldr	r0, [sp, #20]
	cmp	r0, #10
	beq	.L34
	cmp	r0, #16
	movne	r5, #3
	moveq	r5, #4
	sub	r8, r0, #1
	ldr	sl, .L76
	rsb	r0, r5, #32
	mov	r4, #0
	add	r9, sp, #52
	sub	r6, r5, #32
	str	r1, [sp, #44]
	mov	fp, r0
.L37:
	mov	ip, r2, lsr r5
	cmp	r6, #0
	orr	ip, ip, r3, asl fp
	movge	ip, r3, lsr r6
	mov	r7, r3, lsr r5
	and	r2, r2, #255
	and	r2, r2, r8
	mov	r0, ip
	ldr	r1, [sp, #8]
	ldrb	ip, [sl, r2]	@ zero_extendqisi2
	mov	r2, r0
	orr	ip, r1, ip
	orrs	r0, r2, r7
	strb	ip, [r9, r4]
	mov	r3, r7
	add	r4, r4, #1
	bne	.L37
	ldr	r1, [sp, #44]
	sub	ip, r4, #1
.L33:
	ldr	r2, [sp, #36]
	ldr	r3, [sp, #24]
	mov	r5, r2, asl #16
	cmp	r4, r5, asr #16
	movgt	r5, r4, asl #16
	mov	r5, r5, lsr #16
	ldr	r0, [sp, #16]
	rsb	r7, r5, r3
	mov	r7, r7, asl #16
	mov	r7, r7, lsr #16
	tst	r0, #17
	mov	r0, r7
	bne	.L40
	sub	r0, r7, #1
	mov	r0, r0, asl #16
	cmp	r0, #0
	mov	r0, r0, lsr #16
	blt	.L40
	ldr	r3, [sp, #12]
	mov	r8, r0
	add	r2, r3, #1
	add	r2, r2, r0
	mov	r6, #32
.L42:
	cmp	r1, r3
	strhib	r6, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L42
	rsb	r7, r7, #1
	ldr	r2, [sp, #12]
	add	r0, r0, r7
	mov	r0, r0, asl #16
	add	r3, r8, #1
	sub	r0, r0, #65536
	add	r2, r2, r3
	str	r2, [sp, #12]
	mov	r0, r0, lsr #16
.L40:
	ldr	r3, [sp, #32]
	cmp	r3, #0
	beq	.L43
	ldr	r2, [sp, #12]
	cmp	r2, r1
	strccb	r3, [r2, #0]
	ldr	r3, [sp, #12]
	add	r3, r3, #1
	str	r3, [sp, #12]
.L43:
	ldr	r2, [sp, #28]
	cmp	r2, #0
	beq	.L45
	ldr	r3, [sp, #12]
	cmp	r3, r1
	ldrcc	r2, [sp, #12]
	movcc	r3, #48
	strccb	r3, [r2, #0]
	ldr	r2, [sp, #12]
	ldr	r3, [sp, #20]
	add	r2, r2, #1
	cmp	r3, #16
	str	r2, [sp, #12]
	beq	.L75
.L45:
	ldr	r2, [sp, #40]
	cmp	r2, #0
	movne	r6, r0
	movne	r7, r6, asl #16
	bne	.L49
	sub	r6, r0, #1
	ldr	r3, [sp, #16]
	mov	r6, r6, asl #16
	tst	r3, #1
	mov	r6, r6, lsr #16
	movne	r8, #48
	moveq	r8, #32
	movs	r7, r6, asl #16
	bmi	.L49
	sub	r2, r0, #1
	ldr	r3, [sp, #12]
	mov	r2, r2, asl #16
	add	r2, r3, r2, lsr #16
	add	r2, r2, #1
.L53:
	cmp	r1, r3
	strhib	r8, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L53
	rsb	r6, r0, r6
	mov	r6, r6, asl #16
	mov	r6, r6, lsr #16
	str	r3, [sp, #12]
	mov	r7, r6, asl #16
.L49:
	sub	r3, r5, #1
	mov	r3, r3, asl #16
	cmp	r4, r3, asr #16
	bgt	.L54
	sub	r0, r5, #2
	ldr	r3, [sp, #12]
	mov	r0, r0, asl #16
	add	r0, r3, r0, asr #16
	add	r0, r0, #1
	mov	r5, #48
.L56:
	cmp	r1, r3
	strhib	r5, [r3, #0]
	add	r3, r3, #1
	rsb	r2, r3, r0
	cmp	r4, r2
	ble	.L56
	str	r3, [sp, #12]
.L54:
	cmp	ip, #0
	blt	.L57
	add	r2, sp, #52
	ldr	r3, [sp, #12]
	sub	r0, r2, #1
	add	r2, r2, ip
.L59:
	cmp	r1, r3
	ldrhib	r4, [r2, #0]	@ zero_extendqisi2
	sub	r2, r2, #1
	strhib	r4, [r3, #0]
	cmp	r2, r0
	add	r3, r3, #1
	bne	.L59
	ldr	r4, [sp, #12]
	add	ip, ip, #1
	add	r4, r4, ip
	str	r4, [sp, #12]
.L57:
	cmp	r7, #0
	ble	.L60
	ldr	r5, [sp, #12]
	sub	r2, r6, #1
	mov	r2, r2, asl #16
	add	r2, r5, r2, lsr #16
	add	r2, r2, #1
	mov	r3, r5
	mov	r0, #32
.L62:
	cmp	r1, r3
	strhib	r0, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L62
	str	r3, [sp, #12]
.L60:
	ldr	r0, [sp, #12]
	add	sp, sp, #124
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L74:
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	str	r0, [sp, #24]
	mov	r0, #43
	str	r0, [sp, #32]
	b	.L29
.L75:
	cmp	r1, r2
	ldrhi	r2, [sp, #8]
	orrhi	r3, r2, #88
	ldrhi	r2, [sp, #12]
	strhib	r3, [r2, #0]
	ldr	r3, [sp, #12]
	add	r3, r3, #1
	str	r3, [sp, #12]
	b	.L45
.L64:
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	mov	ip, #45
	rsbs	r2, r2, #0
	rsc	r3, r3, #0
	str	r0, [sp, #24]
	str	ip, [sp, #32]
	b	.L29
.L34:
	add	r4, sp, #52
	mov	r0, r4
	str	r1, [sp, #4]
	bl	put_dec
	rsb	r4, r4, r0
	sub	ip, r4, #1
	ldr	r1, [sp, #4]
	b	.L33
.L77:
	.align	2
.L76:
	.word	.LANCHOR0
	.size	number, .-number
	.align	2
	.type	measure_number, %function
measure_number:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 72
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	mov	fp, r3
	sub	sp, sp, #84
	ldr	r3, [r0, #0]
	str	r0, [sp, #12]
	mov	r0, #0
	str	r3, [sp, #8]
	mov	sl, r2
	bl	time
	ldr	r3, [sp, #8]
	mov	r9, #0
	add	r6, sp, #16
	cmp	r3, r0
	mov	r7, r9
	ldr	r5, .L85
	add	r8, r6, #63
	bne	.L84
.L81:
	ldr	r4, .L85+4
.L80:
	mov	ip, sp
	ldmia	r5, {r0, r1}
	mov	r2, sl
	stmia	ip, {r0, r1}
	mov	r3, fp
	mov	r0, r6
	mov	r1, r8
	bl	number
	sub	r4, r4, #1
	cmn	r4, #1
	strb	r7, [r0, #0]
	bne	.L80
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #8]
	add	r9, r9, #4000
	cmp	r3, r0
	beq	.L81
.L84:
	ldr	ip, [sp, #12]
	str	r0, [ip, #0]
	mov	r0, r9
	add	sp, sp, #84
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L86:
	.align	2
.L85:
	.word	.LANCHOR0+16
	.word	3999
	.size	measure_number, .-measure_number
	.align	2
	.type	measure, %function
measure:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 8
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	mov	r0, #0
	sub	sp, sp, #24
	bl	time
	str	r0, [sp, #20]
.L88:
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #20]
	cmp	r0, r3
	beq	.L88
	add	r8, sp, #24
	str	r0, [r8, #-4]!
	mov	r2, #8
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r2, #123
	mov	sl, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L91
	mov	r7, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L91+4
	mov	r6, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L91+8
	mov	r5, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r4, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r9, r0
	mvn	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r1, sl
	str	r0, [sp, #12]
	mov	r2, r7
	mov	r3, r6
	ldr	r0, .L91+12
	str	r5, [sp, #0]
	stmib	sp, {r4, r9}	@ phole stm
	bl	printf
	add	sp, sp, #24
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	bx	lr
.L92:
	.align	2
.L91:
	.word	123456
	.word	12345678
	.word	123456789
	.word	.LC0
	.size	measure, .-measure
	.align	2
	.type	check, %function
check:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 128
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, lr}
	ldr	r3, .L97
	sub	sp, sp, #140
	mov	r5, r0
	mov	r6, r1
	add	r4, sp, #72
	ldmia	r3, {r0, r1}
	mov	r3, sp
	stmia	r3, {r0, r1}
	mov	r2, r5
	mov	r3, r6
	add	r1, r4, #63
	mov	r0, r4
	bl	number
	add	r7, sp, #8
	mov	r3, #0
	strb	r3, [r0, #0]
	mov	r2, r5
	mov	r3, r6
	ldr	r1, .L97+4
	mov	r0, r7
	bl	sprintf
	mov	r0, r4
	mov	r1, r7
	bl	strcmp
	cmp	r0, #0
	bne	.L96
	add	sp, sp, #140
	ldmfd	sp!, {r4, r5, r6, r7, lr}
	bx	lr
.L96:
	mov	r2, r5
	mov	r3, r6
	ldr	r0, .L97+8
	str	r4, [sp, #0]
	bl	printf
	mov	r0, #1
	bl	exit
.L98:
	.align	2
.L97:
	.word	.LANCHOR0+16
	.word	.LC1
	.word	.LC2
	.size	check, .-check
	.align	2
	.global	main
	.type	main, %function
main:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bl	measure
	mov	r4, #0
	bl	measure
	mov	r5, #0
	bl	measure
	ldr	r6, .L103
	bl	measure
	mov	r7, #0
	mov	sl, #1
	mov	fp, #0
	b	.L101
.L100:
	adds	r4, r4, sl
	adc	r5, r5, fp
.L101:
	mov	r0, r4
	mov	r1, r5
	bl	check
	and	r8, r4, r6
	rsbs	r0, r4, #0
	rsc	r1, r5, #0
	and	r9, r5, r7
	bl	check
	orrs	r8, r8, r9
	bne	.L100
	mov	r2, r4
	mov	r3, r5
	ldr	r0, .L103+4
	bl	printf
	mov	r0, r8
	bl	fflush
	b	.L100
.L104:
	.align	2
.L103:
	.word	262143
	.word	.LC3
	.size	main, .-main
	.section	.rodata
	.align	2
.LANCHOR0 = . + 0
	.type	digits.4070, %object
	.size	digits.4070, 16
digits.4070:
	.ascii	"0123456789ABCDEF"
	.type	dummy_spec, %object
	.size	dummy_spec, 8
dummy_spec:
	.byte	8
	.byte	0
	.byte	10
	.byte	0
	.short	0
	.short	0
	.section	.rodata.str1.4,"aMS",%progbits,1
	.align	2
.LC0:
	.ascii	"Conversions per second: 8:%d 123:%d 123456:%d 12345"
	.ascii	"678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
	.ascii	"%llu\000"
	.space	3
.LC2:
	.ascii	"Error in formatting %llu:'%s'\012\000"
	.space	1
.LC3:
	.ascii	"\015Tested %llu                               \000"
	.ident	"GCC: (Debian 4.4.5-8) 4.4.5"
	.section	.note.GNU-stack,"",%progbits

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 10:13       ` Denys Vlasenko
  2012-03-28 10:24         ` roma1390
@ 2012-03-28 10:31         ` roma1390
  2012-03-28 11:23           ` Denys Vlasenko
  1 sibling, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28 10:31 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On 2012.03.28 13:13, Denys Vlasenko wrote:
> Third: switch to algorithm #1 and test whether it fares better.
> To do that, go to test_new.c
> and replace
>    #if LONG_MAX>  ((1UL<<31)-1) || LLONG_MAX>  ((1ULL<<63)-1)
> with
>    #if 1   ////LONG_MAX>  ((1UL<<31)-1) || LLONG_MAX>  ((1ULL<<63)-1)
> (there are two instances of this line there),
> then recompile and rerun the test, and post the result.

After applied following diff:


--- test_new.org.c      2012-03-28 13:25:14.000000000 +0300
+++ test_new.c  2012-03-28 13:25:30.000000000 +0300
@@ -7,7 +7,7 @@
   * (with permission from the author, Douglas W. Jones).
   */

-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ////LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
  /* Formats correctly any integer in [0, 999999999] */
  static noinline_for_stack
  char *put_dec_full9(char *buf, unsigned q)
@@ -106,7 +106,7 @@
   * Else (if long is 32 bits and long long is 64 bits) we use second one.
   */

-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ///LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)

  /* First algorithm: generic */


I have following compile output:
$ arm-linux-gnueabi-gcc -O2 -Wall test_new.c -o test_new
test_new.c: In function ‘number’:
test_new.c:118: error: impossible constraint in ‘asm’
test_new.c:118: error: impossible constraint in ‘asm’
test_new.c:118: error: impossible constraint in ‘asm’

but affter this diff:
--- test_new-with-ifdefs.c      2012-03-28 13:29:04.000000000 +0300
+++ test_new.c  2012-03-28 13:29:12.000000000 +0300
@@ -1,4 +1,4 @@
-#include "test_header.c"
+#include "test_header_arm.c"

  /* Decimal conversion is by far the most typical, and is used
   * for /proc and /sys data. This directly impacts e.g. top performance

compiles fine.

result are:
./test_new2
Conversions per second: 8:5420000 123:4452000 123456:3556000 12345678:3368000 
123456789:2904000 2^32:2412000 2^64:1556000
Conversions per second: 8:5428000 123:4500000 123456:3644000 12345678:3328000 
123456789:2832000 2^32:2408000 2^64:1572000
Conversions per second: 8:5372000 123:4396000 123456:3644000 12345678:3368000 
123456789:2900000 2^32:2392000 2^64:1532000
Conversions per second: 8:5424000 123:4500000 123456:3608000 12345678:3284000 
123456789:2896000 2^32:2416000 2^64:1572000

btw: previovios version (test_new) still running, now is at:
Tested 2446852096

roma1390


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 10:24         ` roma1390
@ 2012-03-28 10:33           ` Denys Vlasenko
  2012-03-28 10:39             ` roma1390
  2012-03-29 10:35             ` Denys Vlasenko
  0 siblings, 2 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 10:33 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Wednesday 28 March 2012 12:24, roma1390 wrote:
> On 2012.03.28 13:13, Denys Vlasenko wrote:
> > Second: run
> > arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
> > and email me resulting test_{org,new}.s files.
> 
> test_{org,new}.s attached.


Bingo. 

        bl      __aeabi_uidivmod

Not good. Your gcc did not optimize division by constant.

Can you add  "noinline_for_stack":

static noinline_for_stack <=== HERE
char *put_dec(char *buf, unsigned long long n) 

amd regenerate and resend the test_new.s?

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 10:33           ` Denys Vlasenko
@ 2012-03-28 10:39             ` roma1390
  2012-03-28 11:20               ` Denys Vlasenko
  2012-03-29 10:35             ` Denys Vlasenko
  1 sibling, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28 10:39 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

On 2012.03.28 13:33, Denys Vlasenko wrote:
> On Wednesday 28 March 2012 12:24, roma1390 wrote:
>> On 2012.03.28 13:13, Denys Vlasenko wrote:
>>> Second: run
>>> arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
>>> and email me resulting test_{org,new}.s files.
>>
>> test_{org,new}.s attached.
>
>
> Bingo.
>
>          bl      __aeabi_uidivmod
>
> Not good. Your gcc did not optimize division by constant.
>
> Can you add  "noinline_for_stack":
>
> static noinline_for_stack<=== HERE
> char *put_dec(char *buf, unsigned long long n)
>
> amd regenerate and resend the test_new.s?
>

Hello,

Your requested asm are attached.

roma1390

[-- Attachment #2: test_new-org-with-noinline_for_stack.diff --]
[-- Type: text/plain, Size: 560 bytes --]

--- test_new.org.c	2012-03-28 13:25:14.000000000 +0300
+++ test_new-org-with-noinline_for_stack.c	2012-03-28 13:37:26.000000000 +0300
@@ -110,7 +110,7 @@
 
 /* First algorithm: generic */
 
-static
+static noinline_for_stack
 char *put_dec(char *buf, unsigned long long n)
 {
 	if (n >= 100*1000*1000) {
@@ -145,7 +145,7 @@
  * (with permission from the author).
  * Performs no 64-bit division and hence should be fast on 32-bit machines.
  */
-static
+static noinline_for_stack
 char *put_dec(char *buf, unsigned long long n)
 {
 	uint32_t d3, d2, d1, q, h;

[-- Attachment #3: test_new-org-with-noinline_for_stack.s --]
[-- Type: text/plain, Size: 14374 bytes --]

	.cpu arm9tdmi
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 2
	.eabi_attribute 18, 4
	.file	"test_new-org-with-noinline_for_stack.c"
	.text
	.align	2
	.type	put_dec_trunc8, %function
put_dec_trunc8:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L5
	stmfd	sp!, {r4, r5, fp}
	umull	fp, ip, r1, r3
	mov	r2, ip
	add	r1, r1, #48
	mov	ip, r0
	add	r0, r2, r2, asl #2
	sub	r1, r1, r0, asl #1
	cmp	r2, #0
	mov	r0, ip
	strb	r1, [r0], #1
	beq	.L2
	umull	r4, r5, r2, r3
	add	r2, r2, #48
	add	fp, r5, r5, asl #2
	sub	r2, r2, fp, asl #1
	cmp	r5, #0
	mov	r1, r5
	strb	r2, [ip, #1]
	add	r0, r0, #1
	beq	.L2
	umull	r4, r5, r1, r3
	add	r1, r1, #48
	add	ip, r5, r5, asl #2
	sub	r1, r1, ip, asl #1
	cmp	r5, #0
	mov	r2, r5
	strb	r1, [r0], #1
	beq	.L2
	umull	r4, r5, r2, r3
	add	r2, r2, #48
	add	r1, r5, r5, asl #2
	sub	r2, r2, r1, asl #1
	cmp	r5, #0
	strb	r2, [r0], #1
	beq	.L2
	add	r2, r5, r5, asl #1
	add	r2, r5, r2, asl #2
	rsb	r2, r2, r2, asl #6
	add	r2, r5, r2, asl #2
	mov	r2, r2, asl #1
	mov	r2, r2, lsr #16
	add	r3, r5, #48
	add	r1, r2, r2, asl #2
	sub	r3, r3, r1, asl #1
	cmp	r2, #0
	strb	r3, [r0], #1
	beq	.L2
	add	r1, r2, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r3, r1, lsr #11
	add	r2, r2, #48
	add	r1, r3, r3, asl #2
	sub	r2, r2, r1, asl #1
	cmp	r3, #0
	strb	r2, [r0], #1
	beq	.L2
	add	r1, r3, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r2, r1, lsr #11
	add	r1, r2, r2, asl #2
	add	r3, r3, #48
	cmp	r2, #0
	sub	r3, r3, r1, asl #1
	strb	r3, [r0], #1
	addne	r2, r2, #48
	strneb	r2, [r0], #1
.L2:
	ldmfd	sp!, {r4, r5, fp}
	bx	lr
.L6:
	.align	2
.L5:
	.word	429496730
	.size	put_dec_trunc8, .-put_dec_trunc8
	.align	2
	.type	put_dec_full4, %function
put_dec_full4:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	add	r3, r1, r1, asl #1
	add	r3, r3, r3, asl #4
	add	r3, r3, r3, asl #8
	add	r3, r1, r3, asl #2
	mov	r3, r3, lsr #19
	add	r2, r3, r3, asl #1
	add	r2, r3, r2, asl #2
	rsb	r2, r2, r2, asl #6
	add	r2, r3, r2, asl #2
	mov	r2, r2, asl #1
	mov	r2, r2, lsr #16
	add	ip, r2, r2, asl #2
	stmfd	sp!, {r4, r5}
	add	r4, r2, ip, asl #3
	add	r5, r3, r3, asl #2
	add	r1, r1, #48
	add	r4, r4, r4, asl #2
	sub	r1, r1, r5, asl #1
	mov	r4, r4, lsr #11
	mov	r5, r0
	strb	r1, [r5], #1
	add	r3, r3, #48
	add	r1, r4, r4, asl #2
	add	r2, r2, #48
	sub	r2, r2, r1, asl #1
	sub	ip, r3, ip, asl #1
	add	r1, r5, #1
	add	r4, r4, #48
	strb	ip, [r0, #1]
	strb	r2, [r5, #1]
	add	r0, r1, #2
	strb	r4, [r1, #1]
	ldmfd	sp!, {r4, r5}
	bx	lr
	.size	put_dec_full4, .-put_dec_full4
	.align	2
	.type	put_dec, %function
put_dec:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	cmp	r3, #0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	mov	ip, r0
	beq	.L17
.L10:
	mov	r6, r2, lsr #16
	add	r1, r6, r6, asl #2
	add	r1, r6, r1, asl #2
	mov	r5, r3, asl #16
	mov	r4, r3, lsr #16
	mov	r5, r5, lsr #16
	add	r3, r6, r1, asl #1
	mov	r2, r2, asl #16
	add	r3, r6, r3, asl #2
	mov	sl, r5, asl #3
	add	r1, r4, r4, asl #2
	mov	r2, r2, lsr #16
	add	r1, r4, r1, asl #3
	add	r2, r2, r3, asl #5
	rsb	r3, r5, sl
	add	r3, r5, r3, asl #3
	ldr	r7, .L19
	add	r2, r2, r1, asl #4
	add	r2, r2, r3, asl #7
	umull	r3, r8, r7, r2
	mov	r8, r8, lsr #13
	add	r3, r8, r8, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	sub	r1, r2, r3, asl #4
	mov	r0, ip
	bl	put_dec_full4
	mov	r3, r4, asl #10
	sub	r3, r3, r4, asl #6
	add	sl, sl, r5
	rsb	r3, r4, r3
	add	sl, sl, sl, asl #5
	rsb	r3, r4, r3, asl #3
	rsb	sl, r5, sl, asl #2
	add	sl, r3, sl, asl #3
	add	r6, r6, r6, asl #1
	add	r6, sl, r6, asl #1
	add	r8, r6, r8
	umull	r3, r6, r7, r8
	mov	r6, r6, lsr #13
	add	r3, r6, r6, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	sub	r1, r8, r3, asl #4
	bl	put_dec_full4
	add	r3, r4, r4, asl #3
	add	r3, r3, r3, asl #5
	add	r2, r5, r5, asl #2
	rsb	r3, r4, r3, asl #2
	add	r3, r4, r3, asl #2
	add	r5, r5, r2, asl #2
	add	r5, r3, r5, asl #1
	add	r5, r5, r6
	umull	r3, r6, r7, r5
	mov	r6, r6, lsr #13
	add	r3, r6, r6, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	sub	r1, r5, r3, asl #4
	bl	put_dec_full4
	add	r2, r4, r4, asl #4
	add	r2, r4, r2, asl #1
	add	r4, r4, r2, asl #3
	adds	r4, r6, r4
	mov	r3, r0
	bne	.L18
.L13:
	mov	r0, r3
	ldrb	r2, [r3, #-1]!	@ zero_extendqisi2
	cmp	r2, #48
	beq	.L13
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	bx	lr
.L17:
	ldr	r1, .L19+4
	cmp	r2, r1
	bhi	.L10
	mov	r1, r2
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	b	put_dec_trunc8
.L18:
	umull	r3, r5, r7, r4
	mov	r5, r5, lsr #13
	add	r3, r5, r5, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	add	r3, r3, r3, asl #2
	sub	r1, r4, r3, asl #4
	bl	put_dec_full4
	cmp	r5, #0
	mov	r3, r0
	beq	.L13
	mov	r1, r5
	bl	put_dec_full4
	mov	r3, r0
	b	.L13
.L20:
	.align	2
.L19:
	.word	-776530087
	.word	99999999
	.size	put_dec, .-put_dec
	.align	2
	.type	number, %function
number:
	@ Function supports interworking.
	@ args = 8, pretend = 0, frame = 120
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	sub	sp, sp, #124
	ldrb	ip, [sp, #161]	@ zero_extendqisi2
	ldrb	r4, [sp, #162]	@ zero_extendqisi2
	ldrh	r5, [sp, #166]
	str	r0, [sp, #12]
	ands	r0, ip, #64
	str	r4, [sp, #20]
	str	r5, [sp, #36]
	ldrh	r4, [sp, #164]
	str	ip, [sp, #16]
	beq	.L74
	ldr	r0, [sp, #20]
	subs	r0, r0, #10
	movne	r0, #1
.L74:
	ands	r5, ip, #16
	str	r0, [sp, #28]
	and	r0, ip, #32
	andne	ip, ip, #254
	strne	ip, [sp, #16]
	str	r5, [sp, #40]
	ldr	r5, [sp, #16]
	str	r0, [sp, #8]
	tst	r5, #2
	beq	.L26
	cmp	r3, #0
	blt	.L65
	tst	r5, #4
	bne	.L76
	ldr	r5, [sp, #16]
	tst	r5, #8
	beq	.L26
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	mov	ip, #32
	str	r0, [sp, #24]
	str	ip, [sp, #32]
	b	.L29
.L26:
	mov	r0, #0
	str	r4, [sp, #24]
	str	r0, [sp, #32]
.L29:
	ldr	r4, [sp, #28]
	cmp	r4, #0
	beq	.L31
	ldr	r5, [sp, #24]
	ldr	ip, [sp, #20]
	sub	r0, r5, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	cmp	ip, #16
	str	r0, [sp, #24]
	subeq	r0, r0, #1
	moveq	r0, r0, asl #16
	moveq	r0, r0, lsr #16
	streq	r0, [sp, #24]
.L31:
	cmp	r3, #0
	bne	.L32
	cmp	r2, #7
	bls	.L77
.L32:
	ldr	r0, [sp, #20]
	cmp	r0, #10
	beq	.L35
	cmp	r0, #16
	movne	r5, #3
	moveq	r5, #4
	sub	r8, r0, #1
	ldr	sl, .L79
	rsb	r0, r5, #32
	mov	r4, #0
	add	r9, sp, #52
	sub	r6, r5, #32
	str	r1, [sp, #44]
	mov	fp, r0
.L38:
	mov	ip, r2, lsr r5
	cmp	r6, #0
	orr	ip, ip, r3, asl fp
	movge	ip, r3, lsr r6
	mov	r7, r3, lsr r5
	and	r2, r2, #255
	and	r2, r2, r8
	mov	r0, ip
	ldr	r1, [sp, #8]
	ldrb	ip, [sl, r2]	@ zero_extendqisi2
	mov	r2, r0
	orr	ip, r1, ip
	orrs	r0, r2, r7
	strb	ip, [r9, r4]
	mov	r3, r7
	add	r4, r4, #1
	bne	.L38
	ldr	r1, [sp, #44]
	sub	r5, r4, #1
.L34:
	ldr	r2, [sp, #36]
	ldr	r3, [sp, #24]
	mov	ip, r2, asl #16
	cmp	r4, ip, asr #16
	movgt	ip, r4, asl #16
	mov	ip, ip, lsr #16
	ldr	r0, [sp, #16]
	rsb	r7, ip, r3
	mov	r7, r7, asl #16
	mov	r7, r7, lsr #16
	tst	r0, #17
	mov	r0, r7
	bne	.L41
	sub	r0, r7, #1
	mov	r0, r0, asl #16
	cmp	r0, #0
	mov	r0, r0, lsr #16
	blt	.L41
	ldr	r3, [sp, #12]
	mov	r8, r0
	add	r2, r3, #1
	add	r2, r2, r0
	mov	r6, #32
.L43:
	cmp	r1, r3
	strhib	r6, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L43
	rsb	r7, r7, #1
	ldr	r2, [sp, #12]
	add	r0, r0, r7
	mov	r0, r0, asl #16
	add	r3, r8, #1
	sub	r0, r0, #65536
	add	r2, r2, r3
	str	r2, [sp, #12]
	mov	r0, r0, lsr #16
.L41:
	ldr	r3, [sp, #32]
	cmp	r3, #0
	beq	.L44
	ldr	r2, [sp, #12]
	cmp	r2, r1
	strccb	r3, [r2, #0]
	ldr	r3, [sp, #12]
	add	r3, r3, #1
	str	r3, [sp, #12]
.L44:
	ldr	r2, [sp, #28]
	cmp	r2, #0
	beq	.L46
	ldr	r3, [sp, #12]
	cmp	r3, r1
	ldrcc	r2, [sp, #12]
	movcc	r3, #48
	strccb	r3, [r2, #0]
	ldr	r2, [sp, #12]
	ldr	r3, [sp, #20]
	add	r2, r2, #1
	cmp	r3, #16
	str	r2, [sp, #12]
	beq	.L78
.L46:
	ldr	r2, [sp, #40]
	cmp	r2, #0
	movne	r6, r0
	movne	r7, r6, asl #16
	bne	.L50
	sub	r6, r0, #1
	ldr	r3, [sp, #16]
	mov	r6, r6, asl #16
	tst	r3, #1
	mov	r6, r6, lsr #16
	movne	r8, #48
	moveq	r8, #32
	movs	r7, r6, asl #16
	bmi	.L50
	sub	r2, r0, #1
	ldr	r3, [sp, #12]
	mov	r2, r2, asl #16
	add	r2, r3, r2, lsr #16
	add	r2, r2, #1
.L54:
	cmp	r1, r3
	strhib	r8, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L54
	rsb	r6, r0, r6
	mov	r6, r6, asl #16
	mov	r6, r6, lsr #16
	str	r3, [sp, #12]
	mov	r7, r6, asl #16
.L50:
	sub	r3, ip, #1
	mov	r3, r3, asl #16
	cmp	r4, r3, asr #16
	bgt	.L55
	sub	r0, ip, #2
	ldr	r3, [sp, #12]
	mov	r0, r0, asl #16
	add	r0, r3, r0, asr #16
	add	r0, r0, #1
	mov	ip, #48
.L57:
	cmp	r1, r3
	strhib	ip, [r3, #0]
	add	r3, r3, #1
	rsb	r2, r3, r0
	cmp	r4, r2
	ble	.L57
	str	r3, [sp, #12]
.L55:
	cmp	r5, #0
	blt	.L58
	add	r2, sp, #52
	ldr	r3, [sp, #12]
	sub	r0, r2, #1
	add	r2, r2, r5
.L60:
	cmp	r1, r3
	ldrhib	ip, [r2, #0]	@ zero_extendqisi2
	sub	r2, r2, #1
	strhib	ip, [r3, #0]
	cmp	r2, r0
	add	r3, r3, #1
	bne	.L60
	ldr	r4, [sp, #12]
	add	r5, r5, #1
	add	r4, r4, r5
	str	r4, [sp, #12]
.L58:
	cmp	r7, #0
	ble	.L61
	ldr	r5, [sp, #12]
	sub	r2, r6, #1
	mov	r2, r2, asl #16
	add	r2, r5, r2, lsr #16
	add	r2, r2, #1
	mov	r3, r5
	mov	r0, #32
.L63:
	cmp	r1, r3
	strhib	r0, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L63
	str	r3, [sp, #12]
.L61:
	ldr	r0, [sp, #12]
	add	sp, sp, #124
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L76:
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	str	r0, [sp, #24]
	mov	r0, #43
	str	r0, [sp, #32]
	b	.L29
.L65:
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	mov	ip, #45
	rsbs	r2, r2, #0
	rsc	r3, r3, #0
	str	r0, [sp, #24]
	str	ip, [sp, #32]
	b	.L29
.L77:
	add	r0, r2, #48
	strb	r0, [sp, #52]
	mov	r5, r3
	mov	r4, #1
	b	.L34
.L35:
	add	r4, sp, #52
	mov	r0, r4
	str	r1, [sp, #4]
	bl	put_dec
	rsb	r4, r4, r0
	sub	r5, r4, #1
	ldr	r1, [sp, #4]
	b	.L34
.L78:
	cmp	r1, r2
	ldrhi	r2, [sp, #8]
	orrhi	r3, r2, #88
	ldrhi	r2, [sp, #12]
	strhib	r3, [r2, #0]
	ldr	r3, [sp, #12]
	add	r3, r3, #1
	str	r3, [sp, #12]
	b	.L46
.L80:
	.align	2
.L79:
	.word	.LANCHOR0
	.size	number, .-number
	.align	2
	.type	measure_number, %function
measure_number:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 72
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	mov	fp, r3
	sub	sp, sp, #84
	ldr	r3, [r0, #0]
	str	r0, [sp, #12]
	mov	r0, #0
	str	r3, [sp, #8]
	mov	sl, r2
	bl	time
	ldr	r3, [sp, #8]
	mov	r9, #0
	add	r6, sp, #16
	cmp	r3, r0
	mov	r7, r9
	ldr	r5, .L88
	add	r8, r6, #63
	bne	.L87
.L84:
	ldr	r4, .L88+4
.L83:
	mov	ip, sp
	ldmia	r5, {r0, r1}
	mov	r2, sl
	stmia	ip, {r0, r1}
	mov	r3, fp
	mov	r0, r6
	mov	r1, r8
	bl	number
	sub	r4, r4, #1
	cmn	r4, #1
	strb	r7, [r0, #0]
	bne	.L83
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #8]
	add	r9, r9, #4000
	cmp	r3, r0
	beq	.L84
.L87:
	ldr	ip, [sp, #12]
	str	r0, [ip, #0]
	mov	r0, r9
	add	sp, sp, #84
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L89:
	.align	2
.L88:
	.word	.LANCHOR0+16
	.word	3999
	.size	measure_number, .-measure_number
	.align	2
	.type	measure, %function
measure:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 8
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	mov	r0, #0
	sub	sp, sp, #24
	bl	time
	str	r0, [sp, #20]
.L91:
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #20]
	cmp	r0, r3
	beq	.L91
	add	r8, sp, #24
	str	r0, [r8, #-4]!
	mov	r2, #8
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r2, #123
	mov	sl, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L94
	mov	r7, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L94+4
	mov	r6, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L94+8
	mov	r5, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r4, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r9, r0
	mvn	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r1, sl
	str	r0, [sp, #12]
	mov	r2, r7
	mov	r3, r6
	ldr	r0, .L94+12
	str	r5, [sp, #0]
	stmib	sp, {r4, r9}	@ phole stm
	bl	printf
	add	sp, sp, #24
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	bx	lr
.L95:
	.align	2
.L94:
	.word	123456
	.word	12345678
	.word	123456789
	.word	.LC0
	.size	measure, .-measure
	.align	2
	.type	check, %function
check:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 128
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, lr}
	ldr	r3, .L100
	sub	sp, sp, #140
	mov	r5, r0
	mov	r6, r1
	add	r4, sp, #72
	ldmia	r3, {r0, r1}
	mov	r3, sp
	stmia	r3, {r0, r1}
	mov	r2, r5
	mov	r3, r6
	add	r1, r4, #63
	mov	r0, r4
	bl	number
	add	r7, sp, #8
	mov	r3, #0
	strb	r3, [r0, #0]
	mov	r2, r5
	mov	r3, r6
	ldr	r1, .L100+4
	mov	r0, r7
	bl	sprintf
	mov	r0, r4
	mov	r1, r7
	bl	strcmp
	cmp	r0, #0
	bne	.L99
	add	sp, sp, #140
	ldmfd	sp!, {r4, r5, r6, r7, lr}
	bx	lr
.L99:
	mov	r2, r5
	mov	r3, r6
	ldr	r0, .L100+8
	str	r4, [sp, #0]
	bl	printf
	mov	r0, #1
	bl	exit
.L101:
	.align	2
.L100:
	.word	.LANCHOR0+16
	.word	.LC1
	.word	.LC2
	.size	check, .-check
	.align	2
	.global	main
	.type	main, %function
main:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bl	measure
	mov	r4, #0
	bl	measure
	mov	r5, #0
	bl	measure
	ldr	r6, .L106
	bl	measure
	mov	r7, #0
	mov	sl, #1
	mov	fp, #0
	b	.L104
.L103:
	adds	r4, r4, sl
	adc	r5, r5, fp
.L104:
	mov	r0, r4
	mov	r1, r5
	bl	check
	and	r8, r4, r6
	rsbs	r0, r4, #0
	rsc	r1, r5, #0
	and	r9, r5, r7
	bl	check
	orrs	r8, r8, r9
	bne	.L103
	mov	r2, r4
	mov	r3, r5
	ldr	r0, .L106+4
	bl	printf
	mov	r0, r8
	bl	fflush
	b	.L103
.L107:
	.align	2
.L106:
	.word	262143
	.word	.LC3
	.size	main, .-main
	.section	.rodata
	.align	2
.LANCHOR0 = . + 0
	.type	digits.3938, %object
	.size	digits.3938, 16
digits.3938:
	.ascii	"0123456789ABCDEF"
	.type	dummy_spec, %object
	.size	dummy_spec, 8
dummy_spec:
	.byte	8
	.byte	0
	.byte	10
	.byte	0
	.short	0
	.short	0
	.section	.rodata.str1.4,"aMS",%progbits,1
	.align	2
.LC0:
	.ascii	"Conversions per second: 8:%d 123:%d 123456:%d 12345"
	.ascii	"678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
	.ascii	"%llu\000"
	.space	3
.LC2:
	.ascii	"Error in formatting %llu:'%s'\012\000"
	.space	1
.LC3:
	.ascii	"\015Tested %llu                               \000"
	.ident	"GCC: (Debian 4.4.5-8) 4.4.5"
	.section	.note.GNU-stack,"",%progbits

[-- Attachment #4: test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.diff --]
[-- Type: text/plain, Size: 1332 bytes --]

--- test_new.org.c	2012-03-28 13:25:14.000000000 +0300
+++ test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c	2012-03-28 13:37:00.000000000 +0300
@@ -1,4 +1,4 @@
-#include "test_header.c"
+#include "test_header_arm.c"
 
 /* Decimal conversion is by far the most typical, and is used
  * for /proc and /sys data. This directly impacts e.g. top performance
@@ -7,7 +7,7 @@
  * (with permission from the author, Douglas W. Jones).
  */
 
-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ////LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
 /* Formats correctly any integer in [0, 999999999] */
 static noinline_for_stack
 char *put_dec_full9(char *buf, unsigned q)
@@ -106,11 +106,11 @@
  * Else (if long is 32 bits and long long is 64 bits) we use second one.
  */
 
-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ///LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
 
 /* First algorithm: generic */
 
-static
+static noinline_for_stack
 char *put_dec(char *buf, unsigned long long n)
 {
 	if (n >= 100*1000*1000) {
@@ -145,7 +145,7 @@
  * (with permission from the author).
  * Performs no 64-bit division and hence should be fast on 32-bit machines.
  */
-static
+static noinline_for_stack
 char *put_dec(char *buf, unsigned long long n)
 {
 	uint32_t d3, d2, d1, q, h;

[-- Attachment #5: test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.s --]
[-- Type: text/plain, Size: 14681 bytes --]

	.cpu arm9tdmi
	.fpu softvfp
	.eabi_attribute 20, 1
	.eabi_attribute 21, 1
	.eabi_attribute 23, 3
	.eabi_attribute 24, 1
	.eabi_attribute 25, 1
	.eabi_attribute 26, 2
	.eabi_attribute 30, 2
	.eabi_attribute 18, 4
	.file	"test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c"
	.text
	.align	2
	.type	put_dec_full9, %function
put_dec_full9:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 40
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L3
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
	umull	r4, r5, r1, r3
	mov	r6, r5
	umull	r4, r5, r6, r3
	umull	r7, r8, r5, r3
	mov	ip, r8
	umull	r7, r8, ip, r3
	mov	r2, r8
	umull	r7, r8, r2, r3
	mov	r4, r5
	add	r5, r8, r8, asl #1
	add	r5, r8, r5, asl #2
	rsb	r5, r5, r5, asl #6
	add	r5, r8, r5, asl #2
	mov	r5, r5, asl #1
	mov	r3, r8
	mov	r5, r5, lsr #16
	add	r1, r1, #48
	add	r8, r6, r6, asl #2
	sub	sp, sp, #40
	add	r7, r5, r5, asl #2
	sub	r8, r1, r8, asl #1
	add	r6, r6, #48
	mov	r1, r0
	str	r7, [sp, #12]
	strb	r8, [r1], #1
	add	r7, r5, r7, asl #3
	str	r6, [sp, #0]
	add	r5, r5, #48
	str	r5, [sp, #32]
	ldr	r5, [sp, #0]
	add	r6, r4, #48
	add	r4, r4, r4, asl #2
	str	r6, [sp, #16]
	sub	r4, r5, r4, asl #1
	str	r4, [sp, #0]
	add	r6, ip, #48
	ldr	r4, [sp, #16]
	add	r7, r7, r7, asl #2
	str	r6, [sp, #20]
	mov	r7, r7, lsr #11
	add	r6, r2, #48
	add	ip, ip, ip, asl #2
	str	r6, [sp, #24]
	add	r8, r1, #1
	add	r9, r7, r7, asl #2
	sub	ip, r4, ip, asl #1
	str	r8, [sp, #8]
	str	ip, [sp, #16]
	ldr	r5, [sp, #20]
	ldr	ip, [sp, #24]
	add	r8, r7, r9, asl #3
	add	r6, r3, #48
	add	r7, r7, #48
	str	r6, [sp, #28]
	str	r7, [sp, #36]
	add	r8, r8, r8, asl #2
	add	r2, r2, r2, asl #2
	add	r3, r3, r3, asl #2
	ldr	r4, [sp, #36]
	sub	r2, r5, r2, asl #1
	sub	r3, ip, r3, asl #1
	ldr	r5, [sp, #28]
	ldr	ip, [sp, #12]
	mov	r8, r8, lsr #11
	add	r7, r8, r8, asl #2
	sub	r7, r4, r7, asl #1
	sub	r4, r5, ip, asl #1
	ldr	r5, [sp, #32]
	ldr	sl, [sp, #8]
	sub	r9, r5, r9, asl #1
	ldr	r5, [sp, #0]
	add	sl, sl, #1
	str	sl, [sp, #4]
	strb	r5, [r0, #1]
	ldr	r0, [sp, #16]
	add	sl, sl, #1
	strb	r0, [r1, #1]
	ldr	r1, [sp, #8]
	add	fp, sl, #1
	strb	r2, [r1, #1]
	ldr	r2, [sp, #4]
	add	r6, fp, #1
	add	ip, r6, #1
	add	r8, r8, #48
	strb	r3, [r2, #1]
	add	r0, ip, #2
	strb	r4, [sl, #1]
	strb	r9, [fp, #1]
	strb	r7, [r6, #1]
	strb	r8, [ip, #1]
	add	sp, sp, #40
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
	bx	lr
.L4:
	.align	2
.L3:
	.word	429496730
	.size	put_dec_full9, .-put_dec_full9
	.align	2
	.type	put_dec_trunc8, %function
put_dec_trunc8:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r3, .L9
	stmfd	sp!, {r4, r5, fp}
	umull	fp, ip, r1, r3
	mov	r2, ip
	add	r1, r1, #48
	mov	ip, r0
	add	r0, r2, r2, asl #2
	sub	r1, r1, r0, asl #1
	cmp	r2, #0
	mov	r0, ip
	strb	r1, [r0], #1
	beq	.L6
	umull	r4, r5, r2, r3
	add	r2, r2, #48
	add	fp, r5, r5, asl #2
	sub	r2, r2, fp, asl #1
	cmp	r5, #0
	mov	r1, r5
	strb	r2, [ip, #1]
	add	r0, r0, #1
	beq	.L6
	umull	r4, r5, r1, r3
	add	r1, r1, #48
	add	ip, r5, r5, asl #2
	sub	r1, r1, ip, asl #1
	cmp	r5, #0
	mov	r2, r5
	strb	r1, [r0], #1
	beq	.L6
	umull	r4, r5, r2, r3
	add	r2, r2, #48
	add	r1, r5, r5, asl #2
	sub	r2, r2, r1, asl #1
	cmp	r5, #0
	strb	r2, [r0], #1
	beq	.L6
	add	r2, r5, r5, asl #1
	add	r2, r5, r2, asl #2
	rsb	r2, r2, r2, asl #6
	add	r2, r5, r2, asl #2
	mov	r2, r2, asl #1
	mov	r2, r2, lsr #16
	add	r3, r5, #48
	add	r1, r2, r2, asl #2
	sub	r3, r3, r1, asl #1
	cmp	r2, #0
	strb	r3, [r0], #1
	beq	.L6
	add	r1, r2, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r3, r1, lsr #11
	add	r2, r2, #48
	add	r1, r3, r3, asl #2
	sub	r2, r2, r1, asl #1
	cmp	r3, #0
	strb	r2, [r0], #1
	beq	.L6
	add	r1, r3, r1, asl #3
	add	r1, r1, r1, asl #2
	mov	r2, r1, lsr #11
	add	r1, r2, r2, asl #2
	add	r3, r3, #48
	cmp	r2, #0
	sub	r3, r3, r1, asl #1
	strb	r3, [r0], #1
	addne	r2, r2, #48
	strneb	r2, [r0], #1
.L6:
	ldmfd	sp!, {r4, r5, fp}
	bx	lr
.L10:
	.align	2
.L9:
	.word	429496730
	.size	put_dec_trunc8, .-put_dec_trunc8
	.align	2
	.type	put_dec, %function
put_dec:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	cmp	r3, #0
	stmfd	sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
	beq	.L28
.L25:
	ldr	sl, .L29
	ldr	r5, .L29+4
	ldr	r6, .L29+8
	mov	r8, #0
.L27:
#APP
@ 118 "test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c" 1
	umull	fp, ip, r5, r2
	cmn	fp, r5
	adcs	ip, ip, r6
	adc	fp, r8, #0
@ 0 "" 2
	mov	r1, r8
#APP
@ 118 "test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c" 1
	umlal	ip, fp, r6, r2
	umlal	ip, r1, r5, r3
	mov	ip, #0
	adds	fp, r1, fp
	adc	ip, ip, #0
	umlal	fp, ip, r6, r3
@ 0 "" 2
	mov	r4, fp, lsr #29
	orr	r4, r4, ip, asl #3
	rsb	r1, r4, r4, asl #5
	rsb	r1, r1, r1, asl #6
	add	r1, r4, r1, asl #3
	add	r1, r1, r1, asl #2
	add	r1, r1, r1, asl #2
	add	r1, r1, r1, asl #2
	mov	r7, ip, lsr #29
	sub	r1, r2, r1, asl #9
	bl	put_dec_full9
	cmp	r7, #0
	mov	r2, r4
	mov	r3, r7
	bne	.L27
	cmp	r4, sl
	bhi	.L27
	ldr	r3, .L29+12
	cmp	r4, r3
	bhi	.L16
.L18:
	mov	r1, r4
	ldmfd	sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
	b	put_dec_trunc8
.L28:
	ldr	r1, .L29+12
	cmp	r2, r1
	movls	r4, r2
	bls	.L18
	ldr	r1, .L29
	cmp	r2, r1
	movls	r4, r2
	bhi	.L25
.L16:
	mov	r1, r4
	ldmfd	sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
	b	put_dec_full9
.L30:
	.align	2
.L29:
	.word	999999999
	.word	917808535
	.word	-1989124287
	.word	99999999
	.size	put_dec, .-put_dec
	.align	2
	.type	number, %function
number:
	@ Function supports interworking.
	@ args = 8, pretend = 0, frame = 120
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	sub	sp, sp, #124
	ldrb	ip, [sp, #161]	@ zero_extendqisi2
	ldrb	r4, [sp, #162]	@ zero_extendqisi2
	ldrh	r5, [sp, #166]
	str	r0, [sp, #12]
	ands	r0, ip, #64
	str	r4, [sp, #20]
	str	r5, [sp, #36]
	ldrh	r4, [sp, #164]
	str	ip, [sp, #16]
	beq	.L84
	ldr	r0, [sp, #20]
	subs	r0, r0, #10
	movne	r0, #1
.L84:
	ands	r5, ip, #16
	str	r0, [sp, #28]
	and	r0, ip, #32
	andne	ip, ip, #254
	strne	ip, [sp, #16]
	str	r5, [sp, #40]
	ldr	r5, [sp, #16]
	str	r0, [sp, #8]
	tst	r5, #2
	beq	.L36
	cmp	r3, #0
	blt	.L75
	tst	r5, #4
	bne	.L86
	ldr	r5, [sp, #16]
	tst	r5, #8
	beq	.L36
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	mov	ip, #32
	str	r0, [sp, #24]
	str	ip, [sp, #32]
	b	.L39
.L36:
	mov	r0, #0
	str	r4, [sp, #24]
	str	r0, [sp, #32]
.L39:
	ldr	r4, [sp, #28]
	cmp	r4, #0
	beq	.L41
	ldr	r5, [sp, #24]
	ldr	ip, [sp, #20]
	sub	r0, r5, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	cmp	ip, #16
	str	r0, [sp, #24]
	subeq	r0, r0, #1
	moveq	r0, r0, asl #16
	moveq	r0, r0, lsr #16
	streq	r0, [sp, #24]
.L41:
	cmp	r3, #0
	bne	.L42
	cmp	r2, #7
	bls	.L87
.L42:
	ldr	r0, [sp, #20]
	cmp	r0, #10
	beq	.L45
	cmp	r0, #16
	movne	r5, #3
	moveq	r5, #4
	sub	r8, r0, #1
	ldr	sl, .L89
	rsb	r0, r5, #32
	mov	r4, #0
	add	r9, sp, #52
	sub	r6, r5, #32
	str	r1, [sp, #44]
	mov	fp, r0
.L48:
	mov	ip, r2, lsr r5
	cmp	r6, #0
	orr	ip, ip, r3, asl fp
	movge	ip, r3, lsr r6
	mov	r7, r3, lsr r5
	and	r2, r2, #255
	and	r2, r2, r8
	mov	r0, ip
	ldr	r1, [sp, #8]
	ldrb	ip, [sl, r2]	@ zero_extendqisi2
	mov	r2, r0
	orr	ip, r1, ip
	orrs	r0, r2, r7
	strb	ip, [r9, r4]
	mov	r3, r7
	add	r4, r4, #1
	bne	.L48
	ldr	r1, [sp, #44]
	sub	r5, r4, #1
.L44:
	ldr	r2, [sp, #36]
	ldr	r3, [sp, #24]
	mov	ip, r2, asl #16
	cmp	r4, ip, asr #16
	movgt	ip, r4, asl #16
	mov	ip, ip, lsr #16
	ldr	r0, [sp, #16]
	rsb	r7, ip, r3
	mov	r7, r7, asl #16
	mov	r7, r7, lsr #16
	tst	r0, #17
	mov	r0, r7
	bne	.L51
	sub	r0, r7, #1
	mov	r0, r0, asl #16
	cmp	r0, #0
	mov	r0, r0, lsr #16
	blt	.L51
	ldr	r3, [sp, #12]
	mov	r8, r0
	add	r2, r3, #1
	add	r2, r2, r0
	mov	r6, #32
.L53:
	cmp	r1, r3
	strhib	r6, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L53
	rsb	r7, r7, #1
	ldr	r2, [sp, #12]
	add	r0, r0, r7
	mov	r0, r0, asl #16
	add	r3, r8, #1
	sub	r0, r0, #65536
	add	r2, r2, r3
	str	r2, [sp, #12]
	mov	r0, r0, lsr #16
.L51:
	ldr	r3, [sp, #32]
	cmp	r3, #0
	beq	.L54
	ldr	r2, [sp, #12]
	cmp	r2, r1
	strccb	r3, [r2, #0]
	ldr	r3, [sp, #12]
	add	r3, r3, #1
	str	r3, [sp, #12]
.L54:
	ldr	r2, [sp, #28]
	cmp	r2, #0
	beq	.L56
	ldr	r3, [sp, #12]
	cmp	r3, r1
	ldrcc	r2, [sp, #12]
	movcc	r3, #48
	strccb	r3, [r2, #0]
	ldr	r2, [sp, #12]
	ldr	r3, [sp, #20]
	add	r2, r2, #1
	cmp	r3, #16
	str	r2, [sp, #12]
	beq	.L88
.L56:
	ldr	r2, [sp, #40]
	cmp	r2, #0
	movne	r6, r0
	movne	r7, r6, asl #16
	bne	.L60
	sub	r6, r0, #1
	ldr	r3, [sp, #16]
	mov	r6, r6, asl #16
	tst	r3, #1
	mov	r6, r6, lsr #16
	movne	r8, #48
	moveq	r8, #32
	movs	r7, r6, asl #16
	bmi	.L60
	sub	r2, r0, #1
	ldr	r3, [sp, #12]
	mov	r2, r2, asl #16
	add	r2, r3, r2, lsr #16
	add	r2, r2, #1
.L64:
	cmp	r1, r3
	strhib	r8, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L64
	rsb	r6, r0, r6
	mov	r6, r6, asl #16
	mov	r6, r6, lsr #16
	str	r3, [sp, #12]
	mov	r7, r6, asl #16
.L60:
	sub	r3, ip, #1
	mov	r3, r3, asl #16
	cmp	r4, r3, asr #16
	bgt	.L65
	sub	r0, ip, #2
	ldr	r3, [sp, #12]
	mov	r0, r0, asl #16
	add	r0, r3, r0, asr #16
	add	r0, r0, #1
	mov	ip, #48
.L67:
	cmp	r1, r3
	strhib	ip, [r3, #0]
	add	r3, r3, #1
	rsb	r2, r3, r0
	cmp	r4, r2
	ble	.L67
	str	r3, [sp, #12]
.L65:
	cmp	r5, #0
	blt	.L68
	add	r2, sp, #52
	ldr	r3, [sp, #12]
	sub	r0, r2, #1
	add	r2, r2, r5
.L70:
	cmp	r1, r3
	ldrhib	ip, [r2, #0]	@ zero_extendqisi2
	sub	r2, r2, #1
	strhib	ip, [r3, #0]
	cmp	r2, r0
	add	r3, r3, #1
	bne	.L70
	ldr	r4, [sp, #12]
	add	r5, r5, #1
	add	r4, r4, r5
	str	r4, [sp, #12]
.L68:
	cmp	r7, #0
	ble	.L71
	ldr	r5, [sp, #12]
	sub	r2, r6, #1
	mov	r2, r2, asl #16
	add	r2, r5, r2, lsr #16
	add	r2, r2, #1
	mov	r3, r5
	mov	r0, #32
.L73:
	cmp	r1, r3
	strhib	r0, [r3, #0]
	add	r3, r3, #1
	cmp	r3, r2
	bne	.L73
	str	r3, [sp, #12]
.L71:
	ldr	r0, [sp, #12]
	add	sp, sp, #124
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L86:
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	str	r0, [sp, #24]
	mov	r0, #43
	str	r0, [sp, #32]
	b	.L39
.L75:
	sub	r0, r4, #1
	mov	r0, r0, asl #16
	mov	r0, r0, lsr #16
	mov	ip, #45
	rsbs	r2, r2, #0
	rsc	r3, r3, #0
	str	r0, [sp, #24]
	str	ip, [sp, #32]
	b	.L39
.L87:
	add	r0, r2, #48
	strb	r0, [sp, #52]
	mov	r5, r3
	mov	r4, #1
	b	.L44
.L45:
	add	r4, sp, #52
	mov	r0, r4
	str	r1, [sp, #4]
	bl	put_dec
	rsb	r4, r4, r0
	sub	r5, r4, #1
	ldr	r1, [sp, #4]
	b	.L44
.L88:
	cmp	r1, r2
	ldrhi	r2, [sp, #8]
	orrhi	r3, r2, #88
	ldrhi	r2, [sp, #12]
	strhib	r3, [r2, #0]
	ldr	r3, [sp, #12]
	add	r3, r3, #1
	str	r3, [sp, #12]
	b	.L56
.L90:
	.align	2
.L89:
	.word	.LANCHOR0
	.size	number, .-number
	.align	2
	.type	measure_number, %function
measure_number:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 72
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	mov	fp, r3
	sub	sp, sp, #84
	ldr	r3, [r0, #0]
	str	r0, [sp, #12]
	mov	r0, #0
	str	r3, [sp, #8]
	mov	sl, r2
	bl	time
	ldr	r3, [sp, #8]
	mov	r9, #0
	add	r6, sp, #16
	cmp	r3, r0
	mov	r7, r9
	ldr	r5, .L98
	add	r8, r6, #63
	bne	.L97
.L94:
	ldr	r4, .L98+4
.L93:
	mov	ip, sp
	ldmia	r5, {r0, r1}
	mov	r2, sl
	stmia	ip, {r0, r1}
	mov	r3, fp
	mov	r0, r6
	mov	r1, r8
	bl	number
	sub	r4, r4, #1
	cmn	r4, #1
	strb	r7, [r0, #0]
	bne	.L93
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #8]
	add	r9, r9, #4000
	cmp	r3, r0
	beq	.L94
.L97:
	ldr	ip, [sp, #12]
	str	r0, [ip, #0]
	mov	r0, r9
	add	sp, sp, #84
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bx	lr
.L99:
	.align	2
.L98:
	.word	.LANCHOR0+16
	.word	3999
	.size	measure_number, .-measure_number
	.align	2
	.type	measure, %function
measure:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 8
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	mov	r0, #0
	sub	sp, sp, #24
	bl	time
	str	r0, [sp, #20]
.L101:
	mov	r0, #0
	bl	time
	ldr	r3, [sp, #20]
	cmp	r0, r3
	beq	.L101
	add	r8, sp, #24
	str	r0, [r8, #-4]!
	mov	r2, #8
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r2, #123
	mov	sl, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L104
	mov	r7, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L104+4
	mov	r6, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	ldr	r2, .L104+8
	mov	r5, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r4, r0
	mov	r3, #0
	mov	r0, r8
	bl	measure_number
	mvn	r2, #0
	mov	r9, r0
	mvn	r3, #0
	mov	r0, r8
	bl	measure_number
	mov	r1, sl
	str	r0, [sp, #12]
	mov	r2, r7
	mov	r3, r6
	ldr	r0, .L104+12
	str	r5, [sp, #0]
	stmib	sp, {r4, r9}	@ phole stm
	bl	printf
	add	sp, sp, #24
	ldmfd	sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
	bx	lr
.L105:
	.align	2
.L104:
	.word	123456
	.word	12345678
	.word	123456789
	.word	.LC0
	.size	measure, .-measure
	.align	2
	.type	check, %function
check:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 128
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r4, r5, r6, r7, lr}
	ldr	r3, .L110
	sub	sp, sp, #140
	mov	r5, r0
	mov	r6, r1
	add	r4, sp, #72
	ldmia	r3, {r0, r1}
	mov	r3, sp
	stmia	r3, {r0, r1}
	mov	r2, r5
	mov	r3, r6
	add	r1, r4, #63
	mov	r0, r4
	bl	number
	add	r7, sp, #8
	mov	r3, #0
	strb	r3, [r0, #0]
	mov	r2, r5
	mov	r3, r6
	ldr	r1, .L110+4
	mov	r0, r7
	bl	sprintf
	mov	r0, r4
	mov	r1, r7
	bl	strcmp
	cmp	r0, #0
	bne	.L109
	add	sp, sp, #140
	ldmfd	sp!, {r4, r5, r6, r7, lr}
	bx	lr
.L109:
	mov	r2, r5
	mov	r3, r6
	ldr	r0, .L110+8
	str	r4, [sp, #0]
	bl	printf
	mov	r0, #1
	bl	exit
.L111:
	.align	2
.L110:
	.word	.LANCHOR0+16
	.word	.LC1
	.word	.LC2
	.size	check, .-check
	.align	2
	.global	main
	.type	main, %function
main:
	@ Function supports interworking.
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	stmfd	sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
	bl	measure
	mov	r4, #0
	bl	measure
	mov	r5, #0
	bl	measure
	ldr	r6, .L116
	bl	measure
	mov	r7, #0
	mov	sl, #1
	mov	fp, #0
	b	.L114
.L113:
	adds	r4, r4, sl
	adc	r5, r5, fp
.L114:
	mov	r0, r4
	mov	r1, r5
	bl	check
	and	r8, r4, r6
	rsbs	r0, r4, #0
	rsc	r1, r5, #0
	and	r9, r5, r7
	bl	check
	orrs	r8, r8, r9
	bne	.L113
	mov	r2, r4
	mov	r3, r5
	ldr	r0, .L116+4
	bl	printf
	mov	r0, r8
	bl	fflush
	b	.L113
.L117:
	.align	2
.L116:
	.word	262143
	.word	.LC3
	.size	main, .-main
	.section	.rodata
	.align	2
.LANCHOR0 = . + 0
	.type	digits.4059, %object
	.size	digits.4059, 16
digits.4059:
	.ascii	"0123456789ABCDEF"
	.type	dummy_spec, %object
	.size	dummy_spec, 8
dummy_spec:
	.byte	8
	.byte	0
	.byte	10
	.byte	0
	.short	0
	.short	0
	.section	.rodata.str1.4,"aMS",%progbits,1
	.align	2
.LC0:
	.ascii	"Conversions per second: 8:%d 123:%d 123456:%d 12345"
	.ascii	"678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
	.ascii	"%llu\000"
	.space	3
.LC2:
	.ascii	"Error in formatting %llu:'%s'\012\000"
	.space	1
.LC3:
	.ascii	"\015Tested %llu                               \000"
	.ident	"GCC: (Debian 4.4.5-8) 4.4.5"
	.section	.note.GNU-stack,"",%progbits

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 10:39             ` roma1390
@ 2012-03-28 11:20               ` Denys Vlasenko
  0 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 11:20 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Wednesday 28 March 2012 12:39, roma1390 wrote:
> On 2012.03.28 13:33, Denys Vlasenko wrote:
> > On Wednesday 28 March 2012 12:24, roma1390 wrote:
> >> On 2012.03.28 13:13, Denys Vlasenko wrote:
> >>> Second: run
> >>> arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
> >>> and email me resulting test_{org,new}.s files.
> >>
> >> test_{org,new}.s attached.
> >
> >
> > Bingo.
> >
> >          bl      __aeabi_uidivmod
> >
> > Not good. Your gcc did not optimize division by constant.
> >
> > Can you add  "noinline_for_stack":
> >
> > static noinline_for_stack<=== HERE
> > char *put_dec(char *buf, unsigned long long n)
> >
> > amd regenerate and resend the test_new.s?
> >
> 
> Hello,
> 
> Your requested asm are attached.

Asm looks good: there are only three long multiplies in the
put_dec + put_dec_full4, and no divisions at all.

Please speed-test this version.

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 10:31         ` roma1390
@ 2012-03-28 11:23           ` Denys Vlasenko
  2012-03-29  5:23             ` roma1390
  0 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 11:23 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Wednesday 28 March 2012 12:31, roma1390 wrote:
> On 2012.03.28 13:13, Denys Vlasenko wrote:
> > Third: switch to algorithm #1 and test whether it fares better.
> > To do that, go to test_new.c
> > and replace
> >    #if LONG_MAX>  ((1UL<<31)-1) || LLONG_MAX>  ((1ULL<<63)-1)
> > with
> >    #if 1   ////LONG_MAX>  ((1UL<<31)-1) || LLONG_MAX>  ((1ULL<<63)-1)
> > (there are two instances of this line there),
> > then recompile and rerun the test, and post the result.

> but affter this diff:
> --- test_new-with-ifdefs.c      2012-03-28 13:29:04.000000000 +0300
> +++ test_new.c  2012-03-28 13:29:12.000000000 +0300
> @@ -1,4 +1,4 @@
> -#include "test_header.c"
> +#include "test_header_arm.c"
> 
>   /* Decimal conversion is by far the most typical, and is used
>    * for /proc and /sys data. This directly impacts e.g. top performance
> 
> compiles fine.
> 
> result are:
> ./test_new2
> Conversions per second: 8:5420000 123:4452000 123456:3556000 12345678:3368000 
> 123456789:2904000 2^32:2412000 2^64:1556000
> Conversions per second: 8:5428000 123:4500000 123456:3644000 12345678:3328000 
> 123456789:2832000 2^32:2408000 2^64:1572000
> Conversions per second: 8:5372000 123:4396000 123456:3644000 12345678:3368000 
> 123456789:2900000 2^32:2392000 2^64:1532000
> Conversions per second: 8:5424000 123:4500000 123456:3608000 12345678:3284000 
> 123456789:2896000 2^32:2416000 2^64:1572000

It looks ok - it is a bit faster that original code;
but I expect algorithm #2 will do better.

BTW, please always run test of original code too when you measure speed,
and post both results.

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-27 15:33   ` Denys Vlasenko
@ 2012-03-29  5:16     ` roma1390
  2012-03-29 10:33       ` Denys Vlasenko
  0 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-29  5:16 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On 2012.03.27 18:33, Denys Vlasenko wrote:
> Can you also let it run longer, to test at least up to
> first 5 billion numbers (which is>  232)?

First test after ~20 hour run:
Tested 12514754560

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 11:23           ` Denys Vlasenko
@ 2012-03-29  5:23             ` roma1390
  2012-03-29 10:33               ` Denys Vlasenko
  0 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-29  5:23 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On 2012.03.28 14:23, Denys Vlasenko wrote:
> It looks ok - it is a bit faster that original code;
> but I expect algorithm #2 will do better.
>
> BTW, please always run test of original code too when you measure speed,
> and post both results.

And the test results are:

./test_new-org-with-noinline_for_stack
Conversions per second: 8:5188000 123:4340000 123456:3536000 12345678:3164000 
123456789:2036000 2^32:2032000 2^64:1544000
Conversions per second: 8:5016000 123:4320000 123456:3536000 12345678:3272000 
123456789:2032000 2^32:1964000 2^64:1540000
Conversions per second: 8:5192000 123:4336000 123456:3420000 12345678:3268000 
123456789:2036000 2^32:2032000 2^64:1544000
Conversions per second: 8:5164000 123:4336000 123456:3536000 12345678:3272000 
123456789:1968000 2^32:2032000 2^64:1544000
Tested 524288                               ^C

# ./test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm
Conversions per second: 8:5020000 123:4332000 123456:3536000 12345678:3272000 
123456789:2828000 2^32:2260000 2^64:1536000
Conversions per second: 8:5192000 123:4332000 123456:3416000 12345678:3252000 
123456789:2828000 2^32:2336000 2^64:1540000
Conversions per second: 8:5172000 123:4336000 123456:3540000 12345678:3272000 
123456789:2736000 2^32:2336000 2^64:1536000
Conversions per second: 8:5192000 123:4192000 123456:3520000 12345678:3276000 
123456789:2828000 2^32:2340000 2^64:1484000
Tested 524288                               ^C


roma1390

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-29  5:16     ` roma1390
@ 2012-03-29 10:33       ` Denys Vlasenko
  0 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-29 10:33 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Thu, Mar 29, 2012 at 7:16 AM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.27 18:33, Denys Vlasenko wrote:
>>
>> Can you also let it run longer, to test at least up to
>> first 5 billion numbers (which is>  232)?
>
> First test after ~20 hour run:
> Tested 12514754560

Looks like it's ok :)

In case you expect that it will finish: well, it won't.
It will roll over 2^64-1 after a few years, and start from 0 :) :)
So please feel free to ^C

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-29  5:23             ` roma1390
@ 2012-03-29 10:33               ` Denys Vlasenko
  0 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-29 10:33 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Thu, Mar 29, 2012 at 7:23 AM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.28 14:23, Denys Vlasenko wrote:
>>
>> It looks ok - it is a bit faster that original code;
>> but I expect algorithm #2 will do better.
>>
>> BTW, please always run test of original code too when you measure speed,
>> and post both results.
>
>
> And the test results are:
>
> ./test_new-org-with-noinline_for_stack
> Conversions per second: 8:5188000 123:4340000 123456:3536000
> 12345678:3164000 123456789:2036000 2^32:2032000 2^64:1544000
> Conversions per second: 8:5016000 123:4320000 123456:3536000
> 12345678:3272000 123456789:2032000 2^32:1964000 2^64:1540000
> Conversions per second: 8:5192000 123:4336000 123456:3420000
> 12345678:3268000 123456789:2036000 2^32:2032000 2^64:1544000
> Conversions per second: 8:5164000 123:4336000 123456:3536000
> 12345678:3272000 123456789:1968000 2^32:2032000 2^64:1544000
> Tested 524288                               ^C
>
> # ./test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm
> Conversions per second: 8:5020000 123:4332000 123456:3536000
> 12345678:3272000 123456789:2828000 2^32:2260000 2^64:1536000
> Conversions per second: 8:5192000 123:4332000 123456:3416000
> 12345678:3252000 123456789:2828000 2^32:2336000 2^64:1540000
> Conversions per second: 8:5172000 123:4336000 123456:3540000
> 12345678:3272000 123456789:2736000 2^32:2336000 2^64:1536000
> Conversions per second: 8:5192000 123:4192000 123456:3520000
> 12345678:3276000 123456789:2828000 2^32:2340000 2^64:1484000
> Tested 524288                               ^C

Thanks! They look good now.

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
  2012-03-28 10:33           ` Denys Vlasenko
  2012-03-28 10:39             ` roma1390
@ 2012-03-29 10:35             ` Denys Vlasenko
  1 sibling, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-29 10:35 UTC (permalink / raw)
  To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz

On Wed, Mar 28, 2012 at 12:33 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Wednesday 28 March 2012 12:24, roma1390 wrote:
>> On 2012.03.28 13:13, Denys Vlasenko wrote:
>> > Second: run
>> > arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
>> > and email me resulting test_{org,new}.s files.
>>
>> test_{org,new}.s attached.
>
> Bingo.
>
>        bl      __aeabi_uidivmod
>
> Not good. Your gcc did not optimize division by constant.
>
> Can you add  "noinline_for_stack":
>
> static noinline_for_stack <=== HERE
> char *put_dec(char *buf, unsigned long long n)
>
> and regenerate and resend the test_new.s?

Since you confirmed that noinlining helps,
I recommend you to file a gcc bug report:
your gcc misses an optimization when it inlines
that function.

-- 
vda

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2012-03-29 10:35 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
2012-03-26 19:51   ` Andrew Morton
2012-03-26 19:56     ` Denys Vlasenko
2012-03-26 20:13       ` Andrew Morton
2012-03-26 20:18         ` Geert Uytterhoeven
2012-03-26 23:18           ` Denys Vlasenko
2012-03-27  0:30             ` Denys Vlasenko
2012-03-27  3:49             ` H. Peter Anvin
2012-03-26 20:20         ` H. Peter Anvin
2012-03-27 17:12           ` Michal Nazarewicz
2012-03-27 17:17             ` H. Peter Anvin
2012-03-27  0:26         ` Denys Vlasenko
2012-03-27 12:08 ` [PATCH 0/1] " roma1390
2012-03-27 15:32   ` Denys Vlasenko
2012-03-27 15:42   ` Denys Vlasenko
2012-03-28  5:56     ` roma1390
2012-03-28 10:13       ` Denys Vlasenko
2012-03-28 10:24         ` roma1390
2012-03-28 10:33           ` Denys Vlasenko
2012-03-28 10:39             ` roma1390
2012-03-28 11:20               ` Denys Vlasenko
2012-03-29 10:35             ` Denys Vlasenko
2012-03-28 10:31         ` roma1390
2012-03-28 11:23           ` Denys Vlasenko
2012-03-29  5:23             ` roma1390
2012-03-29 10:33               ` Denys Vlasenko
2012-03-27 13:49 ` roma1390
2012-03-27 15:33   ` Denys Vlasenko
2012-03-29  5:16     ` roma1390
2012-03-29 10:33       ` Denys Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).