* [PATCH 0/1] vsprintf: optimize decimal conversion (again)
@ 2012-03-26 18:47 Denys Vlasenko
2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 18:47 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Douglas W Jones, Michal Nazarewicz
[-- Attachment #1: Type: text/plain, Size: 1842 bytes --]
Hi Andrew,
Can you take this patch into -mm?
Michal, Jones - can you review the code?
Sometime ago, Michal Nazarewicz <mina86@mina86.com> optimized our
(already fast) decimal-to-string conversion even further.
Somehow this effort did not reach the kernel.
Here is a new iteration of his code.
Optimizations and patch follow in next email.
Please find test programs attached.
32-bit test programs were built using gcc 4.6.2
64-bit test programs were built using gcc 4.2.1
Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
Sizes:
org32.o: 2850 bytes
new32.o: 2858 bytes
org64.o: 2155 bytes
new64.o: 2283 bytes
Correctness: I tested first and last 40 billion values from [0, 2^64-1] range,
they all produced correct result.
Speed:
I measured how many thousands of conversions per second are done, for several values
(it takes different amount of time to convert, say, 123 and 2^64-1 to their
string representations).
Format of data below: VALUE:THOUSANDS_OF_CONVS_PER_SEC.
Intel Core i7 2.7GHz:
org32: 8:46852 123:39252 123456:23992 12345678:21992 123456789:21048 2^32-1:20424 2^64-1:10216
new32: 8:55300 123:43208 123456:34456 12345678:31272 123456789:23584 2^32-1:23568 2^64-1:16720
AMD Phenom II X4 2.4GHz:
org32: 8:29244 123:23988 123456:13792 12345678:12056 123456789:11368 2^32-1:10804 2^64-1:5224
new32: 8:38040 123:30356 123456:22832 12345678:20676 123456789:13556 2^32-1:13472 2^64-1:9228
org64: 8:38664 123:29256 123456:19188 12345678:16320 123456789:15380 2^32-1:14896 2^64-1:7864
new64: 8:42664 123:31660 123456:21632 12345678:19220 123456789:20880 2^32-1:17580 2^64-1:9596
Summary: in all cases new code is faster than old one, in many cases by 30%,
in few cases by more than 50% (for example, on x86-32, conversion of num=12345678).
Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
--
vda
[-- Attachment #2: test_dec_conversion.tar.gz --]
[-- Type: application/x-tgz, Size: 5949 bytes --]
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
@ 2012-03-26 18:51 ` Denys Vlasenko
2012-03-26 19:51 ` Andrew Morton
2012-03-27 12:08 ` [PATCH 0/1] " roma1390
2012-03-27 13:49 ` roma1390
2 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 18:51 UTC (permalink / raw)
To: linux-kernel; +Cc: Andrew Morton, Douglas W Jones, Michal Nazarewicz
commit 01a2904d31d2373886f489429ec662c9be64a6ab
Author: Denys Vlasenko <vda.linux@googlemail.com>
Date: Mon Mar 26 20:40:53 2012 +0200
vsprintf: optimize decimal conversion (again)
Previous code was using optimizations which were developed
to work well even on narrow-word CPUs (by today's standards).
But Linux runs only on 32-bit and wider CPUs. We can use that.
First: using 32x32->64 multiply and trivial 32-bit shift,
we can correctly divide by 10 much larger numbers, and thus
we can print groups of 9 digits instead of groups of 5 digits.
Next: there are two algorithms to print larger numbers.
One is generic: divide by 1000000000 and repeatedly print
groups of (up to) 9 digits. It's conceptually simple,
but requires an (unsigned long long) / 1000000000 division.
Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
manipulates them cleverly and generates groups of 4 decimal digits.
It so happens that it does NOT require long long division.
If long is > 32 bits, division of 64-bit values is relatively easy,
and we will use the first algorithm.
If long long is > 64 bits (strange architecture with VERY large long long),
second algorithm can't be used, and we again use the first one.
Else (if long is 32 bits and long long is 64 bits) we use second one.
And third: there is a simple optimization which takes fast path
not only for zero as was done before, but for all one-digit numbers.
In all tested cases new code is faster than old one, in many cases by 30%,
in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index abbabec..0a2044c 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -112,106 +112,190 @@ int skip_atoi(const char **s)
/* Decimal conversion is by far the most typical, and is used
* for /proc and /sys data. This directly impacts e.g. top performance
* with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
+ * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+ * (with permission from the author, Douglas W. Jones).
+ */
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
+/* Formats correctly any integer in [0, 999999999] */
static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full9(char *buf, unsigned q)
{
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
-
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0'; /* least significant digit */
- d1 = q + 9*d3 + 5*d2 + d1;
- if (d1 != 0) {
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0'; /* next digit */
-
- d2 = q + 2*d2;
- if ((d2 != 0) || (d3 != 0)) {
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0'; /* next digit */
-
- d3 = q + 4*d3;
- if (d3 != 0) {
- q = (d3 * 0xcd) >> 11;
- d3 = d3 - 10*q;
- *buf++ = d3 + '0'; /* next digit */
- if (q != 0)
- *buf++ = q + '0'; /* most sign. digit */
- }
- }
- }
-
+ unsigned r;
+
+ /* Possible ways to approx. divide by 10
+ * (x * 0x1999999a) >> 32 x < 1073741829 (multiply must be 64-bit)
+ * (x * 0xcccd) >> 19 x < 81920 (x < 262149 when 64-bit mul)
+ * (x * 0x6667) >> 18 x < 43699
+ * (x * 0x3334) >> 17 x < 16389
+ * (x * 0x199a) >> 16 x < 16389
+ * (x * 0x0ccd) >> 15 x < 16389
+ * (x * 0x0667) >> 14 x < 2739
+ * (x * 0x0334) >> 13 x < 1029
+ * (x * 0x019a) >> 12 x < 1029
+ * (x * 0x00cd) >> 11 x < 1029 shorter code than * 0x67 (on i386)
+ * (x * 0x0067) >> 10 x < 179
+ * (x * 0x0034) >> 9 x < 69 same
+ * (x * 0x001a) >> 8 x < 69 same
+ * (x * 0x000d) >> 7 x < 69 same, shortest code (on i386)
+ * (x * 0x0007) >> 6 x < 19
+ * See <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+ */
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 1 */
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 2 */
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 3 */
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 4 */
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 5 */
+ /* Now value is under 10000, can avoid 64-bit multiply */
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0'; /* 6 */
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0'; /* 7 */
+ q = (r * 0xcd) >> 11;
+ *buf++ = (r - 10 * q) + '0'; /* 8 */
+ *buf++ = q + '0'; /* 9 */
return buf;
}
-/* Same with if's removed. Always emits five digits */
+#endif
+
+/* Similar to above but do not pad with zeros.
+ * Code can be easily arranged to print 9 digits too, but our callers
+ * always call put_dec_full9() instead when the number has 9 decimal digits.
+ */
static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_trunc8(char *buf, unsigned r)
{
- /* BTW, if q is in [0,9999], 8-bit ints will be enough, */
- /* but anyway, gcc produces better code with full-sized ints */
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
+ unsigned q;
+
+ /* Copy of previous function's body with added early returns */
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 2 */
+ if (q == 0) return buf;
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 3 */
+ if (r == 0) return buf;
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 4 */
+ if (q == 0) return buf;
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 5 */
+ if (r == 0) return buf;
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0'; /* 6 */
+ if (q == 0) return buf;
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0'; /* 7 */
+ if (r == 0) return buf;
+ q = (r * 0xcd) >> 11;
+ *buf++ = (r - 10 * q) + '0'; /* 8 */
+ if (q == 0) return buf;
+ *buf++ = q + '0'; /* 9 */
+ return buf;
+}
- /*
- * Possible ways to approx. divide by 10
- * gcc -O2 replaces multiply with shifts and adds
- * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
- * (x * 0x67) >> 10: 1100111
- * (x * 0x34) >> 9: 110100 - same
- * (x * 0x1a) >> 8: 11010 - same
- * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386)
- */
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0';
- d1 = q + 9*d3 + 5*d2 + d1;
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0';
-
- d2 = q + 2*d2;
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0';
-
- d3 = q + 4*d3;
- q = (d3 * 0xcd) >> 11; /* - shorter code */
- /* q = (d3 * 0x67) >> 10; - would also work */
- d3 = d3 - 10*q;
- *buf++ = d3 + '0';
- *buf++ = q + '0';
+/* There are two algorithms to print larger numbers.
+ * One is generic: divide by 1000000000 and repeatedly print
+ * groups of (up to) 9 digits. It's conceptually simple,
+ * but requires a (unsigned long long) / 1000000000 division.
+ *
+ * Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
+ * manipulates them cleverly and generates groups of 4 decimal digits.
+ * It so happens that it does NOT require long long division.
+ *
+ * If long is > 32 bits, division of 64-bit values is relatively easy,
+ * and we will use the first algorithm.
+ * If long long is > 64 bits (strange architecture with VERY large long long),
+ * second algorithm can't be used, and we again use the first one.
+ *
+ * Else (if long is 32 bits and long long is 64 bits) we use second one.
+ */
- return buf;
+#if BITS_PER_LONG != 32 || ((~0ULL)>>1) != ((1ULL<<63)-1)
+
+/* First algorithm: generic */
+
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+ if (n >= 100*1000*1000) {
+ while (n >= 1000*1000*1000)
+ buf = put_dec_full9(buf, do_div(n, 1000*1000*1000));
+ if (n >= 100*1000*1000)
+ return put_dec_full9(buf, n);
+ }
+ return put_dec_trunc8(buf, n);
}
-/* No inlining helps gcc to use registers better */
+
+#else
+
+/* Second algorithm: valid only for 32-bit longs, 64-bit long longs */
+
static noinline_for_stack
-char *put_dec(char *buf, unsigned long long num)
+char *put_dec_full4(char *buf, unsigned q)
{
- while (1) {
- unsigned rem;
- if (num < 100000)
- return put_dec_trunc(buf, num);
- rem = do_div(num, 100000);
- buf = put_dec_full(buf, rem);
- }
+ unsigned r;
+ r = (q * 0xcccd) >> 19;
+ *buf++ = (q - 10 * r) + '0';
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0';
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0';
+ *buf++ = r + '0';
+ return buf;
+}
+
+/* Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>
+ * (with permission from the author).
+ * Performs no 64-bit division and hence should be fast on 32-bit machines.
+ */
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+ uint32_t d3, d2, d1, q, h;
+
+ if (n < 100*1000*1000)
+ return put_dec_trunc8(buf, n);
+
+ d1 = ((uint32_t)n >> 16); /* implicit "& 0xffff" */
+ h = (n >> 32);
+ d2 = (h ) & 0xffff;
+ d3 = (h >> 16); /* implicit "& 0xffff" */
+
+ q = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
+
+ buf = put_dec_full4(buf, q % 10000);
+ q = q / 10000;
+
+ d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+ buf = put_dec_full4(buf, d1 % 10000);
+ q = d1 / 10000;
+
+ d2 = q + 4749 * d3 + 42 * d2;
+ buf = put_dec_full4(buf, d2 % 10000);
+ q = d2 / 10000;
+
+ d3 = q + 281 * d3;
+ if (!d3)
+ goto done;
+ buf = put_dec_full4(buf, d3 % 10000);
+ q = d3 / 10000;
+ if (!q)
+ goto done;
+ buf = put_dec_full4(buf, q);
+ done:
+ while (buf[-1] == '0')
+ --buf;
+
+ return buf;
}
+#endif
/*
* Convert passed number to decimal string.
* Returns the length of string. On buffer overflow, returns 0.
@@ -220,7 +304,7 @@ char *put_dec(char *buf, unsigned long long num)
*/
int num_to_str(char *buf, int size, unsigned long long num)
{
- char tmp[21]; /* Enough for 2^64 in decimal */
+ char tmp[sizeof(num) * 3];
int idx, len;
len = put_dec(tmp, num) - tmp;
@@ -229,7 +313,7 @@ int num_to_str(char *buf, int size, unsigned long long num)
return 0;
for (idx = 0; idx < len; ++idx)
buf[idx] = tmp[len - idx - 1];
- return len;
+ return len;
}
#define ZEROPAD 1 /* pad with zero */
@@ -312,8 +396,8 @@ char *number(char *buf, char *end, unsigned long long num,
/* generate full string in tmp[], in reverse order */
i = 0;
- if (num == 0)
- tmp[i++] = '0';
+ if (num < spec.base)
+ tmp[i++] = digits[num] | locase;
/* Generic code, for any base:
else do {
tmp[i++] = (digits[do_div(num,base)] | locase);
@@ -607,7 +691,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
}
for (i = 0; i < 4; i++) {
char temp[3]; /* hold each IP quad in reverse order */
- int digits = put_dec_trunc(temp, addr[index]) - temp;
+ int digits = put_dec_trunc8(temp, addr[index]) - temp;
if (leading_zeros) {
if (digits < 3)
*p++ = '0';
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
@ 2012-03-26 19:51 ` Andrew Morton
2012-03-26 19:56 ` Denys Vlasenko
0 siblings, 1 reply; 31+ messages in thread
From: Andrew Morton @ 2012-03-26 19:51 UTC (permalink / raw)
To: Denys Vlasenko; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz
On Mon, 26 Mar 2012 20:51:24 +0200
Denys Vlasenko <vda.linux@googlemail.com> wrote:
> commit 01a2904d31d2373886f489429ec662c9be64a6ab
> Author: Denys Vlasenko <vda.linux@googlemail.com>
> Date: Mon Mar 26 20:40:53 2012 +0200
>
> vsprintf: optimize decimal conversion (again)
>
> Previous code was using optimizations which were developed
> to work well even on narrow-word CPUs (by today's standards).
> But Linux runs only on 32-bit and wider CPUs. We can use that.
>
> First: using 32x32->64 multiply and trivial 32-bit shift,
> we can correctly divide by 10 much larger numbers, and thus
> we can print groups of 9 digits instead of groups of 5 digits.
>
> Next: there are two algorithms to print larger numbers.
> One is generic: divide by 1000000000 and repeatedly print
> groups of (up to) 9 digits. It's conceptually simple,
> but requires an (unsigned long long) / 1000000000 division.
>
> Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
> manipulates them cleverly and generates groups of 4 decimal digits.
> It so happens that it does NOT require long long division.
>
> If long is > 32 bits, division of 64-bit values is relatively easy,
> and we will use the first algorithm.
> If long long is > 64 bits (strange architecture with VERY large long long),
> second algorithm can't be used, and we again use the first one.
>
> Else (if long is 32 bits and long long is 64 bits) we use second one.
>
> And third: there is a simple optimization which takes fast path
> not only for zero as was done before, but for all one-digit numbers.
>
> In all tested cases new code is faster than old one, in many cases by 30%,
> in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
> Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
>
This patch is so nutty that I like it.
> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
What's this for?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 19:51 ` Andrew Morton
@ 2012-03-26 19:56 ` Denys Vlasenko
2012-03-26 20:13 ` Andrew Morton
0 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 19:56 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz
On Mon, Mar 26, 2012 at 9:51 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Mon, 26 Mar 2012 20:51:24 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>> commit 01a2904d31d2373886f489429ec662c9be64a6ab
>> Author: Denys Vlasenko <vda.linux@googlemail.com>
>> Date: Mon Mar 26 20:40:53 2012 +0200
>>
>> vsprintf: optimize decimal conversion (again)
>>
>> Previous code was using optimizations which were developed
>> to work well even on narrow-word CPUs (by today's standards).
>> But Linux runs only on 32-bit and wider CPUs. We can use that.
>>
>> First: using 32x32->64 multiply and trivial 32-bit shift,
>> we can correctly divide by 10 much larger numbers, and thus
>> we can print groups of 9 digits instead of groups of 5 digits.
>>
>> Next: there are two algorithms to print larger numbers.
>> One is generic: divide by 1000000000 and repeatedly print
>> groups of (up to) 9 digits. It's conceptually simple,
>> but requires an (unsigned long long) / 1000000000 division.
>>
>> Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
>> manipulates them cleverly and generates groups of 4 decimal digits.
>> It so happens that it does NOT require long long division.
>>
>> If long is > 32 bits, division of 64-bit values is relatively easy,
>> and we will use the first algorithm.
>> If long long is > 64 bits (strange architecture with VERY large long long),
>> second algorithm can't be used, and we again use the first one.
>>
>> Else (if long is 32 bits and long long is 64 bits) we use second one.
>>
>> And third: there is a simple optimization which takes fast path
>> not only for zero as was done before, but for all one-digit numbers.
>>
>> In all tested cases new code is faster than old one, in many cases by 30%,
>> in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
>> Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
>>
>
> This patch is so nutty that I like it.
>
>> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>
> What's this for?
The second check should be just BITS_PER_LONG_LONG != 64,
but we don't have BITS_PER_LONG_LONG.
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 19:56 ` Denys Vlasenko
@ 2012-03-26 20:13 ` Andrew Morton
2012-03-26 20:18 ` Geert Uytterhoeven
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Andrew Morton @ 2012-03-26 20:13 UTC (permalink / raw)
To: Denys Vlasenko; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz
On Mon, 26 Mar 2012 21:56:38 +0200
Denys Vlasenko <vda.linux@googlemail.com> wrote:
> >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
> >
> > What's this for?
>
> The second check should be just BITS_PER_LONG_LONG != 64,
> but we don't have BITS_PER_LONG_LONG.
So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 20:13 ` Andrew Morton
@ 2012-03-26 20:18 ` Geert Uytterhoeven
2012-03-26 23:18 ` Denys Vlasenko
2012-03-26 20:20 ` H. Peter Anvin
2012-03-27 0:26 ` Denys Vlasenko
2 siblings, 1 reply; 31+ messages in thread
From: Geert Uytterhoeven @ 2012-03-26 20:18 UTC (permalink / raw)
To: Andrew Morton
Cc: Denys Vlasenko, linux-kernel, Douglas W Jones, Michal Nazarewicz
On Mon, Mar 26, 2012 at 22:13, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Mon, 26 Mar 2012 21:56:38 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>> >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>> >
>> > What's this for?
>>
>> The second check should be just BITS_PER_LONG_LONG != 64,
>> but we don't have BITS_PER_LONG_LONG.
>
> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
I don't think Linux runs on anything with BITS_PER_LONG_LONG != 64...
BTW, what about CPUs with slow 32x32 multiplication and/or slow 64-bit
division?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 20:13 ` Andrew Morton
2012-03-26 20:18 ` Geert Uytterhoeven
@ 2012-03-26 20:20 ` H. Peter Anvin
2012-03-27 17:12 ` Michal Nazarewicz
2012-03-27 0:26 ` Denys Vlasenko
2 siblings, 1 reply; 31+ messages in thread
From: H. Peter Anvin @ 2012-03-26 20:20 UTC (permalink / raw)
To: Andrew Morton
Cc: Denys Vlasenko, linux-kernel, Douglas W Jones, Michal Nazarewicz
On 03/26/2012 01:13 PM, Andrew Morton wrote:
> On Mon, 26 Mar 2012 21:56:38 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>>>> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>>>
>>> What's this for?
>>
>> The second check should be just BITS_PER_LONG_LONG != 64,
>> but we don't have BITS_PER_LONG_LONG.
>
> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
First of all, the #if is wrong: the preprocessor doesn't support data
types and does all arithmetic at (u)intmax_t precision.
As far as BITS_PER_LONG_LONG, there are tons of places in the kernel
which already require that long long is exactly 64 bits. That may or
may not be a good thing, but for right now one could simply:
#define BITS_PER_LONG_LONG 64
-hpa
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 20:18 ` Geert Uytterhoeven
@ 2012-03-26 23:18 ` Denys Vlasenko
2012-03-27 0:30 ` Denys Vlasenko
2012-03-27 3:49 ` H. Peter Anvin
0 siblings, 2 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-26 23:18 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Andrew Morton, linux-kernel, Douglas W Jones, Michal Nazarewicz
On Monday 26 March 2012 22:18, Geert Uytterhoeven wrote:
> On Mon, Mar 26, 2012 at 22:13, Andrew Morton <akpm@linux-foundation.org> wrote:
> > On Mon, 26 Mar 2012 21:56:38 +0200
> > Denys Vlasenko <vda.linux@googlemail.com> wrote:
> >
> >> >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
> >> >
> >> > What's this for?
> >>
> >> The second check should be just BITS_PER_LONG_LONG != 64,
> >> but we don't have BITS_PER_LONG_LONG.
> >
> > So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
>
> I don't think Linux runs on anything with BITS_PER_LONG_LONG != 64...
>
> BTW, what about CPUs with slow 32x32 multiplication and/or slow 64-bit
> division?
Without 32x32->64 multiply, the best we can generate is 4 decimal digits:
we produce next digit by approximating x/10 with (x * 0xcccd) >> 19,
and the first x where it gives wrong result is 81920 if multiply result
is truncated to 32 bits.
With it, we can generate 9 digits using (x * 0x1999999a) >> 32.
Regrading "slow 64-bit division" - after this patch, 32-bit machines
wouldn't use it at all. Only 64-bit machines will perform 64-bit
division, one per 9 decimal digits (thus, at most three divisions
per one long_long->string conversion).
In fact, with small change to #ifdefs, all machines with long long <= 64
bits can use division-less routine. It might be a good thing to try...
Any people with ARM hardware in hand interesting in running the test
program I sent in first email?
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 20:13 ` Andrew Morton
2012-03-26 20:18 ` Geert Uytterhoeven
2012-03-26 20:20 ` H. Peter Anvin
@ 2012-03-27 0:26 ` Denys Vlasenko
2 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 0:26 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-kernel, Douglas W Jones, Michal Nazarewicz
On Monday 26 March 2012 22:13, Andrew Morton wrote:
> On Mon, 26 Mar 2012 21:56:38 +0200
> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
> > >> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
> > >
> > > What's this for?
> >
> > The second check should be just BITS_PER_LONG_LONG != 64,
> > but we don't have BITS_PER_LONG_LONG.
>
> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
Sure!
Here is a new version. I also plugged a hole in num_to_str() -
it was assuming it's safe to call put_dec() for num=0.
(We never tripped over it before because the single caller
of num_to_str() takes care of that case).
Signed-off-by and CC are also updated.
--
vda
commit 469bfce8d60d1069424aacc83fef50f9da5701bf
Author: Denys Vlasenko <vda.linux@googlemail.com>
Date: Tue Mar 27 02:18:46 2012 +0200
vsprintf: optimize decimal conversion (again)
This patch is based upon an original from Michal Nazarewicz.
Previous code was using optimizations which were developed
to work well even on narrow-word CPUs (by today's standards).
But Linux runs only on 32-bit and wider CPUs. We can use that.
First: using 32x32->64 multiply and trivial 32-bit shift,
we can correctly divide by 10 much larger numbers, and thus
we can print groups of 9 digits instead of groups of 5 digits.
Next: there are two algorithms to print larger numbers.
One is generic: divide by 1000000000 and repeatedly print
groups of (up to) 9 digits. It's conceptually simple,
but requires an (unsigned long long) / 1000000000 division.
Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
manipulates them cleverly and generates groups of 4 decimal digits.
It so happens that it does NOT require long long division.
If long is > 32 bits, division of 64-bit values is relatively easy,
and we will use the first algorithm.
If long long is > 64 bits (strange architecture with VERY large long long),
second algorithm can't be used, and we again use the first one.
Else (if long is 32 bits and long long is 64 bits) we use second one.
And third: there is a simple optimization which takes fast path
not only for zero as was done before, but for all one-digit numbers.
In all tested cases new code is faster than old one, in many cases by 30%,
in few cases by more than 50% (for example, on x86-32, conversion of 12345678).
Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Douglas W Jones <jones@cs.uiowa.edu>
diff --git a/include/asm-generic/bitsperlong.h b/include/asm-generic/bitsperlong.h
index 4ae54e0..a7b0914 100644
--- a/include/asm-generic/bitsperlong.h
+++ b/include/asm-generic/bitsperlong.h
@@ -28,5 +28,9 @@
#error Inconsistent word size. Check asm/bitsperlong.h
#endif
+#ifndef BITS_PER_LONG_LONG
+#define BITS_PER_LONG_LONG 64
+#endif
+
#endif /* __KERNEL__ */
#endif /* __ASM_GENERIC_BITS_PER_LONG */
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index abbabec..877a503 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -112,106 +112,191 @@ int skip_atoi(const char **s)
/* Decimal conversion is by far the most typical, and is used
* for /proc and /sys data. This directly impacts e.g. top performance
* with many processes running. We optimize it for speed
- * using code from
- * http://www.cs.uiowa.edu/~jones/bcd/decimal.html
- * (with permission from the author, Douglas W. Jones). */
+ * using ideas described at <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+ * (with permission from the author, Douglas W. Jones).
+ */
-/* Formats correctly any integer in [0,99999].
- * Outputs from one to five digits depending on input.
- * On i386 gcc 4.1.2 -O2: ~250 bytes of code. */
+#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
+/* Formats correctly any integer in [0, 999999999] */
static noinline_for_stack
-char *put_dec_trunc(char *buf, unsigned q)
+char *put_dec_full9(char *buf, unsigned q)
{
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
-
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0'; /* least significant digit */
- d1 = q + 9*d3 + 5*d2 + d1;
- if (d1 != 0) {
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0'; /* next digit */
-
- d2 = q + 2*d2;
- if ((d2 != 0) || (d3 != 0)) {
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0'; /* next digit */
-
- d3 = q + 4*d3;
- if (d3 != 0) {
- q = (d3 * 0xcd) >> 11;
- d3 = d3 - 10*q;
- *buf++ = d3 + '0'; /* next digit */
- if (q != 0)
- *buf++ = q + '0'; /* most sign. digit */
- }
- }
- }
-
+ unsigned r;
+
+ /* Possible ways to approx. divide by 10
+ * (x * 0x1999999a) >> 32 x < 1073741829 (multiply must be 64-bit)
+ * (x * 0xcccd) >> 19 x < 81920 (x < 262149 when 64-bit mul)
+ * (x * 0x6667) >> 18 x < 43699
+ * (x * 0x3334) >> 17 x < 16389
+ * (x * 0x199a) >> 16 x < 16389
+ * (x * 0x0ccd) >> 15 x < 16389
+ * (x * 0x0667) >> 14 x < 2739
+ * (x * 0x0334) >> 13 x < 1029
+ * (x * 0x019a) >> 12 x < 1029
+ * (x * 0x00cd) >> 11 x < 1029 shorter code than * 0x67 (on i386)
+ * (x * 0x0067) >> 10 x < 179
+ * (x * 0x0034) >> 9 x < 69 same
+ * (x * 0x001a) >> 8 x < 69 same
+ * (x * 0x000d) >> 7 x < 69 same, shortest code (on i386)
+ * (x * 0x0007) >> 6 x < 19
+ * See <http://www.cs.uiowa.edu/~jones/bcd/divide.html>
+ */
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 1 */
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 2 */
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 3 */
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 4 */
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 5 */
+ /* Now value is under 10000, can avoid 64-bit multiply */
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0'; /* 6 */
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0'; /* 7 */
+ q = (r * 0xcd) >> 11;
+ *buf++ = (r - 10 * q) + '0'; /* 8 */
+ *buf++ = q + '0'; /* 9 */
return buf;
}
-/* Same with if's removed. Always emits five digits */
+#endif
+
+/* Similar to above but do not pad with zeros.
+ * Code can be easily arranged to print 9 digits too, but our callers
+ * always call put_dec_full9() instead when the number has 9 decimal digits.
+ */
static noinline_for_stack
-char *put_dec_full(char *buf, unsigned q)
+char *put_dec_trunc8(char *buf, unsigned r)
{
- /* BTW, if q is in [0,9999], 8-bit ints will be enough, */
- /* but anyway, gcc produces better code with full-sized ints */
- unsigned d3, d2, d1, d0;
- d1 = (q>>4) & 0xf;
- d2 = (q>>8) & 0xf;
- d3 = (q>>12);
+ unsigned q;
+
+ /* Copy of previous function's body with added early returns */
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 2 */
+ if (q == 0) return buf;
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 3 */
+ if (r == 0) return buf;
+ q = (r * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (r - 10 * q) + '0'; /* 4 */
+ if (q == 0) return buf;
+ r = (q * (uint64_t)0x1999999a) >> 32;
+ *buf++ = (q - 10 * r) + '0'; /* 5 */
+ if (r == 0) return buf;
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0'; /* 6 */
+ if (q == 0) return buf;
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0'; /* 7 */
+ if (r == 0) return buf;
+ q = (r * 0xcd) >> 11;
+ *buf++ = (r - 10 * q) + '0'; /* 8 */
+ if (q == 0) return buf;
+ *buf++ = q + '0'; /* 9 */
+ return buf;
+}
- /*
- * Possible ways to approx. divide by 10
- * gcc -O2 replaces multiply with shifts and adds
- * (x * 0xcd) >> 11: 11001101 - shorter code than * 0x67 (on i386)
- * (x * 0x67) >> 10: 1100111
- * (x * 0x34) >> 9: 110100 - same
- * (x * 0x1a) >> 8: 11010 - same
- * (x * 0x0d) >> 7: 1101 - same, shortest code (on i386)
- */
- d0 = 6*(d3 + d2 + d1) + (q & 0xf);
- q = (d0 * 0xcd) >> 11;
- d0 = d0 - 10*q;
- *buf++ = d0 + '0';
- d1 = q + 9*d3 + 5*d2 + d1;
- q = (d1 * 0xcd) >> 11;
- d1 = d1 - 10*q;
- *buf++ = d1 + '0';
-
- d2 = q + 2*d2;
- q = (d2 * 0xd) >> 7;
- d2 = d2 - 10*q;
- *buf++ = d2 + '0';
-
- d3 = q + 4*d3;
- q = (d3 * 0xcd) >> 11; /* - shorter code */
- /* q = (d3 * 0x67) >> 10; - would also work */
- d3 = d3 - 10*q;
- *buf++ = d3 + '0';
- *buf++ = q + '0';
+/* There are two algorithms to print larger numbers.
+ * One is generic: divide by 1000000000 and repeatedly print
+ * groups of (up to) 9 digits. It's conceptually simple,
+ * but requires a (unsigned long long) / 1000000000 division.
+ *
+ * Second algorithm splits 64-bit unsigned long long into 16-bit chunks,
+ * manipulates them cleverly and generates groups of 4 decimal digits.
+ * It so happens that it does NOT require long long division.
+ *
+ * If long is > 32 bits, division of 64-bit values is relatively easy,
+ * and we will use the first algorithm.
+ * If long long is > 64 bits (strange architecture with VERY large long long),
+ * second algorithm can't be used, and we again use the first one.
+ *
+ * Else (if long is 32 bits and long long is 64 bits) we use second one.
+ */
- return buf;
+#if BITS_PER_LONG != 32 || BITS_PER_LONG_LONG != 64
+
+/* First algorithm: generic */
+
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+ if (n >= 100*1000*1000) {
+ while (n >= 1000*1000*1000)
+ buf = put_dec_full9(buf, do_div(n, 1000*1000*1000));
+ if (n >= 100*1000*1000)
+ return put_dec_full9(buf, n);
+ }
+ return put_dec_trunc8(buf, n);
}
-/* No inlining helps gcc to use registers better */
+
+#else
+
+/* Second algorithm: valid only for 64-bit long longs */
+
static noinline_for_stack
-char *put_dec(char *buf, unsigned long long num)
+char *put_dec_full4(char *buf, unsigned q)
{
- while (1) {
- unsigned rem;
- if (num < 100000)
- return put_dec_trunc(buf, num);
- rem = do_div(num, 100000);
- buf = put_dec_full(buf, rem);
- }
+ unsigned r;
+ r = (q * 0xcccd) >> 19;
+ *buf++ = (q - 10 * r) + '0';
+ q = (r * 0x199a) >> 16;
+ *buf++ = (r - 10 * q) + '0';
+ r = (q * 0xcd) >> 11;
+ *buf++ = (q - 10 * r) + '0';
+ *buf++ = r + '0';
+ return buf;
}
+/* Based on code by Douglas W. Jones found at
+ * <http://www.cs.uiowa.edu/~jones/bcd/decimal.html#sixtyfour>
+ * (with permission from the author).
+ * Performs no 64-bit division and hence should be fast on 32-bit machines.
+ */
+static
+char *put_dec(char *buf, unsigned long long n)
+{
+ uint32_t d3, d2, d1, q, h;
+
+ if (n < 100*1000*1000)
+ return put_dec_trunc8(buf, n);
+
+ d1 = ((uint32_t)n >> 16); /* implicit "& 0xffff" */
+ h = (n >> 32);
+ d2 = (h ) & 0xffff;
+ d3 = (h >> 16); /* implicit "& 0xffff" */
+
+ q = 656 * d3 + 7296 * d2 + 5536 * d1 + ((uint32_t)n & 0xffff);
+
+ buf = put_dec_full4(buf, q % 10000);
+ q = q / 10000;
+
+ d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1;
+ buf = put_dec_full4(buf, d1 % 10000);
+ q = d1 / 10000;
+
+ d2 = q + 4749 * d3 + 42 * d2;
+ buf = put_dec_full4(buf, d2 % 10000);
+ q = d2 / 10000;
+
+ d3 = q + 281 * d3;
+ if (!d3)
+ goto done;
+ buf = put_dec_full4(buf, d3 % 10000);
+ q = d3 / 10000;
+ if (!q)
+ goto done;
+ buf = put_dec_full4(buf, q);
+ done:
+ while (buf[-1] == '0')
+ --buf;
+
+ return buf;
+}
+
+#endif
+
/*
* Convert passed number to decimal string.
* Returns the length of string. On buffer overflow, returns 0.
@@ -220,16 +305,22 @@ char *put_dec(char *buf, unsigned long long num)
*/
int num_to_str(char *buf, int size, unsigned long long num)
{
- char tmp[21]; /* Enough for 2^64 in decimal */
+ char tmp[sizeof(num) * 3];
int idx, len;
- len = put_dec(tmp, num) - tmp;
+ /* put_dec() may work incorrectly for num = 0 (generate "", not "0") */
+ if (num <= 9) {
+ tmp[0] = '0' + num;
+ len = 1;
+ } else {
+ len = put_dec(tmp, num) - tmp;
+ }
if (len > size)
return 0;
for (idx = 0; idx < len; ++idx)
buf[idx] = tmp[len - idx - 1];
- return len;
+ return len;
}
#define ZEROPAD 1 /* pad with zero */
@@ -312,8 +403,8 @@ char *number(char *buf, char *end, unsigned long long num,
/* generate full string in tmp[], in reverse order */
i = 0;
- if (num == 0)
- tmp[i++] = '0';
+ if (num < spec.base)
+ tmp[i++] = digits[num] | locase;
/* Generic code, for any base:
else do {
tmp[i++] = (digits[do_div(num,base)] | locase);
@@ -607,7 +698,7 @@ char *ip4_string(char *p, const u8 *addr, const char *fmt)
}
for (i = 0; i < 4; i++) {
char temp[3]; /* hold each IP quad in reverse order */
- int digits = put_dec_trunc(temp, addr[index]) - temp;
+ int digits = put_dec_trunc8(temp, addr[index]) - temp;
if (leading_zeros) {
if (digits < 3)
*p++ = '0';
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 23:18 ` Denys Vlasenko
@ 2012-03-27 0:30 ` Denys Vlasenko
2012-03-27 3:49 ` H. Peter Anvin
1 sibling, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 0:30 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: Andrew Morton, linux-kernel, Douglas W Jones, Michal Nazarewicz
On Tuesday 27 March 2012 01:18, Denys Vlasenko wrote:
> > I don't think Linux runs on anything with BITS_PER_LONG_LONG != 64...
> >
> > BTW, what about CPUs with slow 32x32 multiplication and/or slow 64-bit
> > division?
>
> Without 32x32->64 multiply, the best we can generate is 4 decimal digits:
> we produce next digit by approximating x/10 with (x * 0xcccd) >> 19,
> and the first x where it gives wrong result is 81920 if multiply result
> is truncated to 32 bits.
> With it, we can generate 9 digits using (x * 0x1999999a) >> 32.
>
> Regrading "slow 64-bit division" - after this patch, 32-bit machines
> wouldn't use it at all. Only 64-bit machines will perform 64-bit
> division, one per 9 decimal digits (thus, at most three divisions
> per one long_long->string conversion).
>
> In fact, with small change to #ifdefs, all machines with long long <= 64
> bits can use division-less routine. It might be a good thing to try...
Well, apparently it's not a good idea for my Phenom II in 64-bit mode:
It's bigger:
text data bss dec hex filename
2395 448 0 2843 b1b test_div64.o
2507 448 0 2955 b8b test_nodiv.o
And slower:
Conversions per second:
div64: 8:42660000 123:31472000 123456:21748000 12345678:19140000 123456789:20980000 2^32:16948000 2^64:9480000
nodiv: 8:40532000 123:30276000 123456:21172000 12345678:18672000 123456789:13440000 2^32:13440000 2^64:8992000
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 23:18 ` Denys Vlasenko
2012-03-27 0:30 ` Denys Vlasenko
@ 2012-03-27 3:49 ` H. Peter Anvin
1 sibling, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2012-03-27 3:49 UTC (permalink / raw)
To: Denys Vlasenko
Cc: Geert Uytterhoeven, Andrew Morton, linux-kernel, Douglas W Jones,
Michal Nazarewicz
On 03/26/2012 04:18 PM, Denys Vlasenko wrote:
> Only 64-bit machines will perform 64-bit
> division, one per 9 decimal digits (thus, at most three divisions
> per one long_long->string conversion).
FWIW, gcc supports 128-bit integers on 64-bit platforms using the
__int128 keyword.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
@ 2012-03-27 12:08 ` roma1390
2012-03-27 15:32 ` Denys Vlasenko
2012-03-27 15:42 ` Denys Vlasenko
2012-03-27 13:49 ` roma1390
2 siblings, 2 replies; 31+ messages in thread
From: roma1390 @ 2012-03-27 12:08 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On 2012.03.26 21:47, Denys Vlasenko wrote:
> Please find test programs attached.
>
> 32-bit test programs were built using gcc 4.6.2
> 64-bit test programs were built using gcc 4.2.1
> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
Can't compile reference for arm:
$ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
test_org.c: In function ‘put_dec’:
test_org.c:101: error: impossible constraint in ‘asm’
test_org.c:101: error: impossible constraint in ‘asm’
test_org.c:101: error: impossible constraint in ‘asm’
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
2012-03-27 12:08 ` [PATCH 0/1] " roma1390
@ 2012-03-27 13:49 ` roma1390
2012-03-27 15:33 ` Denys Vlasenko
2 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-27 13:49 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
Hi Denys,
Can't compare speed to base, but I tested this test_new on
2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
./test_new
Conversions per second: 8:5528000 123:4568000 123456:3568000 12345678:3392000
123456789:1168000 2^32:976000 2^64:532000
Conversions per second: 8:5524000 123:4568000 123456:3680000 12345678:3408000
123456789:1132000 2^32:972000 2^64:532000
Conversions per second: 8:5028000 123:4416000 123456:3688000 12345678:3396000
123456789:1168000 2^32:976000 2^64:512000
Conversions per second: 8:5524000 123:4572000 123456:3684000 12345678:3288000
123456789:1168000 2^32:972000 2^64:532000
Tested 900988928 ^Z
Tested-by: roma1390 <roma1390@gmail.com>
roma1390
On 2012.03.26 21:47, Denys Vlasenko wrote:
> Hi Andrew,
>
> Can you take this patch into -mm?
>
> Michal, Jones - can you review the code?
>
> Sometime ago, Michal Nazarewicz<mina86@mina86.com> optimized our
> (already fast) decimal-to-string conversion even further.
> Somehow this effort did not reach the kernel.
>
> Here is a new iteration of his code.
>
> Optimizations and patch follow in next email.
>
> Please find test programs attached.
>
> 32-bit test programs were built using gcc 4.6.2
> 64-bit test programs were built using gcc 4.2.1
> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>
> Sizes:
> org32.o: 2850 bytes
> new32.o: 2858 bytes
> org64.o: 2155 bytes
> new64.o: 2283 bytes
>
> Correctness: I tested first and last 40 billion values from [0, 2^64-1] range,
> they all produced correct result.
>
> Speed:
> I measured how many thousands of conversions per second are done, for several values
> (it takes different amount of time to convert, say, 123 and 2^64-1 to their
> string representations).
> Format of data below: VALUE:THOUSANDS_OF_CONVS_PER_SEC.
>
> Intel Core i7 2.7GHz:
> org32: 8:46852 123:39252 123456:23992 12345678:21992 123456789:21048 2^32-1:20424 2^64-1:10216
> new32: 8:55300 123:43208 123456:34456 12345678:31272 123456789:23584 2^32-1:23568 2^64-1:16720
>
> AMD Phenom II X4 2.4GHz:
> org32: 8:29244 123:23988 123456:13792 12345678:12056 123456789:11368 2^32-1:10804 2^64-1:5224
> new32: 8:38040 123:30356 123456:22832 12345678:20676 123456789:13556 2^32-1:13472 2^64-1:9228
>
> org64: 8:38664 123:29256 123456:19188 12345678:16320 123456789:15380 2^32-1:14896 2^64-1:7864
> new64: 8:42664 123:31660 123456:21632 12345678:19220 123456789:20880 2^32-1:17580 2^64-1:9596
>
> Summary: in all cases new code is faster than old one, in many cases by 30%,
> in few cases by more than 50% (for example, on x86-32, conversion of num=12345678).
> Code growth is ~0 in 32-bit case and ~130 bytes in 64-bit case.
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-27 12:08 ` [PATCH 0/1] " roma1390
@ 2012-03-27 15:32 ` Denys Vlasenko
2012-03-27 15:42 ` Denys Vlasenko
1 sibling, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 15:32 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Tue, Mar 27, 2012 at 2:08 PM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.26 21:47, Denys Vlasenko wrote:
>>
>> Please find test programs attached.
>>
>> 32-bit test programs were built using gcc 4.6.2
>> 64-bit test programs were built using gcc 4.2.1
>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>
> Can't compile reference for arm:
> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
> test_org.c: In function ‘put_dec’:
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’
Problem is that test_header.c contains do_div() macro
copied from x86 arch.
Can you replace it with one taken from arm arch?
You can find it in arch/arm/include/asm/div64.h,
and __asmeq macro can be found in
arch/arm/include/asm/system.h.
--
vda
arch/arm/include/asm/div64.h
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-27 13:49 ` roma1390
@ 2012-03-27 15:33 ` Denys Vlasenko
2012-03-29 5:16 ` roma1390
0 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 15:33 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Tue, Mar 27, 2012 at 3:49 PM, roma1390 <roma1390@gmail.com> wrote:
> Hi Denys,
>
> Can't compare speed to base, but I tested this test_new on
> 2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
> ./test_new
> Conversions per second: 8:5528000 123:4568000 123456:3568000
> 12345678:3392000 123456789:1168000 2^32:976000 2^64:532000
> Conversions per second: 8:5524000 123:4568000 123456:3680000
> 12345678:3408000 123456789:1132000 2^32:972000 2^64:532000
> Conversions per second: 8:5028000 123:4416000 123456:3688000
> 12345678:3396000 123456789:1168000 2^32:976000 2^64:512000
> Conversions per second: 8:5524000 123:4572000 123456:3684000
> 12345678:3288000 123456789:1168000 2^32:972000 2^64:532000
> Tested 900988928 ^Z
Thanks, but without baseline, it's impossible to see whether
we have an improvement...
Can you also let it run longer, to test at least up to
first 5 billion numbers (which is > 2^32)?
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-27 12:08 ` [PATCH 0/1] " roma1390
2012-03-27 15:32 ` Denys Vlasenko
@ 2012-03-27 15:42 ` Denys Vlasenko
2012-03-28 5:56 ` roma1390
1 sibling, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-27 15:42 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
[-- Attachment #1: Type: text/plain, Size: 748 bytes --]
On Tue, Mar 27, 2012 at 2:08 PM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.26 21:47, Denys Vlasenko wrote:
>>
>> Please find test programs attached.
>>
>> 32-bit test programs were built using gcc 4.6.2
>> 64-bit test programs were built using gcc 4.2.1
>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>
> Can't compile reference for arm:
> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
> test_org.c: In function ‘put_dec’:
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’
> test_org.c:101: error: impossible constraint in ‘asm’
Please find a modified test_header.c attached.
I tested and it builds in my arm emulator.
--
vda
[-- Attachment #2: test_header.c --]
[-- Type: text/x-csrc, Size: 8139 bytes --]
#include <features.h>
#include <inttypes.h>
#include <stdint.h>
#include <stdbool.h>
#include <sys/types.h> /* size_t */
#include <limits.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdarg.h>
#include <stddef.h>
#include <ctype.h>
#include <string.h>
#include <errno.h>
#include <stdio.h> /* for FILE */
#include <sys/time.h>
#include <time.h>
#define noinline_for_stack __attribute__((noinline))
typedef uint8_t u8;
typedef int16_t s16;
/*
* The semantics of do_div() are:
*
* uint32_t do_div(uint64_t *n, uint32_t base)
* {
* uint32_t remainder = *n % base;
* *n = *n / base;
* return remainder;
* }
*
* In other words, a 64-bit dividend with a 32-bit divisor producing
* a 64-bit result and a 32-bit remainder. To accomplish this optimally
* we call a special __do_div64 helper with completely non standard
* calling convention for arguments and results (beware).
*/
#define __asmeq(x, y) ".ifnc " x "," y " ; .err ; .endif\n\t"
#define __LINUX_ARM_ARCH__ 4
#ifdef __ARMEB__
#define __xh "r0"
#define __xl "r1"
#else
#define __xl "r0"
#define __xh "r1"
#endif
#define __do_div_asm(n, base) \
({ \
register unsigned int __base asm("r4") = base; \
register unsigned long long __n asm("r0") = n; \
register unsigned long long __res asm("r2"); \
register unsigned int __rem asm(__xh); \
asm( __asmeq("%0", __xh) \
__asmeq("%1", "r2") \
__asmeq("%2", "r0") \
__asmeq("%3", "r4") \
"bl __do_div64" \
: "=r" (__rem), "=r" (__res) \
: "r" (__n), "r" (__base) \
: "ip", "lr", "cc"); \
n = __res; \
__rem; \
})
#if __GNUC__ < 4
/*
* gcc versions earlier than 4.0 are simply too problematic for the
* optimized implementation below. First there is gcc PR 15089 that
* tend to trig on more complex constructs, spurious .global __udivsi3
* are inserted even if none of those symbols are referenced in the
* generated code, and those gcc versions are not able to do constant
* propagation on long long values anyway.
*/
#define do_div(n, base) __do_div_asm(n, base)
#elif __GNUC__ >= 4
/*
* If the divisor happens to be constant, we determine the appropriate
* inverse at compile time to turn the division into a few inline
* multiplications instead which is much faster. And yet only if compiling
* for ARMv4 or higher (we need umull/umlal) and if the gcc version is
* sufficiently recent to perform proper long long constant propagation.
* (It is unfortunate that gcc doesn't perform all this internally.)
*/
#define do_div(n, base) \
({ \
unsigned int __r, __b = (base); \
if (!__builtin_constant_p(__b) || __b == 0 || \
(__LINUX_ARM_ARCH__ < 4 && (__b & (__b - 1)) != 0)) { \
/* non-constant divisor (or zero): slow path */ \
__r = __do_div_asm(n, __b); \
} else if ((__b & (__b - 1)) == 0) { \
/* Trivial: __b is constant and a power of 2 */ \
/* gcc does the right thing with this code. */ \
__r = n; \
__r &= (__b - 1); \
n /= __b; \
} else { \
/* Multiply by inverse of __b: n/b = n*(p/b)/p */ \
/* We rely on the fact that most of this code gets */ \
/* optimized away at compile time due to constant */ \
/* propagation and only a couple inline assembly */ \
/* instructions should remain. Better avoid any */ \
/* code construct that might prevent that. */ \
unsigned long long __res, __x, __t, __m, __n = n; \
unsigned int __c, __p, __z = 0; \
/* preserve low part of n for reminder computation */ \
__r = __n; \
/* determine number of bits to represent __b */ \
__p = 1 << __div64_fls(__b); \
/* compute __m = ((__p << 64) + __b - 1) / __b */ \
__m = (~0ULL / __b) * __p; \
__m += (((~0ULL % __b + 1) * __p) + __b - 1) / __b; \
/* compute __res = __m*(~0ULL/__b*__b-1)/(__p << 64) */ \
__x = ~0ULL / __b * __b - 1; \
__res = (__m & 0xffffffff) * (__x & 0xffffffff); \
__res >>= 32; \
__res += (__m & 0xffffffff) * (__x >> 32); \
__t = __res; \
__res += (__x & 0xffffffff) * (__m >> 32); \
__t = (__res < __t) ? (1ULL << 32) : 0; \
__res = (__res >> 32) + __t; \
__res += (__m >> 32) * (__x >> 32); \
__res /= __p; \
/* Now sanitize and optimize what we've got. */ \
if (~0ULL % (__b / (__b & -__b)) == 0) { \
/* those cases can be simplified with: */ \
__n /= (__b & -__b); \
__m = ~0ULL / (__b / (__b & -__b)); \
__p = 1; \
__c = 1; \
} else if (__res != __x / __b) { \
/* We can't get away without a correction */ \
/* to compensate for bit truncation errors. */ \
/* To avoid it we'd need an additional bit */ \
/* to represent __m which would overflow it. */ \
/* Instead we do m=p/b and n/b=(n*m+m)/p. */ \
__c = 1; \
/* Compute __m = (__p << 64) / __b */ \
__m = (~0ULL / __b) * __p; \
__m += ((~0ULL % __b + 1) * __p) / __b; \
} else { \
/* Reduce __m/__p, and try to clear bit 31 */ \
/* of __m when possible otherwise that'll */ \
/* need extra overflow handling later. */ \
unsigned int __bits = -(__m & -__m); \
__bits |= __m >> 32; \
__bits = (~__bits) << 1; \
/* If __bits == 0 then setting bit 31 is */ \
/* unavoidable. Simply apply the maximum */ \
/* possible reduction in that case. */ \
/* Otherwise the MSB of __bits indicates the */ \
/* best reduction we should apply. */ \
if (!__bits) { \
__p /= (__m & -__m); \
__m /= (__m & -__m); \
} else { \
__p >>= __div64_fls(__bits); \
__m >>= __div64_fls(__bits); \
} \
/* No correction needed. */ \
__c = 0; \
} \
/* Now we have a combination of 2 conditions: */ \
/* 1) whether or not we need a correction (__c), and */ \
/* 2) whether or not there might be an overflow in */ \
/* the cross product (__m & ((1<<63) | (1<<31))) */ \
/* Select the best insn combination to perform the */ \
/* actual __m * __n / (__p << 64) operation. */ \
if (!__c) { \
asm ( "umull %Q0, %R0, %1, %Q2\n\t" \
"mov %Q0, #0" \
: "=&r" (__res) \
: "r" (__m), "r" (__n) \
: "cc" ); \
} else if (!(__m & ((1ULL << 63) | (1ULL << 31)))) { \
__res = __m; \
asm ( "umlal %Q0, %R0, %Q1, %Q2\n\t" \
"mov %Q0, #0" \
: "+&r" (__res) \
: "r" (__m), "r" (__n) \
: "cc" ); \
} else { \
asm ( "umull %Q0, %R0, %Q1, %Q2\n\t" \
"cmn %Q0, %Q1\n\t" \
"adcs %R0, %R0, %R1\n\t" \
"adc %Q0, %3, #0" \
: "=&r" (__res) \
: "r" (__m), "r" (__n), "r" (__z) \
: "cc" ); \
} \
if (!(__m & ((1ULL << 63) | (1ULL << 31)))) { \
asm ( "umlal %R0, %Q0, %R1, %Q2\n\t" \
"umlal %R0, %Q0, %Q1, %R2\n\t" \
"mov %R0, #0\n\t" \
"umlal %Q0, %R0, %R1, %R2" \
: "+&r" (__res) \
: "r" (__m), "r" (__n) \
: "cc" ); \
} else { \
asm ( "umlal %R0, %Q0, %R2, %Q3\n\t" \
"umlal %R0, %1, %Q2, %R3\n\t" \
"mov %R0, #0\n\t" \
"adds %Q0, %1, %Q0\n\t" \
"adc %R0, %R0, #0\n\t" \
"umlal %Q0, %R0, %R2, %R3" \
: "+&r" (__res), "+&r" (__z) \
: "r" (__m), "r" (__n) \
: "cc" ); \
} \
__res /= __p; \
/* The reminder can be computed with 32-bit regs */ \
/* only, and gcc is good at that. */ \
{ \
unsigned int __res0 = __res; \
unsigned int __b0 = __b; \
__r -= __res0 * __b0; \
} \
/* BUG_ON(__r >= __b || __res * __b + __r != n); */ \
n = __res; \
} \
__r; \
})
/* our own fls implementation to make sure constant propagation is fine */
#define __div64_fls(bits) \
({ \
unsigned int __left = (bits), __nr = 0; \
if (__left & 0xffff0000) __nr += 16, __left >>= 16; \
if (__left & 0x0000ff00) __nr += 8, __left >>= 8; \
if (__left & 0x000000f0) __nr += 4, __left >>= 4; \
if (__left & 0x0000000c) __nr += 2, __left >>= 2; \
if (__left & 0x00000002) __nr += 1; \
__nr; \
})
#endif
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-26 20:20 ` H. Peter Anvin
@ 2012-03-27 17:12 ` Michal Nazarewicz
2012-03-27 17:17 ` H. Peter Anvin
0 siblings, 1 reply; 31+ messages in thread
From: Michal Nazarewicz @ 2012-03-27 17:12 UTC (permalink / raw)
To: Andrew Morton, H. Peter Anvin
Cc: Denys Vlasenko, linux-kernel, Douglas W Jones, Michal Nazarewicz
On Mon, 26 Mar 2012 22:20:52 +0200, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/26/2012 01:13 PM, Andrew Morton wrote:
>> On Mon, 26 Mar 2012 21:56:38 +0200
>> Denys Vlasenko <vda.linux@googlemail.com> wrote:
>>
>>>>> +#if BITS_PER_LONG != 32 || (~(0ULL)>>1) != ((1ULL<<63)-1)
>>>>
>>>> What's this for?
>>>
>>> The second check should be just BITS_PER_LONG_LONG != 64,
>>> but we don't have BITS_PER_LONG_LONG.
>>
>> So let's add BITS_PER_LONG_LONG rather than hacking around its absence!
>
> First of all, the #if is wrong: the preprocessor doesn't support data
> types and does all arithmetic at (u)intmax_t precision.
Maybe a regular “if” instead of macro would suffice here? Compiler should
be smart enough to optimise out the dead path.
> As far as BITS_PER_LONG_LONG, there are tons of places in the kernel
> which already require that long long is exactly 64 bits. That may or
> may not be a good thing, but for right now one could simply:
>
> #define BITS_PER_LONG_LONG 64
--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michał “mina86” Nazarewicz (o o)
ooo +----<email/xmpp: mpn@google.com>--------------ooO--(_)--Ooo--
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/1] vsprintf: optimize decimal conversion (again)
2012-03-27 17:12 ` Michal Nazarewicz
@ 2012-03-27 17:17 ` H. Peter Anvin
0 siblings, 0 replies; 31+ messages in thread
From: H. Peter Anvin @ 2012-03-27 17:17 UTC (permalink / raw)
To: Michal Nazarewicz
Cc: Andrew Morton, Denys Vlasenko, linux-kernel, Douglas W Jones,
Michal Nazarewicz
On 03/27/2012 10:12 AM, Michal Nazarewicz wrote:
>
> Maybe a regular “if” instead of macro would suffice here? Compiler should
> be smart enough to optimise out the dead path.
>
Yes, a regular if would work, but it's also pointless.
-hpa
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-27 15:42 ` Denys Vlasenko
@ 2012-03-28 5:56 ` roma1390
2012-03-28 10:13 ` Denys Vlasenko
0 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28 5:56 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On 2012.03.27 18:42, Denys Vlasenko wrote:
> On Tue, Mar 27, 2012 at 2:08 PM, roma1390<roma1390@gmail.com> wrote:
>> On 2012.03.26 21:47, Denys Vlasenko wrote:
>>>
>>> Please find test programs attached.
>>>
>>> 32-bit test programs were built using gcc 4.6.2
>>> 64-bit test programs were built using gcc 4.2.1
>>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
>>
>> Can't compile reference for arm:
>> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
>> test_org.c: In function ‘put_dec’:
>> test_org.c:101: error: impossible constraint in ‘asm’
>> test_org.c:101: error: impossible constraint in ‘asm’
>> test_org.c:101: error: impossible constraint in ‘asm’
>
> Please find a modified test_header.c attached.
> I tested and it builds in my arm emulator.
Run on same:
2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
GCC version:
gcc version 4.4.5 (Debian 4.4.5-8), Target: arm-linux-gnueabi
Compiled with:
arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -o test_{org,new}
run default priority on almost idle machine:
./test_org
Conversions per second: 8:4716000 123:3896000 123456:2756000 12345678:2412000
123456789:2380000 2^32:2348000 2^64:1400000
Conversions per second: 8:4556000 123:3892000 123456:2760000 12345678:2496000
123456789:2384000 2^32:2264000 2^64:1400000
Conversions per second: 8:4716000 123:3896000 123456:2664000 12345678:2496000
123456789:2388000 2^32:2348000 2^64:1400000
Conversions per second: 8:4716000 123:3892000 123456:2760000 12345678:2496000
123456789:2292000 2^32:2348000 2^64:1400000
./test_org
Conversions per second: 8:4716000 123:3748000 123456:2760000 12345678:2496000
123456789:2388000 2^32:2348000 2^64:1348000
Conversions per second: 8:4716000 123:3896000 123456:2756000 12345678:2408000
123456789:2384000 2^32:2348000 2^64:1400000
Conversions per second: 8:4540000 123:3896000 123456:2648000 12345678:2496000
123456789:2388000 2^32:2268000 2^64:1400000
Conversions per second: 8:4716000 123:3888000 123456:2664000 12345678:2496000
123456789:2388000 2^32:2344000 2^64:1400000
./test_new
Conversions per second: 8:5496000 123:4564000 123456:3536000 12345678:3408000
123456789:1168000 2^32:976000 2^64:532000
Conversions per second: 8:5524000 123:4568000 123456:3692000 12345678:3404000
123456789:1132000 2^32:972000 2^64:532000
Conversions per second: 8:5528000 123:4404000 123456:3684000 12345678:3408000
123456789:1168000 2^32:976000 2^64:516000
Conversions per second: 8:5528000 123:4568000 123456:3692000 12345678:3288000
123456789:1168000 2^32:976000 2^64:532000
./test_new
Conversions per second: 8:5524000 123:4416000 123456:3688000 12345678:3400000
123456789:1172000 2^32:972000 2^64:516000
Conversions per second: 8:5520000 123:4568000 123456:3692000 12345678:3276000
123456789:1168000 2^32:976000 2^64:532000
Conversions per second: 8:5344000 123:4552000 123456:3692000 12345678:3404000
123456789:1168000 2^32:944000 2^64:532000
Conversions per second: 8:5528000 123:4568000 123456:3556000 12345678:3404000
123456789:1168000 2^32:976000 2^64:532000
I will run the long running correctness test, and post results later. But this
can take a while...
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 5:56 ` roma1390
@ 2012-03-28 10:13 ` Denys Vlasenko
2012-03-28 10:24 ` roma1390
2012-03-28 10:31 ` roma1390
0 siblings, 2 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 10:13 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Wednesday 28 March 2012 07:56, roma1390 wrote:
> On 2012.03.27 18:42, Denys Vlasenko wrote:
> > On Tue, Mar 27, 2012 at 2:08 PM, roma1390<roma1390@gmail.com> wrote:
> >> On 2012.03.26 21:47, Denys Vlasenko wrote:
> >>>
> >>> Please find test programs attached.
> >>>
> >>> 32-bit test programs were built using gcc 4.6.2
> >>> 64-bit test programs were built using gcc 4.2.1
> >>> Command line: gcc --static [-m32] -O2 -Wall test_{org,new}.c
> >>
> >> Can't compile reference for arm:
> >> $ arm-linux-gnueabi-gcc -O2 -Wall test_org.c -o test_org
> >> test_org.c: In function ‘put_dec’:
> >> test_org.c:101: error: impossible constraint in ‘asm’
> >> test_org.c:101: error: impossible constraint in ‘asm’
> >> test_org.c:101: error: impossible constraint in ‘asm’
> >
> > Please find a modified test_header.c attached.
> > I tested and it builds in my arm emulator.
>
>
> Run on same:
> 2.6.32-5-kirkwood #1 Tue Jan 17 05:11:52 UTC 2012 armv5tel GNU/Linux
> GCC version:
> gcc version 4.4.5 (Debian 4.4.5-8), Target: arm-linux-gnueabi
> Compiled with:
> arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -o test_{org,new}
>
>
> run default priority on almost idle machine:
Hmm, results are not good. Up to 8-digit conversions we were winning,
but when we start converting larger numbers, we lose big time:
123456789:2388000 2^32:2268000 2^64:1400000
123456789:1168000 2^32:976000 2^64:532000
Since ARM is 32-bit, we are using algorithm #2.
It's not like algorithm #2 is intrinsically bad: on i386, it is a win
on both AMD and Intel I tested it on.
Either it's just the difference between ARM and x86, or gcc is
generating suboptimal code for it.
I would like to ask you to do a few things.
First: email me both test_org and test_new binaries.
(Privately, to not spam the list).
Second: run
arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
and email me resulting test_{org,new}.s files.
Third: switch to algorithm #1 and test whether it fares better.
To do that, go to test_new.c
and replace
#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
with
#if 1 ////LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
(there are two instances of this line there),
then recompile and rerun the test, and post the result.
When I disassemble ARM code produced by _my_ compiler,
I don't see any obviously bad things.
put_dec_trunc8 is the function which works well for you.
put_dec_full4 and put_dec are used for printing 9+ digit numbers,
and your testing says they are slow. I don't see why -
see disassembly below.
I need to look at _your_ gcc's output...
00000000 <put_dec_trunc8>:
0: e92d40f0 stmdb sp!, {r4, r5, r6, r7, lr}
4: e59f6188 ldr r6, [pc, #392] ; 194 <.text+0x194>
8: e0843691 umull r3, r4, r1, r6
c: e1a03004 mov r3, r4
10: e1a0c003 mov ip, r3
14: e1a02083 mov r2, r3, lsl #1
18: e1a03183 mov r3, r3, lsl #3
1c: e3a04000 mov r4, #0 ; 0x0
20: e0822003 add r2, r2, r3
24: e2811030 add r1, r1, #48 ; 0x30
28: e0621001 rsb r1, r2, r1
2c: e15c0004 cmp ip, r4
30: e1a05000 mov r5, r0
34: e3a07000 mov r7, #0 ; 0x0
38: e4c01001 strb r1, [r0], #1
3c: 0a000052 beq 18c <put_dec_trunc8+0x18c>
40: e084369c umull r3, r4, ip, r6
44: e1a03004 mov r3, r4
48: e1a02083 mov r2, r3, lsl #1
4c: e1a01183 mov r1, r3, lsl #3
50: e1a0e003 mov lr, r3
54: e3a04000 mov r4, #0 ; 0x0
58: e0822001 add r2, r2, r1
5c: e28c3030 add r3, ip, #48 ; 0x30
60: e0623003 rsb r3, r2, r3
64: e15e0004 cmp lr, r4
68: e5c53001 strb r3, [r5, #1]
6c: e2800001 add r0, r0, #1 ; 0x1
70: 0a000045 beq 18c <put_dec_trunc8+0x18c>
74: e084369e umull r3, r4, lr, r6
78: e1a03004 mov r3, r4
7c: e1a02083 mov r2, r3, lsl #1
80: e1a01183 mov r1, r3, lsl #3
84: e1a05003 mov r5, r3
88: e3a04000 mov r4, #0 ; 0x0
8c: e0822001 add r2, r2, r1
90: e28e3030 add r3, lr, #48 ; 0x30
94: e0623003 rsb r3, r2, r3
98: e1550004 cmp r5, r4
9c: e4c03001 strb r3, [r0], #1
a0: 0a000039 beq 18c <put_dec_trunc8+0x18c>
a4: e0843695 umull r3, r4, r5, r6
a8: e1a03004 mov r3, r4
ac: e1a02083 mov r2, r3, lsl #1
b0: e1a01183 mov r1, r3, lsl #3
b4: e1a0c003 mov ip, r3
b8: e3a04000 mov r4, #0 ; 0x0
bc: e0822001 add r2, r2, r1
c0: e2853030 add r3, r5, #48 ; 0x30
c4: e0623003 rsb r3, r2, r3
c8: e15c0004 cmp ip, r4
cc: e4c03001 strb r3, [r0], #1
d0: 0a00002d beq 18c <put_dec_trunc8+0x18c>
d4: e1a0220c mov r2, ip, lsl #4
d8: e042210c sub r2, r2, ip, lsl #2
dc: e082200c add r2, r2, ip
e0: e1a03302 mov r3, r2, lsl #6
e4: e0623003 rsb r3, r2, r3
e8: e1a03103 mov r3, r3, lsl #2
ec: e083300c add r3, r3, ip
f0: e1a03083 mov r3, r3, lsl #1
f4: e1a0e823 mov lr, r3, lsr #16
f8: e1a0208e mov r2, lr, lsl #1
fc: e1a0118e mov r1, lr, lsl #3
100: e0822001 add r2, r2, r1
104: e28c3030 add r3, ip, #48 ; 0x30
108: e0623003 rsb r3, r2, r3
10c: e15e0004 cmp lr, r4
110: e4c03001 strb r3, [r0], #1
114: 0a00001c beq 18c <put_dec_trunc8+0x18c>
118: e1a0320e mov r3, lr, lsl #4
11c: e043310e sub r3, r3, lr, lsl #2
120: e1a02203 mov r2, r3, lsl #4
124: e0833002 add r3, r3, r2
128: e083300e add r3, r3, lr
12c: e1a0c5a3 mov ip, r3, lsr #11
130: e1a0208c mov r2, ip, lsl #1
134: e1a0118c mov r1, ip, lsl #3
138: e0822001 add r2, r2, r1
13c: e28e3030 add r3, lr, #48 ; 0x30
140: e0623003 rsb r3, r2, r3
144: e15c0004 cmp ip, r4
148: e4c03001 strb r3, [r0], #1
14c: 0a00000e beq 18c <put_dec_trunc8+0x18c>
150: e1a0320c mov r3, ip, lsl #4
154: e043310c sub r3, r3, ip, lsl #2
158: e1a02203 mov r2, r3, lsl #4
15c: e0833002 add r3, r3, r2
160: e083300c add r3, r3, ip
164: e1a0e5a3 mov lr, r3, lsr #11
168: e1a0208e mov r2, lr, lsl #1
16c: e1a0118e mov r1, lr, lsl #3
170: e0822001 add r2, r2, r1
174: e28c3030 add r3, ip, #48 ; 0x30
178: e0623003 rsb r3, r2, r3
17c: e15e0004 cmp lr, r4
180: e4c03001 strb r3, [r0], #1
184: 128e3030 addne r3, lr, #48 ; 0x30
188: 14c03001 strneb r3, [r0], #1
18c: e8bd40f0 ldmia sp!, {r4, r5, r6, r7, lr}
190: e12fff1e bx lr
194: 1999999a ldmneib r9, {r1, r3, r4, r7, r8, fp, ip, pc}
00000198 <put_dec_full4>:
198: e1a0c201 mov ip, r1, lsl #4
19c: e04cc101 sub ip, ip, r1, lsl #2
1a0: e1a0320c mov r3, ip, lsl #4
1a4: e08cc003 add ip, ip, r3
1a8: e1a0240c mov r2, ip, lsl #8
1ac: e08cc002 add ip, ip, r2
1b0: e08cc001 add ip, ip, r1
1b4: e1a0c9ac mov ip, ip, lsr #19
1b8: e1a0320c mov r3, ip, lsl #4
1bc: e043310c sub r3, r3, ip, lsl #2
1c0: e083300c add r3, r3, ip
1c4: e1a02303 mov r2, r3, lsl #6
1c8: e0632002 rsb r2, r3, r2
1cc: e1a02102 mov r2, r2, lsl #2
1d0: e082200c add r2, r2, ip
1d4: e1a02082 mov r2, r2, lsl #1
1d8: e1a02822 mov r2, r2, lsr #16
1dc: e92d4070 stmdb sp!, {r4, r5, r6, lr}
1e0: e1a0e202 mov lr, r2, lsl #4
1e4: e04ee102 sub lr, lr, r2, lsl #2
1e8: e1a0320e mov r3, lr, lsl #4
1ec: e08ee003 add lr, lr, r3
1f0: e1a0408c mov r4, ip, lsl #1
1f4: e1a0318c mov r3, ip, lsl #3
1f8: e0844003 add r4, r4, r3
1fc: e08ee002 add lr, lr, r2
200: e2811030 add r1, r1, #48 ; 0x30
204: e0641001 rsb r1, r4, r1
208: e1a0e5ae mov lr, lr, lsr #11
20c: e1a04000 mov r4, r0
210: e4c41001 strb r1, [r4], #1
214: e1a06000 mov r6, r0
218: e1a05182 mov r5, r2, lsl #3
21c: e1a0018e mov r0, lr, lsl #3
220: e1a01082 mov r1, r2, lsl #1
224: e1a0308e mov r3, lr, lsl #1
228: e0833000 add r3, r3, r0
22c: e0811005 add r1, r1, r5
230: e28cc030 add ip, ip, #48 ; 0x30
234: e2822030 add r2, r2, #48 ; 0x30
238: e2840001 add r0, r4, #1 ; 0x1
23c: e061c00c rsb ip, r1, ip
240: e0632002 rsb r2, r3, r2
244: e28ee030 add lr, lr, #48 ; 0x30
248: e5c6c001 strb ip, [r6, #1]
24c: e5c42001 strb r2, [r4, #1]
250: e5c0e001 strb lr, [r0, #1]
254: e2800002 add r0, r0, #2 ; 0x2
258: e8bd4070 ldmia sp!, {r4, r5, r6, lr}
25c: e12fff1e bx lr
00000260 <put_dec>:
260: e92d4ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
264: e3530000 cmp r3, #0 ; 0x0
268: e24dd00c sub sp, sp, #12 ; 0xc
26c: e1a08002 mov r8, r2
270: e1a09003 mov r9, r3
274: e1a0e000 mov lr, r0
278: 8a000009 bhi 2a4 <put_dec+0x44>
27c: 0a000003 beq 290 <put_dec+0x30>
280: e1a01008 mov r1, r8
284: e28dd00c add sp, sp, #12 ; 0xc
288: e8bd4ff0 ldmia sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
28c: eaffff5b b 0 <put_dec_trunc8>
290: e3e034fa mvn r3, #-100663296 ; 0xfa000000
294: e2433aa1 sub r3, r3, #659456 ; 0xa1000
298: e2433c0f sub r3, r3, #3840 ; 0xf00
29c: e1520003 cmp r2, r3
2a0: 9afffff6 bls 280 <put_dec+0x20>
2a4: e1a07828 mov r7, r8, lsr #16
2a8: e1a0c187 mov ip, r7, lsl #3
2ac: e1a02287 mov r2, r7, lsl #5
2b0: e08c2002 add r2, ip, r2
2b4: e0822007 add r2, r2, r7
2b8: e1a03182 mov r3, r2, lsl #3
2bc: e1a0a829 mov sl, r9, lsr #16
2c0: e0822003 add r2, r2, r3
2c4: e1a04809 mov r4, r9, lsl #16
2c8: e1a04824 mov r4, r4, lsr #16
2cc: e1a00202 mov r0, r2, lsl #4
2d0: e1a0118a mov r1, sl, lsl #3
2d4: e1a0328a mov r3, sl, lsl #5
2d8: e0815003 add r5, r1, r3
2dc: e1a09184 mov r9, r4, lsl #3
2e0: e0620000 rsb r0, r2, r0
2e4: e1a01808 mov r1, r8, lsl #16
2e8: e1a03304 mov r3, r4, lsl #6
2ec: e0800007 add r0, r0, r7
2f0: e1a01821 mov r1, r1, lsr #16
2f4: e0693003 rsb r3, r9, r3
2f8: e085200a add r2, r5, sl
2fc: e1a02202 mov r2, r2, lsl #4
300: e0800001 add r0, r0, r1
304: e0833004 add r3, r3, r4
308: e0800002 add r0, r0, r2
30c: e59fb164 ldr fp, [pc, #356] ; 478 <.text+0x478>
310: e1a03383 mov r3, r3, lsl #7
314: e0800003 add r0, r0, r3
318: e088209b umull r2, r8, fp, r0
31c: e1a086a8 mov r8, r8, lsr #13
320: e1a01388 mov r1, r8, lsl #7
324: e0411108 sub r1, r1, r8, lsl #2
328: e0811008 add r1, r1, r8
32c: e1a03101 mov r3, r1, lsl #2
330: e0811003 add r1, r1, r3
334: e0401201 sub r1, r0, r1, lsl #4
338: e1a0000e mov r0, lr
33c: e58dc004 str ip, [sp, #4]
340: ebffff94 bl 198 <put_dec_full4>
344: e1a03284 mov r3, r4, lsl #5
348: e1a02104 mov r2, r4, lsl #2
34c: e0822003 add r2, r2, r3
350: e1a0630a mov r6, sl, lsl #6
354: e1a0350a mov r3, sl, lsl #10
358: e0663003 rsb r3, r6, r3
35c: e1a01282 mov r1, r2, lsl #5
360: e0822001 add r2, r2, r1
364: e06a3003 rsb r3, sl, r3
368: e0642002 rsb r2, r4, r2
36c: e59dc004 ldr ip, [sp, #4]
370: e1a03183 mov r3, r3, lsl #3
374: e1a02182 mov r2, r2, lsl #3
378: e06a3003 rsb r3, sl, r3
37c: e04cc087 sub ip, ip, r7, lsl #1
380: e0833002 add r3, r3, r2
384: e083300c add r3, r3, ip
388: e0833008 add r3, r3, r8
38c: e087239b umull r2, r7, fp, r3
390: e1a076a7 mov r7, r7, lsr #13
394: e1a01387 mov r1, r7, lsl #7
398: e0411107 sub r1, r1, r7, lsl #2
39c: e0811007 add r1, r1, r7
3a0: e1a02101 mov r2, r1, lsl #2
3a4: e0811002 add r1, r1, r2
3a8: e0431201 sub r1, r3, r1, lsl #4
3ac: e046620a sub r6, r6, sl, lsl #4
3b0: ebffff78 bl 198 <put_dec_full4>
3b4: e1a03286 mov r3, r6, lsl #5
3b8: e0866003 add r6, r6, r3
3bc: e0499084 sub r9, r9, r4, lsl #1
3c0: e06a6006 rsb r6, sl, r6
3c4: e1a02189 mov r2, r9, lsl #3
3c8: e1a03106 mov r3, r6, lsl #2
3cc: e0663003 rsb r3, r6, r3
3d0: e0692002 rsb r2, r9, r2
3d4: e0822003 add r2, r2, r3
3d8: e0822007 add r2, r2, r7
3dc: e084329b umull r3, r4, fp, r2
3e0: e1a046a4 mov r4, r4, lsr #13
3e4: e1a01384 mov r1, r4, lsl #7
3e8: e0411104 sub r1, r1, r4, lsl #2
3ec: e0811004 add r1, r1, r4
3f0: e1a03101 mov r3, r1, lsl #2
3f4: e0811003 add r1, r1, r3
3f8: e0421201 sub r1, r2, r1, lsl #4
3fc: ebffff65 bl 198 <put_dec_full4>
400: e1a03185 mov r3, r5, lsl #3
404: e0653003 rsb r3, r5, r3
408: e083300a add r3, r3, sl
40c: e0944003 adds r4, r4, r3
410: e1a01000 mov r1, r0
414: 0a000010 beq 45c <put_dec+0x1fc>
418: e083249b umull r2, r3, fp, r4
41c: e1a066a3 mov r6, r3, lsr #13
420: e1a01386 mov r1, r6, lsl #7
424: e0411106 sub r1, r1, r6, lsl #2
428: e0811006 add r1, r1, r6
42c: e1a03101 mov r3, r1, lsl #2
430: e0811003 add r1, r1, r3
434: e0441201 sub r1, r4, r1, lsl #4
438: ebffff56 bl 198 <put_dec_full4>
43c: e3560000 cmp r6, #0 ; 0x0
440: e1a01000 mov r1, r0
444: 0a000004 beq 45c <put_dec+0x1fc>
448: e1a01006 mov r1, r6
44c: ebffff51 bl 198 <put_dec_full4>
450: e1a01000 mov r1, r0
454: ea000000 b 45c <put_dec+0x1fc>
458: e2411001 sub r1, r1, #1 ; 0x1
45c: e5513001 ldrb r3, [r1, #-1]
460: e3530030 cmp r3, #48 ; 0x30
464: 0afffffb beq 458 <put_dec+0x1f8>
468: e1a00001 mov r0, r1
46c: e28dd00c add sp, sp, #12 ; 0xc
470: e8bd4ff0 ldmia sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
474: e12fff1e bx lr
478: d1b71759 movles r1, r9, asr r7
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 10:13 ` Denys Vlasenko
@ 2012-03-28 10:24 ` roma1390
2012-03-28 10:33 ` Denys Vlasenko
2012-03-28 10:31 ` roma1390
1 sibling, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28 10:24 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
[-- Attachment #1: Type: text/plain, Size: 189 bytes --]
On 2012.03.28 13:13, Denys Vlasenko wrote:
> Second: run
> arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
> and email me resulting test_{org,new}.s files.
test_{org,new}.s attached.
[-- Attachment #2: test_new.s --]
[-- Type: text/plain, Size: 13521 bytes --]
.cpu arm9tdmi
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 18, 4
.file "test_new.c"
.text
.align 2
.type put_dec_trunc8, %function
put_dec_trunc8:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L5
stmfd sp!, {r4, r5, fp}
umull fp, ip, r1, r3
mov r2, ip
add r1, r1, #48
mov ip, r0
add r0, r2, r2, asl #2
sub r1, r1, r0, asl #1
cmp r2, #0
mov r0, ip
strb r1, [r0], #1
beq .L2
umull r4, r5, r2, r3
add r2, r2, #48
add fp, r5, r5, asl #2
sub r2, r2, fp, asl #1
cmp r5, #0
mov r1, r5
strb r2, [ip, #1]
add r0, r0, #1
beq .L2
umull r4, r5, r1, r3
add r1, r1, #48
add ip, r5, r5, asl #2
sub r1, r1, ip, asl #1
cmp r5, #0
mov r2, r5
strb r1, [r0], #1
beq .L2
umull r4, r5, r2, r3
add r2, r2, #48
add r1, r5, r5, asl #2
sub r2, r2, r1, asl #1
cmp r5, #0
strb r2, [r0], #1
beq .L2
add r2, r5, r5, asl #1
add r2, r5, r2, asl #2
rsb r2, r2, r2, asl #6
add r2, r5, r2, asl #2
mov r2, r2, asl #1
mov r2, r2, lsr #16
add r3, r5, #48
add r1, r2, r2, asl #2
sub r3, r3, r1, asl #1
cmp r2, #0
strb r3, [r0], #1
beq .L2
add r1, r2, r1, asl #3
add r1, r1, r1, asl #2
mov r3, r1, lsr #11
add r2, r2, #48
add r1, r3, r3, asl #2
sub r2, r2, r1, asl #1
cmp r3, #0
strb r2, [r0], #1
beq .L2
add r1, r3, r1, asl #3
add r1, r1, r1, asl #2
mov r2, r1, lsr #11
add r1, r2, r2, asl #2
add r3, r3, #48
cmp r2, #0
sub r3, r3, r1, asl #1
strb r3, [r0], #1
addne r2, r2, #48
strneb r2, [r0], #1
.L2:
ldmfd sp!, {r4, r5, fp}
bx lr
.L6:
.align 2
.L5:
.word 429496730
.size put_dec_trunc8, .-put_dec_trunc8
.align 2
.type put_dec_full4, %function
put_dec_full4:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add r3, r1, r1, asl #1
add r3, r3, r3, asl #4
add r3, r3, r3, asl #8
add r3, r1, r3, asl #2
mov r3, r3, lsr #19
add r2, r3, r3, asl #1
add r2, r3, r2, asl #2
rsb r2, r2, r2, asl #6
add r2, r3, r2, asl #2
mov r2, r2, asl #1
mov r2, r2, lsr #16
add ip, r2, r2, asl #2
stmfd sp!, {r4, r5}
add r4, r2, ip, asl #3
add r5, r3, r3, asl #2
add r1, r1, #48
add r4, r4, r4, asl #2
sub r1, r1, r5, asl #1
mov r4, r4, lsr #11
mov r5, r0
strb r1, [r5], #1
add r3, r3, #48
add r1, r4, r4, asl #2
add r2, r2, #48
sub r2, r2, r1, asl #1
sub ip, r3, ip, asl #1
add r1, r5, #1
add r4, r4, #48
strb ip, [r0, #1]
strb r2, [r5, #1]
add r0, r1, #2
strb r4, [r1, #1]
ldmfd sp!, {r4, r5}
bx lr
.size put_dec_full4, .-put_dec_full4
.global __aeabi_uidivmod
.global __aeabi_uidiv
.align 2
.type number, %function
number:
@ Function supports interworking.
@ args = 8, pretend = 0, frame = 120
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
sub sp, sp, #124
ldrb ip, [sp, #161] @ zero_extendqisi2
mov r4, r1
ldrh r5, [sp, #166]
ldrb r1, [sp, #162] @ zero_extendqisi2
str r0, [sp, #8]
ands r0, ip, #64
str r1, [sp, #16]
str r5, [sp, #32]
ldrh r1, [sp, #164]
str ip, [sp, #12]
beq .L69
ldr r0, [sp, #16]
subs r0, r0, #10
movne r0, #1
.L69:
ands r5, ip, #16
str r0, [sp, #24]
and r0, ip, #32
andne ip, ip, #254
strne ip, [sp, #12]
str r5, [sp, #36]
ldr r5, [sp, #12]
str r0, [sp, #4]
tst r5, #2
beq .L14
cmp r3, #0
blt .L58
tst r5, #4
bne .L71
ldr r5, [sp, #12]
tst r5, #8
beq .L14
sub r1, r1, #1
mov r1, r1, asl #16
mov r1, r1, lsr #16
mov r0, #32
str r1, [sp, #20]
str r0, [sp, #28]
b .L17
.L14:
str r1, [sp, #20]
mov r1, #0
str r1, [sp, #28]
.L17:
ldr r5, [sp, #24]
cmp r5, #0
beq .L19
ldr r0, [sp, #20]
ldr r5, [sp, #16]
sub r1, r0, #1
mov r1, r1, asl #16
mov r1, r1, lsr #16
cmp r5, #16
str r1, [sp, #20]
subeq r1, r1, #1
moveq r1, r1, asl #16
moveq r1, r1, lsr #16
streq r1, [sp, #20]
.L19:
cmp r3, #0
bne .L20
cmp r2, #7
bls .L72
.L20:
ldr r0, [sp, #16]
cmp r0, #10
beq .L23
cmp r0, #16
movne r6, #3
moveq r6, #4
add r1, sp, #52
ldr r9, .L76
sub sl, r0, #1
mov r5, #0
str r1, [sp, #40]
sub r7, r6, #32
rsb fp, r6, #32
str r4, [sp, #44]
mov r8, r1
.L26:
mov ip, r2, lsr r6
cmp r7, #0
orr ip, ip, r3, asl fp
movge ip, r3, lsr r7
mov r4, r3, lsr r6
and r2, r2, #255
mov r0, ip
and r2, r2, sl
ldrb ip, [r9, r2] @ zero_extendqisi2
mov r2, r0
ldr r0, [sp, #4]
orrs r1, r2, r4
orr ip, r0, ip
strb ip, [r8, r5]
mov r3, r4
add r5, r5, #1
bne .L26
ldr r4, [sp, #44]
sub ip, r5, #1
.L22:
ldr r2, [sp, #32]
ldr r3, [sp, #20]
mov r0, r2, asl #16
cmp r5, r0, asr #16
movgt r0, r5, asl #16
mov r0, r0, lsr #16
ldr r1, [sp, #12]
rsb r7, r0, r3
mov r7, r7, asl #16
mov r7, r7, lsr #16
tst r1, #17
mov r1, r7
bne .L34
sub r1, r7, #1
mov r1, r1, asl #16
cmp r1, #0
mov r1, r1, lsr #16
blt .L34
ldr r3, [sp, #8]
mov r8, r1
add r2, r3, #1
add r2, r2, r1
mov r6, #32
.L36:
cmp r4, r3
strhib r6, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L36
rsb r7, r7, #1
ldr r2, [sp, #8]
add r1, r1, r7
mov r1, r1, asl #16
add r8, r8, #1
sub r1, r1, #65536
add r2, r2, r8
str r2, [sp, #8]
mov r1, r1, lsr #16
.L34:
ldr r3, [sp, #28]
cmp r3, #0
beq .L37
ldr r2, [sp, #8]
cmp r2, r4
strccb r3, [r2, #0]
ldr r3, [sp, #8]
add r3, r3, #1
str r3, [sp, #8]
.L37:
ldr r2, [sp, #24]
cmp r2, #0
beq .L39
ldr r3, [sp, #8]
cmp r3, r4
ldrcc r2, [sp, #8]
movcc r3, #48
strccb r3, [r2, #0]
ldr r2, [sp, #8]
ldr r3, [sp, #16]
add r2, r2, #1
cmp r3, #16
str r2, [sp, #8]
beq .L73
.L39:
ldr r2, [sp, #36]
cmp r2, #0
movne r6, r1
movne r7, r6, asl #16
bne .L43
sub r6, r1, #1
ldr r3, [sp, #12]
mov r6, r6, asl #16
tst r3, #1
mov r6, r6, lsr #16
movne r8, #48
moveq r8, #32
movs r7, r6, asl #16
bmi .L43
sub r2, r1, #1
ldr r3, [sp, #8]
mov r2, r2, asl #16
add r2, r3, r2, lsr #16
add r2, r2, #1
.L47:
cmp r4, r3
strhib r8, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L47
rsb r6, r1, r6
mov r6, r6, asl #16
mov r6, r6, lsr #16
str r3, [sp, #8]
mov r7, r6, asl #16
.L43:
sub r3, r0, #1
mov r3, r3, asl #16
cmp r5, r3, asr #16
bgt .L48
sub r1, r0, #2
ldr r3, [sp, #8]
mov r1, r1, asl #16
add r1, r3, r1, asr #16
add r1, r1, #1
mov r0, #48
.L50:
cmp r4, r3
strhib r0, [r3, #0]
add r3, r3, #1
rsb r2, r3, r1
cmp r5, r2
ble .L50
str r3, [sp, #8]
.L48:
cmp ip, #0
blt .L51
add r2, sp, #52
ldr r3, [sp, #8]
sub r1, r2, #1
add r2, r2, ip
.L53:
cmp r4, r3
ldrhib r0, [r2, #0] @ zero_extendqisi2
sub r2, r2, #1
strhib r0, [r3, #0]
cmp r2, r1
add r3, r3, #1
bne .L53
ldr r5, [sp, #8]
add ip, ip, #1
add r5, r5, ip
str r5, [sp, #8]
.L51:
cmp r7, #0
ble .L54
ldr r0, [sp, #8]
sub r2, r6, #1
mov r2, r2, asl #16
add r2, r0, r2, lsr #16
add r2, r2, #1
mov r3, r0
mov r1, #32
.L56:
cmp r4, r3
strhib r1, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L56
str r3, [sp, #8]
.L54:
ldr r0, [sp, #8]
add sp, sp, #124
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L71:
sub r1, r1, #1
mov r1, r1, asl #16
mov r1, r1, lsr #16
str r1, [sp, #20]
mov r1, #43
str r1, [sp, #28]
b .L17
.L58:
sub r1, r1, #1
mov r1, r1, asl #16
mov r1, r1, lsr #16
mov r0, #45
rsbs r2, r2, #0
rsc r3, r3, #0
str r1, [sp, #20]
str r0, [sp, #28]
b .L17
.L72:
add r1, r2, #48
strb r1, [sp, #52]
mov ip, r3
mov r5, #1
b .L22
.L23:
cmp r3, #0
beq .L74
.L27:
mov r1, r2, asl #16
ldr r0, .L76+4
mov r8, r2, lsr #16
mov r1, r1, lsr #16
mla ip, r0, r8, r1
mov r7, r3, lsr #16
mov r6, r3
mov r3, #656
mla r2, r3, r7, ip
mov r6, r6, asl #16
mov r6, r6, lsr #16
mov r3, #7296
mla r5, r3, r6, r2
add r0, sp, #52
str r0, [sp, #40]
ldr r1, .L76+8
mov r0, r5
bl __aeabi_uidivmod
ldr r0, [sp, #40]
bl put_dec_full4
ldr r3, .L76+12
mov sl, r0
mul r2, r3, r6
sub r3, r3, #1824
sub r3, r3, #1
mla ip, r3, r7, r2
mov r0, r5
mov r3, #6
ldr r1, .L76+8
mla r5, r3, r8, ip
bl __aeabi_uidiv
add r5, r5, r0
mov r0, r5
ldr r1, .L76+8
bl __aeabi_uidivmod
mov r0, sl
bl put_dec_full4
ldr r3, .L76+16
mov r8, r0
mul r2, r3, r7
mov r0, r5
mov r3, #42
ldr r1, .L76+8
mla r5, r3, r6, r2
bl __aeabi_uidiv
add r5, r5, r0
mov r0, r5
ldr r1, .L76+8
bl __aeabi_uidivmod
mov r0, r8
bl put_dec_full4
ldr r3, .L76+20
mov r6, r0
ldr r1, .L76+8
mov r0, r5
mul r5, r3, r7
bl __aeabi_uidiv
adds r5, r0, r5
bne .L75
.L30:
mov r3, r6
.L31:
mov r0, r3
ldrb r2, [r3, #-1]! @ zero_extendqisi2
cmp r2, #48
beq .L31
.L29:
ldr r1, [sp, #40]
rsb r5, r1, r0
sub ip, r5, #1
b .L22
.L73:
cmp r4, r2
ldrhi r2, [sp, #4]
orrhi r3, r2, #88
ldrhi r2, [sp, #8]
strhib r3, [r2, #0]
ldr r3, [sp, #8]
add r3, r3, #1
str r3, [sp, #8]
b .L39
.L74:
ldr r1, .L76+24
cmp r2, r1
bhi .L27
add r5, sp, #52
mov r1, r2
mov r0, r5
str r5, [sp, #40]
bl put_dec_trunc8
b .L29
.L75:
mov r0, r5
ldr r1, .L76+8
bl __aeabi_uidivmod
mov r0, r6
bl put_dec_full4
ldr r1, .L76+8
mov r6, r0
mov r0, r5
bl __aeabi_uidiv
subs r1, r0, #0
beq .L30
mov r0, r6
bl put_dec_full4
mov r6, r0
b .L30
.L77:
.align 2
.L76:
.word .LANCHOR0
.word 5536
.word 10000
.word 9496
.word 4749
.word 281
.word 99999999
.size number, .-number
.align 2
.type measure_number, %function
measure_number:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 72
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
mov fp, r3
sub sp, sp, #84
ldr r3, [r0, #0]
str r0, [sp, #12]
mov r0, #0
str r3, [sp, #8]
mov sl, r2
bl time
ldr r3, [sp, #8]
mov r9, #0
add r6, sp, #16
cmp r3, r0
mov r7, r9
ldr r5, .L85
add r8, r6, #63
bne .L84
.L81:
ldr r4, .L85+4
.L80:
mov ip, sp
ldmia r5, {r0, r1}
mov r2, sl
stmia ip, {r0, r1}
mov r3, fp
mov r0, r6
mov r1, r8
bl number
sub r4, r4, #1
cmn r4, #1
strb r7, [r0, #0]
bne .L80
mov r0, #0
bl time
ldr r3, [sp, #8]
add r9, r9, #4000
cmp r3, r0
beq .L81
.L84:
ldr ip, [sp, #12]
str r0, [ip, #0]
mov r0, r9
add sp, sp, #84
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L86:
.align 2
.L85:
.word .LANCHOR0+16
.word 3999
.size measure_number, .-measure_number
.align 2
.type measure, %function
measure:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
mov r0, #0
sub sp, sp, #24
bl time
str r0, [sp, #20]
.L88:
mov r0, #0
bl time
ldr r3, [sp, #20]
cmp r0, r3
beq .L88
add r8, sp, #24
str r0, [r8, #-4]!
mov r2, #8
mov r3, #0
mov r0, r8
bl measure_number
mov r2, #123
mov sl, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L91
mov r7, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L91+4
mov r6, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L91+8
mov r5, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r4, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r9, r0
mvn r3, #0
mov r0, r8
bl measure_number
mov r1, sl
str r0, [sp, #12]
mov r2, r7
mov r3, r6
ldr r0, .L91+12
str r5, [sp, #0]
stmib sp, {r4, r9} @ phole stm
bl printf
add sp, sp, #24
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
bx lr
.L92:
.align 2
.L91:
.word 123456
.word 12345678
.word 123456789
.word .LC0
.size measure, .-measure
.align 2
.type check, %function
check:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 128
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, lr}
ldr r3, .L97
sub sp, sp, #140
mov r5, r0
mov r6, r1
add r4, sp, #72
ldmia r3, {r0, r1}
mov r3, sp
stmia r3, {r0, r1}
mov r2, r5
mov r3, r6
add r1, r4, #63
mov r0, r4
bl number
add r7, sp, #8
mov r3, #0
strb r3, [r0, #0]
mov r2, r5
mov r3, r6
ldr r1, .L97+4
mov r0, r7
bl sprintf
mov r0, r4
mov r1, r7
bl strcmp
cmp r0, #0
bne .L96
add sp, sp, #140
ldmfd sp!, {r4, r5, r6, r7, lr}
bx lr
.L96:
mov r2, r5
mov r3, r6
ldr r0, .L97+8
str r4, [sp, #0]
bl printf
mov r0, #1
bl exit
.L98:
.align 2
.L97:
.word .LANCHOR0+16
.word .LC1
.word .LC2
.size check, .-check
.align 2
.global main
.type main, %function
main:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
bl measure
mov r4, #0
bl measure
mov r5, #0
bl measure
ldr r6, .L103
bl measure
mov r7, #0
mov sl, #1
mov fp, #0
b .L101
.L100:
adds r4, r4, sl
adc r5, r5, fp
.L101:
mov r0, r4
mov r1, r5
bl check
and r8, r4, r6
rsbs r0, r4, #0
rsc r1, r5, #0
and r9, r5, r7
bl check
orrs r8, r8, r9
bne .L100
mov r2, r4
mov r3, r5
ldr r0, .L103+4
bl printf
mov r0, r8
bl fflush
b .L100
.L104:
.align 2
.L103:
.word 262143
.word .LC3
.size main, .-main
.section .rodata
.align 2
.LANCHOR0 = . + 0
.type digits.3938, %object
.size digits.3938, 16
digits.3938:
.ascii "0123456789ABCDEF"
.type dummy_spec, %object
.size dummy_spec, 8
dummy_spec:
.byte 8
.byte 0
.byte 10
.byte 0
.short 0
.short 0
.section .rodata.str1.4,"aMS",%progbits,1
.align 2
.LC0:
.ascii "Conversions per second: 8:%d 123:%d 123456:%d 12345"
.ascii "678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
.ascii "%llu\000"
.space 3
.LC2:
.ascii "Error in formatting %llu:'%s'\012\000"
.space 1
.LC3:
.ascii "\015Tested %llu \000"
.ident "GCC: (Debian 4.4.5-8) 4.4.5"
.section .note.GNU-stack,"",%progbits
[-- Attachment #3: test_org.s --]
[-- Type: text/plain, Size: 13375 bytes --]
.cpu arm9tdmi
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 18, 4
.file "test_org.c"
.text
.align 2
.type put_dec_trunc, %function
put_dec_trunc:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
stmfd sp!, {r4, r5, r6}
mov ip, r1, lsr #8
mov r4, r1, lsr #4
and r4, r4, #15
and ip, ip, #15
mov r3, r1, lsr #12
add r2, r4, ip
add r2, r2, r3
add r2, r2, r2, asl #1
and r1, r1, #15
add r1, r1, r2, asl #1
add r2, r1, r1, asl #2
add r2, r1, r2, asl #3
add r2, r2, r2, asl #2
mov r2, r2, lsr #11
add r5, r3, r3, asl #3
add r4, r5, r4
add r6, r2, r2, asl #2
add r5, ip, ip, asl #2
add r4, r4, r5
sub r1, r1, r6, asl #1
add r1, r1, #48
adds r2, r4, r2
mov r5, r0
strb r1, [r0], #1
beq .L2
add r1, r2, r2, asl #2
add r1, r2, r1, asl #3
add r1, r1, r1, asl #2
mov r1, r1, lsr #11
add r4, r1, r1, asl #2
sub r2, r2, r4, asl #1
add ip, r1, ip, asl #1
add r2, r2, #48
orrs r1, ip, r3
strb r2, [r5, #1]
add r0, r0, #1
beq .L2
add r2, ip, ip, asl #1
add r2, ip, r2, asl #2
mov r2, r2, lsr #7
add r1, r2, r2, asl #2
sub ip, ip, r1, asl #1
add ip, ip, #48
adds r3, r2, r3, asl #2
strb ip, [r0], #1
beq .L2
add r1, r3, r3, asl #2
add r1, r3, r1, asl #3
add r1, r1, r1, asl #2
mov r1, r1, lsr #11
add r2, r1, r1, asl #2
sub r3, r3, r2, asl #1
cmp r1, #0
add r3, r3, #48
strb r3, [r0], #1
addne r1, r1, #48
strneb r1, [r0], #1
.L2:
ldmfd sp!, {r4, r5, r6}
bx lr
.size put_dec_trunc, .-put_dec_trunc
.align 2
.type put_dec_full, %function
put_dec_full:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
mov ip, r1, lsr #4
mov r2, r1, lsr #8
and r2, r2, #15
and ip, ip, #15
stmfd sp!, {r4, r5, r6, r7, r8}
mov r3, r1, lsr #12
add r4, ip, r2
add r4, r4, r3
add r4, r4, r4, asl #1
and r1, r1, #15
add r1, r1, r4, asl #1
add r6, r1, r1, asl #2
add r4, r3, r3, asl #3
add r6, r1, r6, asl #3
add ip, r4, ip
add r6, r6, r6, asl #2
add r4, r2, r2, asl #2
add ip, ip, r4
mov r6, r6, lsr #11
add ip, ip, r6
add r5, ip, ip, asl #2
add r5, ip, r5, asl #3
add r5, r5, r5, asl #2
mov r5, r5, lsr #11
add r2, r5, r2, asl #1
add r4, r2, r2, asl #1
add r4, r2, r4, asl #2
mov r4, r4, lsr #7
add r3, r4, r3, asl #2
add r7, r3, r3, asl #2
add r7, r3, r7, asl #3
add r6, r6, r6, asl #2
sub r1, r1, r6, asl #1
add r7, r7, r7, asl #2
mov r6, r0
mov r7, r7, lsr #11
add r1, r1, #48
strb r1, [r6], #1
add r5, r5, r5, asl #2
add r1, r7, r7, asl #2
add r4, r4, r4, asl #2
add r8, r6, #1
sub ip, ip, r5, asl #1
sub r2, r2, r4, asl #1
sub r3, r3, r1, asl #1
add ip, ip, #48
add r1, r8, #1
add r2, r2, #48
add r3, r3, #48
add r7, r7, #48
strb ip, [r0, #1]
strb r2, [r6, #1]
add r0, r1, #2
strb r3, [r8, #1]
strb r7, [r1, #1]
ldmfd sp!, {r4, r5, r6, r7, r8}
bx lr
.size put_dec_full, .-put_dec_full
.align 2
.type put_dec, %function
put_dec:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
cmp r3, #0
stmfd sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
beq .L18
.L15:
ldr sl, .L19
ldr r5, .L19+4
ldr r6, .L19+8
mov r8, #0
.L17:
#APP
@ 101 "test_org.c" 1
umull fp, ip, r5, r2
cmn fp, r5
adcs ip, ip, r6
adc fp, r8, #0
@ 0 "" 2
mov r1, r8
#APP
@ 101 "test_org.c" 1
umlal ip, fp, r6, r2
umlal ip, r1, r5, r3
mov ip, #0
adds fp, r1, fp
adc ip, ip, #0
umlal fp, ip, r6, r3
@ 0 "" 2
mov r4, fp, lsr #16
orr r4, r4, ip, asl #16
add r1, r4, r4, asl #1
add r1, r1, r1, asl #6
add r1, r4, r1, asl #2
add r1, r4, r1, asl #2
mov r7, ip, lsr #16
sub r1, r2, r1, asl #5
bl put_dec_full
cmp r7, #0
mov r2, r4
mov r3, r7
bne .L17
cmp r4, sl
bhi .L17
.L10:
mov r1, r4
ldmfd sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
b put_dec_trunc
.L18:
ldr r1, .L19
cmp r2, r1
bhi .L15
mov r4, r2
b .L10
.L20:
.align 2
.L19:
.word 99999
.word 457671715
.word -1480217529
.size put_dec, .-put_dec
.align 2
.type number, %function
number:
@ Function supports interworking.
@ args = 8, pretend = 0, frame = 120
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
sub sp, sp, #124
ldrb ip, [sp, #161] @ zero_extendqisi2
ldrb r4, [sp, #162] @ zero_extendqisi2
ldrh r5, [sp, #166]
str r0, [sp, #12]
ands r0, ip, #64
str r4, [sp, #20]
str r5, [sp, #36]
ldrh r4, [sp, #164]
str ip, [sp, #16]
beq .L72
ldr r0, [sp, #20]
subs r0, r0, #10
movne r0, #1
.L72:
ands r5, ip, #16
str r0, [sp, #28]
and r0, ip, #32
andne ip, ip, #254
strne ip, [sp, #16]
str r5, [sp, #40]
ldr r5, [sp, #16]
str r0, [sp, #8]
tst r5, #2
beq .L26
cmp r3, #0
blt .L64
tst r5, #4
bne .L74
ldr r5, [sp, #16]
tst r5, #8
beq .L26
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
mov ip, #32
str r0, [sp, #24]
str ip, [sp, #32]
b .L29
.L26:
mov r0, #0
str r4, [sp, #24]
str r0, [sp, #32]
.L29:
ldr r4, [sp, #28]
cmp r4, #0
beq .L31
ldr r5, [sp, #24]
ldr ip, [sp, #20]
sub r0, r5, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
cmp ip, #16
str r0, [sp, #24]
subeq r0, r0, #1
moveq r0, r0, asl #16
moveq r0, r0, lsr #16
streq r0, [sp, #24]
.L31:
orrs ip, r2, r3
moveq r3, #48
streqb r3, [sp, #52]
moveq r4, #1
beq .L33
ldr r0, [sp, #20]
cmp r0, #10
beq .L34
cmp r0, #16
movne r5, #3
moveq r5, #4
sub r8, r0, #1
ldr sl, .L76
rsb r0, r5, #32
mov r4, #0
add r9, sp, #52
sub r6, r5, #32
str r1, [sp, #44]
mov fp, r0
.L37:
mov ip, r2, lsr r5
cmp r6, #0
orr ip, ip, r3, asl fp
movge ip, r3, lsr r6
mov r7, r3, lsr r5
and r2, r2, #255
and r2, r2, r8
mov r0, ip
ldr r1, [sp, #8]
ldrb ip, [sl, r2] @ zero_extendqisi2
mov r2, r0
orr ip, r1, ip
orrs r0, r2, r7
strb ip, [r9, r4]
mov r3, r7
add r4, r4, #1
bne .L37
ldr r1, [sp, #44]
sub ip, r4, #1
.L33:
ldr r2, [sp, #36]
ldr r3, [sp, #24]
mov r5, r2, asl #16
cmp r4, r5, asr #16
movgt r5, r4, asl #16
mov r5, r5, lsr #16
ldr r0, [sp, #16]
rsb r7, r5, r3
mov r7, r7, asl #16
mov r7, r7, lsr #16
tst r0, #17
mov r0, r7
bne .L40
sub r0, r7, #1
mov r0, r0, asl #16
cmp r0, #0
mov r0, r0, lsr #16
blt .L40
ldr r3, [sp, #12]
mov r8, r0
add r2, r3, #1
add r2, r2, r0
mov r6, #32
.L42:
cmp r1, r3
strhib r6, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L42
rsb r7, r7, #1
ldr r2, [sp, #12]
add r0, r0, r7
mov r0, r0, asl #16
add r3, r8, #1
sub r0, r0, #65536
add r2, r2, r3
str r2, [sp, #12]
mov r0, r0, lsr #16
.L40:
ldr r3, [sp, #32]
cmp r3, #0
beq .L43
ldr r2, [sp, #12]
cmp r2, r1
strccb r3, [r2, #0]
ldr r3, [sp, #12]
add r3, r3, #1
str r3, [sp, #12]
.L43:
ldr r2, [sp, #28]
cmp r2, #0
beq .L45
ldr r3, [sp, #12]
cmp r3, r1
ldrcc r2, [sp, #12]
movcc r3, #48
strccb r3, [r2, #0]
ldr r2, [sp, #12]
ldr r3, [sp, #20]
add r2, r2, #1
cmp r3, #16
str r2, [sp, #12]
beq .L75
.L45:
ldr r2, [sp, #40]
cmp r2, #0
movne r6, r0
movne r7, r6, asl #16
bne .L49
sub r6, r0, #1
ldr r3, [sp, #16]
mov r6, r6, asl #16
tst r3, #1
mov r6, r6, lsr #16
movne r8, #48
moveq r8, #32
movs r7, r6, asl #16
bmi .L49
sub r2, r0, #1
ldr r3, [sp, #12]
mov r2, r2, asl #16
add r2, r3, r2, lsr #16
add r2, r2, #1
.L53:
cmp r1, r3
strhib r8, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L53
rsb r6, r0, r6
mov r6, r6, asl #16
mov r6, r6, lsr #16
str r3, [sp, #12]
mov r7, r6, asl #16
.L49:
sub r3, r5, #1
mov r3, r3, asl #16
cmp r4, r3, asr #16
bgt .L54
sub r0, r5, #2
ldr r3, [sp, #12]
mov r0, r0, asl #16
add r0, r3, r0, asr #16
add r0, r0, #1
mov r5, #48
.L56:
cmp r1, r3
strhib r5, [r3, #0]
add r3, r3, #1
rsb r2, r3, r0
cmp r4, r2
ble .L56
str r3, [sp, #12]
.L54:
cmp ip, #0
blt .L57
add r2, sp, #52
ldr r3, [sp, #12]
sub r0, r2, #1
add r2, r2, ip
.L59:
cmp r1, r3
ldrhib r4, [r2, #0] @ zero_extendqisi2
sub r2, r2, #1
strhib r4, [r3, #0]
cmp r2, r0
add r3, r3, #1
bne .L59
ldr r4, [sp, #12]
add ip, ip, #1
add r4, r4, ip
str r4, [sp, #12]
.L57:
cmp r7, #0
ble .L60
ldr r5, [sp, #12]
sub r2, r6, #1
mov r2, r2, asl #16
add r2, r5, r2, lsr #16
add r2, r2, #1
mov r3, r5
mov r0, #32
.L62:
cmp r1, r3
strhib r0, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L62
str r3, [sp, #12]
.L60:
ldr r0, [sp, #12]
add sp, sp, #124
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L74:
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
str r0, [sp, #24]
mov r0, #43
str r0, [sp, #32]
b .L29
.L75:
cmp r1, r2
ldrhi r2, [sp, #8]
orrhi r3, r2, #88
ldrhi r2, [sp, #12]
strhib r3, [r2, #0]
ldr r3, [sp, #12]
add r3, r3, #1
str r3, [sp, #12]
b .L45
.L64:
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
mov ip, #45
rsbs r2, r2, #0
rsc r3, r3, #0
str r0, [sp, #24]
str ip, [sp, #32]
b .L29
.L34:
add r4, sp, #52
mov r0, r4
str r1, [sp, #4]
bl put_dec
rsb r4, r4, r0
sub ip, r4, #1
ldr r1, [sp, #4]
b .L33
.L77:
.align 2
.L76:
.word .LANCHOR0
.size number, .-number
.align 2
.type measure_number, %function
measure_number:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 72
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
mov fp, r3
sub sp, sp, #84
ldr r3, [r0, #0]
str r0, [sp, #12]
mov r0, #0
str r3, [sp, #8]
mov sl, r2
bl time
ldr r3, [sp, #8]
mov r9, #0
add r6, sp, #16
cmp r3, r0
mov r7, r9
ldr r5, .L85
add r8, r6, #63
bne .L84
.L81:
ldr r4, .L85+4
.L80:
mov ip, sp
ldmia r5, {r0, r1}
mov r2, sl
stmia ip, {r0, r1}
mov r3, fp
mov r0, r6
mov r1, r8
bl number
sub r4, r4, #1
cmn r4, #1
strb r7, [r0, #0]
bne .L80
mov r0, #0
bl time
ldr r3, [sp, #8]
add r9, r9, #4000
cmp r3, r0
beq .L81
.L84:
ldr ip, [sp, #12]
str r0, [ip, #0]
mov r0, r9
add sp, sp, #84
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L86:
.align 2
.L85:
.word .LANCHOR0+16
.word 3999
.size measure_number, .-measure_number
.align 2
.type measure, %function
measure:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
mov r0, #0
sub sp, sp, #24
bl time
str r0, [sp, #20]
.L88:
mov r0, #0
bl time
ldr r3, [sp, #20]
cmp r0, r3
beq .L88
add r8, sp, #24
str r0, [r8, #-4]!
mov r2, #8
mov r3, #0
mov r0, r8
bl measure_number
mov r2, #123
mov sl, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L91
mov r7, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L91+4
mov r6, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L91+8
mov r5, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r4, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r9, r0
mvn r3, #0
mov r0, r8
bl measure_number
mov r1, sl
str r0, [sp, #12]
mov r2, r7
mov r3, r6
ldr r0, .L91+12
str r5, [sp, #0]
stmib sp, {r4, r9} @ phole stm
bl printf
add sp, sp, #24
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
bx lr
.L92:
.align 2
.L91:
.word 123456
.word 12345678
.word 123456789
.word .LC0
.size measure, .-measure
.align 2
.type check, %function
check:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 128
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, lr}
ldr r3, .L97
sub sp, sp, #140
mov r5, r0
mov r6, r1
add r4, sp, #72
ldmia r3, {r0, r1}
mov r3, sp
stmia r3, {r0, r1}
mov r2, r5
mov r3, r6
add r1, r4, #63
mov r0, r4
bl number
add r7, sp, #8
mov r3, #0
strb r3, [r0, #0]
mov r2, r5
mov r3, r6
ldr r1, .L97+4
mov r0, r7
bl sprintf
mov r0, r4
mov r1, r7
bl strcmp
cmp r0, #0
bne .L96
add sp, sp, #140
ldmfd sp!, {r4, r5, r6, r7, lr}
bx lr
.L96:
mov r2, r5
mov r3, r6
ldr r0, .L97+8
str r4, [sp, #0]
bl printf
mov r0, #1
bl exit
.L98:
.align 2
.L97:
.word .LANCHOR0+16
.word .LC1
.word .LC2
.size check, .-check
.align 2
.global main
.type main, %function
main:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
bl measure
mov r4, #0
bl measure
mov r5, #0
bl measure
ldr r6, .L103
bl measure
mov r7, #0
mov sl, #1
mov fp, #0
b .L101
.L100:
adds r4, r4, sl
adc r5, r5, fp
.L101:
mov r0, r4
mov r1, r5
bl check
and r8, r4, r6
rsbs r0, r4, #0
rsc r1, r5, #0
and r9, r5, r7
bl check
orrs r8, r8, r9
bne .L100
mov r2, r4
mov r3, r5
ldr r0, .L103+4
bl printf
mov r0, r8
bl fflush
b .L100
.L104:
.align 2
.L103:
.word 262143
.word .LC3
.size main, .-main
.section .rodata
.align 2
.LANCHOR0 = . + 0
.type digits.4070, %object
.size digits.4070, 16
digits.4070:
.ascii "0123456789ABCDEF"
.type dummy_spec, %object
.size dummy_spec, 8
dummy_spec:
.byte 8
.byte 0
.byte 10
.byte 0
.short 0
.short 0
.section .rodata.str1.4,"aMS",%progbits,1
.align 2
.LC0:
.ascii "Conversions per second: 8:%d 123:%d 123456:%d 12345"
.ascii "678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
.ascii "%llu\000"
.space 3
.LC2:
.ascii "Error in formatting %llu:'%s'\012\000"
.space 1
.LC3:
.ascii "\015Tested %llu \000"
.ident "GCC: (Debian 4.4.5-8) 4.4.5"
.section .note.GNU-stack,"",%progbits
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 10:13 ` Denys Vlasenko
2012-03-28 10:24 ` roma1390
@ 2012-03-28 10:31 ` roma1390
2012-03-28 11:23 ` Denys Vlasenko
1 sibling, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28 10:31 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On 2012.03.28 13:13, Denys Vlasenko wrote:
> Third: switch to algorithm #1 and test whether it fares better.
> To do that, go to test_new.c
> and replace
> #if LONG_MAX> ((1UL<<31)-1) || LLONG_MAX> ((1ULL<<63)-1)
> with
> #if 1 ////LONG_MAX> ((1UL<<31)-1) || LLONG_MAX> ((1ULL<<63)-1)
> (there are two instances of this line there),
> then recompile and rerun the test, and post the result.
After applied following diff:
--- test_new.org.c 2012-03-28 13:25:14.000000000 +0300
+++ test_new.c 2012-03-28 13:25:30.000000000 +0300
@@ -7,7 +7,7 @@
* (with permission from the author, Douglas W. Jones).
*/
-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ////LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
/* Formats correctly any integer in [0, 999999999] */
static noinline_for_stack
char *put_dec_full9(char *buf, unsigned q)
@@ -106,7 +106,7 @@
* Else (if long is 32 bits and long long is 64 bits) we use second one.
*/
-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ///LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
/* First algorithm: generic */
I have following compile output:
$ arm-linux-gnueabi-gcc -O2 -Wall test_new.c -o test_new
test_new.c: In function ‘number’:
test_new.c:118: error: impossible constraint in ‘asm’
test_new.c:118: error: impossible constraint in ‘asm’
test_new.c:118: error: impossible constraint in ‘asm’
but affter this diff:
--- test_new-with-ifdefs.c 2012-03-28 13:29:04.000000000 +0300
+++ test_new.c 2012-03-28 13:29:12.000000000 +0300
@@ -1,4 +1,4 @@
-#include "test_header.c"
+#include "test_header_arm.c"
/* Decimal conversion is by far the most typical, and is used
* for /proc and /sys data. This directly impacts e.g. top performance
compiles fine.
result are:
./test_new2
Conversions per second: 8:5420000 123:4452000 123456:3556000 12345678:3368000
123456789:2904000 2^32:2412000 2^64:1556000
Conversions per second: 8:5428000 123:4500000 123456:3644000 12345678:3328000
123456789:2832000 2^32:2408000 2^64:1572000
Conversions per second: 8:5372000 123:4396000 123456:3644000 12345678:3368000
123456789:2900000 2^32:2392000 2^64:1532000
Conversions per second: 8:5424000 123:4500000 123456:3608000 12345678:3284000
123456789:2896000 2^32:2416000 2^64:1572000
btw: previovios version (test_new) still running, now is at:
Tested 2446852096
roma1390
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 10:24 ` roma1390
@ 2012-03-28 10:33 ` Denys Vlasenko
2012-03-28 10:39 ` roma1390
2012-03-29 10:35 ` Denys Vlasenko
0 siblings, 2 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 10:33 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Wednesday 28 March 2012 12:24, roma1390 wrote:
> On 2012.03.28 13:13, Denys Vlasenko wrote:
> > Second: run
> > arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
> > and email me resulting test_{org,new}.s files.
>
> test_{org,new}.s attached.
Bingo.
bl __aeabi_uidivmod
Not good. Your gcc did not optimize division by constant.
Can you add "noinline_for_stack":
static noinline_for_stack <=== HERE
char *put_dec(char *buf, unsigned long long n)
amd regenerate and resend the test_new.s?
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 10:33 ` Denys Vlasenko
@ 2012-03-28 10:39 ` roma1390
2012-03-28 11:20 ` Denys Vlasenko
2012-03-29 10:35 ` Denys Vlasenko
1 sibling, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-28 10:39 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
[-- Attachment #1: Type: text/plain, Size: 637 bytes --]
On 2012.03.28 13:33, Denys Vlasenko wrote:
> On Wednesday 28 March 2012 12:24, roma1390 wrote:
>> On 2012.03.28 13:13, Denys Vlasenko wrote:
>>> Second: run
>>> arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
>>> and email me resulting test_{org,new}.s files.
>>
>> test_{org,new}.s attached.
>
>
> Bingo.
>
> bl __aeabi_uidivmod
>
> Not good. Your gcc did not optimize division by constant.
>
> Can you add "noinline_for_stack":
>
> static noinline_for_stack<=== HERE
> char *put_dec(char *buf, unsigned long long n)
>
> amd regenerate and resend the test_new.s?
>
Hello,
Your requested asm are attached.
roma1390
[-- Attachment #2: test_new-org-with-noinline_for_stack.diff --]
[-- Type: text/plain, Size: 560 bytes --]
--- test_new.org.c 2012-03-28 13:25:14.000000000 +0300
+++ test_new-org-with-noinline_for_stack.c 2012-03-28 13:37:26.000000000 +0300
@@ -110,7 +110,7 @@
/* First algorithm: generic */
-static
+static noinline_for_stack
char *put_dec(char *buf, unsigned long long n)
{
if (n >= 100*1000*1000) {
@@ -145,7 +145,7 @@
* (with permission from the author).
* Performs no 64-bit division and hence should be fast on 32-bit machines.
*/
-static
+static noinline_for_stack
char *put_dec(char *buf, unsigned long long n)
{
uint32_t d3, d2, d1, q, h;
[-- Attachment #3: test_new-org-with-noinline_for_stack.s --]
[-- Type: text/plain, Size: 14374 bytes --]
.cpu arm9tdmi
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 18, 4
.file "test_new-org-with-noinline_for_stack.c"
.text
.align 2
.type put_dec_trunc8, %function
put_dec_trunc8:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L5
stmfd sp!, {r4, r5, fp}
umull fp, ip, r1, r3
mov r2, ip
add r1, r1, #48
mov ip, r0
add r0, r2, r2, asl #2
sub r1, r1, r0, asl #1
cmp r2, #0
mov r0, ip
strb r1, [r0], #1
beq .L2
umull r4, r5, r2, r3
add r2, r2, #48
add fp, r5, r5, asl #2
sub r2, r2, fp, asl #1
cmp r5, #0
mov r1, r5
strb r2, [ip, #1]
add r0, r0, #1
beq .L2
umull r4, r5, r1, r3
add r1, r1, #48
add ip, r5, r5, asl #2
sub r1, r1, ip, asl #1
cmp r5, #0
mov r2, r5
strb r1, [r0], #1
beq .L2
umull r4, r5, r2, r3
add r2, r2, #48
add r1, r5, r5, asl #2
sub r2, r2, r1, asl #1
cmp r5, #0
strb r2, [r0], #1
beq .L2
add r2, r5, r5, asl #1
add r2, r5, r2, asl #2
rsb r2, r2, r2, asl #6
add r2, r5, r2, asl #2
mov r2, r2, asl #1
mov r2, r2, lsr #16
add r3, r5, #48
add r1, r2, r2, asl #2
sub r3, r3, r1, asl #1
cmp r2, #0
strb r3, [r0], #1
beq .L2
add r1, r2, r1, asl #3
add r1, r1, r1, asl #2
mov r3, r1, lsr #11
add r2, r2, #48
add r1, r3, r3, asl #2
sub r2, r2, r1, asl #1
cmp r3, #0
strb r2, [r0], #1
beq .L2
add r1, r3, r1, asl #3
add r1, r1, r1, asl #2
mov r2, r1, lsr #11
add r1, r2, r2, asl #2
add r3, r3, #48
cmp r2, #0
sub r3, r3, r1, asl #1
strb r3, [r0], #1
addne r2, r2, #48
strneb r2, [r0], #1
.L2:
ldmfd sp!, {r4, r5, fp}
bx lr
.L6:
.align 2
.L5:
.word 429496730
.size put_dec_trunc8, .-put_dec_trunc8
.align 2
.type put_dec_full4, %function
put_dec_full4:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add r3, r1, r1, asl #1
add r3, r3, r3, asl #4
add r3, r3, r3, asl #8
add r3, r1, r3, asl #2
mov r3, r3, lsr #19
add r2, r3, r3, asl #1
add r2, r3, r2, asl #2
rsb r2, r2, r2, asl #6
add r2, r3, r2, asl #2
mov r2, r2, asl #1
mov r2, r2, lsr #16
add ip, r2, r2, asl #2
stmfd sp!, {r4, r5}
add r4, r2, ip, asl #3
add r5, r3, r3, asl #2
add r1, r1, #48
add r4, r4, r4, asl #2
sub r1, r1, r5, asl #1
mov r4, r4, lsr #11
mov r5, r0
strb r1, [r5], #1
add r3, r3, #48
add r1, r4, r4, asl #2
add r2, r2, #48
sub r2, r2, r1, asl #1
sub ip, r3, ip, asl #1
add r1, r5, #1
add r4, r4, #48
strb ip, [r0, #1]
strb r2, [r5, #1]
add r0, r1, #2
strb r4, [r1, #1]
ldmfd sp!, {r4, r5}
bx lr
.size put_dec_full4, .-put_dec_full4
.align 2
.type put_dec, %function
put_dec:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
cmp r3, #0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
mov ip, r0
beq .L17
.L10:
mov r6, r2, lsr #16
add r1, r6, r6, asl #2
add r1, r6, r1, asl #2
mov r5, r3, asl #16
mov r4, r3, lsr #16
mov r5, r5, lsr #16
add r3, r6, r1, asl #1
mov r2, r2, asl #16
add r3, r6, r3, asl #2
mov sl, r5, asl #3
add r1, r4, r4, asl #2
mov r2, r2, lsr #16
add r1, r4, r1, asl #3
add r2, r2, r3, asl #5
rsb r3, r5, sl
add r3, r5, r3, asl #3
ldr r7, .L19
add r2, r2, r1, asl #4
add r2, r2, r3, asl #7
umull r3, r8, r7, r2
mov r8, r8, lsr #13
add r3, r8, r8, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
sub r1, r2, r3, asl #4
mov r0, ip
bl put_dec_full4
mov r3, r4, asl #10
sub r3, r3, r4, asl #6
add sl, sl, r5
rsb r3, r4, r3
add sl, sl, sl, asl #5
rsb r3, r4, r3, asl #3
rsb sl, r5, sl, asl #2
add sl, r3, sl, asl #3
add r6, r6, r6, asl #1
add r6, sl, r6, asl #1
add r8, r6, r8
umull r3, r6, r7, r8
mov r6, r6, lsr #13
add r3, r6, r6, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
sub r1, r8, r3, asl #4
bl put_dec_full4
add r3, r4, r4, asl #3
add r3, r3, r3, asl #5
add r2, r5, r5, asl #2
rsb r3, r4, r3, asl #2
add r3, r4, r3, asl #2
add r5, r5, r2, asl #2
add r5, r3, r5, asl #1
add r5, r5, r6
umull r3, r6, r7, r5
mov r6, r6, lsr #13
add r3, r6, r6, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
sub r1, r5, r3, asl #4
bl put_dec_full4
add r2, r4, r4, asl #4
add r2, r4, r2, asl #1
add r4, r4, r2, asl #3
adds r4, r6, r4
mov r3, r0
bne .L18
.L13:
mov r0, r3
ldrb r2, [r3, #-1]! @ zero_extendqisi2
cmp r2, #48
beq .L13
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
bx lr
.L17:
ldr r1, .L19+4
cmp r2, r1
bhi .L10
mov r1, r2
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
b put_dec_trunc8
.L18:
umull r3, r5, r7, r4
mov r5, r5, lsr #13
add r3, r5, r5, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
add r3, r3, r3, asl #2
sub r1, r4, r3, asl #4
bl put_dec_full4
cmp r5, #0
mov r3, r0
beq .L13
mov r1, r5
bl put_dec_full4
mov r3, r0
b .L13
.L20:
.align 2
.L19:
.word -776530087
.word 99999999
.size put_dec, .-put_dec
.align 2
.type number, %function
number:
@ Function supports interworking.
@ args = 8, pretend = 0, frame = 120
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
sub sp, sp, #124
ldrb ip, [sp, #161] @ zero_extendqisi2
ldrb r4, [sp, #162] @ zero_extendqisi2
ldrh r5, [sp, #166]
str r0, [sp, #12]
ands r0, ip, #64
str r4, [sp, #20]
str r5, [sp, #36]
ldrh r4, [sp, #164]
str ip, [sp, #16]
beq .L74
ldr r0, [sp, #20]
subs r0, r0, #10
movne r0, #1
.L74:
ands r5, ip, #16
str r0, [sp, #28]
and r0, ip, #32
andne ip, ip, #254
strne ip, [sp, #16]
str r5, [sp, #40]
ldr r5, [sp, #16]
str r0, [sp, #8]
tst r5, #2
beq .L26
cmp r3, #0
blt .L65
tst r5, #4
bne .L76
ldr r5, [sp, #16]
tst r5, #8
beq .L26
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
mov ip, #32
str r0, [sp, #24]
str ip, [sp, #32]
b .L29
.L26:
mov r0, #0
str r4, [sp, #24]
str r0, [sp, #32]
.L29:
ldr r4, [sp, #28]
cmp r4, #0
beq .L31
ldr r5, [sp, #24]
ldr ip, [sp, #20]
sub r0, r5, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
cmp ip, #16
str r0, [sp, #24]
subeq r0, r0, #1
moveq r0, r0, asl #16
moveq r0, r0, lsr #16
streq r0, [sp, #24]
.L31:
cmp r3, #0
bne .L32
cmp r2, #7
bls .L77
.L32:
ldr r0, [sp, #20]
cmp r0, #10
beq .L35
cmp r0, #16
movne r5, #3
moveq r5, #4
sub r8, r0, #1
ldr sl, .L79
rsb r0, r5, #32
mov r4, #0
add r9, sp, #52
sub r6, r5, #32
str r1, [sp, #44]
mov fp, r0
.L38:
mov ip, r2, lsr r5
cmp r6, #0
orr ip, ip, r3, asl fp
movge ip, r3, lsr r6
mov r7, r3, lsr r5
and r2, r2, #255
and r2, r2, r8
mov r0, ip
ldr r1, [sp, #8]
ldrb ip, [sl, r2] @ zero_extendqisi2
mov r2, r0
orr ip, r1, ip
orrs r0, r2, r7
strb ip, [r9, r4]
mov r3, r7
add r4, r4, #1
bne .L38
ldr r1, [sp, #44]
sub r5, r4, #1
.L34:
ldr r2, [sp, #36]
ldr r3, [sp, #24]
mov ip, r2, asl #16
cmp r4, ip, asr #16
movgt ip, r4, asl #16
mov ip, ip, lsr #16
ldr r0, [sp, #16]
rsb r7, ip, r3
mov r7, r7, asl #16
mov r7, r7, lsr #16
tst r0, #17
mov r0, r7
bne .L41
sub r0, r7, #1
mov r0, r0, asl #16
cmp r0, #0
mov r0, r0, lsr #16
blt .L41
ldr r3, [sp, #12]
mov r8, r0
add r2, r3, #1
add r2, r2, r0
mov r6, #32
.L43:
cmp r1, r3
strhib r6, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L43
rsb r7, r7, #1
ldr r2, [sp, #12]
add r0, r0, r7
mov r0, r0, asl #16
add r3, r8, #1
sub r0, r0, #65536
add r2, r2, r3
str r2, [sp, #12]
mov r0, r0, lsr #16
.L41:
ldr r3, [sp, #32]
cmp r3, #0
beq .L44
ldr r2, [sp, #12]
cmp r2, r1
strccb r3, [r2, #0]
ldr r3, [sp, #12]
add r3, r3, #1
str r3, [sp, #12]
.L44:
ldr r2, [sp, #28]
cmp r2, #0
beq .L46
ldr r3, [sp, #12]
cmp r3, r1
ldrcc r2, [sp, #12]
movcc r3, #48
strccb r3, [r2, #0]
ldr r2, [sp, #12]
ldr r3, [sp, #20]
add r2, r2, #1
cmp r3, #16
str r2, [sp, #12]
beq .L78
.L46:
ldr r2, [sp, #40]
cmp r2, #0
movne r6, r0
movne r7, r6, asl #16
bne .L50
sub r6, r0, #1
ldr r3, [sp, #16]
mov r6, r6, asl #16
tst r3, #1
mov r6, r6, lsr #16
movne r8, #48
moveq r8, #32
movs r7, r6, asl #16
bmi .L50
sub r2, r0, #1
ldr r3, [sp, #12]
mov r2, r2, asl #16
add r2, r3, r2, lsr #16
add r2, r2, #1
.L54:
cmp r1, r3
strhib r8, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L54
rsb r6, r0, r6
mov r6, r6, asl #16
mov r6, r6, lsr #16
str r3, [sp, #12]
mov r7, r6, asl #16
.L50:
sub r3, ip, #1
mov r3, r3, asl #16
cmp r4, r3, asr #16
bgt .L55
sub r0, ip, #2
ldr r3, [sp, #12]
mov r0, r0, asl #16
add r0, r3, r0, asr #16
add r0, r0, #1
mov ip, #48
.L57:
cmp r1, r3
strhib ip, [r3, #0]
add r3, r3, #1
rsb r2, r3, r0
cmp r4, r2
ble .L57
str r3, [sp, #12]
.L55:
cmp r5, #0
blt .L58
add r2, sp, #52
ldr r3, [sp, #12]
sub r0, r2, #1
add r2, r2, r5
.L60:
cmp r1, r3
ldrhib ip, [r2, #0] @ zero_extendqisi2
sub r2, r2, #1
strhib ip, [r3, #0]
cmp r2, r0
add r3, r3, #1
bne .L60
ldr r4, [sp, #12]
add r5, r5, #1
add r4, r4, r5
str r4, [sp, #12]
.L58:
cmp r7, #0
ble .L61
ldr r5, [sp, #12]
sub r2, r6, #1
mov r2, r2, asl #16
add r2, r5, r2, lsr #16
add r2, r2, #1
mov r3, r5
mov r0, #32
.L63:
cmp r1, r3
strhib r0, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L63
str r3, [sp, #12]
.L61:
ldr r0, [sp, #12]
add sp, sp, #124
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L76:
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
str r0, [sp, #24]
mov r0, #43
str r0, [sp, #32]
b .L29
.L65:
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
mov ip, #45
rsbs r2, r2, #0
rsc r3, r3, #0
str r0, [sp, #24]
str ip, [sp, #32]
b .L29
.L77:
add r0, r2, #48
strb r0, [sp, #52]
mov r5, r3
mov r4, #1
b .L34
.L35:
add r4, sp, #52
mov r0, r4
str r1, [sp, #4]
bl put_dec
rsb r4, r4, r0
sub r5, r4, #1
ldr r1, [sp, #4]
b .L34
.L78:
cmp r1, r2
ldrhi r2, [sp, #8]
orrhi r3, r2, #88
ldrhi r2, [sp, #12]
strhib r3, [r2, #0]
ldr r3, [sp, #12]
add r3, r3, #1
str r3, [sp, #12]
b .L46
.L80:
.align 2
.L79:
.word .LANCHOR0
.size number, .-number
.align 2
.type measure_number, %function
measure_number:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 72
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
mov fp, r3
sub sp, sp, #84
ldr r3, [r0, #0]
str r0, [sp, #12]
mov r0, #0
str r3, [sp, #8]
mov sl, r2
bl time
ldr r3, [sp, #8]
mov r9, #0
add r6, sp, #16
cmp r3, r0
mov r7, r9
ldr r5, .L88
add r8, r6, #63
bne .L87
.L84:
ldr r4, .L88+4
.L83:
mov ip, sp
ldmia r5, {r0, r1}
mov r2, sl
stmia ip, {r0, r1}
mov r3, fp
mov r0, r6
mov r1, r8
bl number
sub r4, r4, #1
cmn r4, #1
strb r7, [r0, #0]
bne .L83
mov r0, #0
bl time
ldr r3, [sp, #8]
add r9, r9, #4000
cmp r3, r0
beq .L84
.L87:
ldr ip, [sp, #12]
str r0, [ip, #0]
mov r0, r9
add sp, sp, #84
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L89:
.align 2
.L88:
.word .LANCHOR0+16
.word 3999
.size measure_number, .-measure_number
.align 2
.type measure, %function
measure:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
mov r0, #0
sub sp, sp, #24
bl time
str r0, [sp, #20]
.L91:
mov r0, #0
bl time
ldr r3, [sp, #20]
cmp r0, r3
beq .L91
add r8, sp, #24
str r0, [r8, #-4]!
mov r2, #8
mov r3, #0
mov r0, r8
bl measure_number
mov r2, #123
mov sl, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L94
mov r7, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L94+4
mov r6, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L94+8
mov r5, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r4, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r9, r0
mvn r3, #0
mov r0, r8
bl measure_number
mov r1, sl
str r0, [sp, #12]
mov r2, r7
mov r3, r6
ldr r0, .L94+12
str r5, [sp, #0]
stmib sp, {r4, r9} @ phole stm
bl printf
add sp, sp, #24
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
bx lr
.L95:
.align 2
.L94:
.word 123456
.word 12345678
.word 123456789
.word .LC0
.size measure, .-measure
.align 2
.type check, %function
check:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 128
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, lr}
ldr r3, .L100
sub sp, sp, #140
mov r5, r0
mov r6, r1
add r4, sp, #72
ldmia r3, {r0, r1}
mov r3, sp
stmia r3, {r0, r1}
mov r2, r5
mov r3, r6
add r1, r4, #63
mov r0, r4
bl number
add r7, sp, #8
mov r3, #0
strb r3, [r0, #0]
mov r2, r5
mov r3, r6
ldr r1, .L100+4
mov r0, r7
bl sprintf
mov r0, r4
mov r1, r7
bl strcmp
cmp r0, #0
bne .L99
add sp, sp, #140
ldmfd sp!, {r4, r5, r6, r7, lr}
bx lr
.L99:
mov r2, r5
mov r3, r6
ldr r0, .L100+8
str r4, [sp, #0]
bl printf
mov r0, #1
bl exit
.L101:
.align 2
.L100:
.word .LANCHOR0+16
.word .LC1
.word .LC2
.size check, .-check
.align 2
.global main
.type main, %function
main:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
bl measure
mov r4, #0
bl measure
mov r5, #0
bl measure
ldr r6, .L106
bl measure
mov r7, #0
mov sl, #1
mov fp, #0
b .L104
.L103:
adds r4, r4, sl
adc r5, r5, fp
.L104:
mov r0, r4
mov r1, r5
bl check
and r8, r4, r6
rsbs r0, r4, #0
rsc r1, r5, #0
and r9, r5, r7
bl check
orrs r8, r8, r9
bne .L103
mov r2, r4
mov r3, r5
ldr r0, .L106+4
bl printf
mov r0, r8
bl fflush
b .L103
.L107:
.align 2
.L106:
.word 262143
.word .LC3
.size main, .-main
.section .rodata
.align 2
.LANCHOR0 = . + 0
.type digits.3938, %object
.size digits.3938, 16
digits.3938:
.ascii "0123456789ABCDEF"
.type dummy_spec, %object
.size dummy_spec, 8
dummy_spec:
.byte 8
.byte 0
.byte 10
.byte 0
.short 0
.short 0
.section .rodata.str1.4,"aMS",%progbits,1
.align 2
.LC0:
.ascii "Conversions per second: 8:%d 123:%d 123456:%d 12345"
.ascii "678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
.ascii "%llu\000"
.space 3
.LC2:
.ascii "Error in formatting %llu:'%s'\012\000"
.space 1
.LC3:
.ascii "\015Tested %llu \000"
.ident "GCC: (Debian 4.4.5-8) 4.4.5"
.section .note.GNU-stack,"",%progbits
[-- Attachment #4: test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.diff --]
[-- Type: text/plain, Size: 1332 bytes --]
--- test_new.org.c 2012-03-28 13:25:14.000000000 +0300
+++ test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c 2012-03-28 13:37:00.000000000 +0300
@@ -1,4 +1,4 @@
-#include "test_header.c"
+#include "test_header_arm.c"
/* Decimal conversion is by far the most typical, and is used
* for /proc and /sys data. This directly impacts e.g. top performance
@@ -7,7 +7,7 @@
* (with permission from the author, Douglas W. Jones).
*/
-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ////LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
/* Formats correctly any integer in [0, 999999999] */
static noinline_for_stack
char *put_dec_full9(char *buf, unsigned q)
@@ -106,11 +106,11 @@
* Else (if long is 32 bits and long long is 64 bits) we use second one.
*/
-#if LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
+#if 1 ///LONG_MAX > ((1UL<<31)-1) || LLONG_MAX > ((1ULL<<63)-1)
/* First algorithm: generic */
-static
+static noinline_for_stack
char *put_dec(char *buf, unsigned long long n)
{
if (n >= 100*1000*1000) {
@@ -145,7 +145,7 @@
* (with permission from the author).
* Performs no 64-bit division and hence should be fast on 32-bit machines.
*/
-static
+static noinline_for_stack
char *put_dec(char *buf, unsigned long long n)
{
uint32_t d3, d2, d1, q, h;
[-- Attachment #5: test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.s --]
[-- Type: text/plain, Size: 14681 bytes --]
.cpu arm9tdmi
.fpu softvfp
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 2
.eabi_attribute 30, 2
.eabi_attribute 18, 4
.file "test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c"
.text
.align 2
.type put_dec_full9, %function
put_dec_full9:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 40
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L3
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
umull r4, r5, r1, r3
mov r6, r5
umull r4, r5, r6, r3
umull r7, r8, r5, r3
mov ip, r8
umull r7, r8, ip, r3
mov r2, r8
umull r7, r8, r2, r3
mov r4, r5
add r5, r8, r8, asl #1
add r5, r8, r5, asl #2
rsb r5, r5, r5, asl #6
add r5, r8, r5, asl #2
mov r5, r5, asl #1
mov r3, r8
mov r5, r5, lsr #16
add r1, r1, #48
add r8, r6, r6, asl #2
sub sp, sp, #40
add r7, r5, r5, asl #2
sub r8, r1, r8, asl #1
add r6, r6, #48
mov r1, r0
str r7, [sp, #12]
strb r8, [r1], #1
add r7, r5, r7, asl #3
str r6, [sp, #0]
add r5, r5, #48
str r5, [sp, #32]
ldr r5, [sp, #0]
add r6, r4, #48
add r4, r4, r4, asl #2
str r6, [sp, #16]
sub r4, r5, r4, asl #1
str r4, [sp, #0]
add r6, ip, #48
ldr r4, [sp, #16]
add r7, r7, r7, asl #2
str r6, [sp, #20]
mov r7, r7, lsr #11
add r6, r2, #48
add ip, ip, ip, asl #2
str r6, [sp, #24]
add r8, r1, #1
add r9, r7, r7, asl #2
sub ip, r4, ip, asl #1
str r8, [sp, #8]
str ip, [sp, #16]
ldr r5, [sp, #20]
ldr ip, [sp, #24]
add r8, r7, r9, asl #3
add r6, r3, #48
add r7, r7, #48
str r6, [sp, #28]
str r7, [sp, #36]
add r8, r8, r8, asl #2
add r2, r2, r2, asl #2
add r3, r3, r3, asl #2
ldr r4, [sp, #36]
sub r2, r5, r2, asl #1
sub r3, ip, r3, asl #1
ldr r5, [sp, #28]
ldr ip, [sp, #12]
mov r8, r8, lsr #11
add r7, r8, r8, asl #2
sub r7, r4, r7, asl #1
sub r4, r5, ip, asl #1
ldr r5, [sp, #32]
ldr sl, [sp, #8]
sub r9, r5, r9, asl #1
ldr r5, [sp, #0]
add sl, sl, #1
str sl, [sp, #4]
strb r5, [r0, #1]
ldr r0, [sp, #16]
add sl, sl, #1
strb r0, [r1, #1]
ldr r1, [sp, #8]
add fp, sl, #1
strb r2, [r1, #1]
ldr r2, [sp, #4]
add r6, fp, #1
add ip, r6, #1
add r8, r8, #48
strb r3, [r2, #1]
add r0, ip, #2
strb r4, [sl, #1]
strb r9, [fp, #1]
strb r7, [r6, #1]
strb r8, [ip, #1]
add sp, sp, #40
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp}
bx lr
.L4:
.align 2
.L3:
.word 429496730
.size put_dec_full9, .-put_dec_full9
.align 2
.type put_dec_trunc8, %function
put_dec_trunc8:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L9
stmfd sp!, {r4, r5, fp}
umull fp, ip, r1, r3
mov r2, ip
add r1, r1, #48
mov ip, r0
add r0, r2, r2, asl #2
sub r1, r1, r0, asl #1
cmp r2, #0
mov r0, ip
strb r1, [r0], #1
beq .L6
umull r4, r5, r2, r3
add r2, r2, #48
add fp, r5, r5, asl #2
sub r2, r2, fp, asl #1
cmp r5, #0
mov r1, r5
strb r2, [ip, #1]
add r0, r0, #1
beq .L6
umull r4, r5, r1, r3
add r1, r1, #48
add ip, r5, r5, asl #2
sub r1, r1, ip, asl #1
cmp r5, #0
mov r2, r5
strb r1, [r0], #1
beq .L6
umull r4, r5, r2, r3
add r2, r2, #48
add r1, r5, r5, asl #2
sub r2, r2, r1, asl #1
cmp r5, #0
strb r2, [r0], #1
beq .L6
add r2, r5, r5, asl #1
add r2, r5, r2, asl #2
rsb r2, r2, r2, asl #6
add r2, r5, r2, asl #2
mov r2, r2, asl #1
mov r2, r2, lsr #16
add r3, r5, #48
add r1, r2, r2, asl #2
sub r3, r3, r1, asl #1
cmp r2, #0
strb r3, [r0], #1
beq .L6
add r1, r2, r1, asl #3
add r1, r1, r1, asl #2
mov r3, r1, lsr #11
add r2, r2, #48
add r1, r3, r3, asl #2
sub r2, r2, r1, asl #1
cmp r3, #0
strb r2, [r0], #1
beq .L6
add r1, r3, r1, asl #3
add r1, r1, r1, asl #2
mov r2, r1, lsr #11
add r1, r2, r2, asl #2
add r3, r3, #48
cmp r2, #0
sub r3, r3, r1, asl #1
strb r3, [r0], #1
addne r2, r2, #48
strneb r2, [r0], #1
.L6:
ldmfd sp!, {r4, r5, fp}
bx lr
.L10:
.align 2
.L9:
.word 429496730
.size put_dec_trunc8, .-put_dec_trunc8
.align 2
.type put_dec, %function
put_dec:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
cmp r3, #0
stmfd sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
beq .L28
.L25:
ldr sl, .L29
ldr r5, .L29+4
ldr r6, .L29+8
mov r8, #0
.L27:
#APP
@ 118 "test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c" 1
umull fp, ip, r5, r2
cmn fp, r5
adcs ip, ip, r6
adc fp, r8, #0
@ 0 "" 2
mov r1, r8
#APP
@ 118 "test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm.c" 1
umlal ip, fp, r6, r2
umlal ip, r1, r5, r3
mov ip, #0
adds fp, r1, fp
adc ip, ip, #0
umlal fp, ip, r6, r3
@ 0 "" 2
mov r4, fp, lsr #29
orr r4, r4, ip, asl #3
rsb r1, r4, r4, asl #5
rsb r1, r1, r1, asl #6
add r1, r4, r1, asl #3
add r1, r1, r1, asl #2
add r1, r1, r1, asl #2
add r1, r1, r1, asl #2
mov r7, ip, lsr #29
sub r1, r2, r1, asl #9
bl put_dec_full9
cmp r7, #0
mov r2, r4
mov r3, r7
bne .L27
cmp r4, sl
bhi .L27
ldr r3, .L29+12
cmp r4, r3
bhi .L16
.L18:
mov r1, r4
ldmfd sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
b put_dec_trunc8
.L28:
ldr r1, .L29+12
cmp r2, r1
movls r4, r2
bls .L18
ldr r1, .L29
cmp r2, r1
movls r4, r2
bhi .L25
.L16:
mov r1, r4
ldmfd sp!, {r4, r5, r6, r7, r8, sl, fp, lr}
b put_dec_full9
.L30:
.align 2
.L29:
.word 999999999
.word 917808535
.word -1989124287
.word 99999999
.size put_dec, .-put_dec
.align 2
.type number, %function
number:
@ Function supports interworking.
@ args = 8, pretend = 0, frame = 120
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
sub sp, sp, #124
ldrb ip, [sp, #161] @ zero_extendqisi2
ldrb r4, [sp, #162] @ zero_extendqisi2
ldrh r5, [sp, #166]
str r0, [sp, #12]
ands r0, ip, #64
str r4, [sp, #20]
str r5, [sp, #36]
ldrh r4, [sp, #164]
str ip, [sp, #16]
beq .L84
ldr r0, [sp, #20]
subs r0, r0, #10
movne r0, #1
.L84:
ands r5, ip, #16
str r0, [sp, #28]
and r0, ip, #32
andne ip, ip, #254
strne ip, [sp, #16]
str r5, [sp, #40]
ldr r5, [sp, #16]
str r0, [sp, #8]
tst r5, #2
beq .L36
cmp r3, #0
blt .L75
tst r5, #4
bne .L86
ldr r5, [sp, #16]
tst r5, #8
beq .L36
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
mov ip, #32
str r0, [sp, #24]
str ip, [sp, #32]
b .L39
.L36:
mov r0, #0
str r4, [sp, #24]
str r0, [sp, #32]
.L39:
ldr r4, [sp, #28]
cmp r4, #0
beq .L41
ldr r5, [sp, #24]
ldr ip, [sp, #20]
sub r0, r5, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
cmp ip, #16
str r0, [sp, #24]
subeq r0, r0, #1
moveq r0, r0, asl #16
moveq r0, r0, lsr #16
streq r0, [sp, #24]
.L41:
cmp r3, #0
bne .L42
cmp r2, #7
bls .L87
.L42:
ldr r0, [sp, #20]
cmp r0, #10
beq .L45
cmp r0, #16
movne r5, #3
moveq r5, #4
sub r8, r0, #1
ldr sl, .L89
rsb r0, r5, #32
mov r4, #0
add r9, sp, #52
sub r6, r5, #32
str r1, [sp, #44]
mov fp, r0
.L48:
mov ip, r2, lsr r5
cmp r6, #0
orr ip, ip, r3, asl fp
movge ip, r3, lsr r6
mov r7, r3, lsr r5
and r2, r2, #255
and r2, r2, r8
mov r0, ip
ldr r1, [sp, #8]
ldrb ip, [sl, r2] @ zero_extendqisi2
mov r2, r0
orr ip, r1, ip
orrs r0, r2, r7
strb ip, [r9, r4]
mov r3, r7
add r4, r4, #1
bne .L48
ldr r1, [sp, #44]
sub r5, r4, #1
.L44:
ldr r2, [sp, #36]
ldr r3, [sp, #24]
mov ip, r2, asl #16
cmp r4, ip, asr #16
movgt ip, r4, asl #16
mov ip, ip, lsr #16
ldr r0, [sp, #16]
rsb r7, ip, r3
mov r7, r7, asl #16
mov r7, r7, lsr #16
tst r0, #17
mov r0, r7
bne .L51
sub r0, r7, #1
mov r0, r0, asl #16
cmp r0, #0
mov r0, r0, lsr #16
blt .L51
ldr r3, [sp, #12]
mov r8, r0
add r2, r3, #1
add r2, r2, r0
mov r6, #32
.L53:
cmp r1, r3
strhib r6, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L53
rsb r7, r7, #1
ldr r2, [sp, #12]
add r0, r0, r7
mov r0, r0, asl #16
add r3, r8, #1
sub r0, r0, #65536
add r2, r2, r3
str r2, [sp, #12]
mov r0, r0, lsr #16
.L51:
ldr r3, [sp, #32]
cmp r3, #0
beq .L54
ldr r2, [sp, #12]
cmp r2, r1
strccb r3, [r2, #0]
ldr r3, [sp, #12]
add r3, r3, #1
str r3, [sp, #12]
.L54:
ldr r2, [sp, #28]
cmp r2, #0
beq .L56
ldr r3, [sp, #12]
cmp r3, r1
ldrcc r2, [sp, #12]
movcc r3, #48
strccb r3, [r2, #0]
ldr r2, [sp, #12]
ldr r3, [sp, #20]
add r2, r2, #1
cmp r3, #16
str r2, [sp, #12]
beq .L88
.L56:
ldr r2, [sp, #40]
cmp r2, #0
movne r6, r0
movne r7, r6, asl #16
bne .L60
sub r6, r0, #1
ldr r3, [sp, #16]
mov r6, r6, asl #16
tst r3, #1
mov r6, r6, lsr #16
movne r8, #48
moveq r8, #32
movs r7, r6, asl #16
bmi .L60
sub r2, r0, #1
ldr r3, [sp, #12]
mov r2, r2, asl #16
add r2, r3, r2, lsr #16
add r2, r2, #1
.L64:
cmp r1, r3
strhib r8, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L64
rsb r6, r0, r6
mov r6, r6, asl #16
mov r6, r6, lsr #16
str r3, [sp, #12]
mov r7, r6, asl #16
.L60:
sub r3, ip, #1
mov r3, r3, asl #16
cmp r4, r3, asr #16
bgt .L65
sub r0, ip, #2
ldr r3, [sp, #12]
mov r0, r0, asl #16
add r0, r3, r0, asr #16
add r0, r0, #1
mov ip, #48
.L67:
cmp r1, r3
strhib ip, [r3, #0]
add r3, r3, #1
rsb r2, r3, r0
cmp r4, r2
ble .L67
str r3, [sp, #12]
.L65:
cmp r5, #0
blt .L68
add r2, sp, #52
ldr r3, [sp, #12]
sub r0, r2, #1
add r2, r2, r5
.L70:
cmp r1, r3
ldrhib ip, [r2, #0] @ zero_extendqisi2
sub r2, r2, #1
strhib ip, [r3, #0]
cmp r2, r0
add r3, r3, #1
bne .L70
ldr r4, [sp, #12]
add r5, r5, #1
add r4, r4, r5
str r4, [sp, #12]
.L68:
cmp r7, #0
ble .L71
ldr r5, [sp, #12]
sub r2, r6, #1
mov r2, r2, asl #16
add r2, r5, r2, lsr #16
add r2, r2, #1
mov r3, r5
mov r0, #32
.L73:
cmp r1, r3
strhib r0, [r3, #0]
add r3, r3, #1
cmp r3, r2
bne .L73
str r3, [sp, #12]
.L71:
ldr r0, [sp, #12]
add sp, sp, #124
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L86:
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
str r0, [sp, #24]
mov r0, #43
str r0, [sp, #32]
b .L39
.L75:
sub r0, r4, #1
mov r0, r0, asl #16
mov r0, r0, lsr #16
mov ip, #45
rsbs r2, r2, #0
rsc r3, r3, #0
str r0, [sp, #24]
str ip, [sp, #32]
b .L39
.L87:
add r0, r2, #48
strb r0, [sp, #52]
mov r5, r3
mov r4, #1
b .L44
.L45:
add r4, sp, #52
mov r0, r4
str r1, [sp, #4]
bl put_dec
rsb r4, r4, r0
sub r5, r4, #1
ldr r1, [sp, #4]
b .L44
.L88:
cmp r1, r2
ldrhi r2, [sp, #8]
orrhi r3, r2, #88
ldrhi r2, [sp, #12]
strhib r3, [r2, #0]
ldr r3, [sp, #12]
add r3, r3, #1
str r3, [sp, #12]
b .L56
.L90:
.align 2
.L89:
.word .LANCHOR0
.size number, .-number
.align 2
.type measure_number, %function
measure_number:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 72
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
mov fp, r3
sub sp, sp, #84
ldr r3, [r0, #0]
str r0, [sp, #12]
mov r0, #0
str r3, [sp, #8]
mov sl, r2
bl time
ldr r3, [sp, #8]
mov r9, #0
add r6, sp, #16
cmp r3, r0
mov r7, r9
ldr r5, .L98
add r8, r6, #63
bne .L97
.L94:
ldr r4, .L98+4
.L93:
mov ip, sp
ldmia r5, {r0, r1}
mov r2, sl
stmia ip, {r0, r1}
mov r3, fp
mov r0, r6
mov r1, r8
bl number
sub r4, r4, #1
cmn r4, #1
strb r7, [r0, #0]
bne .L93
mov r0, #0
bl time
ldr r3, [sp, #8]
add r9, r9, #4000
cmp r3, r0
beq .L94
.L97:
ldr ip, [sp, #12]
str r0, [ip, #0]
mov r0, r9
add sp, sp, #84
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
bx lr
.L99:
.align 2
.L98:
.word .LANCHOR0+16
.word 3999
.size measure_number, .-measure_number
.align 2
.type measure, %function
measure:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 8
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
mov r0, #0
sub sp, sp, #24
bl time
str r0, [sp, #20]
.L101:
mov r0, #0
bl time
ldr r3, [sp, #20]
cmp r0, r3
beq .L101
add r8, sp, #24
str r0, [r8, #-4]!
mov r2, #8
mov r3, #0
mov r0, r8
bl measure_number
mov r2, #123
mov sl, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L104
mov r7, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L104+4
mov r6, r0
mov r3, #0
mov r0, r8
bl measure_number
ldr r2, .L104+8
mov r5, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r4, r0
mov r3, #0
mov r0, r8
bl measure_number
mvn r2, #0
mov r9, r0
mvn r3, #0
mov r0, r8
bl measure_number
mov r1, sl
str r0, [sp, #12]
mov r2, r7
mov r3, r6
ldr r0, .L104+12
str r5, [sp, #0]
stmib sp, {r4, r9} @ phole stm
bl printf
add sp, sp, #24
ldmfd sp!, {r4, r5, r6, r7, r8, r9, sl, lr}
bx lr
.L105:
.align 2
.L104:
.word 123456
.word 12345678
.word 123456789
.word .LC0
.size measure, .-measure
.align 2
.type check, %function
check:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 128
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r4, r5, r6, r7, lr}
ldr r3, .L110
sub sp, sp, #140
mov r5, r0
mov r6, r1
add r4, sp, #72
ldmia r3, {r0, r1}
mov r3, sp
stmia r3, {r0, r1}
mov r2, r5
mov r3, r6
add r1, r4, #63
mov r0, r4
bl number
add r7, sp, #8
mov r3, #0
strb r3, [r0, #0]
mov r2, r5
mov r3, r6
ldr r1, .L110+4
mov r0, r7
bl sprintf
mov r0, r4
mov r1, r7
bl strcmp
cmp r0, #0
bne .L109
add sp, sp, #140
ldmfd sp!, {r4, r5, r6, r7, lr}
bx lr
.L109:
mov r2, r5
mov r3, r6
ldr r0, .L110+8
str r4, [sp, #0]
bl printf
mov r0, #1
bl exit
.L111:
.align 2
.L110:
.word .LANCHOR0+16
.word .LC1
.word .LC2
.size check, .-check
.align 2
.global main
.type main, %function
main:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
stmfd sp!, {r3, r4, r5, r6, r7, r8, r9, sl, fp, lr}
bl measure
mov r4, #0
bl measure
mov r5, #0
bl measure
ldr r6, .L116
bl measure
mov r7, #0
mov sl, #1
mov fp, #0
b .L114
.L113:
adds r4, r4, sl
adc r5, r5, fp
.L114:
mov r0, r4
mov r1, r5
bl check
and r8, r4, r6
rsbs r0, r4, #0
rsc r1, r5, #0
and r9, r5, r7
bl check
orrs r8, r8, r9
bne .L113
mov r2, r4
mov r3, r5
ldr r0, .L116+4
bl printf
mov r0, r8
bl fflush
b .L113
.L117:
.align 2
.L116:
.word 262143
.word .LC3
.size main, .-main
.section .rodata
.align 2
.LANCHOR0 = . + 0
.type digits.4059, %object
.size digits.4059, 16
digits.4059:
.ascii "0123456789ABCDEF"
.type dummy_spec, %object
.size dummy_spec, 8
dummy_spec:
.byte 8
.byte 0
.byte 10
.byte 0
.short 0
.short 0
.section .rodata.str1.4,"aMS",%progbits,1
.align 2
.LC0:
.ascii "Conversions per second: 8:%d 123:%d 123456:%d 12345"
.ascii "678:%d 123456789:%d 2^32:%d 2^64:%d\012\000"
.LC1:
.ascii "%llu\000"
.space 3
.LC2:
.ascii "Error in formatting %llu:'%s'\012\000"
.space 1
.LC3:
.ascii "\015Tested %llu \000"
.ident "GCC: (Debian 4.4.5-8) 4.4.5"
.section .note.GNU-stack,"",%progbits
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 10:39 ` roma1390
@ 2012-03-28 11:20 ` Denys Vlasenko
0 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 11:20 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Wednesday 28 March 2012 12:39, roma1390 wrote:
> On 2012.03.28 13:33, Denys Vlasenko wrote:
> > On Wednesday 28 March 2012 12:24, roma1390 wrote:
> >> On 2012.03.28 13:13, Denys Vlasenko wrote:
> >>> Second: run
> >>> arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
> >>> and email me resulting test_{org,new}.s files.
> >>
> >> test_{org,new}.s attached.
> >
> >
> > Bingo.
> >
> > bl __aeabi_uidivmod
> >
> > Not good. Your gcc did not optimize division by constant.
> >
> > Can you add "noinline_for_stack":
> >
> > static noinline_for_stack<=== HERE
> > char *put_dec(char *buf, unsigned long long n)
> >
> > amd regenerate and resend the test_new.s?
> >
>
> Hello,
>
> Your requested asm are attached.
Asm looks good: there are only three long multiplies in the
put_dec + put_dec_full4, and no divisions at all.
Please speed-test this version.
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 10:31 ` roma1390
@ 2012-03-28 11:23 ` Denys Vlasenko
2012-03-29 5:23 ` roma1390
0 siblings, 1 reply; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-28 11:23 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Wednesday 28 March 2012 12:31, roma1390 wrote:
> On 2012.03.28 13:13, Denys Vlasenko wrote:
> > Third: switch to algorithm #1 and test whether it fares better.
> > To do that, go to test_new.c
> > and replace
> > #if LONG_MAX> ((1UL<<31)-1) || LLONG_MAX> ((1ULL<<63)-1)
> > with
> > #if 1 ////LONG_MAX> ((1UL<<31)-1) || LLONG_MAX> ((1ULL<<63)-1)
> > (there are two instances of this line there),
> > then recompile and rerun the test, and post the result.
> but affter this diff:
> --- test_new-with-ifdefs.c 2012-03-28 13:29:04.000000000 +0300
> +++ test_new.c 2012-03-28 13:29:12.000000000 +0300
> @@ -1,4 +1,4 @@
> -#include "test_header.c"
> +#include "test_header_arm.c"
>
> /* Decimal conversion is by far the most typical, and is used
> * for /proc and /sys data. This directly impacts e.g. top performance
>
> compiles fine.
>
> result are:
> ./test_new2
> Conversions per second: 8:5420000 123:4452000 123456:3556000 12345678:3368000
> 123456789:2904000 2^32:2412000 2^64:1556000
> Conversions per second: 8:5428000 123:4500000 123456:3644000 12345678:3328000
> 123456789:2832000 2^32:2408000 2^64:1572000
> Conversions per second: 8:5372000 123:4396000 123456:3644000 12345678:3368000
> 123456789:2900000 2^32:2392000 2^64:1532000
> Conversions per second: 8:5424000 123:4500000 123456:3608000 12345678:3284000
> 123456789:2896000 2^32:2416000 2^64:1572000
It looks ok - it is a bit faster that original code;
but I expect algorithm #2 will do better.
BTW, please always run test of original code too when you measure speed,
and post both results.
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-27 15:33 ` Denys Vlasenko
@ 2012-03-29 5:16 ` roma1390
2012-03-29 10:33 ` Denys Vlasenko
0 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-29 5:16 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On 2012.03.27 18:33, Denys Vlasenko wrote:
> Can you also let it run longer, to test at least up to
> first 5 billion numbers (which is> 232)?
First test after ~20 hour run:
Tested 12514754560
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 11:23 ` Denys Vlasenko
@ 2012-03-29 5:23 ` roma1390
2012-03-29 10:33 ` Denys Vlasenko
0 siblings, 1 reply; 31+ messages in thread
From: roma1390 @ 2012-03-29 5:23 UTC (permalink / raw)
To: Denys Vlasenko
Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On 2012.03.28 14:23, Denys Vlasenko wrote:
> It looks ok - it is a bit faster that original code;
> but I expect algorithm #2 will do better.
>
> BTW, please always run test of original code too when you measure speed,
> and post both results.
And the test results are:
./test_new-org-with-noinline_for_stack
Conversions per second: 8:5188000 123:4340000 123456:3536000 12345678:3164000
123456789:2036000 2^32:2032000 2^64:1544000
Conversions per second: 8:5016000 123:4320000 123456:3536000 12345678:3272000
123456789:2032000 2^32:1964000 2^64:1540000
Conversions per second: 8:5192000 123:4336000 123456:3420000 12345678:3268000
123456789:2036000 2^32:2032000 2^64:1544000
Conversions per second: 8:5164000 123:4336000 123456:3536000 12345678:3272000
123456789:1968000 2^32:2032000 2^64:1544000
Tested 524288 ^C
# ./test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm
Conversions per second: 8:5020000 123:4332000 123456:3536000 12345678:3272000
123456789:2828000 2^32:2260000 2^64:1536000
Conversions per second: 8:5192000 123:4332000 123456:3416000 12345678:3252000
123456789:2828000 2^32:2336000 2^64:1540000
Conversions per second: 8:5172000 123:4336000 123456:3540000 12345678:3272000
123456789:2736000 2^32:2336000 2^64:1536000
Conversions per second: 8:5192000 123:4192000 123456:3520000 12345678:3276000
123456789:2828000 2^32:2340000 2^64:1484000
Tested 524288 ^C
roma1390
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-29 5:16 ` roma1390
@ 2012-03-29 10:33 ` Denys Vlasenko
0 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-29 10:33 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Thu, Mar 29, 2012 at 7:16 AM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.27 18:33, Denys Vlasenko wrote:
>>
>> Can you also let it run longer, to test at least up to
>> first 5 billion numbers (which is> 232)?
>
> First test after ~20 hour run:
> Tested 12514754560
Looks like it's ok :)
In case you expect that it will finish: well, it won't.
It will roll over 2^64-1 after a few years, and start from 0 :) :)
So please feel free to ^C
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-29 5:23 ` roma1390
@ 2012-03-29 10:33 ` Denys Vlasenko
0 siblings, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-29 10:33 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Thu, Mar 29, 2012 at 7:23 AM, roma1390 <roma1390@gmail.com> wrote:
> On 2012.03.28 14:23, Denys Vlasenko wrote:
>>
>> It looks ok - it is a bit faster that original code;
>> but I expect algorithm #2 will do better.
>>
>> BTW, please always run test of original code too when you measure speed,
>> and post both results.
>
>
> And the test results are:
>
> ./test_new-org-with-noinline_for_stack
> Conversions per second: 8:5188000 123:4340000 123456:3536000
> 12345678:3164000 123456789:2036000 2^32:2032000 2^64:1544000
> Conversions per second: 8:5016000 123:4320000 123456:3536000
> 12345678:3272000 123456789:2032000 2^32:1964000 2^64:1540000
> Conversions per second: 8:5192000 123:4336000 123456:3420000
> 12345678:3268000 123456789:2036000 2^32:2032000 2^64:1544000
> Conversions per second: 8:5164000 123:4336000 123456:3536000
> 12345678:3272000 123456789:1968000 2^32:2032000 2^64:1544000
> Tested 524288 ^C
>
> # ./test_new-org-with-noinline_for_stack_if-LONG_MAX_header_arm
> Conversions per second: 8:5020000 123:4332000 123456:3536000
> 12345678:3272000 123456789:2828000 2^32:2260000 2^64:1536000
> Conversions per second: 8:5192000 123:4332000 123456:3416000
> 12345678:3252000 123456789:2828000 2^32:2336000 2^64:1540000
> Conversions per second: 8:5172000 123:4336000 123456:3540000
> 12345678:3272000 123456789:2736000 2^32:2336000 2^64:1536000
> Conversions per second: 8:5192000 123:4192000 123456:3520000
> 12345678:3276000 123456789:2828000 2^32:2340000 2^64:1484000
> Tested 524288 ^C
Thanks! They look good now.
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 0/1] vsprintf: optimize decimal conversion (again)
2012-03-28 10:33 ` Denys Vlasenko
2012-03-28 10:39 ` roma1390
@ 2012-03-29 10:35 ` Denys Vlasenko
1 sibling, 0 replies; 31+ messages in thread
From: Denys Vlasenko @ 2012-03-29 10:35 UTC (permalink / raw)
To: roma1390; +Cc: linux-kernel, Andrew Morton, Douglas W Jones, Michal Nazarewicz
On Wed, Mar 28, 2012 at 12:33 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Wednesday 28 March 2012 12:24, roma1390 wrote:
>> On 2012.03.28 13:13, Denys Vlasenko wrote:
>> > Second: run
>> > arm-linux-gnueabi-gcc -O2 -Wall test_{org,new}.c -S
>> > and email me resulting test_{org,new}.s files.
>>
>> test_{org,new}.s attached.
>
> Bingo.
>
> bl __aeabi_uidivmod
>
> Not good. Your gcc did not optimize division by constant.
>
> Can you add "noinline_for_stack":
>
> static noinline_for_stack <=== HERE
> char *put_dec(char *buf, unsigned long long n)
>
> and regenerate and resend the test_new.s?
Since you confirmed that noinlining helps,
I recommend you to file a gcc bug report:
your gcc misses an optimization when it inlines
that function.
--
vda
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2012-03-29 10:35 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-26 18:47 [PATCH 0/1] vsprintf: optimize decimal conversion (again) Denys Vlasenko
2012-03-26 18:51 ` [PATCH 1/1] " Denys Vlasenko
2012-03-26 19:51 ` Andrew Morton
2012-03-26 19:56 ` Denys Vlasenko
2012-03-26 20:13 ` Andrew Morton
2012-03-26 20:18 ` Geert Uytterhoeven
2012-03-26 23:18 ` Denys Vlasenko
2012-03-27 0:30 ` Denys Vlasenko
2012-03-27 3:49 ` H. Peter Anvin
2012-03-26 20:20 ` H. Peter Anvin
2012-03-27 17:12 ` Michal Nazarewicz
2012-03-27 17:17 ` H. Peter Anvin
2012-03-27 0:26 ` Denys Vlasenko
2012-03-27 12:08 ` [PATCH 0/1] " roma1390
2012-03-27 15:32 ` Denys Vlasenko
2012-03-27 15:42 ` Denys Vlasenko
2012-03-28 5:56 ` roma1390
2012-03-28 10:13 ` Denys Vlasenko
2012-03-28 10:24 ` roma1390
2012-03-28 10:33 ` Denys Vlasenko
2012-03-28 10:39 ` roma1390
2012-03-28 11:20 ` Denys Vlasenko
2012-03-29 10:35 ` Denys Vlasenko
2012-03-28 10:31 ` roma1390
2012-03-28 11:23 ` Denys Vlasenko
2012-03-29 5:23 ` roma1390
2012-03-29 10:33 ` Denys Vlasenko
2012-03-27 13:49 ` roma1390
2012-03-27 15:33 ` Denys Vlasenko
2012-03-29 5:16 ` roma1390
2012-03-29 10:33 ` Denys Vlasenko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).