[PATCH] Reduce the number of expensive division instructions done by _parse

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
@ 2012-02-09 15:48 David Howells
  2012-02-09 16:28 ` Eric Dumazet
  0 siblings, 1 reply; 13+ messages in thread
From: David Howells @ 2012-02-09 15:48 UTC (permalink / raw)
  To: adobriyan; +Cc: torvalds, dhowells, linux-kernel

_parse_integer() does one or two division instructions (which are slow) per
digit parsed to perform the overflow check.

Furthermore, these are particularly expensive examples of division instruction
as the number of clock cycles required to complete them may go up with the
position of the most significant set bit in the dividend:

	if (*res > div_u64(ULLONG_MAX - val, base))

which is as maximal as possible.

Worse, on 32-bit arches, more than one of these division instructions may be
required per digit.

So, assuming we don't support a base of more than 16, skip the check if the
top nibble of the result is not set at this point.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 lib/kstrtox.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/lib/kstrtox.c b/lib/kstrtox.c
index 7a94c8f..f80c896 100644
--- a/lib/kstrtox.c
+++ b/lib/kstrtox.c
@@ -64,7 +64,7 @@ unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long
 
 		if (val >= base)
 			break;
-		if (*res > div_u64(ULLONG_MAX - val, base))
+		if (unlikely(*res >> 60) && *res > div_u64(ULLONG_MAX - val, base))
 			overflow = 1;
 		*res = *res * base + val;
 		rv++;


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 15:48 [PATCH] Reduce the number of expensive division instructions done by _parse_integer() David Howells
@ 2012-02-09 16:28 ` Eric Dumazet
  2012-02-09 16:42   ` Eric Dumazet
  2012-02-09 16:58   ` Linus Torvalds
  0 siblings, 2 replies; 13+ messages in thread
From: Eric Dumazet @ 2012-02-09 16:28 UTC (permalink / raw)
  To: David Howells; +Cc: adobriyan, torvalds, linux-kernel

Le jeudi 09 février 2012 à 15:48 +0000, David Howells a écrit :
> _parse_integer() does one or two division instructions (which are slow) per
> digit parsed to perform the overflow check.
> 
> Furthermore, these are particularly expensive examples of division instruction
> as the number of clock cycles required to complete them may go up with the
> position of the most significant set bit in the dividend:
> 
> 	if (*res > div_u64(ULLONG_MAX - val, base))
> 
> which is as maximal as possible.
> 
> Worse, on 32-bit arches, more than one of these division instructions may be
> required per digit.
> 
> So, assuming we don't support a base of more than 16, skip the check if the
> top nibble of the result is not set at this point.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> ---
> 
>  lib/kstrtox.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/lib/kstrtox.c b/lib/kstrtox.c
> index 7a94c8f..f80c896 100644
> --- a/lib/kstrtox.c
> +++ b/lib/kstrtox.c
> @@ -64,7 +64,7 @@ unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long
>  
>  		if (val >= base)
>  			break;
> -		if (*res > div_u64(ULLONG_MAX - val, base))
> +		if (unlikely(*res >> 60) && *res > div_u64(ULLONG_MAX - val, base))
>  			overflow = 1;
>  		*res = *res * base + val;
>  		rv++;
> 
> -

You could avoid the divide and have cleaner code I think.

	unsigned long long next = *res * base + val;

	if (next < *res)
		overflow = 1;
	*res = next;



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 16:28 ` Eric Dumazet
@ 2012-02-09 16:42   ` Eric Dumazet
  2012-02-09 16:58   ` Linus Torvalds
  1 sibling, 0 replies; 13+ messages in thread
From: Eric Dumazet @ 2012-02-09 16:42 UTC (permalink / raw)
  To: David Howells; +Cc: adobriyan, torvalds, linux-kernel

Le jeudi 09 février 2012 à 17:28 +0100, Eric Dumazet a écrit :

> You could avoid the divide and have cleaner code I think.
> 
> 	unsigned long long next = *res * base + val;
> 
> 	if (next < *res)
> 		overflow = 1;
> 	*res = next;
> 

Oh well, this one is better.

        unsigned long long next = *res * base;

        if (next < *res)
                overflow = 1;
        *res = next + val;





^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 16:28 ` Eric Dumazet
  2012-02-09 16:42   ` Eric Dumazet
@ 2012-02-09 16:58   ` Linus Torvalds
  2012-02-09 17:08     ` Linus Torvalds
  2012-02-09 17:46     ` David Howells
  1 sibling, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2012-02-09 16:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Howells, adobriyan, linux-kernel

On Thu, Feb 9, 2012 at 8:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> You could avoid the divide and have cleaner code I think.
>
>        unsigned long long next = *res * base + val;
>
>        if (next < *res)
>                overflow = 1;
>        *res = next;

No. That pattern only works for a single addition.

For multiplication, you can overflow *and* still be bigger than the
starting value.

You either need to do the multiplication in a wider type (which would
be the natural way to do it on many 64-bit architectures, since they
often support 64x64->128 bit multiplies, and then doing a carry-add is
trivial), or you need to check the value before the multiplication. Or
you need to have a multiply that gives you overflow information, which
most don't.

So David's patch looks ok, although if this really is
performance-critical (and I could imagine some crazy /proc or /sys
access loads where the divide really does show up very clearly) I do
think it could be even done more efficiently. But probably only if we
start doing some arch-specific stuff.

                  Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 16:58   ` Linus Torvalds
@ 2012-02-09 17:08     ` Linus Torvalds
  2012-02-09 17:46     ` David Howells
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2012-02-09 17:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Howells, adobriyan, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

On Thu, Feb 9, 2012 at 8:58 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So David's patch looks ok, although if this really is
> performance-critical (and I could imagine some crazy /proc or /sys
> access loads where the divide really does show up very clearly) I do
> think it could be even done more efficiently. But probably only if we
> start doing some arch-specific stuff.

Looking at the code generated, the "val >> 60" thing actually does
generate a shift, and at least on x86-64, the attached patch generates
better code.

David, does this work for you? I also removed the nasty pointer
indirection crap.

                        Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1226 bytes --]

 lib/kstrtox.c |   18 +++++++++++++-----
 1 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/lib/kstrtox.c b/lib/kstrtox.c
index 7a94c8f14e29..b1dd3e7d88cb 100644
--- a/lib/kstrtox.c
+++ b/lib/kstrtox.c
@@ -44,12 +44,13 @@ const char *_parse_integer_fixup_radix(const char *s, unsigned int *base)
  *
  * Don't you dare use this function.
  */
-unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long *res)
+unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long *p)
 {
+	unsigned long long res;
 	unsigned int rv;
 	int overflow;
 
-	*res = 0;
+	res = 0;
 	rv = 0;
 	overflow = 0;
 	while (*s) {
@@ -64,12 +65,19 @@ unsigned int _parse_integer(const char *s, unsigned int base, unsigned long long
 
 		if (val >= base)
 			break;
-		if (*res > div_u64(ULLONG_MAX - val, base))
-			overflow = 1;
-		*res = *res * base + val;
+		/*
+		 * Check for overflow only if we are within range of
+		 * it in the max base we support (16)
+		 */
+		if (unlikely(res & (~0ull << 60))) {
+			if (res > div_u64(ULLONG_MAX - val, base))
+				overflow = 1;
+		}
+		res = res * base + val;
 		rv++;
 		s++;
 	}
+	*p = res;
 	if (overflow)
 		rv |= KSTRTOX_OVERFLOW;
 	return rv;

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 16:58   ` Linus Torvalds
  2012-02-09 17:08     ` Linus Torvalds
@ 2012-02-09 17:46     ` David Howells
  2012-02-09 17:56       ` Linus Torvalds
  2012-02-09 18:07       ` David Howells
  1 sibling, 2 replies; 13+ messages in thread
From: David Howells @ 2012-02-09 17:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, Eric Dumazet, adobriyan, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Looking at the code generated, the "val >> 60" thing actually does
> generate a shift, and at least on x86-64, the attached patch generates
> better code.

On fixed-size instruction arches, the runtime shift is probably the better
option, as simply loading 64-bit large constant would take likely take at
least four instructions - and might involve a shift anyway.  On the other
hand, it seems the compiler can optimise your suggestion fairly well.  In both
cases, the 64-bit arithmetic can be reduced to 32-bit arithmetic on the MSW
only on 32-bit arches.

On x86_64 we have:

  400649:       48 89 d8                mov    %rbx,%rax
  40064c:       48 c1 e8 3c             shr    $0x3c,%rax
  400650:       48 85 c0                test   %rax,%rax
  400653:       75 52                   jne    4006a7 <_parse_integer+0xa7>

And on i386 we have:

 8048532:       8b 54 24 14             mov    0x14(%esp),%edx
 ...
 8048538:       89 c7                   mov    %eax,%edi
 804853a:       c1 ea 1c                shr    $0x1c,%edx
 804853d:       85 d2                   test   %edx,%edx
 804853f:       75 79                   jne    80485ba <_parse_integer+0xda>

With your code, we have on x86_64:

  40062d:       49 bf 00 00 00 00 00    movabs $0xf000000000000000,%r15
  400634:       00 00 f0 
  ...
  400659:       4c 85 fb                test   %r15,%rbx
  40065c:       75 59                   jne    4006b7 <_parse_integer+0xb7>

And on i386:

 804853c:       89 c7                   mov    %eax,%edi
 804853e:       f7 44 24 1c 00 00 00    testl  $0xf0000000,0x1c(%esp)
 8048545:       f0 
 8048546:       75 79                   jne    80485c1 <_parse_integer+0xe1>

But it will work too.  And I like the pointer indirection removal as well.

I'm not sure there's a lot to choose between them, though I prefer mine as I
think it produces slightly smaller code.

Want me to wrap these changes up with my patch description?

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 17:46     ` David Howells
@ 2012-02-09 17:56       ` Linus Torvalds
  2012-02-09 18:07       ` David Howells
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2012-02-09 17:56 UTC (permalink / raw)
  To: David Howells; +Cc: Eric Dumazet, adobriyan, linux-kernel

On Thu, Feb 9, 2012 at 9:46 AM, David Howells <dhowells@redhat.com> wrote:
>
> On fixed-size instruction arches, the runtime shift is probably the better
> option, as simply loading 64-bit large constant would take likely take at
> least four instructions - and might involve a shift anyway.

Bah. The constant is hoisted out of the loop on anything that has
enough registers - and "cannot generate constants" is almost
universally associated with "fair amount of registers".

The exception to that would be architectures with small register sets,
and particularly 32-bit ones. There, the 64-bit value and the multiply
likely not only takes more registers, but also kills registers due to
a function call. On the other hand "few 32-bit registers" is mainly
32-bit 386, which has no problems with a 32-bit constant.

And even if you have "out-of-line 64-bit multiply killing registers so
that you can't hoist the constant", even then gcc knows all about
those "big constants can be expensive" and likely knows how to
optimize them.

In particular, this big constant takes two instructions to generate
even in the absolute *worst* case: one to generate the constant 15,
one to shift it left by 60 (or 28, on a 32-bit setup).

But I don't have access to some broken "I can't do constants well"
architecture to actually test. Gcc could generate crazy bad code, but
almost certainly won't.

                           Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 17:46     ` David Howells
  2012-02-09 17:56       ` Linus Torvalds
@ 2012-02-09 18:07       ` David Howells
  2012-02-09 18:08         ` Linus Torvalds
  2012-02-09 18:50         ` David Howells
  1 sibling, 2 replies; 13+ messages in thread
From: David Howells @ 2012-02-09 18:07 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, Eric Dumazet, adobriyan, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Bah. The constant is hoisted out of the loop on anything that has
> enough registers - and "cannot generate constants" is almost
> universally associated with "fair amount of registers".

True.

Do you want me to attach my patch description to your patch, or do you want to
do that?

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 18:07       ` David Howells
@ 2012-02-09 18:08         ` Linus Torvalds
  2012-02-09 18:50         ` David Howells
  1 sibling, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2012-02-09 18:08 UTC (permalink / raw)
  To: David Howells; +Cc: Eric Dumazet, adobriyan, linux-kernel

On Thu, Feb 9, 2012 at 10:07 AM, David Howells <dhowells@redhat.com> wrote:
>
> Do you want me to attach my patch description to your patch, or do you want to
> do that?

I'll commit it as yours and with your commentary, my changes were
purely cosmetic and just made the x86-64 inner loop look better.

                 Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 18:07       ` David Howells
  2012-02-09 18:08         ` Linus Torvalds
@ 2012-02-09 18:50         ` David Howells
  2012-02-09 19:14           ` Linus Torvalds
  2012-02-09 19:23           ` David Howells
  1 sibling, 2 replies; 13+ messages in thread
From: David Howells @ 2012-02-09 18:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, Eric Dumazet, adobriyan, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > Do you want me to attach my patch description to your patch, or do you
> > want to do that?
> 
> I'll commit it as yours and with your commentary, my changes were
> purely cosmetic and just made the x86-64 inner loop look better.

Thanks!

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 18:50         ` David Howells
@ 2012-02-09 19:14           ` Linus Torvalds
  2012-02-10 13:50             ` Alexey Dobriyan
  2012-02-09 19:23           ` David Howells
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2012-02-09 19:14 UTC (permalink / raw)
  To: David Howells; +Cc: Eric Dumazet, adobriyan, linux-kernel

Btw, what was the load that you noticed this on?

Because quite frankly, I think we only support bases 8/10/16 in the
kernel, and if you really have some case where this all is expensive,
it might be better to simply have three different functions for the
three bases. That would turn the multiplies into constants too, and
also simplify the character tests.

That said, I can't really see how this could ever be all that hot a
function. Did you ever see it in a profile, or was this all just from
looking at the code?

                         Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 18:50         ` David Howells
  2012-02-09 19:14           ` Linus Torvalds
@ 2012-02-09 19:23           ` David Howells
  1 sibling, 0 replies; 13+ messages in thread
From: David Howells @ 2012-02-09 19:23 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, Eric Dumazet, adobriyan, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Btw, what was the load that you noticed this on?
> 
> Because quite frankly, I think we only support bases 8/10/16 in the
> kernel, and if you really have some case where this all is expensive,
> it might be better to simply have three different functions for the
> three bases. That would turn the multiplies into constants too, and
> also simplify the character tests.
> 
> That said, I can't really see how this could ever be all that hot a
> function. Did you ever see it in a profile, or was this all just from
> looking at the code?

Just by looking at the code.  I can't think of anything particularly where
this is likely to be encountered in a critical path.

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Reduce the number of expensive division instructions done by _parse_integer()
  2012-02-09 19:14           ` Linus Torvalds
@ 2012-02-10 13:50             ` Alexey Dobriyan
  0 siblings, 0 replies; 13+ messages in thread
From: Alexey Dobriyan @ 2012-02-10 13:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Howells, Eric Dumazet, linux-kernel

On Thu, Feb 9, 2012 at 10:14 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:

Ugh, I'm late to the party.

> Because quite frankly, I think we only support bases 8/10/16 in the
> kernel, and if you really have some case where this all is expensive,
> it might be better to simply have three different functions for the
> three bases. That would turn the multiplies into constants too, and
> also simplify the character tests.

base 2 is used to sort of autolimit input to 0/1 characters.

> That said, I can't really see how this could ever be all that hot a
> function. Did you ever see it in a profile, or was this all just from
> looking at the code?

That's why there was no such check from the beginning -- not
performance critical.
We could maintain small table from which digit overflow can happen,
but since this is already committed...

There is CONFIG_TEST_KSTRTOX, does it still passes?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-02-10 13:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-09 15:48 [PATCH] Reduce the number of expensive division instructions done by _parse_integer() David Howells
2012-02-09 16:28 ` Eric Dumazet
2012-02-09 16:42   ` Eric Dumazet
2012-02-09 16:58   ` Linus Torvalds
2012-02-09 17:08     ` Linus Torvalds
2012-02-09 17:46     ` David Howells
2012-02-09 17:56       ` Linus Torvalds
2012-02-09 18:07       ` David Howells
2012-02-09 18:08         ` Linus Torvalds
2012-02-09 18:50         ` David Howells
2012-02-09 19:14           ` Linus Torvalds
2012-02-10 13:50             ` Alexey Dobriyan
2012-02-09 19:23           ` David Howells

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.