util-linux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] libuuid: improve uuid_unparse() performance
@ 2020-03-24 21:26 Aurelien LAJOIE
  2020-03-25 11:10 ` Karel Zak
  2020-03-25 14:16 ` Peter Cordes
  0 siblings, 2 replies; 7+ messages in thread
From: Aurelien LAJOIE @ 2020-03-24 21:26 UTC (permalink / raw)
  To: util-linux; +Cc: Aurelien LAJOIE

There is 2 improvements:

 * remove useless uuid_unpack,
 * directly print the hexa format from memory without using printf
   we can do this as the bytes order is the network byte order

The improvement is important, some results for 1000000 uuid_unparse calls:

Little Endian Ubuntu:
before took 382623 us
after  took  36740 us

Big Endian OpenBSD:
before took 3138172 us
after  took  180116 us

Signed-off-by: Aurelien LAJOIE <orel@melix.net>
---
 libuuid/src/unparse.c | 35 +++++++++++++++++------------------
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/libuuid/src/unparse.c b/libuuid/src/unparse.c
index a95bbb042..62bb3ef26 100644
--- a/libuuid/src/unparse.c
+++ b/libuuid/src/unparse.c
@@ -36,41 +36,40 @@
 
 #include "uuidP.h"
 
-static const char *fmt_lower =
-	"%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x";
-
-static const char *fmt_upper =
-	"%08X-%04X-%04X-%02X%02X-%02X%02X%02X%02X%02X%02X";
+char const __str_digits_lower[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
+char const __str_digits_upper[36] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
 
 #ifdef UUID_UNPARSE_DEFAULT_UPPER
-#define FMT_DEFAULT fmt_upper
+#define STR_DIGIT_DEFAULT __str_digits_upper
 #else
-#define FMT_DEFAULT fmt_lower
+#define STR_DIGIT_DEFAULT __str_digits_lower
 #endif
 
-static void uuid_unparse_x(const uuid_t uu, char *out, const char *fmt)
+static void uuid_fmt(char *buf, const uuid_t uuid, char const fmt[36])
 {
-	struct uuid uuid;
+	char *p = buf;
 
-	uuid_unpack(uu, &uuid);
-	sprintf(out, fmt,
-		uuid.time_low, uuid.time_mid, uuid.time_hi_and_version,
-		uuid.clock_seq >> 8, uuid.clock_seq & 0xFF,
-		uuid.node[0], uuid.node[1], uuid.node[2],
-		uuid.node[3], uuid.node[4], uuid.node[5]);
+	for (int i = 0; i < 16; i++) {
+		if (i == 4 || i == 6 || i == 8 || i == 10) {
+			*p++ = '-';
+		}
+		*p++ = fmt[uuid[i] >> 4];
+		*p++ = fmt[uuid[i] & 15];
+	}
+	*p = '\0';
 }
 
 void uuid_unparse_lower(const uuid_t uu, char *out)
 {
-	uuid_unparse_x(uu, out,	fmt_lower);
+	uuid_fmt(out, uu, __str_digits_lower);
 }
 
 void uuid_unparse_upper(const uuid_t uu, char *out)
 {
-	uuid_unparse_x(uu, out,	fmt_upper);
+	uuid_fmt(out, uu, __str_digits_upper);
 }
 
 void uuid_unparse(const uuid_t uu, char *out)
 {
-	uuid_unparse_x(uu, out, FMT_DEFAULT);
+	uuid_fmt(out, uu, STR_DIGIT_DEFAULT);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] libuuid: improve uuid_unparse() performance
  2020-03-24 21:26 [PATCH] libuuid: improve uuid_unparse() performance Aurelien LAJOIE
@ 2020-03-25 11:10 ` Karel Zak
  2020-03-26  0:54   ` Aurélien Lajoie
  2020-03-25 14:16 ` Peter Cordes
  1 sibling, 1 reply; 7+ messages in thread
From: Karel Zak @ 2020-03-25 11:10 UTC (permalink / raw)
  To: Aurelien LAJOIE; +Cc: util-linux

On Tue, Mar 24, 2020 at 10:26:25PM +0100, Aurelien LAJOIE wrote:
> There is 2 improvements:
> 
>  * remove useless uuid_unpack,
>  * directly print the hexa format from memory without using printf
>    we can do this as the bytes order is the network byte order

I'm not sure, but are you sure that whole UUID is in big-endian order? 
I think that last part (aka "node", 6 bytes) is not subject to swapping. 
It seems uuid_unpack() does nothing with the last part of the UUID.

But your patch works on LE as well as on BE, so I probably miss
something :-)

> before took 382623 us
> after  took  36740 us
> 
> Big Endian OpenBSD:
> before took 3138172 us
> after  took  180116 us

I guess all this is about sprintf(), another way would be to use
uuid_unpack() but avoid sprintf().

    Karel

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] libuuid: improve uuid_unparse() performance
  2020-03-24 21:26 [PATCH] libuuid: improve uuid_unparse() performance Aurelien LAJOIE
  2020-03-25 11:10 ` Karel Zak
@ 2020-03-25 14:16 ` Peter Cordes
  2020-03-26  1:06   ` Aurélien Lajoie
  1 sibling, 1 reply; 7+ messages in thread
From: Peter Cordes @ 2020-03-25 14:16 UTC (permalink / raw)
  To: Aurelien LAJOIE; +Cc: util-linux

Nice optimization, and yes converting to hex is something CPUs can do
really efficiently.  Not surprising that scanf and other overhead was
a huge part of the total time.  I have some suggestions to make it
even more efficient:

* Use static const for your lookup tables (avoiding GOT overhead vs.
globals), and make them only 16 bytes not 36
* Make the helper function take its first 2 args in the same order as
the external API so it doesn't need to shuffle registers around before
tailcalling, just put that 3rd arg in a register and jmp.  (The
function is big enough that GCC chooses not to inline it into all both
different callers at -O2, but will at -O3.)
* The variant with the fixed default could be defined as an alias for
the existing one, I think, so it can have the same address as the
lower or upper function.  This saves code size by not having another
lea / jmp, or worse a 3rd copy of the whole function if it inlines.
* Avoid extra loads of the source data by reading into a tmp var,
instead of re-accessing the same uuid[i] twice, the 2nd time after a
store to *p which might overlap; the compiler can't prove otherwise.

If you really are bottlenecking on UUID throughput, see my SIMD answer
on https://stackoverflow.com/questions/53823756/how-to-convert-a-binary-integer-number-to-a-hex-string
with x86 SSE2 (baseline for x86-64), SSSE3, AVX2 variable-shift, and
AVX512VBMI integer -> hex manual vectorization that can do 8 input
bytes -> 16 hex digits at once.  Or with YMM vectors, 32 hex digits.
The asm should be straightforward to translate to intrinsics.  (Remove
the part that reverses the byte-order from x86 little-endian to
printing order, since uuid bytes are apparently already in the right
order).  You'd need some shuffling to store to the right places around
the '-' formatting but that's doable.  Using an 8-byte store that
overlaps where you want a '-', *then* storing the '-', then a 4-byte
store of the bottom of the register, could work well to avoid one
shuffle with SSE2.  Or with SSSE3 or higher, use pshufb to shuffle in
the '-' bytes

On Tue, Mar 24, 2020 at 6:35 PM Aurelien LAJOIE <orel@melix.net> wrote:

>
>  libuuid/src/unparse.c | 35 +++++++++++++++++------------------
>  1 file changed, 17 insertions(+), 18 deletions(-)
>
> diff --git a/libuuid/src/unparse.c b/libuuid/src/unparse.c
> index a95bbb042..62bb3ef26 100644
> --- a/libuuid/src/unparse.c
> +++ b/libuuid/src/unparse.c
> @@ -36,41 +36,40 @@
>
>  #include "uuidP.h"
>
> -static const char *fmt_lower =
> -       "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x";
> -
> -static const char *fmt_upper =
> -       "%08X-%04X-%04X-%02X%02X-%02X%02X%02X%02X%02X%02X";
> +char const __str_digits_lower[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
> +char const __str_digits_upper[36] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";


Shouldn't these be static, not global, like the old format strings
were?  Unless/until we want them somewhere else as well, these should
probably be private, and only 16 bytes each because the only current
use is for mapping a nibble -> hex ASCII.

Also, leading double underscore identifiers are reserved for use by
the C implementation.  If we want to use a global name, it should
probably be uuid__str_digits_lower or something like that.  And if
not, the name can be hexdigits_lower / upper.

>
> -static void uuid_unparse_x(const uuid_t uu, char *out, const char *fmt)
> +static void uuid_fmt(char *buf, const uuid_t uuid, char const fmt[36])
>  {
> +       char *p = buf;
> +       for (int i = 0; i < 16; i++) {
> +               if (i == 4 || i == 6 || i == 8 || i == 10) {
> +                       *p++ = '-';
> +               }
> +               *p++ = fmt[uuid[i] >> 4];
> +               *p++ = fmt[uuid[i] & 15];


It's slightly more efficient to load into unsigned tmp; the compiler
can't prove that buf and uuid don't overlap so it actually reloads for
the 2nd statement.  This is bad because we're pretty much going to
bottleneck or close to it on throughput of load/store instructions, on
a typical modern x86 for example.

char *restrict out would solve the same problem, and presumably no
caller would ever pass overlapping buffers.  And if they did, would
rather have this function read the original bytes instead of reloading
hex-digit bytes as binary UUID bytes.  Although at that point we're
into UB territory.

>
> +       }
> +       *p = '\0';
>  }
>
>  void uuid_unparse_lower(const uuid_t uu, char *out)
>  {
> -       uuid_unparse_x(uu, out, fmt_lower);
> +       uuid_fmt(out, uu, __str_digits_lower);
>  }
>

Best to have uuid_fmt take args in the same order as these wrappers,
so if a compiler decides not to inline it, the wrapper functions can
just put the digit-table pointer into another register and tailcall.
For example, your version compiles like this for x86-64: (on the
Godbolt compiler explorer, clang -O2 -fPIC)

uuid_unparse_lower_orig:
    mov rax, rdi
    lea rdx, [rip + hexdigits_lower]    # add a 3rd arg
    mov rdi, rsi
    mov rsi, rax             # swap the first 2 args
    jmp uuid_fmt_orig # TAILCALL

vs.

uuid_unparse_lower:
    lea rdx, hexdigits_lower[rip]          # add a 3rd arg
    jmp uuid_fmt                          # and tailcall

uuid_t is a typedef for an array of unsigned char[16] so as a function
arg it's just a pointer.

https://godbolt.org/z/fcZFhi is what I've been playing around with, in
case I don't get back to this and actually make a patch myself.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] libuuid: improve uuid_unparse() performance
  2020-03-25 11:10 ` Karel Zak
@ 2020-03-26  0:54   ` Aurélien Lajoie
  0 siblings, 0 replies; 7+ messages in thread
From: Aurélien Lajoie @ 2020-03-26  0:54 UTC (permalink / raw)
  To: util-linux

On Wed, Mar 25, 2020 at 12:11 PM Karel Zak <kzak@redhat.com> wrote:
>
> On Tue, Mar 24, 2020 at 10:26:25PM +0100, Aurelien LAJOIE wrote:
> > There is 2 improvements:
> >
> >  * remove useless uuid_unpack,
> >  * directly print the hexa format from memory without using printf
> >    we can do this as the bytes order is the network byte order
>
> I'm not sure, but are you sure that whole UUID is in big-endian order?
> I think that last part (aka "node", 6 bytes) is not subject to swapping.
> It seems uuid_unpack() does nothing with the last part of the UUID.
>
> But your patch works on LE as well as on BE, so I probably miss
> something :-)
The RFC is quite clear on this "with each field encoded with the Most
Significant Byte first"
https://tools.ietf.org/html/rfc4122#section-4.1.2

I agree this is not clear for the node part
>From the RFC node should be an unsigned 48 bit integer

The parsing is done byte per byte
        for (i=0; i < 6; i++) {
                buf[0] = *cp++;
                buf[1] = *cp++;
                uuid.node[i] = strtoul(buf, NULL, 16);
        }
Then the sprintf is also done byte per byte.
So Big Endian and the swapping are hidden by the manipulation byte per byte
I cannot find any usage of the node field to set specific value only random.
As long the node is handled byte per byte it will work.

>
> > before took 382623 us
> > after  took  36740 us
> >
> > Big Endian OpenBSD:
> > before took 3138172 us
> > after  took  180116 us
>
> I guess all this is about sprintf(), another way would be to use
> uuid_unpack() but avoid sprintf().

Using uuid_unpack to fill a struct uuid, will trigger to handle the
endianness to print it
whereas uuid_t is matching the order needed to print it.
>
>     Karel
>
> --
>  Karel Zak  <kzak@redhat.com>
>  http://karelzak.blogspot.com
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] libuuid: improve uuid_unparse() performance
  2020-03-25 14:16 ` Peter Cordes
@ 2020-03-26  1:06   ` Aurélien Lajoie
  2020-03-26  2:13     ` Peter Cordes
  0 siblings, 1 reply; 7+ messages in thread
From: Aurélien Lajoie @ 2020-03-26  1:06 UTC (permalink / raw)
  To: util-linux

On Wed, Mar 25, 2020 at 3:16 PM Peter Cordes <peter@cordes.ca> wrote:
>
> Nice optimization, and yes converting to hex is something CPUs can do
> really efficiently.  Not surprising that scanf and other overhead was
> a huge part of the total time.  I have some suggestions to make it
> even more efficient:
Thanks, I will submit a new version.
>
> * Use static const for your lookup tables (avoiding GOT overhead vs.
> globals), and make them only 16 bytes not 36
done
> * Make the helper function take its first 2 args in the same order as
> the external API so it doesn't need to shuffle registers around before
> tailcalling, just put that 3rd arg in a register and jmp.  (The
> function is big enough that GCC chooses not to inline it into all both
> different callers at -O2, but will at -O3.)
done
> * The variant with the fixed default could be defined as an alias for
> the existing one, I think, so it can have the same address as the
> lower or upper function.  This saves code size by not having another
> lea / jmp, or worse a 3rd copy of the whole function if it inlines.

I have done it using __attribute__ alias like:

void uuid_unparse(const uuid_t uu, char *out)
        __attribute__((alias("uuid_unparse_lower")));

> * Avoid extra loads of the source data by reading into a tmp var,
> instead of re-accessing the same uuid[i] twice, the 2nd time after a
> store to *p which might overlap; the compiler can't prove otherwise.
Done

>
> If you really are bottlenecking on UUID throughput, see my SIMD answer
> on https://stackoverflow.com/questions/53823756/how-to-convert-a-binary-integer-number-to-a-hex-string
> with x86 SSE2 (baseline for x86-64), SSSE3, AVX2 variable-shift, and
> AVX512VBMI integer -> hex manual vectorization that can do 8 input
> bytes -> 16 hex digits at once.  Or with YMM vectors, 32 hex digits.
> The asm should be straightforward to translate to intrinsics.  (Remove
> the part that reverses the byte-order from x86 little-endian to
> printing order, since uuid bytes are apparently already in the right
> order).  You'd need some shuffling to store to the right places around
> the '-' formatting but that's doable.  Using an 8-byte store that
> overlaps where you want a '-', *then* storing the '-', then a 4-byte
> store of the bottom of the register, could work well to avoid one
> shuffle with SSE2.  Or with SSSE3 or higher, use pshufb to shuffle in
> the '-' bytes

I will take a look at it, but in a second time, I get your idea.
I am not familiar with this, nice way to jumb on SIMD operations.

>
> On Tue, Mar 24, 2020 at 6:35 PM Aurelien LAJOIE <orel@melix.net> wrote:
>
> >
> >  libuuid/src/unparse.c | 35 +++++++++++++++++------------------
> >  1 file changed, 17 insertions(+), 18 deletions(-)
> >
> > diff --git a/libuuid/src/unparse.c b/libuuid/src/unparse.c
> > index a95bbb042..62bb3ef26 100644
> > --- a/libuuid/src/unparse.c
> > +++ b/libuuid/src/unparse.c
> > @@ -36,41 +36,40 @@
> >
> >  #include "uuidP.h"
> >
> > -static const char *fmt_lower =
> > -       "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x";
> > -
> > -static const char *fmt_upper =
> > -       "%08X-%04X-%04X-%02X%02X-%02X%02X%02X%02X%02X%02X";
> > +char const __str_digits_lower[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
> > +char const __str_digits_upper[36] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
>
>
> Shouldn't these be static, not global, like the old format strings
> were?  Unless/until we want them somewhere else as well, these should
> probably be private, and only 16 bytes each because the only current
> use is for mapping a nibble -> hex ASCII.
>
> Also, leading double underscore identifiers are reserved for use by
> the C implementation.  If we want to use a global name, it should
> probably be uuid__str_digits_lower or something like that.  And if
> not, the name can be hexdigits_lower / upper.
>
> >
> > -static void uuid_unparse_x(const uuid_t uu, char *out, const char *fmt)
> > +static void uuid_fmt(char *buf, const uuid_t uuid, char const fmt[36])
> >  {
> > +       char *p = buf;
> > +       for (int i = 0; i < 16; i++) {
> > +               if (i == 4 || i == 6 || i == 8 || i == 10) {
> > +                       *p++ = '-';
> > +               }
> > +               *p++ = fmt[uuid[i] >> 4];
> > +               *p++ = fmt[uuid[i] & 15];
>
>
> It's slightly more efficient to load into unsigned tmp; the compiler
> can't prove that buf and uuid don't overlap so it actually reloads for
> the 2nd statement.  This is bad because we're pretty much going to
> bottleneck or close to it on throughput of load/store instructions, on
> a typical modern x86 for example.
>
> char *restrict out would solve the same problem, and presumably no
> caller would ever pass overlapping buffers.  And if they did, would
> rather have this function read the original bytes instead of reloading
> hex-digit bytes as binary UUID bytes.  Although at that point we're
> into UB territory.
>
> >
> > +       }
> > +       *p = '\0';
> >  }
> >
> >  void uuid_unparse_lower(const uuid_t uu, char *out)
> >  {
> > -       uuid_unparse_x(uu, out, fmt_lower);
> > +       uuid_fmt(out, uu, __str_digits_lower);
> >  }
> >
>
> Best to have uuid_fmt take args in the same order as these wrappers,
> so if a compiler decides not to inline it, the wrapper functions can
> just put the digit-table pointer into another register and tailcall.
> For example, your version compiles like this for x86-64: (on the
> Godbolt compiler explorer, clang -O2 -fPIC)
>
> uuid_unparse_lower_orig:
>     mov rax, rdi
>     lea rdx, [rip + hexdigits_lower]    # add a 3rd arg
>     mov rdi, rsi
>     mov rsi, rax             # swap the first 2 args
>     jmp uuid_fmt_orig # TAILCALL
>
> vs.
>
> uuid_unparse_lower:
>     lea rdx, hexdigits_lower[rip]          # add a 3rd arg
>     jmp uuid_fmt                          # and tailcall
>
> uuid_t is a typedef for an array of unsigned char[16] so as a function
> arg it's just a pointer.
>
> https://godbolt.org/z/fcZFhi is what I've been playing around with, in
> case I don't get back to this and actually make a patch myself.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] libuuid: improve uuid_unparse() performance
  2020-03-26  1:06   ` Aurélien Lajoie
@ 2020-03-26  2:13     ` Peter Cordes
  2020-03-26 23:22       ` Peter Cordes
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Cordes @ 2020-03-26  2:13 UTC (permalink / raw)
  To: Aurélien Lajoie; +Cc: util-linux

On Wed, Mar 25, 2020 at 10:07 PM Aurélien Lajoie <orel@melix.net> wrote:
>
> On Wed, Mar 25, 2020 at 3:16 PM Peter Cordes <peter@cordes.ca> wrote:
> > If you really are bottlenecking on UUID throughput, see my SIMD answer
> > on https://stackoverflow.com/questions/53823756/how-to-convert-a-binary-integer-number-to-a-hex-string
> > with x86 SSE2 (baseline for x86-64), SSSE3, AVX2 variable-shift, and
> > AVX512VBMI integer -> hex manual vectorization
>
> I will take a look at it, but in a second time, I get your idea.
> I am not familiar with this, nice way to jumb on SIMD operations.

I can write that code with _mm_cmpgt_epi8 intrinsics from immintrin.h
if libuuid actually wants a patch add an #ifdef __SSE2__ version that
x86-64 can use all the time instead of the scalar version.  I'm very
familiar with x86 SIMD intrinsics so it would be easy for me to write
the code I'm already imagining in my head.  But it might not be worth
the trouble if it won't get merged because nobody wants to maintain
it.

 Also __SSSE3__,  __AVX2__, and __AVX512VBMI__ versions if we want
them, but those would only get enabled for people compiling libuuid
with  -march=native on their machines, or stuff like that.

Or we could even to runtime CPU detection to set a function pointer to
the version that's best for the current CPU.  SSSE3 helps a lot (byte
shuffle as a hexdigit LUT, and to line up data for the '-' gaps).  And
AVX512VBMI is fantastic for this on IceLake client/server.  It's only
called internally so we don't need to use the dynamic-link-time CPU
detection that glibc uses to resolve memset to for example
__memset_avx2_unaligned_erms, using a custom symbol resolver function.
We can see how much speedup we get from using more than SSE2 and
decide if it's worth the trouble.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] libuuid: improve uuid_unparse() performance
  2020-03-26  2:13     ` Peter Cordes
@ 2020-03-26 23:22       ` Peter Cordes
  0 siblings, 0 replies; 7+ messages in thread
From: Peter Cordes @ 2020-03-26 23:22 UTC (permalink / raw)
  To: Aurélien Lajoie; +Cc: util-linux

On Wed, Mar 25, 2020 at 11:13 PM Peter Cordes <peter@cordes.ca> wrote:
> I can write that code with _mm_cmpgt_epi8 intrinsics from immintrin.h
> if libuuid actually wants a patch add an #ifdef __SSE2__ version that
> x86-64 can use all the time instead of the scalar version.

I wrote SSE2 and SSSE3 versions, and timed them vs. scalar unrolled by
2 or fully rolled, on Skylake and Core 2.
https://godbolt.org/z/WB-C_y  self-contained with a main() you can
test.  Still a mess of comments and wrestling the compiler into
submission, not ready to make a patch yet.  But it looks like it works
as far as correctness goes.  Also an untested AVX512VBMI version that
might be 1.5 to 2x as fast as the SSSE3 version on CPUs that support
it (Ice Lake).  Inserting the dashes with a vpermb that does
merge-masking into a vector of dashes while putting the hex digits in
the right place into the first 32 bytes of the UUID is fun. :)

With everything hot in cache I get a nice 5.5x speedup from just SSE2
vs. partially-unrolled scalar loop, for gcc -O2 on Skylake.  That's
something distros can use for real.  It does need more static data,
but the total amount of static data still fits in one 64-byte cache
line, I think.  I-cache footprint is similar to the partially-unrolled
scalar loop, like 2 or 3 cache lines.

TIMES in seconds for 100 000 000 (100M) iterations, recorded with perf
stat on the whole executable because that's easy.

LUT is the table-lookup like Aurélien's code; CMOV is using c < 10 ?
c+'0' : c + (alpha_base-10); which compiles to a CMOV + ADD for x86,
or a CSEL + ADD for AArch64.  CMOV is 2 uops on Intel before
Broadwell, but Intel since Sandybridge can do 2 loads per clock cycle
so despite only loading single bytes (rather than a whole register and
shifting) the LUT version is faster and may not bottleneck on cache
load throughput.  non-x86 microarchitectures with fewer cache read
ports and more efficient conditionals might see a different tradeoff.

* Skylake @4.1GHz gcc9.3 -march=core2 -O2
(energy_performance_preference=performance so turbo ramps up near
instantly to ~4.1GHz)
* SSSE3 version: 0.175
* SSE2 version: 0.30
* scalar version: unrolled x2, LUT: 1.69
* scalar version: unrolled x2, CMOV: 2.19 (half each cmova 2 uops / cmovb 1 uop)
* scalar version: rolled up, LUT: 2.43
* scalar version: rolled up, CMOV: 2.64

The dash positions are at even i values so unrolling helps
significantly, doing that if() check less often.  i.e. the loop
overhead is much larger than most loops.  But this is might be fast
*enough* that spending more code size here could be unwise for overall
I-cache pressure when a large program uses this function.

So we have speedups from Aurélien's plain rolled up LUT version on
modern Intel, should be representative of Sandybridge-family at least
since Haswell, and probably similar on Zen:
* LUT rolled: 1x baseline
* LUT unrolled by 2:  1.4x on Skylake, possibly similar on some other
ISAs where I haven't written SIMD versions
* SSE2: 8x
* SSSE3: 13.8 x

So it might well be worth doing dynamic dispatch for the SSSE3 version
that can use x86's SIMD byte shuffle.  It has smaller code + data
footprints than the SSE2 version.  And I think the SSE2 version is
worth using; on x86-64 we never need a fallback to scalar.


Conroe (first-gen Core 2) @2.4GHz, same gcc9.3 -march=core2 -O2  (ran
the same binary over NFS)
* SSSE3: 1.21 (clang 1.08)
* SSE2 version: 1.00 (clang 0.880) clang9.0.1
* scalar version: unrolled x2, LUT: 2.88
* scalar version: unrolled x2, CMOV: 6.50
* scalar version: rolled up, LUT: 6.40
* scalar version: rolled up, CMOV: 7.64

Core 2 is sensitive to code alignment quirks that lead to decoding
bottlenecks; this is why Intel introduced a uop cache with
Sandybridge.  And various other stalls.  I expect the unroll vs.
rolled and LUT vs. CMOV numbers have a large error margin as far as
predicting performance on similar CPUs for other compiler versions or
for different quirks of code alignment.  But we still see a 3x
advantage for SIMD over the best scalar version.

Slow shuffle CPUs like Conroe/Merom do better with the SSE2 version.
Pre-Nehalem also have slow unaligned 16-byte loads/stores.
 We should prob. require SSE4 if dynamic dispatching; that would
exclude K8 and first-gen Core 2, although would exclude AMD K10 which
has fast shuffles.

-O3 typically fully unrolls the scalar loops. (clang does that even at
-O2 if you manually unroll the source by 2, probably undesirable).
SIMD doesn't change (much), scalar improves significantly in this
microbenchmark, but at the cost of larger I-cache footprint which
would hurt in real life in programs that care about doing other things
between calls to uuid_unparse() :P

Skylake @4.1GHz gcc9.3 -march=core2 -O3
 * SSSE3 version: 0.173
 * SSE2 version: 0.273
 * scalar version: unrolled x2, LUT: 0.95  // all are fully unrolled I assume
 * scalar version: unrolled x2, CMOV: 2.10 (half each cmova 2 uops /
cmovb 1 uop)
 * scalar version: unrolled x2, mixed: 1.34
 * scalar version: LUT: 0.95
 * scalar version: rolled up, CMOV: 2.54

Conroe @2.4GHz running gcc9.3 -march=core2 -O3
* SSSE3: 1.21
* SSE2 version: 0.965
* scalar version: unrolled x2, LUT: 2.37    // actually fully unrolled
* scalar version: unrolled x2, CMOV: 4.90
* scalar version: unrolled x2, mixed: 2.72

I'll hopefully get back to this and submit a patch, but if I don't,
anyone else is more than welcome to clean up my code and use it.  I
have ADHD and have a bad habit of leaving things unfinished. >.<

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-03-26 23:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24 21:26 [PATCH] libuuid: improve uuid_unparse() performance Aurelien LAJOIE
2020-03-25 11:10 ` Karel Zak
2020-03-26  0:54   ` Aurélien Lajoie
2020-03-25 14:16 ` Peter Cordes
2020-03-26  1:06   ` Aurélien Lajoie
2020-03-26  2:13     ` Peter Cordes
2020-03-26 23:22       ` Peter Cordes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).