From: "Aurélien Lajoie" <orel@melix.net>
To: util-linux@vger.kernel.org
Subject: Re: [PATCH] libuuid: improve uuid_unparse() performance
Date: Thu, 26 Mar 2020 02:06:38 +0100 [thread overview]
Message-ID: <CAA0A08WYct_BhybReWMkK_LXBS2L0DmmyNh2W8cp=kys9Q0R7g@mail.gmail.com> (raw)
In-Reply-To: <CA+bjHUQtPfk5vDNbxSbxPzashLNG+MdLzy8HtZLbsOJ0C_fnzQ@mail.gmail.com>
On Wed, Mar 25, 2020 at 3:16 PM Peter Cordes <peter@cordes.ca> wrote:
>
> Nice optimization, and yes converting to hex is something CPUs can do
> really efficiently. Not surprising that scanf and other overhead was
> a huge part of the total time. I have some suggestions to make it
> even more efficient:
Thanks, I will submit a new version.
>
> * Use static const for your lookup tables (avoiding GOT overhead vs.
> globals), and make them only 16 bytes not 36
done
> * Make the helper function take its first 2 args in the same order as
> the external API so it doesn't need to shuffle registers around before
> tailcalling, just put that 3rd arg in a register and jmp. (The
> function is big enough that GCC chooses not to inline it into all both
> different callers at -O2, but will at -O3.)
done
> * The variant with the fixed default could be defined as an alias for
> the existing one, I think, so it can have the same address as the
> lower or upper function. This saves code size by not having another
> lea / jmp, or worse a 3rd copy of the whole function if it inlines.
I have done it using __attribute__ alias like:
void uuid_unparse(const uuid_t uu, char *out)
__attribute__((alias("uuid_unparse_lower")));
> * Avoid extra loads of the source data by reading into a tmp var,
> instead of re-accessing the same uuid[i] twice, the 2nd time after a
> store to *p which might overlap; the compiler can't prove otherwise.
Done
>
> If you really are bottlenecking on UUID throughput, see my SIMD answer
> on https://stackoverflow.com/questions/53823756/how-to-convert-a-binary-integer-number-to-a-hex-string
> with x86 SSE2 (baseline for x86-64), SSSE3, AVX2 variable-shift, and
> AVX512VBMI integer -> hex manual vectorization that can do 8 input
> bytes -> 16 hex digits at once. Or with YMM vectors, 32 hex digits.
> The asm should be straightforward to translate to intrinsics. (Remove
> the part that reverses the byte-order from x86 little-endian to
> printing order, since uuid bytes are apparently already in the right
> order). You'd need some shuffling to store to the right places around
> the '-' formatting but that's doable. Using an 8-byte store that
> overlaps where you want a '-', *then* storing the '-', then a 4-byte
> store of the bottom of the register, could work well to avoid one
> shuffle with SSE2. Or with SSSE3 or higher, use pshufb to shuffle in
> the '-' bytes
I will take a look at it, but in a second time, I get your idea.
I am not familiar with this, nice way to jumb on SIMD operations.
>
> On Tue, Mar 24, 2020 at 6:35 PM Aurelien LAJOIE <orel@melix.net> wrote:
>
> >
> > libuuid/src/unparse.c | 35 +++++++++++++++++------------------
> > 1 file changed, 17 insertions(+), 18 deletions(-)
> >
> > diff --git a/libuuid/src/unparse.c b/libuuid/src/unparse.c
> > index a95bbb042..62bb3ef26 100644
> > --- a/libuuid/src/unparse.c
> > +++ b/libuuid/src/unparse.c
> > @@ -36,41 +36,40 @@
> >
> > #include "uuidP.h"
> >
> > -static const char *fmt_lower =
> > - "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x";
> > -
> > -static const char *fmt_upper =
> > - "%08X-%04X-%04X-%02X%02X-%02X%02X%02X%02X%02X%02X";
> > +char const __str_digits_lower[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
> > +char const __str_digits_upper[36] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
>
>
> Shouldn't these be static, not global, like the old format strings
> were? Unless/until we want them somewhere else as well, these should
> probably be private, and only 16 bytes each because the only current
> use is for mapping a nibble -> hex ASCII.
>
> Also, leading double underscore identifiers are reserved for use by
> the C implementation. If we want to use a global name, it should
> probably be uuid__str_digits_lower or something like that. And if
> not, the name can be hexdigits_lower / upper.
>
> >
> > -static void uuid_unparse_x(const uuid_t uu, char *out, const char *fmt)
> > +static void uuid_fmt(char *buf, const uuid_t uuid, char const fmt[36])
> > {
> > + char *p = buf;
> > + for (int i = 0; i < 16; i++) {
> > + if (i == 4 || i == 6 || i == 8 || i == 10) {
> > + *p++ = '-';
> > + }
> > + *p++ = fmt[uuid[i] >> 4];
> > + *p++ = fmt[uuid[i] & 15];
>
>
> It's slightly more efficient to load into unsigned tmp; the compiler
> can't prove that buf and uuid don't overlap so it actually reloads for
> the 2nd statement. This is bad because we're pretty much going to
> bottleneck or close to it on throughput of load/store instructions, on
> a typical modern x86 for example.
>
> char *restrict out would solve the same problem, and presumably no
> caller would ever pass overlapping buffers. And if they did, would
> rather have this function read the original bytes instead of reloading
> hex-digit bytes as binary UUID bytes. Although at that point we're
> into UB territory.
>
> >
> > + }
> > + *p = '\0';
> > }
> >
> > void uuid_unparse_lower(const uuid_t uu, char *out)
> > {
> > - uuid_unparse_x(uu, out, fmt_lower);
> > + uuid_fmt(out, uu, __str_digits_lower);
> > }
> >
>
> Best to have uuid_fmt take args in the same order as these wrappers,
> so if a compiler decides not to inline it, the wrapper functions can
> just put the digit-table pointer into another register and tailcall.
> For example, your version compiles like this for x86-64: (on the
> Godbolt compiler explorer, clang -O2 -fPIC)
>
> uuid_unparse_lower_orig:
> mov rax, rdi
> lea rdx, [rip + hexdigits_lower] # add a 3rd arg
> mov rdi, rsi
> mov rsi, rax # swap the first 2 args
> jmp uuid_fmt_orig # TAILCALL
>
> vs.
>
> uuid_unparse_lower:
> lea rdx, hexdigits_lower[rip] # add a 3rd arg
> jmp uuid_fmt # and tailcall
>
> uuid_t is a typedef for an array of unsigned char[16] so as a function
> arg it's just a pointer.
>
> https://godbolt.org/z/fcZFhi is what I've been playing around with, in
> case I don't get back to this and actually make a patch myself.
next prev parent reply other threads:[~2020-03-26 1:06 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-24 21:26 [PATCH] libuuid: improve uuid_unparse() performance Aurelien LAJOIE
2020-03-25 11:10 ` Karel Zak
2020-03-26 0:54 ` Aurélien Lajoie
2020-03-25 14:16 ` Peter Cordes
2020-03-26 1:06 ` Aurélien Lajoie [this message]
2020-03-26 2:13 ` Peter Cordes
2020-03-26 23:22 ` Peter Cordes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAA0A08WYct_BhybReWMkK_LXBS2L0DmmyNh2W8cp=kys9Q0R7g@mail.gmail.com' \
--to=orel@melix.net \
--cc=util-linux@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).