All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: "René Scharfe" <l.s.r@web.de>
Cc: Git Mailing List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] use strpbrk(3) to search for characters from a given set
Date: Mon, 24 Feb 2020 01:38:02 -0500	[thread overview]
Message-ID: <20200224063802.GB1015967@coredump.intra.peff.net> (raw)
In-Reply-To: <4140dade-d999-a74a-1f8e-06eedb84ed20@web.de>

On Sat, Feb 22, 2020 at 07:51:19PM +0100, René Scharfe wrote:

> We can check if certain characters are present in a string by calling
> strchr(3) on each of them, or we can pass them all to a single
> strpbrk(3) call.  The latter is shorter, less repetitive and slightly
> more efficient, so let's do that instead.

I think the resulting code is more readable, and this is worth doing on
those grounds alone.

I was curious whether it's always more efficient (tldr, it is; read on
only if you're also curious). The default glibc implementation of
strpbrk() is built on strcspn(). That in turn creates a 256-byte table
with a boolean flag set for each char in the accept string, and then
walks the haystack string doing O(1) lookups in the table for a hit. So
we have to memset() the table to zeroes at the start. Depending on the
length of the input string, that could be more expensive than just
walking the haystack string three times (one for each char in the accept
string).

But of course in practice there's all kinds of weird SSE assembly going
on behind the scenes. So in most of my tests, strpbrk() beats strchr()
handily. The one exception is when the first strchr() matches early in
the string and can short-circuit the rest of them.

The test I used was:

  $ cat foo.c
  #include <string.h>
  
  int check(const char *s)
  {
  #if USE_STRCHR
  	return strchr(s, '<') || strchr(s, '>') || strchr(s, '@');
  #else
  	return !!strpbrk(s, "@<>");
  #endif
  }

  $ cat bar.c 
  int check(const char *s);
  
  int main(int argc, const char **argv)
  {
  	int ret;
  	int i;
  
  	for (i = 0; i < 100000000; i++)
  		ret += check(argv[1]);
  	return ret & 0x7f;
  }
  

I put it into two separate files to avoid check() being hoisted out of
the loop. There's something a bit interesting, though. If you combine
those two files, gcc _does_ hoist the strpbrk() version, generating
this as the complete code:

	  subq    $8, %rsp
          .cfi_def_cfa_offset 16
          movq    8(%rsi), %rdi
          leaq    .LC0(%rip), %rsi
          call    strpbrk@PLT
          testq   %rax, %rax
          setne   %al
          addq    $8, %rsp
          .cfi_def_cfa_offset 8
          movzbl  %al, %eax
          movl    %eax, %edx
          imull   $99999999, %eax, %eax
          addl    %edx, %eax
          andl    $127, %eax
          ret

So it calls strpbrk() exactly once, then recognizes that the loop is
just adding up the same variable over and over and turns it into a
multiply. No loop at all; very cool (though I am puzzled why it doesn't
multiply by the full loop count; instead if multiplies by one less and
then adds back in one iteration).

But with strchr() it was less willing to do that (curiously, it will if
the function _isn't_ static, but won't if it is).

Anyway, that's all compiler arcana subject to change between versions,
and it's a dumb toy example anyway. But it does seem like the single
call is more likely to get inlined or hoisted by the compiler, which is
a point in its favor.

-Peff

  parent reply	other threads:[~2020-02-24  6:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-22 18:51 [PATCH] use strpbrk(3) to search for characters from a given set René Scharfe
2020-02-23  9:01 ` Martin Ågren
2020-02-24  6:38 ` Jeff King [this message]
2020-02-24 17:10 ` Junio C Hamano
2020-02-24 19:01   ` René Scharfe
2020-02-24 19:10     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200224063802.GB1015967@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.