All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: 'Willy Tarreau' <w@1wt.eu>
Cc: Douglas Gilbert <dgilbert@interlog.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: RE: how many memset(,0,) calls in kernel ?
Date: Tue, 14 Sep 2021 08:23:40 +0000	[thread overview]
Message-ID: <15cd0a8e72b3460db939060db25dd59a@AcuMS.aculab.com> (raw)
In-Reply-To: <20210913160945.GA2456@1wt.eu>

From: Willy Tarreau
> Sent: 13 September 2021 17:10
> 
> On Mon, Sep 13, 2021 at 04:03:09PM +0000, David Laight wrote:
> > >   36:   b9 06 00 00 00          mov    $0x6,%ecx
> > >   3b:   4c 89 e7                mov    %r12,%rdi
> > >   3e:   f3 ab                   rep stos %eax,%es:(%rdi)
> > >
> > > The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes
> > > for some code that modern processors are even able to optimize.
> >
> > Hmmm I'd bet that 6 stores will be faster on ~everything.
> > 'modern' processors do better than some older ones [1], but 6
> > writes isn't enough to get into the really fast paths.
> > So you'll still take a few cycles of setup.
> 
> The exact point is, here it's up to the compiler to decide thanks to
> its builtin what it considers best for the target CPU. It already
> knows the fixed size and the code is emitted accordingly. It may
> very well be a call to the memset() function when the size is large
> and a power of two because it knows alternate variants are available
> for example.
> 
> The compiler might even decide to shrink that area if other bytes
> are written just after the memset(), leaving only holes touched by
> memset().

You might think the compiler will make sane choices for the target CPU.
But it often makes a complete pig's breakfast of it.
I'm pretty sure 6 'rep stos' is slower than 6 write an absolutely
everything - with the possible exception of an 8088.

By far the worst ones are when the compiler decides to pessimise
a loop by using the simd (eg avx512) instructions to do 4 (or 8)
loop iterations in one pass.
It might be fine if the loop count is in the 100s - but not when it is 3.

One compiler I've used nicely converted any byte copy loop
into a 'rep movsb' instruction.
That was contemporary with P4 netburst - where it was terribly slow.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


  reply	other threads:[~2021-09-14  8:23 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-12  3:36 how many memset(,0,) calls in kernel ? Douglas Gilbert
2021-09-12  4:56 ` Willy Tarreau
2021-09-13 16:03   ` David Laight
2021-09-13 16:09     ` Willy Tarreau
2021-09-14  8:23       ` David Laight [this message]
2021-09-14 16:46         ` Willy Tarreau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=15cd0a8e72b3460db939060db25dd59a@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=dgilbert@interlog.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.