All of lore.kernel.org
 help / color / mirror / Atom feed
* how many memset(,0,) calls in kernel ?
@ 2021-09-12  3:36 Douglas Gilbert
  2021-09-12  4:56 ` Willy Tarreau
  0 siblings, 1 reply; 6+ messages in thread
From: Douglas Gilbert @ 2021-09-12  3:36 UTC (permalink / raw)
  To: LKML

Here is a pretty rough estimate:
$ find . -name '*.c' -exec fgrep "memset(" {} \; > memset_in_kern.txt

$ cat memset_in_kern.txt | wc -l
     20159

Some of those are in comments, EXPORTs, etc, but the vast majority are
in code. Plus there will be memset()s in header files not counted by
that find. Checking in that output file I see:

$ grep ", 0," memset_in_kern.txt | wc -l
     18107
$ grep ", 0" memset_in_kern.txt | wc -l
     19349
$ grep ", 0x" memset_in_kern.txt | wc -l
     1210
$ grep ", 0x01" memset_in_kern.txt | wc -l
     3
$ grep ", 0x0," memset_in_kern.txt | wc -l
     199
$ grep ",0," memset_in_kern.txt | wc -l
     72

$ grep "= memset" memset_in_kern.txt | wc -l
      11

It seems only 11 invocations use the return value of memset() .

If the BSD flavours of Unix had not given us:
    void bzero(void *s, size_t n);
would the Linux kernel have something similar in common usage (e.g.
memzero() or mem0() ), that was less wasteful than the standard:
    void *memset(void *s, int c, size_t n);
in the extremely common case where c=0 and the return value is
not used?

Doug Gilbert



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how many memset(,0,) calls in kernel ?
  2021-09-12  3:36 how many memset(,0,) calls in kernel ? Douglas Gilbert
@ 2021-09-12  4:56 ` Willy Tarreau
  2021-09-13 16:03   ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Willy Tarreau @ 2021-09-12  4:56 UTC (permalink / raw)
  To: Douglas Gilbert; +Cc: LKML

On Sat, Sep 11, 2021 at 11:36:07PM -0400, Douglas Gilbert wrote:
> Here is a pretty rough estimate:
> $ find . -name '*.c' -exec fgrep "memset(" {} \; > memset_in_kern.txt
> 
> $ cat memset_in_kern.txt | wc -l
>     20159
> 
> Some of those are in comments, EXPORTs, etc, but the vast majority are
> in code. Plus there will be memset()s in header files not counted by
> that find. Checking in that output file I see:
> 
> $ grep ", 0," memset_in_kern.txt | wc -l
>     18107
> $ grep ", 0" memset_in_kern.txt | wc -l
>     19349
> $ grep ", 0x" memset_in_kern.txt | wc -l
>     1210
> $ grep ", 0x01" memset_in_kern.txt | wc -l
>     3
> $ grep ", 0x0," memset_in_kern.txt | wc -l
>     199
> $ grep ",0," memset_in_kern.txt | wc -l
>     72

Note that in order to get something faster and slightly more accurate,
you can use 'git grep':

   $ git grep 'memset([^,]*,\s*0\(\|x0*\),' |wc -l
   18822

> If the BSD flavours of Unix had not given us:
>    void bzero(void *s, size_t n);
> would the Linux kernel have something similar in common usage (e.g.
> memzero() or mem0() ), that was less wasteful than the standard:
>    void *memset(void *s, int c, size_t n);
> in the extremely common case where c=0 and the return value is
> not used?

What do you mean by "wasteful" here ? What are you trying to preserve,
caracters in the source code maybe ? Because the output code is already
adapted to the context thanks to memset() being builtin. Let's take one
of the first instances I found that's easy to match against asm code:

net/core/dev.c:

  int __init netdev_boot_setup(char *str)
  {
        int ints[5];
        struct ifmap map;

        str = get_options(str, ARRAY_SIZE(ints), ints);
        if (!str || !*str)
                return 0;

        /* Save settings */
        memset(&map, 0, sizeof(map));
        ...
  }

It gives this:

  16:   e8 00 00 00 00          callq  1b <netdev_boot_setup+0x1b>
                        17: R_X86_64_PC32       get_options-0x4
  1b:   48 89 c6                mov    %rax,%rsi

note that we're zeroing %eax below in preparation for the "return 0"
statement:

  1e:   31 c0                   xor    %eax,%eax

This is the "if (!str || !*str)" :

  20:   48 85 f6                test   %rsi,%rsi
  23:   0f 84 98 00 00 00       je     c1 <netdev_boot_setup+0xc1>
  29:   80 3e 00                cmpb   $0x0,(%rsi)
  2c:   0f 84 8f 00 00 00       je     c1 <netdev_boot_setup+0xc1>

%r12 is set to &map:

  32:   4c 8d 65 d0             lea    -0x30(%rbp),%r12

And this is the memset "call" itself, which reuses the zero from
the %eax register:

  36:   b9 06 00 00 00          mov    $0x6,%ecx
  3b:   4c 89 e7                mov    %r12,%rdi
  3e:   f3 ab                   rep stos %eax,%es:(%rdi)

The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes
for some code that modern processors are even able to optimize.

As you can see there's not much waste here in the output code, and
in fact using any dedicated function would be larger and likely
slower.

Hoping this helps,
Willy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: how many memset(,0,) calls in kernel ?
  2021-09-12  4:56 ` Willy Tarreau
@ 2021-09-13 16:03   ` David Laight
  2021-09-13 16:09     ` Willy Tarreau
  0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2021-09-13 16:03 UTC (permalink / raw)
  To: 'Willy Tarreau', Douglas Gilbert; +Cc: LKML

>   36:   b9 06 00 00 00          mov    $0x6,%ecx
>   3b:   4c 89 e7                mov    %r12,%rdi
>   3e:   f3 ab                   rep stos %eax,%es:(%rdi)
> 
> The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes
> for some code that modern processors are even able to optimize.

Hmmm I'd bet that 6 stores will be faster on ~everything.
'modern' processors do better than some older ones [1], but 6
writes isn't enough to get into the really fast paths.
So you'll still take a few cycles of setup.

[1] P4 netburst had a ~40 clock setup for any 'rep' operation.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how many memset(,0,) calls in kernel ?
  2021-09-13 16:03   ` David Laight
@ 2021-09-13 16:09     ` Willy Tarreau
  2021-09-14  8:23       ` David Laight
  0 siblings, 1 reply; 6+ messages in thread
From: Willy Tarreau @ 2021-09-13 16:09 UTC (permalink / raw)
  To: David Laight; +Cc: Douglas Gilbert, LKML

On Mon, Sep 13, 2021 at 04:03:09PM +0000, David Laight wrote:
> >   36:   b9 06 00 00 00          mov    $0x6,%ecx
> >   3b:   4c 89 e7                mov    %r12,%rdi
> >   3e:   f3 ab                   rep stos %eax,%es:(%rdi)
> > 
> > The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes
> > for some code that modern processors are even able to optimize.
> 
> Hmmm I'd bet that 6 stores will be faster on ~everything.
> 'modern' processors do better than some older ones [1], but 6
> writes isn't enough to get into the really fast paths.
> So you'll still take a few cycles of setup.

The exact point is, here it's up to the compiler to decide thanks to
its builtin what it considers best for the target CPU. It already
knows the fixed size and the code is emitted accordingly. It may
very well be a call to the memset() function when the size is large
and a power of two because it knows alternate variants are available
for example.

The compiler might even decide to shrink that area if other bytes
are written just after the memset(), leaving only holes touched by
memset().

Willy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: how many memset(,0,) calls in kernel ?
  2021-09-13 16:09     ` Willy Tarreau
@ 2021-09-14  8:23       ` David Laight
  2021-09-14 16:46         ` Willy Tarreau
  0 siblings, 1 reply; 6+ messages in thread
From: David Laight @ 2021-09-14  8:23 UTC (permalink / raw)
  To: 'Willy Tarreau'; +Cc: Douglas Gilbert, LKML

From: Willy Tarreau
> Sent: 13 September 2021 17:10
> 
> On Mon, Sep 13, 2021 at 04:03:09PM +0000, David Laight wrote:
> > >   36:   b9 06 00 00 00          mov    $0x6,%ecx
> > >   3b:   4c 89 e7                mov    %r12,%rdi
> > >   3e:   f3 ab                   rep stos %eax,%es:(%rdi)
> > >
> > > The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes
> > > for some code that modern processors are even able to optimize.
> >
> > Hmmm I'd bet that 6 stores will be faster on ~everything.
> > 'modern' processors do better than some older ones [1], but 6
> > writes isn't enough to get into the really fast paths.
> > So you'll still take a few cycles of setup.
> 
> The exact point is, here it's up to the compiler to decide thanks to
> its builtin what it considers best for the target CPU. It already
> knows the fixed size and the code is emitted accordingly. It may
> very well be a call to the memset() function when the size is large
> and a power of two because it knows alternate variants are available
> for example.
> 
> The compiler might even decide to shrink that area if other bytes
> are written just after the memset(), leaving only holes touched by
> memset().

You might think the compiler will make sane choices for the target CPU.
But it often makes a complete pig's breakfast of it.
I'm pretty sure 6 'rep stos' is slower than 6 write an absolutely
everything - with the possible exception of an 8088.

By far the worst ones are when the compiler decides to pessimise
a loop by using the simd (eg avx512) instructions to do 4 (or 8)
loop iterations in one pass.
It might be fine if the loop count is in the 100s - but not when it is 3.

One compiler I've used nicely converted any byte copy loop
into a 'rep movsb' instruction.
That was contemporary with P4 netburst - where it was terribly slow.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: how many memset(,0,) calls in kernel ?
  2021-09-14  8:23       ` David Laight
@ 2021-09-14 16:46         ` Willy Tarreau
  0 siblings, 0 replies; 6+ messages in thread
From: Willy Tarreau @ 2021-09-14 16:46 UTC (permalink / raw)
  To: David Laight; +Cc: Douglas Gilbert, LKML

On Tue, Sep 14, 2021 at 08:23:40AM +0000, David Laight wrote:
> > The exact point is, here it's up to the compiler to decide thanks to
> > its builtin what it considers best for the target CPU. It already
> > knows the fixed size and the code is emitted accordingly. It may
> > very well be a call to the memset() function when the size is large
> > and a power of two because it knows alternate variants are available
> > for example.
> > 
> > The compiler might even decide to shrink that area if other bytes
> > are written just after the memset(), leaving only holes touched by
> > memset().
> 
> You might think the compiler will make sane choices for the target CPU.
> But it often makes a complete pig's breakfast of it.
> I'm pretty sure 6 'rep stos' is slower than 6 write an absolutely
> everything - with the possible exception of an 8088.

It can be suboptimal (especially with the moderate latencies required
for small areas), but my point is that in plenty of cases the memset()
call will be totally eliminated. Example:

The file:
  #include <string.h>

  int f(int a, int b)
  {
        struct {
                int n1;
                int n2;
                int n3;
                int n4;
        } s;

        memset(&s, 0, sizeof(s));

        s.n2 = a;
        s.n3 = b;

        return s.n1 + s.n2 + s.n3 + s.n4;
  }

gives:

  0000000000000000 <f>:
   0:   8d 04 37                lea    (%rdi,%rsi,1),%eax
   3:   c3                      retq   

See ? The builtin allowed the compiler to *know* that these areas
were zeroes and could optimize them away. More importantly this
can save some reads from being performed, with the data being only
written into:

  #include <string.h>

  struct {
        int n1;
        int n2;
  } s;

  void f(int a, int b)
  {

        memset(&s, 0, sizeof(s));

        s.n1 |= a;
        s.n2 |= b;
  }

Gives:

  0000000000000000 <f>:
   0:   89 3d 00 00 00 00       mov    %edi,0x0(%rip)        # 6 <f+0x6>
   6:   89 35 00 00 00 00       mov    %esi,0x0(%rip)        # c <f+0xc>
   c:   c3                      retq   

See ? Just plain writes, no read-modify-write of the memory area.
If you'd call an external memset() function, you'd instantly lose
all these possibilities:

  0000000000000000 <f>:
   0:   55                      push   %rbp
   1:   ba 08 00 00 00          mov    $0x8,%edx
   6:   89 fd                   mov    %edi,%ebp
   8:   bf 00 00 00 00          mov    $0x0,%edi
   d:   53                      push   %rbx
   e:   89 f3                   mov    %esi,%ebx
  10:   31 f6                   xor    %esi,%esi
  12:   48 83 ec 08             sub    $0x8,%rsp
  16:   e8 00 00 00 00          callq  1b <f+0x1b>
  1b:   09 2d 00 00 00 00       or     %ebp,0x0(%rip)        # 21 <f+0x21>
  21:   09 1d 00 00 00 00       or     %ebx,0x0(%rip)        # 27 <f+0x27>
  27:   48 83 c4 08             add    $0x8,%rsp
  2b:   5b                      pop    %rbx
  2c:   5d                      pop    %rbp
  2d:   c3                      retq   

Thus the fact that the compiler has knowledge of the memset() is useful.

Willy

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-14 16:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-12  3:36 how many memset(,0,) calls in kernel ? Douglas Gilbert
2021-09-12  4:56 ` Willy Tarreau
2021-09-13 16:03   ` David Laight
2021-09-13 16:09     ` Willy Tarreau
2021-09-14  8:23       ` David Laight
2021-09-14 16:46         ` Willy Tarreau

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.