* how many memset(,0,) calls in kernel ? @ 2021-09-12 3:36 Douglas Gilbert 2021-09-12 4:56 ` Willy Tarreau 0 siblings, 1 reply; 6+ messages in thread From: Douglas Gilbert @ 2021-09-12 3:36 UTC (permalink / raw) To: LKML Here is a pretty rough estimate: $ find . -name '*.c' -exec fgrep "memset(" {} \; > memset_in_kern.txt $ cat memset_in_kern.txt | wc -l 20159 Some of those are in comments, EXPORTs, etc, but the vast majority are in code. Plus there will be memset()s in header files not counted by that find. Checking in that output file I see: $ grep ", 0," memset_in_kern.txt | wc -l 18107 $ grep ", 0" memset_in_kern.txt | wc -l 19349 $ grep ", 0x" memset_in_kern.txt | wc -l 1210 $ grep ", 0x01" memset_in_kern.txt | wc -l 3 $ grep ", 0x0," memset_in_kern.txt | wc -l 199 $ grep ",0," memset_in_kern.txt | wc -l 72 $ grep "= memset" memset_in_kern.txt | wc -l 11 It seems only 11 invocations use the return value of memset() . If the BSD flavours of Unix had not given us: void bzero(void *s, size_t n); would the Linux kernel have something similar in common usage (e.g. memzero() or mem0() ), that was less wasteful than the standard: void *memset(void *s, int c, size_t n); in the extremely common case where c=0 and the return value is not used? Doug Gilbert ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how many memset(,0,) calls in kernel ? 2021-09-12 3:36 how many memset(,0,) calls in kernel ? Douglas Gilbert @ 2021-09-12 4:56 ` Willy Tarreau 2021-09-13 16:03 ` David Laight 0 siblings, 1 reply; 6+ messages in thread From: Willy Tarreau @ 2021-09-12 4:56 UTC (permalink / raw) To: Douglas Gilbert; +Cc: LKML On Sat, Sep 11, 2021 at 11:36:07PM -0400, Douglas Gilbert wrote: > Here is a pretty rough estimate: > $ find . -name '*.c' -exec fgrep "memset(" {} \; > memset_in_kern.txt > > $ cat memset_in_kern.txt | wc -l > 20159 > > Some of those are in comments, EXPORTs, etc, but the vast majority are > in code. Plus there will be memset()s in header files not counted by > that find. Checking in that output file I see: > > $ grep ", 0," memset_in_kern.txt | wc -l > 18107 > $ grep ", 0" memset_in_kern.txt | wc -l > 19349 > $ grep ", 0x" memset_in_kern.txt | wc -l > 1210 > $ grep ", 0x01" memset_in_kern.txt | wc -l > 3 > $ grep ", 0x0," memset_in_kern.txt | wc -l > 199 > $ grep ",0," memset_in_kern.txt | wc -l > 72 Note that in order to get something faster and slightly more accurate, you can use 'git grep': $ git grep 'memset([^,]*,\s*0\(\|x0*\),' |wc -l 18822 > If the BSD flavours of Unix had not given us: > void bzero(void *s, size_t n); > would the Linux kernel have something similar in common usage (e.g. > memzero() or mem0() ), that was less wasteful than the standard: > void *memset(void *s, int c, size_t n); > in the extremely common case where c=0 and the return value is > not used? What do you mean by "wasteful" here ? What are you trying to preserve, caracters in the source code maybe ? Because the output code is already adapted to the context thanks to memset() being builtin. Let's take one of the first instances I found that's easy to match against asm code: net/core/dev.c: int __init netdev_boot_setup(char *str) { int ints[5]; struct ifmap map; str = get_options(str, ARRAY_SIZE(ints), ints); if (!str || !*str) return 0; /* Save settings */ memset(&map, 0, sizeof(map)); ... } It gives this: 16: e8 00 00 00 00 callq 1b <netdev_boot_setup+0x1b> 17: R_X86_64_PC32 get_options-0x4 1b: 48 89 c6 mov %rax,%rsi note that we're zeroing %eax below in preparation for the "return 0" statement: 1e: 31 c0 xor %eax,%eax This is the "if (!str || !*str)" : 20: 48 85 f6 test %rsi,%rsi 23: 0f 84 98 00 00 00 je c1 <netdev_boot_setup+0xc1> 29: 80 3e 00 cmpb $0x0,(%rsi) 2c: 0f 84 8f 00 00 00 je c1 <netdev_boot_setup+0xc1> %r12 is set to &map: 32: 4c 8d 65 d0 lea -0x30(%rbp),%r12 And this is the memset "call" itself, which reuses the zero from the %eax register: 36: b9 06 00 00 00 mov $0x6,%ecx 3b: 4c 89 e7 mov %r12,%rdi 3e: f3 ab rep stos %eax,%es:(%rdi) The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes for some code that modern processors are even able to optimize. As you can see there's not much waste here in the output code, and in fact using any dedicated function would be larger and likely slower. Hoping this helps, Willy ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: how many memset(,0,) calls in kernel ? 2021-09-12 4:56 ` Willy Tarreau @ 2021-09-13 16:03 ` David Laight 2021-09-13 16:09 ` Willy Tarreau 0 siblings, 1 reply; 6+ messages in thread From: David Laight @ 2021-09-13 16:03 UTC (permalink / raw) To: 'Willy Tarreau', Douglas Gilbert; +Cc: LKML > 36: b9 06 00 00 00 mov $0x6,%ecx > 3b: 4c 89 e7 mov %r12,%rdi > 3e: f3 ab rep stos %eax,%es:(%rdi) > > The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes > for some code that modern processors are even able to optimize. Hmmm I'd bet that 6 stores will be faster on ~everything. 'modern' processors do better than some older ones [1], but 6 writes isn't enough to get into the really fast paths. So you'll still take a few cycles of setup. [1] P4 netburst had a ~40 clock setup for any 'rep' operation. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how many memset(,0,) calls in kernel ? 2021-09-13 16:03 ` David Laight @ 2021-09-13 16:09 ` Willy Tarreau 2021-09-14 8:23 ` David Laight 0 siblings, 1 reply; 6+ messages in thread From: Willy Tarreau @ 2021-09-13 16:09 UTC (permalink / raw) To: David Laight; +Cc: Douglas Gilbert, LKML On Mon, Sep 13, 2021 at 04:03:09PM +0000, David Laight wrote: > > 36: b9 06 00 00 00 mov $0x6,%ecx > > 3b: 4c 89 e7 mov %r12,%rdi > > 3e: f3 ab rep stos %eax,%es:(%rdi) > > > > The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes > > for some code that modern processors are even able to optimize. > > Hmmm I'd bet that 6 stores will be faster on ~everything. > 'modern' processors do better than some older ones [1], but 6 > writes isn't enough to get into the really fast paths. > So you'll still take a few cycles of setup. The exact point is, here it's up to the compiler to decide thanks to its builtin what it considers best for the target CPU. It already knows the fixed size and the code is emitted accordingly. It may very well be a call to the memset() function when the size is large and a power of two because it knows alternate variants are available for example. The compiler might even decide to shrink that area if other bytes are written just after the memset(), leaving only holes touched by memset(). Willy ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: how many memset(,0,) calls in kernel ? 2021-09-13 16:09 ` Willy Tarreau @ 2021-09-14 8:23 ` David Laight 2021-09-14 16:46 ` Willy Tarreau 0 siblings, 1 reply; 6+ messages in thread From: David Laight @ 2021-09-14 8:23 UTC (permalink / raw) To: 'Willy Tarreau'; +Cc: Douglas Gilbert, LKML From: Willy Tarreau > Sent: 13 September 2021 17:10 > > On Mon, Sep 13, 2021 at 04:03:09PM +0000, David Laight wrote: > > > 36: b9 06 00 00 00 mov $0x6,%ecx > > > 3b: 4c 89 e7 mov %r12,%rdi > > > 3e: f3 ab rep stos %eax,%es:(%rdi) > > > > > > The last line does exactly "memset(%rdi, %eax, %ecx)". Just two bytes > > > for some code that modern processors are even able to optimize. > > > > Hmmm I'd bet that 6 stores will be faster on ~everything. > > 'modern' processors do better than some older ones [1], but 6 > > writes isn't enough to get into the really fast paths. > > So you'll still take a few cycles of setup. > > The exact point is, here it's up to the compiler to decide thanks to > its builtin what it considers best for the target CPU. It already > knows the fixed size and the code is emitted accordingly. It may > very well be a call to the memset() function when the size is large > and a power of two because it knows alternate variants are available > for example. > > The compiler might even decide to shrink that area if other bytes > are written just after the memset(), leaving only holes touched by > memset(). You might think the compiler will make sane choices for the target CPU. But it often makes a complete pig's breakfast of it. I'm pretty sure 6 'rep stos' is slower than 6 write an absolutely everything - with the possible exception of an 8088. By far the worst ones are when the compiler decides to pessimise a loop by using the simd (eg avx512) instructions to do 4 (or 8) loop iterations in one pass. It might be fine if the loop count is in the 100s - but not when it is 3. One compiler I've used nicely converted any byte copy loop into a 'rep movsb' instruction. That was contemporary with P4 netburst - where it was terribly slow. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: how many memset(,0,) calls in kernel ? 2021-09-14 8:23 ` David Laight @ 2021-09-14 16:46 ` Willy Tarreau 0 siblings, 0 replies; 6+ messages in thread From: Willy Tarreau @ 2021-09-14 16:46 UTC (permalink / raw) To: David Laight; +Cc: Douglas Gilbert, LKML On Tue, Sep 14, 2021 at 08:23:40AM +0000, David Laight wrote: > > The exact point is, here it's up to the compiler to decide thanks to > > its builtin what it considers best for the target CPU. It already > > knows the fixed size and the code is emitted accordingly. It may > > very well be a call to the memset() function when the size is large > > and a power of two because it knows alternate variants are available > > for example. > > > > The compiler might even decide to shrink that area if other bytes > > are written just after the memset(), leaving only holes touched by > > memset(). > > You might think the compiler will make sane choices for the target CPU. > But it often makes a complete pig's breakfast of it. > I'm pretty sure 6 'rep stos' is slower than 6 write an absolutely > everything - with the possible exception of an 8088. It can be suboptimal (especially with the moderate latencies required for small areas), but my point is that in plenty of cases the memset() call will be totally eliminated. Example: The file: #include <string.h> int f(int a, int b) { struct { int n1; int n2; int n3; int n4; } s; memset(&s, 0, sizeof(s)); s.n2 = a; s.n3 = b; return s.n1 + s.n2 + s.n3 + s.n4; } gives: 0000000000000000 <f>: 0: 8d 04 37 lea (%rdi,%rsi,1),%eax 3: c3 retq See ? The builtin allowed the compiler to *know* that these areas were zeroes and could optimize them away. More importantly this can save some reads from being performed, with the data being only written into: #include <string.h> struct { int n1; int n2; } s; void f(int a, int b) { memset(&s, 0, sizeof(s)); s.n1 |= a; s.n2 |= b; } Gives: 0000000000000000 <f>: 0: 89 3d 00 00 00 00 mov %edi,0x0(%rip) # 6 <f+0x6> 6: 89 35 00 00 00 00 mov %esi,0x0(%rip) # c <f+0xc> c: c3 retq See ? Just plain writes, no read-modify-write of the memory area. If you'd call an external memset() function, you'd instantly lose all these possibilities: 0000000000000000 <f>: 0: 55 push %rbp 1: ba 08 00 00 00 mov $0x8,%edx 6: 89 fd mov %edi,%ebp 8: bf 00 00 00 00 mov $0x0,%edi d: 53 push %rbx e: 89 f3 mov %esi,%ebx 10: 31 f6 xor %esi,%esi 12: 48 83 ec 08 sub $0x8,%rsp 16: e8 00 00 00 00 callq 1b <f+0x1b> 1b: 09 2d 00 00 00 00 or %ebp,0x0(%rip) # 21 <f+0x21> 21: 09 1d 00 00 00 00 or %ebx,0x0(%rip) # 27 <f+0x27> 27: 48 83 c4 08 add $0x8,%rsp 2b: 5b pop %rbx 2c: 5d pop %rbp 2d: c3 retq Thus the fact that the compiler has knowledge of the memset() is useful. Willy ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-09-14 16:47 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-09-12 3:36 how many memset(,0,) calls in kernel ? Douglas Gilbert 2021-09-12 4:56 ` Willy Tarreau 2021-09-13 16:03 ` David Laight 2021-09-13 16:09 ` Willy Tarreau 2021-09-14 8:23 ` David Laight 2021-09-14 16:46 ` Willy Tarreau
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.