* gcc miscompiles csum_tcpudp_magic() on ARMv5 @ 2013-12-12 12:14 Maxime Bizon 2013-12-12 12:40 ` Russell King - ARM Linux 0 siblings, 1 reply; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 12:14 UTC (permalink / raw) To: linux-arm-kernel Hello, I tried using csum_tcpudp_magic() like this: csum_tcpudp_magic(src, dst, ntohs(udph->len), IPPROTO_UDP, csum); instead of the more common: len = ntohs(udhp->len); [...] csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum); the first one gives a bad checksum while the second one is ok. I've tracked down the problem to csum_tcpudp_nofold(), which uses inline asm and an unsigned short for len. If the len value is say 0x3412, and I pass ntohs(len), then the assigned register for len contains 0x00341234 instead of the expected 0x1234. The ntohs() expand to a dumb swab16 on my arch, and gcc does the swap but does not clear the high nibble, I guess it expects the assembly code to only use the low 16 bits. Is there a missing constraint or gcc is doing something wrong here ? -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 12:14 gcc miscompiles csum_tcpudp_magic() on ARMv5 Maxime Bizon @ 2013-12-12 12:40 ` Russell King - ARM Linux 2013-12-12 13:36 ` Maxime Bizon 0 siblings, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 12:40 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 01:14:04PM +0100, Maxime Bizon wrote: > > Hello, > > I tried using csum_tcpudp_magic() like this: > > csum_tcpudp_magic(src, dst, ntohs(udph->len), IPPROTO_UDP, csum); > > instead of the more common: > > len = ntohs(udhp->len); > [...] > csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum); > > the first one gives a bad checksum while the second one is ok. > > > I've tracked down the problem to csum_tcpudp_nofold(), which uses inline > asm and an unsigned short for len. > > If the len value is say 0x3412, and I pass ntohs(len), then the assigned > register for len contains 0x00341234 instead of the expected 0x1234. > > The ntohs() expand to a dumb swab16 on my arch, and gcc does the swap > but does not clear the high nibble, I guess it expects the assembly code > to only use the low 16 bits. > > Is there a missing constraint or gcc is doing something wrong here ? Depends which swab16 you mean by "dumb swab16". If it is a gcc bug then you need to submit a bug report to gcc people. ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 12:40 ` Russell King - ARM Linux @ 2013-12-12 13:36 ` Maxime Bizon 2013-12-12 13:48 ` Måns Rullgård 2013-12-12 14:48 ` Russell King - ARM Linux 0 siblings, 2 replies; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 13:36 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote: > Depends which swab16 you mean by "dumb swab16". If it is a gcc bug then that one: #define __swab16(x) ((uint16_t)( \ (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \ (((uint16_t)(x) & (uint16_t)0xff00U) >> 8))) usually expands to this: 24: e1a00800 lsl r0, r0, #16 28: e1a03c20 lsr r3, r0, #24 2c: e1833420 orr r3, r3, r0, lsr #8 30: e1a03803 lsl r3, r3, #16 34: e1a00823 lsr r0, r3, #16 but in my case, the two last shifts are missing. > you need to submit a bug report to gcc people. but is it for sure ? I couldn't find any working gcc version so it does not look like a regression, hence my doubt. basically if you use inline asm with a variable that is smaller than register width (32 bits here), can you assume the value in the register will be zero extended ? I could not find the answer in the gcc manual. -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 13:36 ` Maxime Bizon @ 2013-12-12 13:48 ` Måns Rullgård 2013-12-12 14:10 ` Maxime Bizon 2013-12-12 14:48 ` Russell King - ARM Linux 1 sibling, 1 reply; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 13:48 UTC (permalink / raw) To: linux-arm-kernel Maxime Bizon <mbizon@freebox.fr> writes: > On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote: > >> Depends which swab16 you mean by "dumb swab16". If it is a gcc bug then > > that one: > > #define __swab16(x) ((uint16_t)( \ > (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \ > (((uint16_t)(x) & (uint16_t)0xff00U) >> 8))) > > usually expands to this: > > 24: e1a00800 lsl r0, r0, #16 > 28: e1a03c20 lsr r3, r0, #24 > 2c: e1833420 orr r3, r3, r0, lsr #8 > 30: e1a03803 lsl r3, r3, #16 > 34: e1a00823 lsr r0, r3, #16 > > but in my case, the two last shifts are missing. > >> you need to submit a bug report to gcc people. > > but is it for sure ? > > I couldn't find any working gcc version so it does not look like a > regression, hence my doubt. > > basically if you use inline asm with a variable that is smaller than > register width (32 bits here), can you assume the value in the register > will be zero extended ? I could not find the answer in the gcc manual. In the code above, the outer (uint16_t) cast should clear the top half, as should passing the value to a function (inline doesn't alter the semantics) as a 16-bit type, so there's something fishy here. Which gcc versions did you try? -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 13:48 ` Måns Rullgård @ 2013-12-12 14:10 ` Maxime Bizon 2013-12-12 14:19 ` Willy Tarreau 2013-12-12 14:26 ` Måns Rullgård 0 siblings, 2 replies; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 14:10 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 13:48 +0000, M?ns Rullg?rd wrote: > In the code above, the outer (uint16_t) cast should clear the top half, > as should passing the value to a function (inline doesn't alter the > semantics) as a 16-bit type, so there's something fishy here. using __attribute__((noinline)), or putting the function in another file makes the bug disappear But I'm not convinced inline doesn't change the semantic, since gcc is merging the function inside another one the rules of calling convention should not matter anymore. I attached a second test case without a separate function that has the same bug. > Which gcc versions did you try? 4.3.2, 4.6.0, 4.7.2, 4.8-2013 (linaro) Here is a simple userspace test case. #include <stdint.h> #include <stdio.h> static inline uint32_t asm_add(uint16_t len, uint32_t sum) { __asm__( "add %0, %1, %2 \n" : "=&r"(sum) : "r" (sum), "r" (len) ); return sum; } #define local_swab16(x) ((uint16_t)( \ (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \ (((uint16_t)(x) & (uint16_t)0xff00U) >> 8))) int main(int argc, char *argv[]) { uint16_t foo; foo = strtoul(argv[1], NULL, 0); printf("%08x\n", asm_add(local_swab16(foo), 0)); return 0; } without optimization, or with noinline: # ./a.out 0x3412 00001234 with optimization: # ./a.out 0x3412 00341234 And the second test case without the inline function. int main(int argc, char *argv[]) { uint32_t sum = 0; uint16_t foo; foo = strtoul(argv[1], NULL, 0); __asm__ ( "add %0, %1, %2 \n" : "=&r"(sum) : "r" (sum), "r" (local_swab16(foo)) ); printf("%08x\n", sum); return 0; } -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:10 ` Maxime Bizon @ 2013-12-12 14:19 ` Willy Tarreau 2013-12-12 14:28 ` Maxime Bizon 2013-12-12 14:37 ` Måns Rullgård 2013-12-12 14:26 ` Måns Rullgård 1 sibling, 2 replies; 37+ messages in thread From: Willy Tarreau @ 2013-12-12 14:19 UTC (permalink / raw) To: linux-arm-kernel Hi Maxime, On Thu, Dec 12, 2013 at 03:10:10PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 13:48 +0000, M?ns Rullg?rd wrote: > > > In the code above, the outer (uint16_t) cast should clear the top half, > > as should passing the value to a function (inline doesn't alter the > > semantics) as a 16-bit type, so there's something fishy here. > > using __attribute__((noinline)), or putting the function in another file > makes the bug disappear > > But I'm not convinced inline doesn't change the semantic, since gcc is > merging the function inside another one the rules of calling convention > should not matter anymore. I disagree here, since gcc may decide by itself to inline or not, it must not impact the validity of the emitted code. Inline functions have input and output types for a reason! (...) > #include <stdint.h> > #include <stdio.h> > > static inline uint32_t asm_add(uint16_t len, uint32_t sum) > { > __asm__( > "add %0, %1, %2 \n" > : "=&r"(sum) > : "r" (sum), "r" (len) > ); > return sum; > } Hmmm aren't you passing a 16-bit register directly to the ASM for being used as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how you're going to use the register. Could you check if that fixes it : static inline uint32_t asm_add(uint16_t len, uint32_t sum) { uint32_t len32 = len; __asm__( "add %0, %1, %2 \n" : "=&r"(sum) : "r" (sum), "r" (len32) ); return sum; } Or maybe simply : static inline uint32_t asm_add(uint16_t len, uint32_t sum) { __asm__( "add %0, %1, %2 \n" : "=&r"(sum) : "r" (sum), "r" (uint32_t)(len) ); return sum; } Willy ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:19 ` Willy Tarreau @ 2013-12-12 14:28 ` Maxime Bizon 2013-12-12 14:42 ` Måns Rullgård 2013-12-12 14:37 ` Måns Rullgård 1 sibling, 1 reply; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 14:28 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 15:19 +0100, Willy Tarreau wrote: > I disagree here, since gcc may decide by itself to inline or not, it must > not impact the validity of the emitted code. Inline functions have input > and output types for a reason! but the code emitted is completely different for an inlined "occurence" of the function and the non-inlined case. if a function does not use one of its argument, and it is inlined, then in the resulting code you see no trace of it > Hmmm aren't you passing a 16-bit register directly to the ASM for being used > as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how > you're going to use the register. this is exactly what I'm complaining about, the arm code for csum_tcpudp_nofold() in the kernel does exactly that. > Could you check if that fixes it : > > static inline uint32_t asm_add(uint16_t len, uint32_t sum) > { > uint32_t len32 = len; or change the asm_add() proto to take an "uint32_t len" instead, and yes of course that fixes it. -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:28 ` Maxime Bizon @ 2013-12-12 14:42 ` Måns Rullgård 2013-12-12 14:52 ` Maxime Bizon 2013-12-12 15:07 ` Willy Tarreau 0 siblings, 2 replies; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 14:42 UTC (permalink / raw) To: linux-arm-kernel Maxime Bizon <mbizon@freebox.fr> writes: > On Thu, 2013-12-12 at 15:19 +0100, Willy Tarreau wrote: > >> I disagree here, since gcc may decide by itself to inline or not, it must >> not impact the validity of the emitted code. Inline functions have input >> and output types for a reason! > > but the code emitted is completely different for an inlined "occurence" > of the function and the non-inlined case. > > if a function does not use one of its argument, and it is inlined, then > in the resulting code you see no trace of it Again, that's an optimisation that does not alter the semantics of the code. Although the generated code looks very different, it does the same thing. >> Hmmm aren't you passing a 16-bit register directly to the ASM for being used >> as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how >> you're going to use the register. > > this is exactly what I'm complaining about, the arm code for > csum_tcpudp_nofold() in the kernel does exactly that. > >> Could you check if that fixes it : >> >> static inline uint32_t asm_add(uint16_t len, uint32_t sum) >> { >> uint32_t len32 = len; > > or change the asm_add() proto to take an "uint32_t len" instead, and yes > of course that fixes it. It's a bug. Please report it to the gcc developers. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:42 ` Måns Rullgård @ 2013-12-12 14:52 ` Maxime Bizon 2013-12-12 14:58 ` Måns Rullgård 2013-12-12 15:00 ` Russell King - ARM Linux 2013-12-12 15:07 ` Willy Tarreau 1 sibling, 2 replies; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 14:52 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 14:42 +0000, M?ns Rullg?rd wrote: > > Again, that's an optimisation that does not alter the semantics of the > code. Although the generated code looks very different, it does the > same thing. > It cannot do the same thing as there are possibly nothing to do after inline. static __attribute__((noinline)) unsigned int do_nothing(unsigned char foo) { foo += 42; return 0; } int func(int a) { return do_nothing(a); } 00000000 <do_nothing>: 0: e3a00000 mov r0, #0 4: e12fff1e bx lr 00000008 <func>: 8: e52de004 push {lr} ; (str lr, [sp, #-4]!) c: e24dd004 sub sp, sp, #4 10: e20000ff and r0, r0, #255 ; 0xff 14: ebfffff9 bl 0 <do_nothing> 18: e28dd004 add sp, sp, #4 1c: e8bd8000 ldmfd sp!, {pc} static inline unsigned int do_nothing(unsigned char foo) { foo += 42; return 0; } int func(int a) { return do_nothing(a); } 00000000 <func>: 0: e3a00000 mov r0, #0 4: e12fff1e bx lr In the first case, the compiler narrows "int a" to char and call the uninlined function. In the second case, there is absolutely no generated code to push any arguments as the function that does nothing is inlined into func(). > -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:52 ` Maxime Bizon @ 2013-12-12 14:58 ` Måns Rullgård 2013-12-12 15:00 ` Russell King - ARM Linux 1 sibling, 0 replies; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 14:58 UTC (permalink / raw) To: linux-arm-kernel Maxime Bizon <mbizon@freebox.fr> writes: > On Thu, 2013-12-12 at 14:42 +0000, M?ns Rullg?rd wrote: >> >> Again, that's an optimisation that does not alter the semantics of the >> code. Although the generated code looks very different, it does the >> same thing. >> > It cannot do the same thing as there are possibly nothing to do after > inline. > > static __attribute__((noinline)) unsigned int do_nothing(unsigned char foo) > { > foo += 42; > return 0; > } > > int func(int a) > { > return do_nothing(a); > } > > 00000000 <do_nothing>: > 0: e3a00000 mov r0, #0 > 4: e12fff1e bx lr > > 00000008 <func>: > 8: e52de004 push {lr} ; (str lr, [sp, #-4]!) > c: e24dd004 sub sp, sp, #4 > 10: e20000ff and r0, r0, #255 ; 0xff > 14: ebfffff9 bl 0 <do_nothing> > 18: e28dd004 add sp, sp, #4 > 1c: e8bd8000 ldmfd sp!, {pc} > > static inline unsigned int do_nothing(unsigned char foo) > { > foo += 42; > return 0; > } > > int func(int a) > { > return do_nothing(a); > } > > 00000000 <func>: > 0: e3a00000 mov r0, #0 > 4: e12fff1e bx lr > > In the first case, the compiler narrows "int a" to char and call the > uninlined function. > > In the second case, there is absolutely no generated code to push any > arguments as the function that does nothing is inlined into func(). In both cases, the effects on the global state of the program are exactly the same. That's all that matters. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:52 ` Maxime Bizon 2013-12-12 14:58 ` Måns Rullgård @ 2013-12-12 15:00 ` Russell King - ARM Linux 2013-12-12 15:26 ` Maxime Bizon 1 sibling, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 15:00 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 03:52:43PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 14:42 +0000, M?ns Rullg?rd wrote: > > > > Again, that's an optimisation that does not alter the semantics of the > > code. Although the generated code looks very different, it does the > > same thing. > > > It cannot do the same thing as there are possibly nothing to do after > inline. > > > static __attribute__((noinline)) unsigned int do_nothing(unsigned char foo) > { > foo += 42; > return 0; > } > > int func(int a) > { > return do_nothing(a); > } > > 00000000 <do_nothing>: > 0: e3a00000 mov r0, #0 > 4: e12fff1e bx lr > > 00000008 <func>: > 8: e52de004 push {lr} ; (str lr, [sp, #-4]!) > c: e24dd004 sub sp, sp, #4 > 10: e20000ff and r0, r0, #255 ; 0xff > 14: ebfffff9 bl 0 <do_nothing> > 18: e28dd004 add sp, sp, #4 > 1c: e8bd8000 ldmfd sp!, {pc} > > > static inline unsigned int do_nothing(unsigned char foo) > { > foo += 42; > return 0; > } > > int func(int a) > { > return do_nothing(a); > } > > > 00000000 <func>: > 0: e3a00000 mov r0, #0 > 4: e12fff1e bx lr > > > In the first case, the compiler narrows "int a" to char and call the > uninlined function. > > In the second case, there is absolutely no generated code to push any > arguments as the function that does nothing is inlined into func(). This is different - the compiler has recognised in both cases that the addition od 42 to foo is useless as the result is not used, and therefore has optimised the addition away. In the second case, it has realised that the narrowing cast used to then add 42 to is also not used, and it has also optimised that away. A better test case would be do to do this: foo += 42; return foo; so that "foo" is actually used. Or, if you don't feel happy with that: extern void use_result(unsigned int); foo += 42; use_result(foo); return 0; so that the compiler can't decide that 'foo' is never used. ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:00 ` Russell King - ARM Linux @ 2013-12-12 15:26 ` Maxime Bizon 0 siblings, 0 replies; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 15:26 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 15:00 +0000, Russell King - ARM Linux wrote: > This is different - the compiler has recognised in both cases that the > addition od 42 to foo is useless as the result is not used, and > therefore has optimised the addition away. In the second case, it has > realised that the narrowing cast used to then add 42 to is also not > used, and it has also optimised that away. I was just trying to show that in the inline case the generated code was not the same (apparently there was a misunderstanding on what I was arguing about). Now if you want a more interesting/on topic example: #include <stdint.h> static __attribute__((noinline)) unsigned int set_bit_2(unsigned int foo) { foo |= 0x4; return foo; } #define local_swab16(x) ((uint16_t)( \ (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \ (((uint16_t)(x) & (uint16_t)0xff00U) >> 8))) int func(uint16_t a) { a = set_bit_2(local_swab16(a)); if ((a & 0xffff) == 0x1234) return 1; return 0; } 00000000 <set_bit_2>: 0: e3800004 orr r0, r0, #4 4: e12fff1e bx lr 00000008 <func>: 8: e92d4008 push {r3, lr} c: e1a03420 lsr r3, r0, #8 10: e1830400 orr r0, r3, r0, lsl #8 14: e1a00800 lsl r0, r0, #16 18: e1a00820 lsr r0, r0, #16 1c: ebfffff7 bl 0 <set_bit_2> 20: e3a03c12 mov r3, #4608 ; 0x1200 24: e1a00800 lsl r0, r0, #16 28: e2833034 add r3, r3, #52 ; 0x34 2c: e1530820 cmp r3, r0, lsr #16 30: 03a00001 moveq r0, #1 34: 13a00000 movne r0, #0 38: e8bd8008 pop {r3, pc} you can see here the full swab16 code, with the narrowing before calling set_bit_2() per calling convention. Now remove the noninline: 00000000 <func>: 0: e3a03c12 mov r3, #4608 ; 0x1200 4: e1a02420 lsr r2, r0, #8 8: e1820400 orr r0, r2, r0, lsl #8 c: e3802004 orr r2, r0, #4 10: e1a02802 lsl r2, r2, #16 14: e2833034 add r3, r3, #52 ; 0x34 18: e1530822 cmp r3, r2, lsr #16 1c: 03a00001 moveq r0, #1 20: 13a00000 movne r0, #0 24: e12fff1e bx lr and you see that the double shift is removed because gcc knows it is not needed by the remaining code, which uses only the lower 16 bits. This is IMO exactly what happen in the csum case, the inline assembly rule (the one we are not aware of) must be that you cannot use the 16 higher bits of the register in the assembly code, and so gcc takes the liberty not to zero extend the register. -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:42 ` Måns Rullgård 2013-12-12 14:52 ` Maxime Bizon @ 2013-12-12 15:07 ` Willy Tarreau 2013-12-12 15:18 ` Måns Rullgård 1 sibling, 1 reply; 37+ messages in thread From: Willy Tarreau @ 2013-12-12 15:07 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 02:42:23PM +0000, M?ns Rullg?rd wrote: > Maxime Bizon <mbizon@freebox.fr> writes: > > > On Thu, 2013-12-12 at 15:19 +0100, Willy Tarreau wrote: > > > >> I disagree here, since gcc may decide by itself to inline or not, it must > >> not impact the validity of the emitted code. Inline functions have input > >> and output types for a reason! > > > > but the code emitted is completely different for an inlined "occurence" > > of the function and the non-inlined case. > > > > if a function does not use one of its argument, and it is inlined, then > > in the resulting code you see no trace of it > > Again, that's an optimisation that does not alter the semantics of the code. > Although the generated code looks very different, it does the same thing. Agreed. > >> Hmmm aren't you passing a 16-bit register directly to the ASM for being used > >> as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how > >> you're going to use the register. > > > > this is exactly what I'm complaining about, the arm code for > > csum_tcpudp_nofold() in the kernel does exactly that. > > > >> Could you check if that fixes it : > >> > >> static inline uint32_t asm_add(uint16_t len, uint32_t sum) > >> { > >> uint32_t len32 = len; > > > > or change the asm_add() proto to take an "uint32_t len" instead, and yes > > of course that fixes it. > > It's a bug. Please report it to the gcc developers. Here I don't agree with the generalization (and believe me I swear all the day about gcc's bugs). It's a matter of ABI and availability or not of 16 bit registers or not. If the ASM supports 16-bit regs and the compiler is allowed to emit 16-bit regs, then gcc will have no way to know whether it must zero-extend the value first. If it's specified that "r" is necessarily a 32-bit register then it should. Maybe the issue is in the ABI itself, I don't know if 16-bit values are supposed to be zero-extended only when they're converted to 32-bit or also when they're passed as arguments. The fact that it works without inline may simply be a side effect of the different code (eg: 16 lower bits of the register copied into another one). So one needs to look at the specs of the ABI to know where the 16-bit value is supposed to be converted to 32-bit, then the faulty component must be fixed (gcc or kernel code). Regards, Willy ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:07 ` Willy Tarreau @ 2013-12-12 15:18 ` Måns Rullgård 2013-12-12 15:28 ` Willy Tarreau 0 siblings, 1 reply; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 15:18 UTC (permalink / raw) To: linux-arm-kernel Willy Tarreau <w@1wt.eu> writes: >> >> Hmmm aren't you passing a 16-bit register directly to the ASM for >> >> being used as a 32-bit one ? This seems hasardous to me since >> >> nowhere you tell gcc how you're going to use the register. >> > >> > this is exactly what I'm complaining about, the arm code for >> > csum_tcpudp_nofold() in the kernel does exactly that. >> > >> >> Could you check if that fixes it : >> >> >> >> static inline uint32_t asm_add(uint16_t len, uint32_t sum) >> >> { >> >> uint32_t len32 = len; >> > >> > or change the asm_add() proto to take an "uint32_t len" instead, and yes >> > of course that fixes it. >> >> It's a bug. Please report it to the gcc developers. > > Here I don't agree with the generalization (and believe me I swear all the > day about gcc's bugs). It's a matter of ABI and availability or not of 16 > bit registers or not. If the ASM supports 16-bit regs and the compiler is > allowed to emit 16-bit regs, then gcc will have no way to know whether it > must zero-extend the value first. If it's specified that "r" is necessarily > a 32-bit register then it should. ARM has *only* 32-bit registers. > Maybe the issue is in the ABI itself, I don't know if 16-bit values > are supposed to be zero-extended only when they're converted to 32-bit > or also when they're passed as arguments. The fact that it works > without inline may simply be a side effect of the different code (eg: > 16 lower bits of the register copied into another one). > > So one needs to look at the specs of the ABI to know where the 16-bit value > is supposed to be converted to 32-bit, then the faulty component must be > fixed (gcc or kernel code). The ABI for function calls sign/zero-extends all arguments prior to the call. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:18 ` Måns Rullgård @ 2013-12-12 15:28 ` Willy Tarreau 2013-12-12 15:43 ` Russell King - ARM Linux 0 siblings, 1 reply; 37+ messages in thread From: Willy Tarreau @ 2013-12-12 15:28 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 03:18:18PM +0000, M?ns Rullg?rd wrote: > Willy Tarreau <w@1wt.eu> writes: > > >> >> Hmmm aren't you passing a 16-bit register directly to the ASM for > >> >> being used as a 32-bit one ? This seems hasardous to me since > >> >> nowhere you tell gcc how you're going to use the register. > >> > > >> > this is exactly what I'm complaining about, the arm code for > >> > csum_tcpudp_nofold() in the kernel does exactly that. > >> > > >> >> Could you check if that fixes it : > >> >> > >> >> static inline uint32_t asm_add(uint16_t len, uint32_t sum) > >> >> { > >> >> uint32_t len32 = len; > >> > > >> > or change the asm_add() proto to take an "uint32_t len" instead, and yes > >> > of course that fixes it. > >> > >> It's a bug. Please report it to the gcc developers. > > > > Here I don't agree with the generalization (and believe me I swear all the > > day about gcc's bugs). It's a matter of ABI and availability or not of 16 > > bit registers or not. If the ASM supports 16-bit regs and the compiler is > > allowed to emit 16-bit regs, then gcc will have no way to know whether it > > must zero-extend the value first. If it's specified that "r" is necessarily > > a 32-bit register then it should. > > ARM has *only* 32-bit registers. > > > Maybe the issue is in the ABI itself, I don't know if 16-bit values > > are supposed to be zero-extended only when they're converted to 32-bit > > or also when they're passed as arguments. The fact that it works > > without inline may simply be a side effect of the different code (eg: > > 16 lower bits of the register copied into another one). > > > > So one needs to look at the specs of the ABI to know where the 16-bit value > > is supposed to be converted to 32-bit, then the faulty component must be > > fixed (gcc or kernel code). > > The ABI for function calls sign/zero-extends all arguments prior to the > call. OK then that's pretty clear, there's no ambiguity, thanks for the precision. Willy ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:28 ` Willy Tarreau @ 2013-12-12 15:43 ` Russell King - ARM Linux 2013-12-12 15:50 ` Måns Rullgård 0 siblings, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 15:43 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 04:28:49PM +0100, Willy Tarreau wrote: > On Thu, Dec 12, 2013 at 03:18:18PM +0000, M?ns Rullg?rd wrote: > > The ABI for function calls sign/zero-extends all arguments prior to the > > call. > > OK then that's pretty clear, there's no ambiguity, thanks for the precision. Let's be clear about this: there is no GCC bug here - the kernel code unfortunately what is buggy. asm() input arguments are not subject to this promotion irrespective of the ABI. ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:43 ` Russell King - ARM Linux @ 2013-12-12 15:50 ` Måns Rullgård 0 siblings, 0 replies; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 15:50 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux <linux@arm.linux.org.uk> writes: > On Thu, Dec 12, 2013 at 04:28:49PM +0100, Willy Tarreau wrote: >> On Thu, Dec 12, 2013 at 03:18:18PM +0000, M?ns Rullg?rd wrote: >> > The ABI for function calls sign/zero-extends all arguments prior to the >> > call. >> >> OK then that's pretty clear, there's no ambiguity, thanks for the precision. > > Let's be clear about this: there is no GCC bug here - the kernel code > unfortunately what is buggy. asm() input arguments are not subject to > this promotion irrespective of the ABI. At the asm() the value should already have been zero-extended. Since the code does not do anything with undefined behaviour, the fact that different optimisation levels give different results clearly indicates a compiler bug. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:19 ` Willy Tarreau 2013-12-12 14:28 ` Maxime Bizon @ 2013-12-12 14:37 ` Måns Rullgård 2013-12-12 14:40 ` Maxime Bizon 1 sibling, 1 reply; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 14:37 UTC (permalink / raw) To: linux-arm-kernel Willy Tarreau <w@1wt.eu> writes: >> #include <stdint.h> >> #include <stdio.h> >> >> static inline uint32_t asm_add(uint16_t len, uint32_t sum) >> { >> __asm__( >> "add %0, %1, %2 \n" >> : "=&r"(sum) >> : "r" (sum), "r" (len) >> ); >> return sum; >> } > > Hmmm aren't you passing a 16-bit register directly to the ASM for being used > as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how > you're going to use the register. There is no such thing as a 16-bit register on ARM. What this code does is ask the compiler to take the 16-bit value 'len' and put it in a register for use in the asm code. All registers are 32-bit, so there is no ambiguity. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:37 ` Måns Rullgård @ 2013-12-12 14:40 ` Maxime Bizon 2013-12-12 14:47 ` Måns Rullgård 0 siblings, 1 reply; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 14:40 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 14:37 +0000, M?ns Rullg?rd wrote: > There is no such thing as a 16-bit register on ARM. What this code does > is ask the compiler to take the 16-bit value 'len' and put it in a register > for use in the asm code. All registers are 32-bit, so there is no ambiguity. which strictly speaking the compiler did ;) it took the 16 bits value and put it inside the lower 16 bits of the 32 bits register, it just didn't touch the higher ones -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:40 ` Maxime Bizon @ 2013-12-12 14:47 ` Måns Rullgård 0 siblings, 0 replies; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 14:47 UTC (permalink / raw) To: linux-arm-kernel Maxime Bizon <mbizon@freebox.fr> writes: > On Thu, 2013-12-12 at 14:37 +0000, M?ns Rullg?rd wrote: > >> There is no such thing as a 16-bit register on ARM. What this code does >> is ask the compiler to take the 16-bit value 'len' and put it in a register >> for use in the asm code. All registers are 32-bit, so there is no ambiguity. > > which strictly speaking the compiler did ;) > > it took the 16 bits value and put it inside the lower 16 bits of the 32 > bits register, it just didn't touch the higher ones That's not where things went wrong. The entire computation is done in registers, and the swab16 macro casts the result to uint16_t. This cast should clear the high half of the register (that's the lsl/lsr pair you saw), but for some reason this isn't happening. Now please file a gcc bug report. That's the proper place to be discussing this. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:10 ` Maxime Bizon 2013-12-12 14:19 ` Willy Tarreau @ 2013-12-12 14:26 ` Måns Rullgård 1 sibling, 0 replies; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 14:26 UTC (permalink / raw) To: linux-arm-kernel Maxime Bizon <mbizon@freebox.fr> writes: > On Thu, 2013-12-12 at 13:48 +0000, M?ns Rullg?rd wrote: > >> In the code above, the outer (uint16_t) cast should clear the top half, >> as should passing the value to a function (inline doesn't alter the >> semantics) as a 16-bit type, so there's something fishy here. > > using __attribute__((noinline)), or putting the function in another file > makes the bug disappear > > But I'm not convinced inline doesn't change the semantic, since gcc is > merging the function inside another one the rules of calling convention > should not matter anymore. Function inlining is just an optimisation and like any optimisation it must not alter the semantics of the program. What happens inside the function also should not matter. > I attached a second test case without a separate function that has the > same bug. > >> Which gcc versions did you try? > > 4.3.2, 4.6.0, 4.7.2, 4.8-2013 (linaro) That's a decent spread. Recently, I don't recall the exact version, gcc started recognising typical byte-reverse patters, which could have been related to this, but since it goes back to 4.3, that seems not to be the case. > Here is a simple userspace test case. > > #include <stdint.h> > #include <stdio.h> > > static inline uint32_t asm_add(uint16_t len, uint32_t sum) > { > __asm__( > "add %0, %1, %2 \n" > : "=&r"(sum) > : "r" (sum), "r" (len) > ); > return sum; > } > > #define local_swab16(x) ((uint16_t)( \ > (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \ > (((uint16_t)(x) & (uint16_t)0xff00U) >> 8))) > > int main(int argc, char *argv[]) > { > uint16_t foo; > > foo = strtoul(argv[1], NULL, 0); > printf("%08x\n", asm_add(local_swab16(foo), 0)); > return 0; > } > > without optimization, or with noinline: > > # ./a.out 0x3412 > 00001234 > > with optimization: > > # ./a.out 0x3412 > 00341234 This looks like a gcc bug to me. Report it in the gcc bugzilla and see what they say. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 13:36 ` Maxime Bizon 2013-12-12 13:48 ` Måns Rullgård @ 2013-12-12 14:48 ` Russell King - ARM Linux 2013-12-12 15:00 ` Måns Rullgård 2013-12-12 15:04 ` Maxime Bizon 1 sibling, 2 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 14:48 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 02:36:30PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote: > > > Depends which swab16 you mean by "dumb swab16". If it is a gcc bug then > > that one: > > #define __swab16(x) ((uint16_t)( \ > (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \ > (((uint16_t)(x) & (uint16_t)0xff00U) >> 8))) > > usually expands to this: > > 24: e1a00800 lsl r0, r0, #16 > 28: e1a03c20 lsr r3, r0, #24 > 2c: e1833420 orr r3, r3, r0, lsr #8 > 30: e1a03803 lsl r3, r3, #16 > 34: e1a00823 lsr r0, r3, #16 > > but in my case, the two last shifts are missing. > > > you need to submit a bug report to gcc people. > > but is it for sure ? Well, in the above code, we're being quite explicit about wanting only bits 7-0 to be moved to bits 15-8, whereas what the compiler is actually doing is moving bits 15-0 to 23-8. In other words, the compiler has completely lost sight of that & 0x00ff in there. > I couldn't find any working gcc version so it does not look like a > regression, hence my doubt. Well, looking at the builds I have here, it seems that my gcc also produces wrong code here in __udp4_lib_rcv(), which makes me wonder how NFS works. Ah, that's how - in the standard kernel, it's not "len" which usually gets swabbed, it's the protocol ID - which for UDP is 17. __swab16(17) produces 0x1100 even with the bug. Even so, the code _is_ buggy, because if the protocol value had bits 15-8 set, then this would go wrong for all the same reasons that you've found. GCC is definitely ignoring the outter (uint16_t) cast in the above. So yes, afaics it's a GCC bug which appears to affect many GCC versions. The question is... what to do about this. I don't think we want to wait for a gcc version to be fixed... ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:48 ` Russell King - ARM Linux @ 2013-12-12 15:00 ` Måns Rullgård 2013-12-12 15:04 ` Maxime Bizon 1 sibling, 0 replies; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 15:00 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux <linux@arm.linux.org.uk> writes: > On Thu, Dec 12, 2013 at 02:36:30PM +0100, Maxime Bizon wrote: >> >> On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote: >> >> > Depends which swab16 you mean by "dumb swab16". If it is a gcc bug then >> >> that one: >> >> #define __swab16(x) ((uint16_t)( \ >> (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) | \ >> (((uint16_t)(x) & (uint16_t)0xff00U) >> 8))) >> >> usually expands to this: >> >> 24: e1a00800 lsl r0, r0, #16 >> 28: e1a03c20 lsr r3, r0, #24 >> 2c: e1833420 orr r3, r3, r0, lsr #8 >> 30: e1a03803 lsl r3, r3, #16 >> 34: e1a00823 lsr r0, r3, #16 >> >> but in my case, the two last shifts are missing. >> >> > you need to submit a bug report to gcc people. >> >> but is it for sure ? > > Well, in the above code, we're being quite explicit about wanting only > bits 7-0 to be moved to bits 15-8, whereas what the compiler is actually > doing is moving bits 15-0 to 23-8. > > In other words, the compiler has completely lost sight of that & 0x00ff > in there. > >> I couldn't find any working gcc version so it does not look like a >> regression, hence my doubt. > > Well, looking at the builds I have here, it seems that my gcc also > produces wrong code here in __udp4_lib_rcv(), which makes me wonder > how NFS works. Ah, that's how - in the standard kernel, it's not > "len" which usually gets swabbed, it's the protocol ID - which for > UDP is 17. __swab16(17) produces 0x1100 even with the bug. > > Even so, the code _is_ buggy, because if the protocol value had bits > 15-8 set, then this would go wrong for all the same reasons that > you've found. GCC is definitely ignoring the outter (uint16_t) cast > in the above. > > So yes, afaics it's a GCC bug which appears to affect many GCC versions. > > The question is... what to do about this. I don't think we want to wait > for a gcc version to be fixed... First we need to figure out exactly pattern triggers the bug. Once that is known, we can work around it. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 14:48 ` Russell King - ARM Linux 2013-12-12 15:00 ` Måns Rullgård @ 2013-12-12 15:04 ` Maxime Bizon 2013-12-12 15:41 ` Russell King - ARM Linux 1 sibling, 1 reply; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 15:04 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 14:48 +0000, Russell King - ARM Linux wrote: > Even so, the code _is_ buggy, because if the protocol value had bits > 15-8 set, then this would go wrong for all the same reasons that > you've found. GCC is definitely ignoring the outter (uint16_t) cast > in the above. ok I'll file a gcc bug report to get their input on this. -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:04 ` Maxime Bizon @ 2013-12-12 15:41 ` Russell King - ARM Linux 2013-12-12 16:04 ` Måns Rullgård ` (2 more replies) 0 siblings, 3 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 15:41 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 04:04:17PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 14:48 +0000, Russell King - ARM Linux wrote: > > > Even so, the code _is_ buggy, because if the protocol value had bits > > 15-8 set, then this would go wrong for all the same reasons that > > you've found. GCC is definitely ignoring the outter (uint16_t) cast > > in the above. > > ok I'll file a gcc bug report to get their input on this. Actually, it may not be a gcc bug at all. One of the issues here is that internally, GCC tracks the "machine mode" of a register, where "mode" basically means the type of register it is. In this case, the register will be in "half integer" mode when it is passed into the assembler, and gcc does _not_ promote it. We can see this in the RTL: (insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ]) (asm_operands:SI ("mov %0, %1") ("=&r") 0 [ (subreg:HI (reg:SI 148) 0) ] [ (asm_input:HI ("r") (null):0) ] [] ../t.c:12)) -1 (nil)) Note that "HI" for the input register 148. What this means is that the compiler "knows" that it's supposed to be a 16-bit quantity, and has recorded that fact, and _will_ truncate it down to 16-bit when it needs to be promoted. Since the asm() only asks for a "register", that's what we get - a register which _happens_ to be in HI mode containing the value. However, there's no way to tell that from the assembly code itself (GCC doesn't have the facility to do conditional assembly generation depending on the argument type passed into it.) What this means is that passing "unsigned short"s into an asm is dodgy. What's more, I've just had this confirmed by compiler people in ARM Ltd - we can't rely on the state of the upper 16 bits when passing a u16 into an asm(). So, it seems that it _is_ a kernel bug after all. As I said, we need to fix it in the kernel because of the huge number of GCC versions which show this behaviour _even_ if it turns out to be a GCC bug (which it seems it *isn't*.) ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:41 ` Russell King - ARM Linux @ 2013-12-12 16:04 ` Måns Rullgård 2013-12-12 16:04 ` Willy Tarreau 2013-12-12 17:34 ` Maxime Bizon 2 siblings, 0 replies; 37+ messages in thread From: Måns Rullgård @ 2013-12-12 16:04 UTC (permalink / raw) To: linux-arm-kernel Russell King - ARM Linux <linux@arm.linux.org.uk> writes: > On Thu, Dec 12, 2013 at 04:04:17PM +0100, Maxime Bizon wrote: >> >> On Thu, 2013-12-12 at 14:48 +0000, Russell King - ARM Linux wrote: >> >> > Even so, the code _is_ buggy, because if the protocol value had bits >> > 15-8 set, then this would go wrong for all the same reasons that >> > you've found. GCC is definitely ignoring the outter (uint16_t) cast >> > in the above. >> >> ok I'll file a gcc bug report to get their input on this. > > Actually, it may not be a gcc bug at all. > > One of the issues here is that internally, GCC tracks the "machine mode" > of a register, where "mode" basically means the type of register it is. > In this case, the register will be in "half integer" mode when it is > passed into the assembler, and gcc does _not_ promote it. > > We can see this in the RTL: > > (insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ]) > (asm_operands:SI ("mov %0, %1") ("=&r") 0 [ > (subreg:HI (reg:SI 148) 0) > ] > [ > (asm_input:HI ("r") (null):0) > ] > [] ../t.c:12)) -1 (nil)) > > Note that "HI" for the input register 148. What this means is that the > compiler "knows" that it's supposed to be a 16-bit quantity, and has > recorded that fact, and _will_ truncate it down to 16-bit when it needs > to be promoted. > > Since the asm() only asks for a "register", that's what we get - a > register which _happens_ to be in HI mode containing the value. However, > there's no way to tell that from the assembly code itself (GCC doesn't > have the facility to do conditional assembly generation depending on > the argument type passed into it.) > > What this means is that passing "unsigned short"s into an asm is dodgy. If that's the intended behaviour, the compiler really ought to error or warn when this happens. It's far too easy to accidentally fall foul of this. > What's more, I've just had this confirmed by compiler people in ARM Ltd - > we can't rely on the state of the upper 16 bits when passing a u16 into > an asm(). So, it seems that it _is_ a kernel bug after all. With that being the official stance, the proper fix appears to be to explicitly cast all 8/16-bit asm operands to a 32-bit type. Good luck finding them. -- M?ns Rullg?rd mans at mansr.com ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:41 ` Russell King - ARM Linux 2013-12-12 16:04 ` Måns Rullgård @ 2013-12-12 16:04 ` Willy Tarreau 2013-12-12 16:47 ` Russell King - ARM Linux 2013-12-12 17:34 ` Maxime Bizon 2 siblings, 1 reply; 37+ messages in thread From: Willy Tarreau @ 2013-12-12 16:04 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 03:41:10PM +0000, Russell King - ARM Linux wrote: > One of the issues here is that internally, GCC tracks the "machine mode" > of a register, where "mode" basically means the type of register it is. > In this case, the register will be in "half integer" mode when it is > passed into the assembler, and gcc does _not_ promote it. > > We can see this in the RTL: > > (insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ]) > (asm_operands:SI ("mov %0, %1") ("=&r") 0 [ > (subreg:HI (reg:SI 148) 0) > ] > [ > (asm_input:HI ("r") (null):0) > ] > [] ../t.c:12)) -1 (nil)) > > Note that "HI" for the input register 148. What this means is that the > compiler "knows" that it's supposed to be a 16-bit quantity, and has > recorded that fact, and _will_ truncate it down to 16-bit when it needs > to be promoted. > > Since the asm() only asks for a "register", that's what we get - a > register which _happens_ to be in HI mode containing the value. However, > there's no way to tell that from the assembly code itself (GCC doesn't > have the facility to do conditional assembly generation depending on > the argument type passed into it.) OK so that means that it's just like having true 16-bit registers except that it's just a temporary representation and not a feature of the CPU. The compiler has the right to expect the asm instructions to only use the lower 16 bits and has no way to verify that. Then changing the type of the function argument would probably be safer! Regards, Willy ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 16:04 ` Willy Tarreau @ 2013-12-12 16:47 ` Russell King - ARM Linux 2013-12-12 17:11 ` Willy Tarreau 2013-12-12 22:30 ` Maxime Bizon 0 siblings, 2 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 16:47 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 05:04:26PM +0100, Willy Tarreau wrote: > On Thu, Dec 12, 2013 at 03:41:10PM +0000, Russell King - ARM Linux wrote: > > One of the issues here is that internally, GCC tracks the "machine mode" > > of a register, where "mode" basically means the type of register it is. > > In this case, the register will be in "half integer" mode when it is > > passed into the assembler, and gcc does _not_ promote it. > > > > We can see this in the RTL: > > > > (insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ]) > > (asm_operands:SI ("mov %0, %1") ("=&r") 0 [ > > (subreg:HI (reg:SI 148) 0) > > ] > > [ > > (asm_input:HI ("r") (null):0) > > ] > > [] ../t.c:12)) -1 (nil)) > > > > Note that "HI" for the input register 148. What this means is that the > > compiler "knows" that it's supposed to be a 16-bit quantity, and has > > recorded that fact, and _will_ truncate it down to 16-bit when it needs > > to be promoted. > > > > Since the asm() only asks for a "register", that's what we get - a > > register which _happens_ to be in HI mode containing the value. However, > > there's no way to tell that from the assembly code itself (GCC doesn't > > have the facility to do conditional assembly generation depending on > > the argument type passed into it.) > > OK so that means that it's just like having true 16-bit registers except > that it's just a temporary representation and not a feature of the CPU. > The compiler has the right to expect the asm instructions to only use > the lower 16 bits and has no way to verify that. Quite. > Then changing the type of the function argument would probably be safer! Actually, I think we can do a bit better with this code. We really don't need much of this messing around here, we can combine some of these steps. We have: 16-bit protocol in host endian 16-bit length in host endian and we need to combine them into a 32-bit checksum which is then subsequently folded down to 16-bits by adding the top and bottom halves. Now, what we can do is this: 1. Construct a combined 32-bit protocol and length: unsigned lenproto = len | proto << 16; 2. Pass this into the assembly thusly: __asm__( "adds %0, %1, %2 @ csum_tcpudp_nofold \n\t" "adcs %0, %0, %3 \n\t" #ifdef __ARMEB__ "adcs %0, %0, %4 \n\t" #else "adcs %0, %0, %4, ror #8 \n\t" #endif "adc %0, %0, #0" : "=&r"(sum) : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot) : "cc"); with no swabbing@this stage. Well, where do we get the endian conversion? See that ror #8 - that a 32 bit rotate by 8 bits. As these are two 16-bit quantities, we end up with this: original: 31..24 23..16 15..8 7..0 len_h len_l pro_h pro_l accumulated: 31..24 23..16 15..8 7..0 pro_l len_h len_l pro_h And now when we fold it down to 16-bit: 15..8 7..0 len_l pro_h pro_l len_h There we go - the low and high bytes end up correctly placed. Now, the advantage to having "len" and "proto" combined by the compiler is that the compiler itself can make the decision about how to combine these two 16-bit HI-mode quantities in the most efficient way. Here's some examples from udp.c - I have another optimisation in here too which makes this code sequence slightly shorter. Some intermediary instructions have been omitted. mov r6, r6, asl #16 @ tmp204, len, mov r6, r6, lsr #16 @ tmp203, tmp204, orr r6, r6, #1114112 @ tmp205, tmp203, @ 92 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1 adds r3, r8, r7 @ csum_tcpudp_nofold0 @ sum, dst, src adcs r3, r3, r6, ror #8 @ sum, tmp205 adc r3, r3, #0 @ sum mov r6, r6, asl #16 @ tmp220, len, mov r6, r6, lsr #16 @ tmp219, tmp220, orr r6, r6, #1114112 @ tmp221, tmp219, @ 104 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1 adds r3, r0, r8 @ csum_tcpudp_nofold @ sum,, dst adcs r3, r3, r7 @ sum, src adcs r3, r3, r6, ror #8 @ sum, tmp221 adc r3, r3, #0 @ sum Note this one sourcing an 8-bit protocol value. ldrb r3, [r7, #269] @ zero_extendqisi2 @ tmp321, sk_5->sk_protocol orr sl, sl, r3, asl #16 @, tmp324, D.48540, tmp321, @ 104 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1 adds r3, r0, r2 @ csum_tcpudp_nofold @ sum, csum, fl4_19(D)->daddr adcs r3, r3, r1 @ sum, fl4_19(D)->saddr adcs r3, r3, sl, ror #8 @ sum, tmp324 adc r3, r3, #0 @ sum ldrh lr, [r4, #76] @ tmp503, skb_10(D)->len orr lr, lr, r7, asl #16 @, tmp505, tmp503, proto, @ 104 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1 adds r1, sl, r0 @ csum_tcpudp_nofold @ sum, skb_10(D)->D.22802.csum, iph_354->daddr adcs r1, r1, ip @ sum, iph_354->saddr adcs r1, r1, lr, ror #8 @ sum, tmp505 adc r1, r1, #0 @ sum ldrh r0, [r4, #76] @ tmp529, skb_10(D)->len orr r0, r0, r7, asl #16 @, tmp531, tmp529, proto, @ 92 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1 adds r3, r2, r1 @ csum_tcpudp_nofold0 @ sum, iph_354->daddr, iph_354->saddr adcs r3, r3, r0, ror #8 @ sum, tmp531 adc r3, r3, #0 @ sum This one is a little more complicated because it uses skb->len - we can see the narrowing cast being performed: ldr r0, [r4, #76] @ skb_1->len, skb_1->len rsb r0, r9, r0 @ tmp335, D.53494, skb_1->len mov r0, r0, asl #16 @ tmp337, tmp335, mov r0, r0, lsr #16 @ tmp336, tmp337, orr r0, r0, #1114112 @ tmp338, tmp336, @ 92 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1 adds r3, r2, r1 @ csum_tcpudp_nofold0 @ sum, iph_252->daddr, iph_252->saddr adcs r3, r3, r0, ror #8 @ sum, tmp338 adc r3, r3, #0 @ sum Looking at whether it's better to shift len or proto to the upper 16 bits reveals no real advantage to either: we end up with the same overall number of instructions between the UDP and TCP code combined, individually there's only one line between them. So here's a patch for Maxime to try - I've not run it yet but it seems to do the right thing by code inspection. diff --git a/arch/arm/include/asm/checksum.h b/arch/arm/include/asm/checksum.h index 6dcc16430868..523315115478 100644 --- a/arch/arm/include/asm/checksum.h +++ b/arch/arm/include/asm/checksum.h @@ -87,19 +87,34 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr, unsigned short len, unsigned short proto, __wsum sum) { - __asm__( - "adds %0, %1, %2 @ csum_tcpudp_nofold \n\ - adcs %0, %0, %3 \n" + u32 lenprot = len | proto << 16; + + if (__builtin_constant_p(sum) && sum == 0) { + __asm__( + "adds %0, %1, %2 @ csum_tcpudp_nofold0 \n\t" #ifdef __ARMEB__ - "adcs %0, %0, %4 \n" + "adcs %0, %0, %3 \n\t" #else - "adcs %0, %0, %4, lsl #8 \n" + "adcs %0, %0, %3, ror #8 \n\t" #endif - "adcs %0, %0, %5 \n\ - adc %0, %0, #0" - : "=&r"(sum) - : "r" (sum), "r" (daddr), "r" (saddr), "r" (len), "Ir" (htons(proto)) - : "cc"); + "adc %0, %0, #0" + : "=&r" (sum) + : "r" (daddr), "r" (saddr), "r" (lenprot) + : "cc"); + } else { + __asm__( + "adds %0, %1, %2 @ csum_tcpudp_nofold \n\t" + "adcs %0, %0, %3 \n\t" +#ifdef __ARMEB__ + "adcs %0, %0, %4 \n\t" +#else + "adcs %0, %0, %4, ror #8 \n\t" +#endif + "adc %0, %0, #0" + : "=&r"(sum) + : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot) + : "cc"); + } return sum; } /* ^ permalink raw reply related [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 16:47 ` Russell King - ARM Linux @ 2013-12-12 17:11 ` Willy Tarreau 2013-12-12 17:20 ` Russell King - ARM Linux 2013-12-12 22:30 ` Maxime Bizon 1 sibling, 1 reply; 37+ messages in thread From: Willy Tarreau @ 2013-12-12 17:11 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 04:47:48PM +0000, Russell King - ARM Linux wrote: > > Then changing the type of the function argument would probably be safer! > > Actually, I think we can do a bit better with this code. We really don't > need much of this messing around here, we can combine some of these steps. > > We have: > > 16-bit protocol in host endian > 16-bit length in host endian > > and we need to combine them into a 32-bit checksum which is then > subsequently folded down to 16-bits by adding the top and bottom halves. > > Now, what we can do is this: > > 1. Construct a combined 32-bit protocol and length: > > unsigned lenproto = len | proto << 16; > > 2. Pass this into the assembly thusly: > > __asm__( > "adds %0, %1, %2 @ csum_tcpudp_nofold \n\t" > "adcs %0, %0, %3 \n\t" > #ifdef __ARMEB__ > "adcs %0, %0, %4 \n\t" > #else > "adcs %0, %0, %4, ror #8 \n\t" > #endif > "adc %0, %0, #0" > : "=&r"(sum) > : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot) > : "cc"); > > with no swabbing at this stage. Well, where do we get the endian > conversion? See that ror #8 - that a 32 bit rotate by 8 bits. As > these are two 16-bit quantities, we end up with this: > > original: > 31..24 23..16 15..8 7..0 > len_h len_l pro_h pro_l > > accumulated: > 31..24 23..16 15..8 7..0 > pro_l len_h len_l pro_h > > And now when we fold it down to 16-bit: > > 15..8 7..0 > len_l pro_h > pro_l len_h Amusing, I've used the same optimization yesterday when computing a TCP pseudo-header checksum. Another thing that can be done to improve the folding of the 16-bit checksum is to swap the values to be added, sum them and only keep the high half integer which already contains the carry. At least on x86 I save some cycles doing this : 31:24 23:16 15:8 7:0 sum32 = D C B A To fold this into 16-bit at a time, I just do this : 31:24 23:16 15:8 7:0 sum32 D C B A + sum32swapped B A D C = A+B C+A+carry(B+D/C+A) B+D C+A so just take the upper result and you get the final 16-bit word at once. In C it does : fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16 When the CPU has a rotate instruction, it's fast :-) Cheers, Willy ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 17:11 ` Willy Tarreau @ 2013-12-12 17:20 ` Russell King - ARM Linux 2013-12-12 17:35 ` Willy Tarreau 2013-12-12 18:07 ` Nicolas Pitre 0 siblings, 2 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 17:20 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote: > Another thing that can be done to improve the folding of the 16-bit > checksum is to swap the values to be added, sum them and only keep > the high half integer which already contains the carry. At least on > x86 I save some cycles doing this : > > 31:24 23:16 15:8 7:0 > sum32 = D C B A > > To fold this into 16-bit at a time, I just do this : > > 31:24 23:16 15:8 7:0 > sum32 D C B A > + sum32swapped B A D C > = A+B C+A+carry(B+D/C+A) B+D C+A > > so just take the upper result and you get the final 16-bit word at > once. > > In C it does : > > fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16 > > When the CPU has a rotate instruction, it's fast :-) Indeed - and if your CPU can do the rotate and add at the same time, it's just a singe instruction, and it ends up looking remarkably similar to this: static inline __sum16 csum_fold(__wsum sum) { __asm__( "add %0, %1, %1, ror #16 @ csum_fold" : "=r" (sum) : "r" (sum) : "cc"); return (__force __sum16)(~(__force u32)sum >> 16); } ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 17:20 ` Russell King - ARM Linux @ 2013-12-12 17:35 ` Willy Tarreau 2013-12-12 18:07 ` Nicolas Pitre 1 sibling, 0 replies; 37+ messages in thread From: Willy Tarreau @ 2013-12-12 17:35 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 05:20:49PM +0000, Russell King - ARM Linux wrote: > On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote: > > Another thing that can be done to improve the folding of the 16-bit > > checksum is to swap the values to be added, sum them and only keep > > the high half integer which already contains the carry. At least on > > x86 I save some cycles doing this : > > > > 31:24 23:16 15:8 7:0 > > sum32 = D C B A > > > > To fold this into 16-bit at a time, I just do this : > > > > 31:24 23:16 15:8 7:0 > > sum32 D C B A > > + sum32swapped B A D C > > = A+B C+A+carry(B+D/C+A) B+D C+A > > > > so just take the upper result and you get the final 16-bit word at > > once. > > > > In C it does : > > > > fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16 > > > > When the CPU has a rotate instruction, it's fast :-) > > Indeed - and if your CPU can do the rotate and add at the same time, > it's just a singe instruction, and it ends up looking remarkably > similar to this: > > static inline __sum16 csum_fold(__wsum sum) > { > __asm__( > "add %0, %1, %1, ror #16 @ csum_fold" > : "=r" (sum) > : "r" (sum) > : "cc"); > return (__force __sum16)(~(__force u32)sum >> 16); > } Marvelous :-) Willy ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 17:20 ` Russell King - ARM Linux 2013-12-12 17:35 ` Willy Tarreau @ 2013-12-12 18:07 ` Nicolas Pitre 1 sibling, 0 replies; 37+ messages in thread From: Nicolas Pitre @ 2013-12-12 18:07 UTC (permalink / raw) To: linux-arm-kernel On Thu, 12 Dec 2013, Russell King - ARM Linux wrote: > static inline __sum16 csum_fold(__wsum sum) > { > __asm__( > "add %0, %1, %1, ror #16 @ csum_fold" > : "=r" (sum) > : "r" (sum) > : "cc"); > return (__force __sum16)(~(__force u32)sum >> 16); > } There is no longer any flag updates so you should remove "cc" from the clobber list as well. Nicolas ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 16:47 ` Russell King - ARM Linux 2013-12-12 17:11 ` Willy Tarreau @ 2013-12-12 22:30 ` Maxime Bizon 2013-12-12 22:36 ` Russell King - ARM Linux 1 sibling, 1 reply; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 22:30 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 16:47 +0000, Russell King - ARM Linux wrote: > > So here's a patch for Maxime to try - I've not run it yet but it seems > to do the right thing by code inspection. Kernel built with this patch seems to work fine. Also I called it manually with a few test patterns and the results are correct. Tested-by: Maxime Bizon <mbizon@freebox.fr> Thanks Russell, -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 22:30 ` Maxime Bizon @ 2013-12-12 22:36 ` Russell King - ARM Linux 2013-12-12 22:44 ` Maxime Bizon 0 siblings, 1 reply; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 22:36 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 11:30:02PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 16:47 +0000, Russell King - ARM Linux wrote: > > > > So here's a patch for Maxime to try - I've not run it yet but it seems > > to do the right thing by code inspection. > > Kernel built with this patch seems to work fine. Also I called it > manually with a few test patterns and the results are correct. > > Tested-by: Maxime Bizon <mbizon@freebox.fr> > > Thanks Russell, Thanks Maxime. Does this bug show up in existing unmodified mainline kernels? ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 22:36 ` Russell King - ARM Linux @ 2013-12-12 22:44 ` Maxime Bizon 2013-12-12 22:48 ` Russell King - ARM Linux 0 siblings, 1 reply; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 22:44 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 22:36 +0000, Russell King - ARM Linux wrote: > Does this bug show up in existing unmodified mainline kernels? not for me, I found this while writing new code. though all possible kernel callers are not enabled in my .config. -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 22:44 ` Maxime Bizon @ 2013-12-12 22:48 ` Russell King - ARM Linux 0 siblings, 0 replies; 37+ messages in thread From: Russell King - ARM Linux @ 2013-12-12 22:48 UTC (permalink / raw) To: linux-arm-kernel On Thu, Dec 12, 2013 at 11:44:05PM +0100, Maxime Bizon wrote: > > On Thu, 2013-12-12 at 22:36 +0000, Russell King - ARM Linux wrote: > > > Does this bug show up in existing unmodified mainline kernels? > > not for me, I found this while writing new code. > > though all possible kernel callers are not enabled in my .config. Okay, as we have no reports so far of this affecting anything in any mainline kernel, I think I'm going to queue the fix up for v3.14 and not -rc - we know that although the assembly is not correct in the IP code, we know that it's completely harmless there because the overflowed byte is always zero. So... I'll queue it for v3.14. If anyone does find it affects existing kernels, we can always apply it for -rc and stable kernels. Thanks for your patience Maxime. ^ permalink raw reply [flat|nested] 37+ messages in thread
* gcc miscompiles csum_tcpudp_magic() on ARMv5 2013-12-12 15:41 ` Russell King - ARM Linux 2013-12-12 16:04 ` Måns Rullgård 2013-12-12 16:04 ` Willy Tarreau @ 2013-12-12 17:34 ` Maxime Bizon 2 siblings, 0 replies; 37+ messages in thread From: Maxime Bizon @ 2013-12-12 17:34 UTC (permalink / raw) To: linux-arm-kernel On Thu, 2013-12-12 at 15:41 +0000, Russell King - ARM Linux wrote: > > What's more, I've just had this confirmed by compiler people in ARM > Ltd - we can't rely on the state of the upper 16 bits when passing a > u16 into an asm(). So, it seems that it _is_ a kernel bug after all. > ok, I was not crazy thanks for the deeper analysis, I will test your patch ASAP -- Maxime ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2013-12-12 22:48 UTC | newest] Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2013-12-12 12:14 gcc miscompiles csum_tcpudp_magic() on ARMv5 Maxime Bizon 2013-12-12 12:40 ` Russell King - ARM Linux 2013-12-12 13:36 ` Maxime Bizon 2013-12-12 13:48 ` Måns Rullgård 2013-12-12 14:10 ` Maxime Bizon 2013-12-12 14:19 ` Willy Tarreau 2013-12-12 14:28 ` Maxime Bizon 2013-12-12 14:42 ` Måns Rullgård 2013-12-12 14:52 ` Maxime Bizon 2013-12-12 14:58 ` Måns Rullgård 2013-12-12 15:00 ` Russell King - ARM Linux 2013-12-12 15:26 ` Maxime Bizon 2013-12-12 15:07 ` Willy Tarreau 2013-12-12 15:18 ` Måns Rullgård 2013-12-12 15:28 ` Willy Tarreau 2013-12-12 15:43 ` Russell King - ARM Linux 2013-12-12 15:50 ` Måns Rullgård 2013-12-12 14:37 ` Måns Rullgård 2013-12-12 14:40 ` Maxime Bizon 2013-12-12 14:47 ` Måns Rullgård 2013-12-12 14:26 ` Måns Rullgård 2013-12-12 14:48 ` Russell King - ARM Linux 2013-12-12 15:00 ` Måns Rullgård 2013-12-12 15:04 ` Maxime Bizon 2013-12-12 15:41 ` Russell King - ARM Linux 2013-12-12 16:04 ` Måns Rullgård 2013-12-12 16:04 ` Willy Tarreau 2013-12-12 16:47 ` Russell King - ARM Linux 2013-12-12 17:11 ` Willy Tarreau 2013-12-12 17:20 ` Russell King - ARM Linux 2013-12-12 17:35 ` Willy Tarreau 2013-12-12 18:07 ` Nicolas Pitre 2013-12-12 22:30 ` Maxime Bizon 2013-12-12 22:36 ` Russell King - ARM Linux 2013-12-12 22:44 ` Maxime Bizon 2013-12-12 22:48 ` Russell King - ARM Linux 2013-12-12 17:34 ` Maxime Bizon
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.