All of lore.kernel.org
 help / color / mirror / Atom feed
* gcc miscompiles csum_tcpudp_magic() on ARMv5
@ 2013-12-12 12:14 Maxime Bizon
  2013-12-12 12:40 ` Russell King - ARM Linux
  0 siblings, 1 reply; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 12:14 UTC (permalink / raw)
  To: linux-arm-kernel


Hello,

I tried using csum_tcpudp_magic() like this:

csum_tcpudp_magic(src, dst, ntohs(udph->len), IPPROTO_UDP, csum);

instead of the more common:

len = ntohs(udhp->len);
[...]
csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum);

the first one gives a bad checksum while the second one is ok.


I've tracked down the problem to csum_tcpudp_nofold(), which uses inline
asm and an unsigned short for len.

If the len value is say 0x3412, and I pass ntohs(len), then the assigned
register for len contains 0x00341234 instead of the expected 0x1234.

The ntohs() expand to a dumb swab16 on my arch, and gcc does the swap
but does not clear the high nibble, I guess it expects the assembly code
to only use the low 16 bits.

Is there a missing constraint or gcc is doing something wrong here ?

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 12:14 gcc miscompiles csum_tcpudp_magic() on ARMv5 Maxime Bizon
@ 2013-12-12 12:40 ` Russell King - ARM Linux
  2013-12-12 13:36   ` Maxime Bizon
  0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 12:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 01:14:04PM +0100, Maxime Bizon wrote:
> 
> Hello,
> 
> I tried using csum_tcpudp_magic() like this:
> 
> csum_tcpudp_magic(src, dst, ntohs(udph->len), IPPROTO_UDP, csum);
> 
> instead of the more common:
> 
> len = ntohs(udhp->len);
> [...]
> csum_tcpudp_magic(src, dst, len, IPPROTO_UDP, csum);
> 
> the first one gives a bad checksum while the second one is ok.
> 
> 
> I've tracked down the problem to csum_tcpudp_nofold(), which uses inline
> asm and an unsigned short for len.
> 
> If the len value is say 0x3412, and I pass ntohs(len), then the assigned
> register for len contains 0x00341234 instead of the expected 0x1234.
> 
> The ntohs() expand to a dumb swab16 on my arch, and gcc does the swap
> but does not clear the high nibble, I guess it expects the assembly code
> to only use the low 16 bits.
> 
> Is there a missing constraint or gcc is doing something wrong here ?

Depends which swab16 you mean by "dumb swab16".  If it is a gcc bug then
you need to submit a bug report to gcc people.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 12:40 ` Russell King - ARM Linux
@ 2013-12-12 13:36   ` Maxime Bizon
  2013-12-12 13:48     ` Måns Rullgård
  2013-12-12 14:48     ` Russell King - ARM Linux
  0 siblings, 2 replies; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 13:36 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote:

> Depends which swab16 you mean by "dumb swab16".  If it is a gcc bug then

that one:

#define __swab16(x) ((uint16_t)(                                      \
        (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) |                  \
        (((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))

usually expands to this:

  24:	e1a00800 	lsl	r0, r0, #16
  28:	e1a03c20 	lsr	r3, r0, #24
  2c:	e1833420 	orr	r3, r3, r0, lsr #8
  30:	e1a03803 	lsl	r3, r3, #16
  34:	e1a00823 	lsr	r0, r3, #16

but in my case, the two last shifts are missing.

> you need to submit a bug report to gcc people.

but is it for sure ?

I couldn't find any working gcc version so it does not look like a
regression, hence my doubt.

basically if you use inline asm with a variable that is smaller than
register width (32 bits here), can you assume the value in the register
will be zero extended ? I could not find the answer in the gcc manual.

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 13:36   ` Maxime Bizon
@ 2013-12-12 13:48     ` Måns Rullgård
  2013-12-12 14:10       ` Maxime Bizon
  2013-12-12 14:48     ` Russell King - ARM Linux
  1 sibling, 1 reply; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 13:48 UTC (permalink / raw)
  To: linux-arm-kernel

Maxime Bizon <mbizon@freebox.fr> writes:

> On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote:
>
>> Depends which swab16 you mean by "dumb swab16".  If it is a gcc bug then
>
> that one:
>
> #define __swab16(x) ((uint16_t)(                                      \
>         (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) |                  \
>         (((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))
>
> usually expands to this:
>
>   24:	e1a00800 	lsl	r0, r0, #16
>   28:	e1a03c20 	lsr	r3, r0, #24
>   2c:	e1833420 	orr	r3, r3, r0, lsr #8
>   30:	e1a03803 	lsl	r3, r3, #16
>   34:	e1a00823 	lsr	r0, r3, #16
>
> but in my case, the two last shifts are missing.
>
>> you need to submit a bug report to gcc people.
>
> but is it for sure ?
>
> I couldn't find any working gcc version so it does not look like a
> regression, hence my doubt.
>
> basically if you use inline asm with a variable that is smaller than
> register width (32 bits here), can you assume the value in the register
> will be zero extended ? I could not find the answer in the gcc manual.

In the code above, the outer (uint16_t) cast should clear the top half,
as should passing the value to a function (inline doesn't alter the
semantics) as a 16-bit type, so there's something fishy here.

Which gcc versions did you try?

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 13:48     ` Måns Rullgård
@ 2013-12-12 14:10       ` Maxime Bizon
  2013-12-12 14:19         ` Willy Tarreau
  2013-12-12 14:26         ` Måns Rullgård
  0 siblings, 2 replies; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 14:10 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 13:48 +0000, M?ns Rullg?rd wrote:

> In the code above, the outer (uint16_t) cast should clear the top half,
> as should passing the value to a function (inline doesn't alter the
> semantics) as a 16-bit type, so there's something fishy here.

using __attribute__((noinline)), or putting the function in another file
makes the bug disappear

But I'm not convinced inline doesn't change the semantic, since gcc is
merging the function inside another one the rules of calling convention
should not matter anymore.

I attached a second test case without a separate function that has the
same bug.

> Which gcc versions did you try?

4.3.2, 4.6.0, 4.7.2, 4.8-2013 (linaro)

Here is a simple userspace test case.


#include <stdint.h>
#include <stdio.h>

static inline uint32_t asm_add(uint16_t len, uint32_t sum)
{
        __asm__(
        "add    %0, %1, %2 \n"
        : "=&r"(sum)
        : "r" (sum), "r" (len)
        );
        return sum;
}

#define local_swab16(x) ((uint16_t)(                                    \
        (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) |                  \
        (((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))

int main(int argc, char *argv[])
{
        uint16_t foo;

        foo = strtoul(argv[1], NULL, 0);
        printf("%08x\n", asm_add(local_swab16(foo), 0));
        return 0;
}

without optimization, or with noinline:

# ./a.out 0x3412
00001234


with optimization:

# ./a.out 0x3412
00341234


And the second test case without the inline function.

int main(int argc, char *argv[])
{
        uint32_t sum = 0;
        uint16_t foo;

        foo = strtoul(argv[1], NULL, 0);

        __asm__ (
        "add    %0, %1, %2 \n"
        : "=&r"(sum)
        : "r" (sum), "r" (local_swab16(foo))
        );

        printf("%08x\n", sum);
        return 0;
}


-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:10       ` Maxime Bizon
@ 2013-12-12 14:19         ` Willy Tarreau
  2013-12-12 14:28           ` Maxime Bizon
  2013-12-12 14:37           ` Måns Rullgård
  2013-12-12 14:26         ` Måns Rullgård
  1 sibling, 2 replies; 37+ messages in thread
From: Willy Tarreau @ 2013-12-12 14:19 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Maxime,

On Thu, Dec 12, 2013 at 03:10:10PM +0100, Maxime Bizon wrote:
> 
> On Thu, 2013-12-12 at 13:48 +0000, M?ns Rullg?rd wrote:
> 
> > In the code above, the outer (uint16_t) cast should clear the top half,
> > as should passing the value to a function (inline doesn't alter the
> > semantics) as a 16-bit type, so there's something fishy here.
> 
> using __attribute__((noinline)), or putting the function in another file
> makes the bug disappear
> 
> But I'm not convinced inline doesn't change the semantic, since gcc is
> merging the function inside another one the rules of calling convention
> should not matter anymore.

I disagree here, since gcc may decide by itself to inline or not, it must
not impact the validity of the emitted code. Inline functions have input
and output types for a reason!

(...)
> #include <stdint.h>
> #include <stdio.h>
> 
> static inline uint32_t asm_add(uint16_t len, uint32_t sum)
> {
>         __asm__(
>         "add    %0, %1, %2 \n"
>         : "=&r"(sum)
>         : "r" (sum), "r" (len)
>         );
>         return sum;
> }

Hmmm aren't you passing a 16-bit register directly to the ASM for being used
as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how
you're going to use the register.

Could you check if that fixes it :

 static inline uint32_t asm_add(uint16_t len, uint32_t sum)
 {
         uint32_t len32 = len;

         __asm__(
         "add    %0, %1, %2 \n"
         : "=&r"(sum)
         : "r" (sum), "r" (len32)
         );
         return sum;
 }

Or maybe simply :

 static inline uint32_t asm_add(uint16_t len, uint32_t sum)
 {
         __asm__(
         "add    %0, %1, %2 \n"
         : "=&r"(sum)
         : "r" (sum), "r" (uint32_t)(len)
         );
         return sum;
 }

Willy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:10       ` Maxime Bizon
  2013-12-12 14:19         ` Willy Tarreau
@ 2013-12-12 14:26         ` Måns Rullgård
  1 sibling, 0 replies; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 14:26 UTC (permalink / raw)
  To: linux-arm-kernel

Maxime Bizon <mbizon@freebox.fr> writes:

> On Thu, 2013-12-12 at 13:48 +0000, M?ns Rullg?rd wrote:
>
>> In the code above, the outer (uint16_t) cast should clear the top half,
>> as should passing the value to a function (inline doesn't alter the
>> semantics) as a 16-bit type, so there's something fishy here.
>
> using __attribute__((noinline)), or putting the function in another file
> makes the bug disappear
>
> But I'm not convinced inline doesn't change the semantic, since gcc is
> merging the function inside another one the rules of calling convention
> should not matter anymore.

Function inlining is just an optimisation and like any optimisation it
must not alter the semantics of the program.  What happens inside the
function also should not matter.

> I attached a second test case without a separate function that has the
> same bug.
>
>> Which gcc versions did you try?
>
> 4.3.2, 4.6.0, 4.7.2, 4.8-2013 (linaro)

That's a decent spread.  Recently, I don't recall the exact version, gcc
started recognising typical byte-reverse patters, which could have been
related to this, but since it goes back to 4.3, that seems not to be the
case.

> Here is a simple userspace test case.
>
> #include <stdint.h>
> #include <stdio.h>
>
> static inline uint32_t asm_add(uint16_t len, uint32_t sum)
> {
>         __asm__(
>         "add    %0, %1, %2 \n"
>         : "=&r"(sum)
>         : "r" (sum), "r" (len)
>         );
>         return sum;
> }
>
> #define local_swab16(x) ((uint16_t)(                                    \
>         (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) |                  \
>         (((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))
>
> int main(int argc, char *argv[])
> {
>         uint16_t foo;
>
>         foo = strtoul(argv[1], NULL, 0);
>         printf("%08x\n", asm_add(local_swab16(foo), 0));
>         return 0;
> }
>
> without optimization, or with noinline:
>
> # ./a.out 0x3412
> 00001234
>
> with optimization:
>
> # ./a.out 0x3412
> 00341234

This looks like a gcc bug to me.  Report it in the gcc bugzilla and
see what they say.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:19         ` Willy Tarreau
@ 2013-12-12 14:28           ` Maxime Bizon
  2013-12-12 14:42             ` Måns Rullgård
  2013-12-12 14:37           ` Måns Rullgård
  1 sibling, 1 reply; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 14:28 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 15:19 +0100, Willy Tarreau wrote:

> I disagree here, since gcc may decide by itself to inline or not, it must
> not impact the validity of the emitted code. Inline functions have input
> and output types for a reason!

but the code emitted is completely different for an inlined "occurence"
of the function and the non-inlined case.

if a function does not use one of its argument, and it is inlined, then
in the resulting code you see no trace of it

> Hmmm aren't you passing a 16-bit register directly to the ASM for being used
> as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how
> you're going to use the register.

this is exactly what I'm complaining about, the arm code for
csum_tcpudp_nofold() in the kernel does exactly that.

> Could you check if that fixes it :
> 
>  static inline uint32_t asm_add(uint16_t len, uint32_t sum)
>  {
>          uint32_t len32 = len;

or change the asm_add() proto to take an "uint32_t len" instead, and yes
of course that fixes it.

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:19         ` Willy Tarreau
  2013-12-12 14:28           ` Maxime Bizon
@ 2013-12-12 14:37           ` Måns Rullgård
  2013-12-12 14:40             ` Maxime Bizon
  1 sibling, 1 reply; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 14:37 UTC (permalink / raw)
  To: linux-arm-kernel

Willy Tarreau <w@1wt.eu> writes:

>> #include <stdint.h>
>> #include <stdio.h>
>> 
>> static inline uint32_t asm_add(uint16_t len, uint32_t sum)
>> {
>>         __asm__(
>>         "add    %0, %1, %2 \n"
>>         : "=&r"(sum)
>>         : "r" (sum), "r" (len)
>>         );
>>         return sum;
>> }
>
> Hmmm aren't you passing a 16-bit register directly to the ASM for being used
> as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how
> you're going to use the register.

There is no such thing as a 16-bit register on ARM.  What this code does
is ask the compiler to take the 16-bit value 'len' and put it in a register
for use in the asm code.  All registers are 32-bit, so there is no ambiguity.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:37           ` Måns Rullgård
@ 2013-12-12 14:40             ` Maxime Bizon
  2013-12-12 14:47               ` Måns Rullgård
  0 siblings, 1 reply; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 14:40 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 14:37 +0000, M?ns Rullg?rd wrote:

> There is no such thing as a 16-bit register on ARM.  What this code does
> is ask the compiler to take the 16-bit value 'len' and put it in a register
> for use in the asm code.  All registers are 32-bit, so there is no ambiguity.

which strictly speaking the compiler did ;)

it took the 16 bits value and put it inside the lower 16 bits of the 32
bits register, it just didn't touch the higher ones

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:28           ` Maxime Bizon
@ 2013-12-12 14:42             ` Måns Rullgård
  2013-12-12 14:52               ` Maxime Bizon
  2013-12-12 15:07               ` Willy Tarreau
  0 siblings, 2 replies; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 14:42 UTC (permalink / raw)
  To: linux-arm-kernel

Maxime Bizon <mbizon@freebox.fr> writes:

> On Thu, 2013-12-12 at 15:19 +0100, Willy Tarreau wrote:
>
>> I disagree here, since gcc may decide by itself to inline or not, it must
>> not impact the validity of the emitted code. Inline functions have input
>> and output types for a reason!
>
> but the code emitted is completely different for an inlined "occurence"
> of the function and the non-inlined case.
>
> if a function does not use one of its argument, and it is inlined, then
> in the resulting code you see no trace of it

Again, that's an optimisation that does not alter the semantics of the code.
Although the generated code looks very different, it does the same thing.

>> Hmmm aren't you passing a 16-bit register directly to the ASM for being used
>> as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how
>> you're going to use the register.
>
> this is exactly what I'm complaining about, the arm code for
> csum_tcpudp_nofold() in the kernel does exactly that.
>
>> Could you check if that fixes it :
>> 
>>  static inline uint32_t asm_add(uint16_t len, uint32_t sum)
>>  {
>>          uint32_t len32 = len;
>
> or change the asm_add() proto to take an "uint32_t len" instead, and yes
> of course that fixes it.

It's a bug.  Please report it to the gcc developers.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:40             ` Maxime Bizon
@ 2013-12-12 14:47               ` Måns Rullgård
  0 siblings, 0 replies; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 14:47 UTC (permalink / raw)
  To: linux-arm-kernel

Maxime Bizon <mbizon@freebox.fr> writes:

> On Thu, 2013-12-12 at 14:37 +0000, M?ns Rullg?rd wrote:
>
>> There is no such thing as a 16-bit register on ARM.  What this code does
>> is ask the compiler to take the 16-bit value 'len' and put it in a register
>> for use in the asm code.  All registers are 32-bit, so there is no ambiguity.
>
> which strictly speaking the compiler did ;)
>
> it took the 16 bits value and put it inside the lower 16 bits of the 32
> bits register, it just didn't touch the higher ones

That's not where things went wrong.  The entire computation is done in
registers, and the swab16 macro casts the result to uint16_t.  This cast
should clear the high half of the register (that's the lsl/lsr pair you
saw), but for some reason this isn't happening.

Now please file a gcc bug report.  That's the proper place to be
discussing this.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 13:36   ` Maxime Bizon
  2013-12-12 13:48     ` Måns Rullgård
@ 2013-12-12 14:48     ` Russell King - ARM Linux
  2013-12-12 15:00       ` Måns Rullgård
  2013-12-12 15:04       ` Maxime Bizon
  1 sibling, 2 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 02:36:30PM +0100, Maxime Bizon wrote:
> 
> On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote:
> 
> > Depends which swab16 you mean by "dumb swab16".  If it is a gcc bug then
> 
> that one:
> 
> #define __swab16(x) ((uint16_t)(                                      \
>         (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) |                  \
>         (((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))
> 
> usually expands to this:
> 
>   24:	e1a00800 	lsl	r0, r0, #16
>   28:	e1a03c20 	lsr	r3, r0, #24
>   2c:	e1833420 	orr	r3, r3, r0, lsr #8
>   30:	e1a03803 	lsl	r3, r3, #16
>   34:	e1a00823 	lsr	r0, r3, #16
> 
> but in my case, the two last shifts are missing.
> 
> > you need to submit a bug report to gcc people.
> 
> but is it for sure ?

Well, in the above code, we're being quite explicit about wanting only
bits 7-0 to be moved to bits 15-8, whereas what the compiler is actually
doing is moving bits 15-0 to 23-8.

In other words, the compiler has completely lost sight of that & 0x00ff
in there.

> I couldn't find any working gcc version so it does not look like a
> regression, hence my doubt.

Well, looking at the builds I have here, it seems that my gcc also
produces wrong code here in __udp4_lib_rcv(), which makes me wonder
how NFS works.  Ah, that's how - in the standard kernel, it's not
"len" which usually gets swabbed, it's the protocol ID - which for
UDP is 17.  __swab16(17) produces 0x1100 even with the bug.

Even so, the code _is_ buggy, because if the protocol value had bits
15-8 set, then this would go wrong for all the same reasons that
you've found.  GCC is definitely ignoring the outter (uint16_t) cast
in the above.

So yes, afaics it's a GCC bug which appears to affect many GCC versions.

The question is... what to do about this.  I don't think we want to wait
for a gcc version to be fixed...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:42             ` Måns Rullgård
@ 2013-12-12 14:52               ` Maxime Bizon
  2013-12-12 14:58                 ` Måns Rullgård
  2013-12-12 15:00                 ` Russell King - ARM Linux
  2013-12-12 15:07               ` Willy Tarreau
  1 sibling, 2 replies; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 14:52 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 14:42 +0000, M?ns Rullg?rd wrote:
> 
> Again, that's an optimisation that does not alter the semantics of the
> code. Although the generated code looks very different, it does the
> same thing.
> 
It cannot do the same thing as there are possibly nothing to do after
inline.


static __attribute__((noinline)) unsigned int do_nothing(unsigned char foo)                                 
{
        foo += 42;
        return 0;
}

int func(int a)
{
        return do_nothing(a);
}

00000000 <do_nothing>:
   0:	e3a00000 	mov	r0, #0
   4:	e12fff1e 	bx	lr

00000008 <func>:
   8:	e52de004 	push	{lr}		; (str lr, [sp, #-4]!)
   c:	e24dd004 	sub	sp, sp, #4
  10:	e20000ff 	and	r0, r0, #255	; 0xff
  14:	ebfffff9 	bl	0 <do_nothing>
  18:	e28dd004 	add	sp, sp, #4
  1c:	e8bd8000 	ldmfd	sp!, {pc}


static inline unsigned int do_nothing(unsigned char foo)                                 
{
        foo += 42;
        return 0;
}

int func(int a)
{
        return do_nothing(a);
}


00000000 <func>:
   0:	e3a00000 	mov	r0, #0
   4:	e12fff1e 	bx	lr


In the first case, the compiler narrows "int a" to char and call the
uninlined function.

In the second case, there is absolutely no generated code to push any
arguments as the function that does nothing is inlined into func().
> 
-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:52               ` Maxime Bizon
@ 2013-12-12 14:58                 ` Måns Rullgård
  2013-12-12 15:00                 ` Russell King - ARM Linux
  1 sibling, 0 replies; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 14:58 UTC (permalink / raw)
  To: linux-arm-kernel

Maxime Bizon <mbizon@freebox.fr> writes:

> On Thu, 2013-12-12 at 14:42 +0000, M?ns Rullg?rd wrote:
>> 
>> Again, that's an optimisation that does not alter the semantics of the
>> code. Although the generated code looks very different, it does the
>> same thing.
>> 
> It cannot do the same thing as there are possibly nothing to do after
> inline.
>
> static __attribute__((noinline)) unsigned int do_nothing(unsigned char foo)                                 
> {
>         foo += 42;
>         return 0;
> }
>
> int func(int a)
> {
>         return do_nothing(a);
> }
>
> 00000000 <do_nothing>:
>    0:	e3a00000 	mov	r0, #0
>    4:	e12fff1e 	bx	lr
>
> 00000008 <func>:
>    8:	e52de004 	push	{lr}		; (str lr, [sp, #-4]!)
>    c:	e24dd004 	sub	sp, sp, #4
>   10:	e20000ff 	and	r0, r0, #255	; 0xff
>   14:	ebfffff9 	bl	0 <do_nothing>
>   18:	e28dd004 	add	sp, sp, #4
>   1c:	e8bd8000 	ldmfd	sp!, {pc}
>
> static inline unsigned int do_nothing(unsigned char foo)                                 
> {
>         foo += 42;
>         return 0;
> }
>
> int func(int a)
> {
>         return do_nothing(a);
> }
>
> 00000000 <func>:
>    0:	e3a00000 	mov	r0, #0
>    4:	e12fff1e 	bx	lr
>
> In the first case, the compiler narrows "int a" to char and call the
> uninlined function.
>
> In the second case, there is absolutely no generated code to push any
> arguments as the function that does nothing is inlined into func().

In both cases, the effects on the global state of the program are
exactly the same.  That's all that matters.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:52               ` Maxime Bizon
  2013-12-12 14:58                 ` Måns Rullgård
@ 2013-12-12 15:00                 ` Russell King - ARM Linux
  2013-12-12 15:26                   ` Maxime Bizon
  1 sibling, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 15:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 03:52:43PM +0100, Maxime Bizon wrote:
> 
> On Thu, 2013-12-12 at 14:42 +0000, M?ns Rullg?rd wrote:
> > 
> > Again, that's an optimisation that does not alter the semantics of the
> > code. Although the generated code looks very different, it does the
> > same thing.
> > 
> It cannot do the same thing as there are possibly nothing to do after
> inline.
> 
> 
> static __attribute__((noinline)) unsigned int do_nothing(unsigned char foo)                                 
> {
>         foo += 42;
>         return 0;
> }
> 
> int func(int a)
> {
>         return do_nothing(a);
> }
> 
> 00000000 <do_nothing>:
>    0:	e3a00000 	mov	r0, #0
>    4:	e12fff1e 	bx	lr
> 
> 00000008 <func>:
>    8:	e52de004 	push	{lr}		; (str lr, [sp, #-4]!)
>    c:	e24dd004 	sub	sp, sp, #4
>   10:	e20000ff 	and	r0, r0, #255	; 0xff
>   14:	ebfffff9 	bl	0 <do_nothing>
>   18:	e28dd004 	add	sp, sp, #4
>   1c:	e8bd8000 	ldmfd	sp!, {pc}
> 
> 
> static inline unsigned int do_nothing(unsigned char foo)                                 
> {
>         foo += 42;
>         return 0;
> }
> 
> int func(int a)
> {
>         return do_nothing(a);
> }
> 
> 
> 00000000 <func>:
>    0:	e3a00000 	mov	r0, #0
>    4:	e12fff1e 	bx	lr
> 
> 
> In the first case, the compiler narrows "int a" to char and call the
> uninlined function.
> 
> In the second case, there is absolutely no generated code to push any
> arguments as the function that does nothing is inlined into func().

This is different - the compiler has recognised in both cases that the
addition od 42 to foo is useless as the result is not used, and therefore
has optimised the addition away.  In the second case, it has realised that
the narrowing cast used to then add 42 to is also not used, and it has
also optimised that away.

A better test case would be do to do this:

	foo += 42;
	return foo;

so that "foo" is actually used.  Or, if you don't feel happy with that:

extern void use_result(unsigned int);

	foo += 42;
	use_result(foo);
	return 0;

so that the compiler can't decide that 'foo' is never used.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:48     ` Russell King - ARM Linux
@ 2013-12-12 15:00       ` Måns Rullgård
  2013-12-12 15:04       ` Maxime Bizon
  1 sibling, 0 replies; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 15:00 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

> On Thu, Dec 12, 2013 at 02:36:30PM +0100, Maxime Bizon wrote:
>> 
>> On Thu, 2013-12-12 at 12:40 +0000, Russell King - ARM Linux wrote:
>> 
>> > Depends which swab16 you mean by "dumb swab16".  If it is a gcc bug then
>> 
>> that one:
>> 
>> #define __swab16(x) ((uint16_t)(                                      \
>>         (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) |                  \
>>         (((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))
>> 
>> usually expands to this:
>> 
>>   24:	e1a00800 	lsl	r0, r0, #16
>>   28:	e1a03c20 	lsr	r3, r0, #24
>>   2c:	e1833420 	orr	r3, r3, r0, lsr #8
>>   30:	e1a03803 	lsl	r3, r3, #16
>>   34:	e1a00823 	lsr	r0, r3, #16
>> 
>> but in my case, the two last shifts are missing.
>> 
>> > you need to submit a bug report to gcc people.
>> 
>> but is it for sure ?
>
> Well, in the above code, we're being quite explicit about wanting only
> bits 7-0 to be moved to bits 15-8, whereas what the compiler is actually
> doing is moving bits 15-0 to 23-8.
>
> In other words, the compiler has completely lost sight of that & 0x00ff
> in there.
>
>> I couldn't find any working gcc version so it does not look like a
>> regression, hence my doubt.
>
> Well, looking at the builds I have here, it seems that my gcc also
> produces wrong code here in __udp4_lib_rcv(), which makes me wonder
> how NFS works.  Ah, that's how - in the standard kernel, it's not
> "len" which usually gets swabbed, it's the protocol ID - which for
> UDP is 17.  __swab16(17) produces 0x1100 even with the bug.
>
> Even so, the code _is_ buggy, because if the protocol value had bits
> 15-8 set, then this would go wrong for all the same reasons that
> you've found.  GCC is definitely ignoring the outter (uint16_t) cast
> in the above.
>
> So yes, afaics it's a GCC bug which appears to affect many GCC versions.
>
> The question is... what to do about this.  I don't think we want to wait
> for a gcc version to be fixed...

First we need to figure out exactly pattern triggers the bug.  Once that
is known, we can work around it.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:48     ` Russell King - ARM Linux
  2013-12-12 15:00       ` Måns Rullgård
@ 2013-12-12 15:04       ` Maxime Bizon
  2013-12-12 15:41         ` Russell King - ARM Linux
  1 sibling, 1 reply; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 15:04 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 14:48 +0000, Russell King - ARM Linux wrote:

> Even so, the code _is_ buggy, because if the protocol value had bits
> 15-8 set, then this would go wrong for all the same reasons that
> you've found.  GCC is definitely ignoring the outter (uint16_t) cast
> in the above.

ok I'll file a gcc bug report to get their input on this.

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 14:42             ` Måns Rullgård
  2013-12-12 14:52               ` Maxime Bizon
@ 2013-12-12 15:07               ` Willy Tarreau
  2013-12-12 15:18                 ` Måns Rullgård
  1 sibling, 1 reply; 37+ messages in thread
From: Willy Tarreau @ 2013-12-12 15:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 02:42:23PM +0000, M?ns Rullg?rd wrote:
> Maxime Bizon <mbizon@freebox.fr> writes:
> 
> > On Thu, 2013-12-12 at 15:19 +0100, Willy Tarreau wrote:
> >
> >> I disagree here, since gcc may decide by itself to inline or not, it must
> >> not impact the validity of the emitted code. Inline functions have input
> >> and output types for a reason!
> >
> > but the code emitted is completely different for an inlined "occurence"
> > of the function and the non-inlined case.
> >
> > if a function does not use one of its argument, and it is inlined, then
> > in the resulting code you see no trace of it
> 
> Again, that's an optimisation that does not alter the semantics of the code.
> Although the generated code looks very different, it does the same thing.

Agreed.

> >> Hmmm aren't you passing a 16-bit register directly to the ASM for being used
> >> as a 32-bit one ? This seems hasardous to me since nowhere you tell gcc how
> >> you're going to use the register.
> >
> > this is exactly what I'm complaining about, the arm code for
> > csum_tcpudp_nofold() in the kernel does exactly that.
> >
> >> Could you check if that fixes it :
> >> 
> >>  static inline uint32_t asm_add(uint16_t len, uint32_t sum)
> >>  {
> >>          uint32_t len32 = len;
> >
> > or change the asm_add() proto to take an "uint32_t len" instead, and yes
> > of course that fixes it.
> 
> It's a bug.  Please report it to the gcc developers.

Here I don't agree with the generalization (and believe me I swear all the
day about gcc's bugs). It's a matter of ABI and availability or not of 16
bit registers or not. If the ASM supports 16-bit regs and the compiler is
allowed to emit 16-bit regs, then gcc will have no way to know whether it
must zero-extend the value first. If it's specified that "r" is necessarily
a 32-bit register then it should. Maybe the issue is in the ABI itself, I
don't know if 16-bit values are supposed to be zero-extended only when
they're converted to 32-bit or also when they're passed as arguments. The
fact that it works without inline may simply be a side effect of the different
code (eg: 16 lower bits of the register copied into another one).

So one needs to look at the specs of the ABI to know where the 16-bit value
is supposed to be converted to 32-bit, then the faulty component must be
fixed (gcc or kernel code).

Regards,
Willy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:07               ` Willy Tarreau
@ 2013-12-12 15:18                 ` Måns Rullgård
  2013-12-12 15:28                   ` Willy Tarreau
  0 siblings, 1 reply; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

Willy Tarreau <w@1wt.eu> writes:

>> >> Hmmm aren't you passing a 16-bit register directly to the ASM for
>> >> being used as a 32-bit one ? This seems hasardous to me since
>> >> nowhere you tell gcc how you're going to use the register.
>> >
>> > this is exactly what I'm complaining about, the arm code for
>> > csum_tcpudp_nofold() in the kernel does exactly that.
>> >
>> >> Could you check if that fixes it :
>> >> 
>> >>  static inline uint32_t asm_add(uint16_t len, uint32_t sum)
>> >>  {
>> >>          uint32_t len32 = len;
>> >
>> > or change the asm_add() proto to take an "uint32_t len" instead, and yes
>> > of course that fixes it.
>> 
>> It's a bug.  Please report it to the gcc developers.
>
> Here I don't agree with the generalization (and believe me I swear all the
> day about gcc's bugs). It's a matter of ABI and availability or not of 16
> bit registers or not. If the ASM supports 16-bit regs and the compiler is
> allowed to emit 16-bit regs, then gcc will have no way to know whether it
> must zero-extend the value first. If it's specified that "r" is necessarily
> a 32-bit register then it should.

ARM has *only* 32-bit registers.

> Maybe the issue is in the ABI itself, I don't know if 16-bit values
> are supposed to be zero-extended only when they're converted to 32-bit
> or also when they're passed as arguments. The fact that it works
> without inline may simply be a side effect of the different code (eg:
> 16 lower bits of the register copied into another one).
>
> So one needs to look at the specs of the ABI to know where the 16-bit value
> is supposed to be converted to 32-bit, then the faulty component must be
> fixed (gcc or kernel code).

The ABI for function calls sign/zero-extends all arguments prior to the
call.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:00                 ` Russell King - ARM Linux
@ 2013-12-12 15:26                   ` Maxime Bizon
  0 siblings, 0 replies; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 15:26 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 15:00 +0000, Russell King - ARM Linux wrote:

> This is different - the compiler has recognised in both cases that the
> addition od 42 to foo is useless as the result is not used, and
> therefore has optimised the addition away.  In the second case, it has
> realised that the narrowing cast used to then add 42 to is also not
> used, and it has also optimised that away.

I was just trying to show that in the inline case the generated code was
not the same (apparently there was a misunderstanding on what I was
arguing about).

Now if you want a more interesting/on topic example:

#include <stdint.h>

static __attribute__((noinline)) unsigned int set_bit_2(unsigned int foo)
{
        foo |= 0x4;
        return foo;
}

#define local_swab16(x) ((uint16_t)(                                    \
        (((uint16_t)(x) & (uint16_t)0x00ffU) << 8) |                  \
        (((uint16_t)(x) & (uint16_t)0xff00U) >> 8)))

int func(uint16_t a)
{
        a = set_bit_2(local_swab16(a));
        if ((a & 0xffff) == 0x1234)
                return 1;
        return 0;
}

00000000 <set_bit_2>:
   0:	e3800004 	orr	r0, r0, #4
   4:	e12fff1e 	bx	lr

00000008 <func>:
   8:	e92d4008 	push	{r3, lr}
   c:	e1a03420 	lsr	r3, r0, #8
  10:	e1830400 	orr	r0, r3, r0, lsl #8
  14:	e1a00800 	lsl	r0, r0, #16
  18:	e1a00820 	lsr	r0, r0, #16
  1c:	ebfffff7 	bl	0 <set_bit_2>
  20:	e3a03c12 	mov	r3, #4608	; 0x1200
  24:	e1a00800 	lsl	r0, r0, #16
  28:	e2833034 	add	r3, r3, #52	; 0x34
  2c:	e1530820 	cmp	r3, r0, lsr #16
  30:	03a00001 	moveq	r0, #1
  34:	13a00000 	movne	r0, #0
  38:	e8bd8008 	pop	{r3, pc}

you can see here the full swab16 code, with the narrowing before calling 
set_bit_2() per calling convention.


Now remove the noninline:

00000000 <func>:
   0:	e3a03c12 	mov	r3, #4608	; 0x1200
   4:	e1a02420 	lsr	r2, r0, #8
   8:	e1820400 	orr	r0, r2, r0, lsl #8
   c:	e3802004 	orr	r2, r0, #4
  10:	e1a02802 	lsl	r2, r2, #16
  14:	e2833034 	add	r3, r3, #52	; 0x34
  18:	e1530822 	cmp	r3, r2, lsr #16
  1c:	03a00001 	moveq	r0, #1
  20:	13a00000 	movne	r0, #0
  24:	e12fff1e 	bx	lr

and you see that the double shift is removed because gcc knows it is not
needed by the remaining code, which uses only the lower 16 bits.

This is IMO exactly what happen in the csum case, the inline assembly
rule (the one we are not aware of) must be that you cannot use the 16
higher bits of the register in the assembly code, and so gcc takes the
liberty not to zero extend the register.

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:18                 ` Måns Rullgård
@ 2013-12-12 15:28                   ` Willy Tarreau
  2013-12-12 15:43                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 37+ messages in thread
From: Willy Tarreau @ 2013-12-12 15:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 03:18:18PM +0000, M?ns Rullg?rd wrote:
> Willy Tarreau <w@1wt.eu> writes:
> 
> >> >> Hmmm aren't you passing a 16-bit register directly to the ASM for
> >> >> being used as a 32-bit one ? This seems hasardous to me since
> >> >> nowhere you tell gcc how you're going to use the register.
> >> >
> >> > this is exactly what I'm complaining about, the arm code for
> >> > csum_tcpudp_nofold() in the kernel does exactly that.
> >> >
> >> >> Could you check if that fixes it :
> >> >> 
> >> >>  static inline uint32_t asm_add(uint16_t len, uint32_t sum)
> >> >>  {
> >> >>          uint32_t len32 = len;
> >> >
> >> > or change the asm_add() proto to take an "uint32_t len" instead, and yes
> >> > of course that fixes it.
> >> 
> >> It's a bug.  Please report it to the gcc developers.
> >
> > Here I don't agree with the generalization (and believe me I swear all the
> > day about gcc's bugs). It's a matter of ABI and availability or not of 16
> > bit registers or not. If the ASM supports 16-bit regs and the compiler is
> > allowed to emit 16-bit regs, then gcc will have no way to know whether it
> > must zero-extend the value first. If it's specified that "r" is necessarily
> > a 32-bit register then it should.
> 
> ARM has *only* 32-bit registers.
> 
> > Maybe the issue is in the ABI itself, I don't know if 16-bit values
> > are supposed to be zero-extended only when they're converted to 32-bit
> > or also when they're passed as arguments. The fact that it works
> > without inline may simply be a side effect of the different code (eg:
> > 16 lower bits of the register copied into another one).
> >
> > So one needs to look at the specs of the ABI to know where the 16-bit value
> > is supposed to be converted to 32-bit, then the faulty component must be
> > fixed (gcc or kernel code).
> 
> The ABI for function calls sign/zero-extends all arguments prior to the
> call.

OK then that's pretty clear, there's no ambiguity, thanks for the precision.

Willy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:04       ` Maxime Bizon
@ 2013-12-12 15:41         ` Russell King - ARM Linux
  2013-12-12 16:04           ` Måns Rullgård
                             ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 15:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 04:04:17PM +0100, Maxime Bizon wrote:
> 
> On Thu, 2013-12-12 at 14:48 +0000, Russell King - ARM Linux wrote:
> 
> > Even so, the code _is_ buggy, because if the protocol value had bits
> > 15-8 set, then this would go wrong for all the same reasons that
> > you've found.  GCC is definitely ignoring the outter (uint16_t) cast
> > in the above.
> 
> ok I'll file a gcc bug report to get their input on this.

Actually, it may not be a gcc bug at all.

One of the issues here is that internally, GCC tracks the "machine mode"
of a register, where "mode" basically means the type of register it is.
In this case, the register will be in "half integer" mode when it is
passed into the assembler, and gcc does _not_ promote it.

We can see this in the RTL:

(insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ])
        (asm_operands:SI ("mov  %0, %1") ("=&r") 0 [
                (subreg:HI (reg:SI 148) 0)
            ]
             [
                (asm_input:HI ("r") (null):0)
            ]
             [] ../t.c:12)) -1 (nil))

Note that "HI" for the input register 148.  What this means is that the
compiler "knows" that it's supposed to be a 16-bit quantity, and has
recorded that fact, and _will_ truncate it down to 16-bit when it needs
to be promoted.

Since the asm() only asks for a "register", that's what we get - a
register which _happens_ to be in HI mode containing the value.  However,
there's no way to tell that from the assembly code itself (GCC doesn't
have the facility to do conditional assembly generation depending on
the argument type passed into it.)

What this means is that passing "unsigned short"s into an asm is dodgy.

What's more, I've just had this confirmed by compiler people in ARM Ltd -
we can't rely on the state of the upper 16 bits when passing a u16 into
an asm().  So, it seems that it _is_ a kernel bug after all.

As I said, we need to fix it in the kernel because of the huge number of
GCC versions which show this behaviour _even_ if it turns out to be a GCC
bug (which it seems it *isn't*.)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:28                   ` Willy Tarreau
@ 2013-12-12 15:43                     ` Russell King - ARM Linux
  2013-12-12 15:50                       ` Måns Rullgård
  0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 15:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 04:28:49PM +0100, Willy Tarreau wrote:
> On Thu, Dec 12, 2013 at 03:18:18PM +0000, M?ns Rullg?rd wrote:
> > The ABI for function calls sign/zero-extends all arguments prior to the
> > call.
> 
> OK then that's pretty clear, there's no ambiguity, thanks for the precision.

Let's be clear about this: there is no GCC bug here - the kernel code
unfortunately what is buggy.  asm() input arguments are not subject to
this promotion irrespective of the ABI.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:43                     ` Russell King - ARM Linux
@ 2013-12-12 15:50                       ` Måns Rullgård
  0 siblings, 0 replies; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 15:50 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

> On Thu, Dec 12, 2013 at 04:28:49PM +0100, Willy Tarreau wrote:
>> On Thu, Dec 12, 2013 at 03:18:18PM +0000, M?ns Rullg?rd wrote:
>> > The ABI for function calls sign/zero-extends all arguments prior to the
>> > call.
>> 
>> OK then that's pretty clear, there's no ambiguity, thanks for the precision.
>
> Let's be clear about this: there is no GCC bug here - the kernel code
> unfortunately what is buggy.  asm() input arguments are not subject to
> this promotion irrespective of the ABI.

At the asm() the value should already have been zero-extended.  Since
the code does not do anything with undefined behaviour, the fact that
different optimisation levels give different results clearly indicates a
compiler bug.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:41         ` Russell King - ARM Linux
@ 2013-12-12 16:04           ` Måns Rullgård
  2013-12-12 16:04           ` Willy Tarreau
  2013-12-12 17:34           ` Maxime Bizon
  2 siblings, 0 replies; 37+ messages in thread
From: Måns Rullgård @ 2013-12-12 16:04 UTC (permalink / raw)
  To: linux-arm-kernel

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

> On Thu, Dec 12, 2013 at 04:04:17PM +0100, Maxime Bizon wrote:
>> 
>> On Thu, 2013-12-12 at 14:48 +0000, Russell King - ARM Linux wrote:
>> 
>> > Even so, the code _is_ buggy, because if the protocol value had bits
>> > 15-8 set, then this would go wrong for all the same reasons that
>> > you've found.  GCC is definitely ignoring the outter (uint16_t) cast
>> > in the above.
>> 
>> ok I'll file a gcc bug report to get their input on this.
>
> Actually, it may not be a gcc bug at all.
>
> One of the issues here is that internally, GCC tracks the "machine mode"
> of a register, where "mode" basically means the type of register it is.
> In this case, the register will be in "half integer" mode when it is
> passed into the assembler, and gcc does _not_ promote it.
>
> We can see this in the RTL:
>
> (insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ])
>         (asm_operands:SI ("mov  %0, %1") ("=&r") 0 [
>                 (subreg:HI (reg:SI 148) 0)
>             ]
>              [
>                 (asm_input:HI ("r") (null):0)
>             ]
>              [] ../t.c:12)) -1 (nil))
>
> Note that "HI" for the input register 148.  What this means is that the
> compiler "knows" that it's supposed to be a 16-bit quantity, and has
> recorded that fact, and _will_ truncate it down to 16-bit when it needs
> to be promoted.
>
> Since the asm() only asks for a "register", that's what we get - a
> register which _happens_ to be in HI mode containing the value.  However,
> there's no way to tell that from the assembly code itself (GCC doesn't
> have the facility to do conditional assembly generation depending on
> the argument type passed into it.)
>
> What this means is that passing "unsigned short"s into an asm is dodgy.

If that's the intended behaviour, the compiler really ought to error or
warn when this happens.  It's far too easy to accidentally fall foul of
this.

> What's more, I've just had this confirmed by compiler people in ARM Ltd -
> we can't rely on the state of the upper 16 bits when passing a u16 into
> an asm().  So, it seems that it _is_ a kernel bug after all.

With that being the official stance, the proper fix appears to be to
explicitly cast all 8/16-bit asm operands to a 32-bit type.  Good luck
finding them.

-- 
M?ns Rullg?rd
mans at mansr.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:41         ` Russell King - ARM Linux
  2013-12-12 16:04           ` Måns Rullgård
@ 2013-12-12 16:04           ` Willy Tarreau
  2013-12-12 16:47             ` Russell King - ARM Linux
  2013-12-12 17:34           ` Maxime Bizon
  2 siblings, 1 reply; 37+ messages in thread
From: Willy Tarreau @ 2013-12-12 16:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 03:41:10PM +0000, Russell King - ARM Linux wrote:
> One of the issues here is that internally, GCC tracks the "machine mode"
> of a register, where "mode" basically means the type of register it is.
> In this case, the register will be in "half integer" mode when it is
> passed into the assembler, and gcc does _not_ promote it.
> 
> We can see this in the RTL:
> 
> (insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ])
>         (asm_operands:SI ("mov  %0, %1") ("=&r") 0 [
>                 (subreg:HI (reg:SI 148) 0)
>             ]
>              [
>                 (asm_input:HI ("r") (null):0)
>             ]
>              [] ../t.c:12)) -1 (nil))
> 
> Note that "HI" for the input register 148.  What this means is that the
> compiler "knows" that it's supposed to be a 16-bit quantity, and has
> recorded that fact, and _will_ truncate it down to 16-bit when it needs
> to be promoted.
> 
> Since the asm() only asks for a "register", that's what we get - a
> register which _happens_ to be in HI mode containing the value.  However,
> there's no way to tell that from the assembly code itself (GCC doesn't
> have the facility to do conditional assembly generation depending on
> the argument type passed into it.)

OK so that means that it's just like having true 16-bit registers except
that it's just a temporary representation and not a feature of the CPU.
The compiler has the right to expect the asm instructions to only use
the lower 16 bits and has no way to verify that.

Then changing the type of the function argument would probably be safer!

Regards,
Willy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 16:04           ` Willy Tarreau
@ 2013-12-12 16:47             ` Russell King - ARM Linux
  2013-12-12 17:11               ` Willy Tarreau
  2013-12-12 22:30               ` Maxime Bizon
  0 siblings, 2 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 16:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 05:04:26PM +0100, Willy Tarreau wrote:
> On Thu, Dec 12, 2013 at 03:41:10PM +0000, Russell King - ARM Linux wrote:
> > One of the issues here is that internally, GCC tracks the "machine mode"
> > of a register, where "mode" basically means the type of register it is.
> > In this case, the register will be in "half integer" mode when it is
> > passed into the assembler, and gcc does _not_ promote it.
> > 
> > We can see this in the RTL:
> > 
> > (insn 11 10 12 3 ../t.c:9 (set (reg:SI 143 [ sum ])
> >         (asm_operands:SI ("mov  %0, %1") ("=&r") 0 [
> >                 (subreg:HI (reg:SI 148) 0)
> >             ]
> >              [
> >                 (asm_input:HI ("r") (null):0)
> >             ]
> >              [] ../t.c:12)) -1 (nil))
> > 
> > Note that "HI" for the input register 148.  What this means is that the
> > compiler "knows" that it's supposed to be a 16-bit quantity, and has
> > recorded that fact, and _will_ truncate it down to 16-bit when it needs
> > to be promoted.
> > 
> > Since the asm() only asks for a "register", that's what we get - a
> > register which _happens_ to be in HI mode containing the value.  However,
> > there's no way to tell that from the assembly code itself (GCC doesn't
> > have the facility to do conditional assembly generation depending on
> > the argument type passed into it.)
> 
> OK so that means that it's just like having true 16-bit registers except
> that it's just a temporary representation and not a feature of the CPU.
> The compiler has the right to expect the asm instructions to only use
> the lower 16 bits and has no way to verify that.

Quite.

> Then changing the type of the function argument would probably be safer!

Actually, I think we can do a bit better with this code.  We really don't
need much of this messing around here, we can combine some of these steps.

We have:

16-bit protocol in host endian
16-bit length in host endian

and we need to combine them into a 32-bit checksum which is then
subsequently folded down to 16-bits by adding the top and bottom halves.

Now, what we can do is this:

1. Construct a combined 32-bit protocol and length:

	unsigned lenproto = len | proto << 16;

2. Pass this into the assembly thusly:

                __asm__(
                "adds   %0, %1, %2      @ csum_tcpudp_nofold    \n\t"
                "adcs   %0, %0, %3                              \n\t"
#ifdef __ARMEB__
                "adcs   %0, %0, %4                              \n\t"
#else
                "adcs   %0, %0, %4, ror #8                      \n\t"
#endif
                "adc    %0, %0, #0"
                : "=&r"(sum)
                : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot)
                : "cc");

with no swabbing@this stage.  Well, where do we get the endian
conversion?  See that ror #8 - that a 32 bit rotate by 8 bits.  As
these are two 16-bit quantities, we end up with this:

original:
	31..24	23..16	15..8	7..0
	len_h	len_l	pro_h	pro_l

accumulated:
	31..24	23..16	15..8	7..0
	pro_l	len_h	len_l	pro_h

And now when we fold it down to 16-bit:

			15..8	7..0
			len_l	pro_h
			pro_l	len_h

There we go - the low and high bytes end up correctly placed.  Now, the
advantage to having "len" and "proto" combined by the compiler is that
the compiler itself can make the decision about how to combine these two
16-bit HI-mode quantities in the most efficient way.

Here's some examples from udp.c - I have another optimisation in here
too which makes this code sequence slightly shorter.  Some intermediary
instructions have been omitted.

        mov     r6, r6, asl #16 @ tmp204, len,
        mov     r6, r6, lsr #16 @ tmp203, tmp204,
        orr     r6, r6, #1114112        @ tmp205, tmp203,
@ 92 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1
        adds    r3, r8, r7      @ csum_tcpudp_nofold0           @ sum, dst, src
        adcs    r3, r3, r6, ror #8                              @ sum, tmp205
        adc     r3, r3, #0      @ sum


        mov     r6, r6, asl #16 @ tmp220, len,
        mov     r6, r6, lsr #16 @ tmp219, tmp220,
        orr     r6, r6, #1114112        @ tmp221, tmp219,
@ 104 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1
        adds    r3, r0, r8      @ csum_tcpudp_nofold            @ sum,, dst
        adcs    r3, r3, r7                                      @ sum, src
        adcs    r3, r3, r6, ror #8                              @ sum, tmp221
        adc     r3, r3, #0      @ sum

Note this one sourcing an 8-bit protocol value.

        ldrb    r3, [r7, #269]  @ zero_extendqisi2      @ tmp321, sk_5->sk_protocol
        orr     sl, sl, r3, asl #16     @, tmp324, D.48540, tmp321,
@ 104 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1
        adds    r3, r0, r2      @ csum_tcpudp_nofold            @ sum, csum, fl4_19(D)->daddr
        adcs    r3, r3, r1                                      @ sum, fl4_19(D)->saddr
        adcs    r3, r3, sl, ror #8                              @ sum, tmp324
        adc     r3, r3, #0      @ sum


        ldrh    lr, [r4, #76]   @ tmp503, skb_10(D)->len
        orr     lr, lr, r7, asl #16     @, tmp505, tmp503, proto,
@ 104 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1
        adds    r1, sl, r0      @ csum_tcpudp_nofold            @ sum, skb_10(D)->D.22802.csum, iph_354->daddr
        adcs    r1, r1, ip                                      @ sum, iph_354->saddr
        adcs    r1, r1, lr, ror #8                              @ sum, tmp505
        adc     r1, r1, #0      @ sum


        ldrh    r0, [r4, #76]   @ tmp529, skb_10(D)->len
        orr     r0, r0, r7, asl #16     @, tmp531, tmp529, proto,
@ 92 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1
        adds    r3, r2, r1      @ csum_tcpudp_nofold0           @ sum, iph_354->daddr, iph_354->saddr
        adcs    r3, r3, r0, ror #8                              @ sum, tmp531
        adc     r3, r3, #0      @ sum

This one is a little more complicated because it uses skb->len - we
can see the narrowing cast being performed:

        ldr     r0, [r4, #76]   @ skb_1->len, skb_1->len
        rsb     r0, r9, r0      @ tmp335, D.53494, skb_1->len
        mov     r0, r0, asl #16 @ tmp337, tmp335,
        mov     r0, r0, lsr #16 @ tmp336, tmp337,
        orr     r0, r0, #1114112        @ tmp338, tmp336,
@ 92 "/home/rmk/git/linux-rmk/arch/arm/include/asm/checksum.h" 1
        adds    r3, r2, r1      @ csum_tcpudp_nofold0           @ sum, iph_252->daddr, iph_252->saddr
        adcs    r3, r3, r0, ror #8                              @ sum, tmp338
        adc     r3, r3, #0      @ sum

Looking at whether it's better to shift len or proto to the upper 16 bits
reveals no real advantage to either: we end up with the same overall number
of instructions between the UDP and TCP code combined, individually there's
only one line between them.

So here's a patch for Maxime to try - I've not run it yet but it seems to
do the right thing by code inspection.

diff --git a/arch/arm/include/asm/checksum.h b/arch/arm/include/asm/checksum.h
index 6dcc16430868..523315115478 100644
--- a/arch/arm/include/asm/checksum.h
+++ b/arch/arm/include/asm/checksum.h
@@ -87,19 +87,34 @@ static inline __wsum
 csum_tcpudp_nofold(__be32 saddr, __be32 daddr, unsigned short len,
 		   unsigned short proto, __wsum sum)
 {
-	__asm__(
-	"adds	%0, %1, %2		@ csum_tcpudp_nofold	\n\
-	adcs	%0, %0, %3					\n"
+	u32 lenprot = len | proto << 16;
+
+	if (__builtin_constant_p(sum) && sum == 0) {
+		__asm__(
+		"adds	%0, %1, %2	@ csum_tcpudp_nofold0	\n\t"
 #ifdef __ARMEB__
-	"adcs	%0, %0, %4					\n"
+		"adcs	%0, %0, %3				\n\t"
 #else
-	"adcs	%0, %0, %4, lsl #8				\n"
+		"adcs	%0, %0, %3, ror #8			\n\t"
 #endif
-	"adcs	%0, %0, %5					\n\
-	adc	%0, %0, #0"
-	: "=&r"(sum)
-	: "r" (sum), "r" (daddr), "r" (saddr), "r" (len), "Ir" (htons(proto))
-	: "cc");
+		"adc	%0, %0, #0"
+		: "=&r" (sum)
+		: "r" (daddr), "r" (saddr), "r" (lenprot)
+		: "cc");
+	} else {
+		__asm__(
+		"adds	%0, %1, %2	@ csum_tcpudp_nofold	\n\t"
+		"adcs	%0, %0, %3				\n\t"
+#ifdef __ARMEB__
+		"adcs	%0, %0, %4				\n\t"
+#else
+		"adcs	%0, %0, %4, ror #8			\n\t"
+#endif
+		"adc	%0, %0, #0"
+		: "=&r"(sum)
+		: "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot)
+		: "cc");
+	}
 	return sum;
 }	
 /*

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 16:47             ` Russell King - ARM Linux
@ 2013-12-12 17:11               ` Willy Tarreau
  2013-12-12 17:20                 ` Russell King - ARM Linux
  2013-12-12 22:30               ` Maxime Bizon
  1 sibling, 1 reply; 37+ messages in thread
From: Willy Tarreau @ 2013-12-12 17:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 04:47:48PM +0000, Russell King - ARM Linux wrote:
> > Then changing the type of the function argument would probably be safer!
> 
> Actually, I think we can do a bit better with this code.  We really don't
> need much of this messing around here, we can combine some of these steps.
> 
> We have:
> 
> 16-bit protocol in host endian
> 16-bit length in host endian
> 
> and we need to combine them into a 32-bit checksum which is then
> subsequently folded down to 16-bits by adding the top and bottom halves.
> 
> Now, what we can do is this:
> 
> 1. Construct a combined 32-bit protocol and length:
> 
> 	unsigned lenproto = len | proto << 16;
> 
> 2. Pass this into the assembly thusly:
> 
>                 __asm__(
>                 "adds   %0, %1, %2      @ csum_tcpudp_nofold    \n\t"
>                 "adcs   %0, %0, %3                              \n\t"
> #ifdef __ARMEB__
>                 "adcs   %0, %0, %4                              \n\t"
> #else
>                 "adcs   %0, %0, %4, ror #8                      \n\t"
> #endif
>                 "adc    %0, %0, #0"
>                 : "=&r"(sum)
>                 : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot)
>                 : "cc");
> 
> with no swabbing at this stage.  Well, where do we get the endian
> conversion?  See that ror #8 - that a 32 bit rotate by 8 bits.  As
> these are two 16-bit quantities, we end up with this:
> 
> original:
> 	31..24	23..16	15..8	7..0
> 	len_h	len_l	pro_h	pro_l
> 
> accumulated:
> 	31..24	23..16	15..8	7..0
> 	pro_l	len_h	len_l	pro_h
> 
> And now when we fold it down to 16-bit:
> 
> 			15..8	7..0
> 			len_l	pro_h
> 			pro_l	len_h

Amusing, I've used the same optimization yesterday when computing a
TCP pseudo-header checksum.

Another thing that can be done to improve the folding of the 16-bit
checksum is to swap the values to be added, sum them and only keep
the high half integer which already contains the carry. At least on
x86 I save some cycles doing this :

              31:24  23:16  15:8  7:0
     sum32 =    D      C      B    A

     To fold this into 16-bit at a time, I just do this :

                   31:24     23:16          15:8  7:0
     sum32           D         C              B    A
  +  sum32swapped    B         A              D    C
  =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A

so just take the upper result and you get the final 16-bit word at
once.

In C it does :

       fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16

When the CPU has a rotate instruction, it's fast :-)

Cheers,
Willy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 17:11               ` Willy Tarreau
@ 2013-12-12 17:20                 ` Russell King - ARM Linux
  2013-12-12 17:35                   ` Willy Tarreau
  2013-12-12 18:07                   ` Nicolas Pitre
  0 siblings, 2 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 17:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote:
> Another thing that can be done to improve the folding of the 16-bit
> checksum is to swap the values to be added, sum them and only keep
> the high half integer which already contains the carry. At least on
> x86 I save some cycles doing this :
> 
>               31:24  23:16  15:8  7:0
>      sum32 =    D      C      B    A
> 
>      To fold this into 16-bit at a time, I just do this :
> 
>                    31:24     23:16          15:8  7:0
>      sum32           D         C              B    A
>   +  sum32swapped    B         A              D    C
>   =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A
> 
> so just take the upper result and you get the final 16-bit word at
> once.
> 
> In C it does :
> 
>        fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16
> 
> When the CPU has a rotate instruction, it's fast :-)

Indeed - and if your CPU can do the rotate and add at the same time,
it's just a singe instruction, and it ends up looking remarkably
similar to this:

static inline __sum16 csum_fold(__wsum sum)
{
        __asm__(
        "add    %0, %1, %1, ror #16     @ csum_fold"
        : "=r" (sum)
        : "r" (sum)
        : "cc");
        return (__force __sum16)(~(__force u32)sum >> 16);
}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 15:41         ` Russell King - ARM Linux
  2013-12-12 16:04           ` Måns Rullgård
  2013-12-12 16:04           ` Willy Tarreau
@ 2013-12-12 17:34           ` Maxime Bizon
  2 siblings, 0 replies; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 17:34 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 15:41 +0000, Russell King - ARM Linux wrote:
> 
> What's more, I've just had this confirmed by compiler people in ARM
> Ltd - we can't rely on the state of the upper 16 bits when passing a
> u16 into an asm().  So, it seems that it _is_ a kernel bug after all.
> 
ok, I was not crazy

thanks for the deeper analysis, I will test your patch ASAP

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 17:20                 ` Russell King - ARM Linux
@ 2013-12-12 17:35                   ` Willy Tarreau
  2013-12-12 18:07                   ` Nicolas Pitre
  1 sibling, 0 replies; 37+ messages in thread
From: Willy Tarreau @ 2013-12-12 17:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 05:20:49PM +0000, Russell King - ARM Linux wrote:
> On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote:
> > Another thing that can be done to improve the folding of the 16-bit
> > checksum is to swap the values to be added, sum them and only keep
> > the high half integer which already contains the carry. At least on
> > x86 I save some cycles doing this :
> > 
> >               31:24  23:16  15:8  7:0
> >      sum32 =    D      C      B    A
> > 
> >      To fold this into 16-bit at a time, I just do this :
> > 
> >                    31:24     23:16          15:8  7:0
> >      sum32           D         C              B    A
> >   +  sum32swapped    B         A              D    C
> >   =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A
> > 
> > so just take the upper result and you get the final 16-bit word at
> > once.
> > 
> > In C it does :
> > 
> >        fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16
> > 
> > When the CPU has a rotate instruction, it's fast :-)
> 
> Indeed - and if your CPU can do the rotate and add at the same time,
> it's just a singe instruction, and it ends up looking remarkably
> similar to this:
> 
> static inline __sum16 csum_fold(__wsum sum)
> {
>         __asm__(
>         "add    %0, %1, %1, ror #16     @ csum_fold"
>         : "=r" (sum)
>         : "r" (sum)
>         : "cc");
>         return (__force __sum16)(~(__force u32)sum >> 16);
> }

Marvelous :-)

Willy

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 17:20                 ` Russell King - ARM Linux
  2013-12-12 17:35                   ` Willy Tarreau
@ 2013-12-12 18:07                   ` Nicolas Pitre
  1 sibling, 0 replies; 37+ messages in thread
From: Nicolas Pitre @ 2013-12-12 18:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 12 Dec 2013, Russell King - ARM Linux wrote:

> static inline __sum16 csum_fold(__wsum sum)
> {
>         __asm__(
>         "add    %0, %1, %1, ror #16     @ csum_fold"
>         : "=r" (sum)
>         : "r" (sum)
>         : "cc");
>         return (__force __sum16)(~(__force u32)sum >> 16);
> }

There is no longer any flag updates so you should remove "cc" from the 
clobber list as well.


Nicolas

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 16:47             ` Russell King - ARM Linux
  2013-12-12 17:11               ` Willy Tarreau
@ 2013-12-12 22:30               ` Maxime Bizon
  2013-12-12 22:36                 ` Russell King - ARM Linux
  1 sibling, 1 reply; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 22:30 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 16:47 +0000, Russell King - ARM Linux wrote:
> 
> So here's a patch for Maxime to try - I've not run it yet but it seems
> to do the right thing by code inspection. 

Kernel built with this patch seems to work fine. Also I called it
manually with a few test patterns and the results are correct.

Tested-by: Maxime Bizon <mbizon@freebox.fr>

Thanks Russell,

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 22:30               ` Maxime Bizon
@ 2013-12-12 22:36                 ` Russell King - ARM Linux
  2013-12-12 22:44                   ` Maxime Bizon
  0 siblings, 1 reply; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 22:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 11:30:02PM +0100, Maxime Bizon wrote:
> 
> On Thu, 2013-12-12 at 16:47 +0000, Russell King - ARM Linux wrote:
> > 
> > So here's a patch for Maxime to try - I've not run it yet but it seems
> > to do the right thing by code inspection. 
> 
> Kernel built with this patch seems to work fine. Also I called it
> manually with a few test patterns and the results are correct.
> 
> Tested-by: Maxime Bizon <mbizon@freebox.fr>
> 
> Thanks Russell,

Thanks Maxime.

Does this bug show up in existing unmodified mainline kernels?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 22:36                 ` Russell King - ARM Linux
@ 2013-12-12 22:44                   ` Maxime Bizon
  2013-12-12 22:48                     ` Russell King - ARM Linux
  0 siblings, 1 reply; 37+ messages in thread
From: Maxime Bizon @ 2013-12-12 22:44 UTC (permalink / raw)
  To: linux-arm-kernel


On Thu, 2013-12-12 at 22:36 +0000, Russell King - ARM Linux wrote:

> Does this bug show up in existing unmodified mainline kernels?

not for me, I found this while writing new code.

though all possible kernel callers are not enabled in my .config.

-- 
Maxime

^ permalink raw reply	[flat|nested] 37+ messages in thread

* gcc miscompiles csum_tcpudp_magic() on ARMv5
  2013-12-12 22:44                   ` Maxime Bizon
@ 2013-12-12 22:48                     ` Russell King - ARM Linux
  0 siblings, 0 replies; 37+ messages in thread
From: Russell King - ARM Linux @ 2013-12-12 22:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Dec 12, 2013 at 11:44:05PM +0100, Maxime Bizon wrote:
> 
> On Thu, 2013-12-12 at 22:36 +0000, Russell King - ARM Linux wrote:
> 
> > Does this bug show up in existing unmodified mainline kernels?
> 
> not for me, I found this while writing new code.
> 
> though all possible kernel callers are not enabled in my .config.

Okay, as we have no reports so far of this affecting anything in any
mainline kernel, I think I'm going to queue the fix up for v3.14 and
not -rc - we know that although the assembly is not correct in the IP
code, we know that it's completely harmless there because the overflowed
byte is always zero.

So... I'll queue it for v3.14.  If anyone does find it affects existing
kernels, we can always apply it for -rc and stable kernels.

Thanks for your patience Maxime.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2013-12-12 22:48 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-12 12:14 gcc miscompiles csum_tcpudp_magic() on ARMv5 Maxime Bizon
2013-12-12 12:40 ` Russell King - ARM Linux
2013-12-12 13:36   ` Maxime Bizon
2013-12-12 13:48     ` Måns Rullgård
2013-12-12 14:10       ` Maxime Bizon
2013-12-12 14:19         ` Willy Tarreau
2013-12-12 14:28           ` Maxime Bizon
2013-12-12 14:42             ` Måns Rullgård
2013-12-12 14:52               ` Maxime Bizon
2013-12-12 14:58                 ` Måns Rullgård
2013-12-12 15:00                 ` Russell King - ARM Linux
2013-12-12 15:26                   ` Maxime Bizon
2013-12-12 15:07               ` Willy Tarreau
2013-12-12 15:18                 ` Måns Rullgård
2013-12-12 15:28                   ` Willy Tarreau
2013-12-12 15:43                     ` Russell King - ARM Linux
2013-12-12 15:50                       ` Måns Rullgård
2013-12-12 14:37           ` Måns Rullgård
2013-12-12 14:40             ` Maxime Bizon
2013-12-12 14:47               ` Måns Rullgård
2013-12-12 14:26         ` Måns Rullgård
2013-12-12 14:48     ` Russell King - ARM Linux
2013-12-12 15:00       ` Måns Rullgård
2013-12-12 15:04       ` Maxime Bizon
2013-12-12 15:41         ` Russell King - ARM Linux
2013-12-12 16:04           ` Måns Rullgård
2013-12-12 16:04           ` Willy Tarreau
2013-12-12 16:47             ` Russell King - ARM Linux
2013-12-12 17:11               ` Willy Tarreau
2013-12-12 17:20                 ` Russell King - ARM Linux
2013-12-12 17:35                   ` Willy Tarreau
2013-12-12 18:07                   ` Nicolas Pitre
2013-12-12 22:30               ` Maxime Bizon
2013-12-12 22:36                 ` Russell King - ARM Linux
2013-12-12 22:44                   ` Maxime Bizon
2013-12-12 22:48                     ` Russell King - ARM Linux
2013-12-12 17:34           ` Maxime Bizon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.