linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
@ 2021-03-09  4:18 Tiezhu Yang
  2021-03-14 20:49 ` Maciej W. Rozycki
  2021-03-15 10:24 ` Alexander Lobakin
  0 siblings, 2 replies; 10+ messages in thread
From: Tiezhu Yang @ 2021-03-09  4:18 UTC (permalink / raw)
  To: Thomas Bogendoerfer; +Cc: linux-mips, linux-kernel, Xuefeng Li, David Laight

The asm code in csum_tcpudp_nofold() is performance-critical, I am sorry
for the poorly considered implementation about the performance influence
with GCC in the commit 198688edbf77 ("MIPS: Fix inline asm input/output
type mismatch in checksum.h used with Clang").

With this patch, we can build successfully by both GCC and Clang,
at the same time, we can avoid the potential performance influence
with GCC.

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
---
 arch/mips/include/asm/checksum.h | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
index 1e6c135..80eddd4 100644
--- a/arch/mips/include/asm/checksum.h
+++ b/arch/mips/include/asm/checksum.h
@@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 
 static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
 					__u32 len, __u8 proto,
-					__wsum sum)
+					__wsum sum_in)
 {
-	unsigned long tmp = (__force unsigned long)sum;
+#ifdef __clang__
+	unsigned long sum = (__force unsigned long)sum_in;
+#else
+	__wsum sum = sum_in;
+#endif
 
 	__asm__(
 	"	.set	push		# csum_tcpudp_nofold\n"
@@ -159,7 +163,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
 	"	addu	%0, $1		\n"
 #endif
 	"	.set	pop"
-	: "=r" (tmp)
+	: "=r" (sum)
 	: "0" ((__force unsigned long)daddr),
 	  "r" ((__force unsigned long)saddr),
 #ifdef __MIPSEL__
@@ -169,7 +173,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
 #endif
 	  "r" ((__force unsigned long)sum));
 
-	return (__force __wsum)tmp;
+	return (__force __wsum)sum;
 }
 #define csum_tcpudp_nofold csum_tcpudp_nofold
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-09  4:18 [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold() Tiezhu Yang
@ 2021-03-14 20:49 ` Maciej W. Rozycki
  2021-03-15 12:26   ` Tiezhu Yang
  2021-03-15 10:24 ` Alexander Lobakin
  1 sibling, 1 reply; 10+ messages in thread
From: Maciej W. Rozycki @ 2021-03-14 20:49 UTC (permalink / raw)
  To: Tiezhu Yang
  Cc: Thomas Bogendoerfer, linux-mips, linux-kernel, Xuefeng Li, David Laight

On Tue, 9 Mar 2021, Tiezhu Yang wrote:

> diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
> index 1e6c135..80eddd4 100644
> --- a/arch/mips/include/asm/checksum.h
> +++ b/arch/mips/include/asm/checksum.h
> @@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  
>  static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>  					__u32 len, __u8 proto,
> -					__wsum sum)
> +					__wsum sum_in)
>  {
> -	unsigned long tmp = (__force unsigned long)sum;
> +#ifdef __clang__
> +	unsigned long sum = (__force unsigned long)sum_in;
> +#else
> +	__wsum sum = sum_in;
> +#endif

 This looks much better to me, but I'd keep the variable names unchanged 
as `sum_in' isn't used beyond the initial assignment anyway (you'll have 
to update the references with asm operands accordingly of course).

 Have you verified that code produced by GCC remains the same with your 
change in place as it used to be up to commit 198688edbf77?  I can see no
such information in the commit description whether here or in the said 
commit.

  Maciej

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-09  4:18 [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold() Tiezhu Yang
  2021-03-14 20:49 ` Maciej W. Rozycki
@ 2021-03-15 10:24 ` Alexander Lobakin
  2021-03-15 12:10   ` Tiezhu Yang
  1 sibling, 1 reply; 10+ messages in thread
From: Alexander Lobakin @ 2021-03-15 10:24 UTC (permalink / raw)
  To: Tiezhu Yang
  Cc: Alexander Lobakin, Thomas Bogendoerfer, Maciej W. Rozycki,
	linux-mips, linux-kernel, Xuefeng Li, David Laight

From: Tiezhu Yang <yangtiezhu@loongson.cn>
Date: Tue, 9 Mar 2021 12:18:13 +0800

> The asm code in csum_tcpudp_nofold() is performance-critical, I am sorry
> for the poorly considered implementation about the performance influence
> with GCC in the commit 198688edbf77 ("MIPS: Fix inline asm input/output
> type mismatch in checksum.h used with Clang").
>
> With this patch, we can build successfully by both GCC and Clang,
> at the same time, we can avoid the potential performance influence
> with GCC.
>
> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
> ---
>  arch/mips/include/asm/checksum.h | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
> index 1e6c135..80eddd4 100644
> --- a/arch/mips/include/asm/checksum.h
> +++ b/arch/mips/include/asm/checksum.h
> @@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>
>  static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>  					__u32 len, __u8 proto,
> -					__wsum sum)
> +					__wsum sum_in)
>  {
> -	unsigned long tmp = (__force unsigned long)sum;
> +#ifdef __clang__

Why not rely on CONFIG_CC_IS_CLANG here?

> +	unsigned long sum = (__force unsigned long)sum_in;
> +#else
> +	__wsum sum = sum_in;
> +#endif
>
>  	__asm__(
>  	"	.set	push		# csum_tcpudp_nofold\n"
> @@ -159,7 +163,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>  	"	addu	%0, $1		\n"
>  #endif
>  	"	.set	pop"
> -	: "=r" (tmp)
> +	: "=r" (sum)
>  	: "0" ((__force unsigned long)daddr),
>  	  "r" ((__force unsigned long)saddr),
>  #ifdef __MIPSEL__
> @@ -169,7 +173,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>  #endif
>  	  "r" ((__force unsigned long)sum));
>
> -	return (__force __wsum)tmp;
> +	return (__force __wsum)sum;
>  }
>  #define csum_tcpudp_nofold csum_tcpudp_nofold
>
> --
> 2.1.0

Al


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-15 10:24 ` Alexander Lobakin
@ 2021-03-15 12:10   ` Tiezhu Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Tiezhu Yang @ 2021-03-15 12:10 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Thomas Bogendoerfer, Maciej W. Rozycki, linux-mips, linux-kernel,
	Xuefeng Li, David Laight

On 03/15/2021 06:24 PM, Alexander Lobakin wrote:
> From: Tiezhu Yang <yangtiezhu@loongson.cn>
> Date: Tue, 9 Mar 2021 12:18:13 +0800
>
>> The asm code in csum_tcpudp_nofold() is performance-critical, I am sorry
>> for the poorly considered implementation about the performance influence
>> with GCC in the commit 198688edbf77 ("MIPS: Fix inline asm input/output
>> type mismatch in checksum.h used with Clang").
>>
>> With this patch, we can build successfully by both GCC and Clang,
>> at the same time, we can avoid the potential performance influence
>> with GCC.
>>
>> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
>> ---
>>   arch/mips/include/asm/checksum.h | 12 ++++++++----
>>   1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
>> index 1e6c135..80eddd4 100644
>> --- a/arch/mips/include/asm/checksum.h
>> +++ b/arch/mips/include/asm/checksum.h
>> @@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>>
>>   static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>>   					__u32 len, __u8 proto,
>> -					__wsum sum)
>> +					__wsum sum_in)
>>   {
>> -	unsigned long tmp = (__force unsigned long)sum;
>> +#ifdef __clang__
> Why not rely on CONFIG_CC_IS_CLANG here?

Hi,

Thanks for your suggestion, I once considered that way:
https://lore.kernel.org/patchwork/patch/1371666/#1587127

But it still occurs build error under CC_IS_GCC when
make M=samples/bpf which used with Clang compiler,
so use __clang__ is better.

Thanks,
Tiezhu

>
>> +	unsigned long sum = (__force unsigned long)sum_in;
>> +#else
>> +	__wsum sum = sum_in;
>> +#endif
>>
>>   	__asm__(
>>   	"	.set	push		# csum_tcpudp_nofold\n"
>> @@ -159,7 +163,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>>   	"	addu	%0, $1		\n"
>>   #endif
>>   	"	.set	pop"
>> -	: "=r" (tmp)
>> +	: "=r" (sum)
>>   	: "0" ((__force unsigned long)daddr),
>>   	  "r" ((__force unsigned long)saddr),
>>   #ifdef __MIPSEL__
>> @@ -169,7 +173,7 @@ static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>>   #endif
>>   	  "r" ((__force unsigned long)sum));
>>
>> -	return (__force __wsum)tmp;
>> +	return (__force __wsum)sum;
>>   }
>>   #define csum_tcpudp_nofold csum_tcpudp_nofold
>>
>> --
>> 2.1.0
> Al


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-14 20:49 ` Maciej W. Rozycki
@ 2021-03-15 12:26   ` Tiezhu Yang
  2021-03-15 12:42     ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Tiezhu Yang @ 2021-03-15 12:26 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Thomas Bogendoerfer, linux-mips, linux-kernel, Xuefeng Li, David Laight

On 03/15/2021 04:49 AM, Maciej W. Rozycki wrote:
> On Tue, 9 Mar 2021, Tiezhu Yang wrote:
>
>> diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
>> index 1e6c135..80eddd4 100644
>> --- a/arch/mips/include/asm/checksum.h
>> +++ b/arch/mips/include/asm/checksum.h
>> @@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>>   
>>   static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>>   					__u32 len, __u8 proto,
>> -					__wsum sum)
>> +					__wsum sum_in)
>>   {
>> -	unsigned long tmp = (__force unsigned long)sum;
>> +#ifdef __clang__
>> +	unsigned long sum = (__force unsigned long)sum_in;
>> +#else
>> +	__wsum sum = sum_in;
>> +#endif
>   This looks much better to me, but I'd keep the variable names unchanged
> as `sum_in' isn't used beyond the initial assignment anyway (you'll have
> to update the references with asm operands accordingly of course).
>
>   Have you verified that code produced by GCC remains the same with your
> change in place as it used to be up to commit 198688edbf77?  I can see no
> such information in the commit description whether here or in the said
> commit.
>
>    Maciej

Hi Maciej,

Thanks for your reply.

gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110

net/ipv4/tcp_ipv4.c
tcp_v4_send_reset()
   csum_tcpudp_nofold()

objdump -d vmlinux > vmlinux.dump

(1) Before commit 198688edbf77
("MIPS: Fix inline asm input/output type mismatch in checksum.h used 
with Clang"):

ffffffff80aa835c:       00004025        move    a4,zero
ffffffff80aa8360:       92020012        lbu     v0,18(s0)
ffffffff80aa8364:       de140030        ld      s4,48(s0)
ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
ffffffff80aa8378:       afa70038        sw      a3,56(sp)
ffffffff80aa837c:       24071a00        li      a3,6656
ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
ffffffff80aa8390:       0081202d        daddu   a0,a0,at
ffffffff80aa8394:       0081082b        sltu    at,a0,at
ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
ffffffff80aa839c:       00812021        addu    a0,a0,at

(2) After commit 198688edbf77
("MIPS: Fix inline asm input/output type mismatch in checksum.h used 
with Clang"):

ffffffff80aa836c:       00004025        move    a4,zero
ffffffff80aa8370:       92040012        lbu     a0,18(s0)
ffffffff80aa8374:       de140030        ld      s4,48(s0)
ffffffff80aa8378:       0062182d        daddu   v1,v1,v0
ffffffff80aa837c:       308400ff        andi    a0,a0,0xff
ffffffff80aa8380:       9c62000c        lwu     v0,12(v1)
ffffffff80aa8384:       9c660010        lwu     a2,16(v1)
ffffffff80aa8388:       afa70038        sw      a3,56(sp)
ffffffff80aa838c:       24071a00        li      a3,6656
ffffffff80aa8390:       0046102d        daddu   v0,v0,a2
ffffffff80aa8394:       0047102d        daddu   v0,v0,a3
ffffffff80aa8398:       0048102d        daddu   v0,v0,a4
ffffffff80aa839c:       0002083c        dsll32  at,v0,0x0
ffffffff80aa83a0:       0041102d        daddu   v0,v0,at
ffffffff80aa83a4:       0041082b        sltu    at,v0,at
ffffffff80aa83a8:       0002103f        dsra32  v0,v0,0x0
ffffffff80aa83ac:       00411021        addu    v0,v0,at

(3) With this patch:

ffffffff80aa835c:       00004025        move    a4,zero
ffffffff80aa8360:       92020012        lbu     v0,18(s0)
ffffffff80aa8364:       de140030        ld      s4,48(s0)
ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
ffffffff80aa8378:       afa70038        sw      a3,56(sp)
ffffffff80aa837c:       24071a00        li      a3,6656
ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
ffffffff80aa8390:       0081202d        daddu   a0,a0,at
ffffffff80aa8394:       0081082b        sltu    at,a0,at
ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
ffffffff80aa839c:       00812021        addu    a0,a0,at

(4) With the following changes based on commit 198688edbf77
("MIPS: Fix inline asm input/output type mismatch in checksum.h used 
with Clang"):

diff --git a/arch/mips/include/asm/checksum.h 
b/arch/mips/include/asm/checksum.h
index 1e6c135..e1f80407 100644
--- a/arch/mips/include/asm/checksum.h
+++ b/arch/mips/include/asm/checksum.h
@@ -130,7 +130,11 @@ static inline __wsum csum_tcpudp_nofold(__be32 
saddr, __be32 daddr,
                      __u32 len, __u8 proto,
                      __wsum sum)
  {
+#ifdef __clang__
      unsigned long tmp = (__force unsigned long)sum;
+#else
+    __wsum tmp = sum;
+#endif

      __asm__(
      "    .set    push        # csum_tcpudp_nofold\n"

ffffffff80aa835c:       00004025        move    a4,zero
ffffffff80aa8360:       92020012        lbu     v0,18(s0)
ffffffff80aa8364:       de140030        ld      s4,48(s0)
ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
ffffffff80aa8378:       afa70038        sw      a3,56(sp)
ffffffff80aa837c:       24071a00        li      a3,6656
ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
ffffffff80aa8390:       0081202d        daddu   a0,a0,at
ffffffff80aa8394:       0081082b        sltu    at,a0,at
ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
ffffffff80aa839c:       00812021        addu    a0,a0,at

The code produced by GCC remains the same between (1), (3) and (4),
the last changes looks like better (with less changes based on commit
198688edbf77), so I will send v3 later.

Thanks,
Tiezhu


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-15 12:26   ` Tiezhu Yang
@ 2021-03-15 12:42     ` David Laight
  2021-03-17  0:26       ` Tiezhu Yang
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2021-03-15 12:42 UTC (permalink / raw)
  To: 'Tiezhu Yang', Maciej W. Rozycki
  Cc: Thomas Bogendoerfer, linux-mips, linux-kernel, Xuefeng Li

From: Tiezhu Yang <yangtiezhu@loongson.cn>
> Sent: 15 March 2021 12:26
> On 03/15/2021 04:49 AM, Maciej W. Rozycki wrote:
> > On Tue, 9 Mar 2021, Tiezhu Yang wrote:
> >
> >> diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
> >> index 1e6c135..80eddd4 100644
> >> --- a/arch/mips/include/asm/checksum.h
> >> +++ b/arch/mips/include/asm/checksum.h
> >> @@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
> >>
> >>   static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
> >>   					__u32 len, __u8 proto,
> >> -					__wsum sum)
> >> +					__wsum sum_in)
> >>   {
> >> -	unsigned long tmp = (__force unsigned long)sum;
> >> +#ifdef __clang__
> >> +	unsigned long sum = (__force unsigned long)sum_in;
> >> +#else
> >> +	__wsum sum = sum_in;
> >> +#endif
> >   This looks much better to me, but I'd keep the variable names unchanged
> > as `sum_in' isn't used beyond the initial assignment anyway (you'll have
> > to update the references with asm operands accordingly of course).
> >
> >   Have you verified that code produced by GCC remains the same with your
> > change in place as it used to be up to commit 198688edbf77?  I can see no
> > such information in the commit description whether here or in the said
> > commit.
> >
> >    Maciej
> 
> Hi Maciej,
> 
> Thanks for your reply.
> 
> gcc --version
> gcc (Debian 10.2.1-6) 10.2.1 20210110
> 
> net/ipv4/tcp_ipv4.c
> tcp_v4_send_reset()
>    csum_tcpudp_nofold()
> 
> objdump -d vmlinux > vmlinux.dump
> 
> (1) Before commit 198688edbf77
> ("MIPS: Fix inline asm input/output type mismatch in checksum.h used
> with Clang"):
> 
> ffffffff80aa835c:       00004025        move    a4,zero
> ffffffff80aa8360:       92020012        lbu     v0,18(s0)
> ffffffff80aa8364:       de140030        ld      s4,48(s0)
> ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
> ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
> ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
> ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
> ffffffff80aa8378:       afa70038        sw      a3,56(sp)
> ffffffff80aa837c:       24071a00        li      a3,6656
> ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
> ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
> ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
> ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
> ffffffff80aa8390:       0081202d        daddu   a0,a0,at
> ffffffff80aa8394:       0081082b        sltu    at,a0,at
> ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
> ffffffff80aa839c:       00812021        addu    a0,a0,at
> 
> (2) After commit 198688edbf77
> ("MIPS: Fix inline asm input/output type mismatch in checksum.h used
> with Clang"):
> 
> ffffffff80aa836c:       00004025        move    a4,zero
> ffffffff80aa8370:       92040012        lbu     a0,18(s0)
> ffffffff80aa8374:       de140030        ld      s4,48(s0)
> ffffffff80aa8378:       0062182d        daddu   v1,v1,v0
> ffffffff80aa837c:       308400ff        andi    a0,a0,0xff
> ffffffff80aa8380:       9c62000c        lwu     v0,12(v1)
> ffffffff80aa8384:       9c660010        lwu     a2,16(v1)
> ffffffff80aa8388:       afa70038        sw      a3,56(sp)
> ffffffff80aa838c:       24071a00        li      a3,6656
> ffffffff80aa8390:       0046102d        daddu   v0,v0,a2
> ffffffff80aa8394:       0047102d        daddu   v0,v0,a3
> ffffffff80aa8398:       0048102d        daddu   v0,v0,a4
> ffffffff80aa839c:       0002083c        dsll32  at,v0,0x0
> ffffffff80aa83a0:       0041102d        daddu   v0,v0,at
> ffffffff80aa83a4:       0041082b        sltu    at,v0,at
> ffffffff80aa83a8:       0002103f        dsra32  v0,v0,0x0
> ffffffff80aa83ac:       00411021        addu    v0,v0,at
> 
> (3) With this patch:
> 
> ffffffff80aa835c:       00004025        move    a4,zero
> ffffffff80aa8360:       92020012        lbu     v0,18(s0)
> ffffffff80aa8364:       de140030        ld      s4,48(s0)
> ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
> ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
> ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
> ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
> ffffffff80aa8378:       afa70038        sw      a3,56(sp)
> ffffffff80aa837c:       24071a00        li      a3,6656
> ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
> ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
> ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
> ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
> ffffffff80aa8390:       0081202d        daddu   a0,a0,at
> ffffffff80aa8394:       0081082b        sltu    at,a0,at
> ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
> ffffffff80aa839c:       00812021        addu    a0,a0,at
> 
> (4) With the following changes based on commit 198688edbf77
> ("MIPS: Fix inline asm input/output type mismatch in checksum.h used
> with Clang"):
> 
> diff --git a/arch/mips/include/asm/checksum.h
> b/arch/mips/include/asm/checksum.h
> index 1e6c135..e1f80407 100644
> --- a/arch/mips/include/asm/checksum.h
> +++ b/arch/mips/include/asm/checksum.h
> @@ -130,7 +130,11 @@ static inline __wsum csum_tcpudp_nofold(__be32
> saddr, __be32 daddr,
>                       __u32 len, __u8 proto,
>                       __wsum sum)
>   {
> +#ifdef __clang__
>       unsigned long tmp = (__force unsigned long)sum;
> +#else
> +    __wsum tmp = sum;
> +#endif
> 
>       __asm__(
>       "    .set    push        # csum_tcpudp_nofold\n"
> 
> ffffffff80aa835c:       00004025        move    a4,zero
> ffffffff80aa8360:       92020012        lbu     v0,18(s0)
> ffffffff80aa8364:       de140030        ld      s4,48(s0)
> ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
> ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
> ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
> ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
> ffffffff80aa8378:       afa70038        sw      a3,56(sp)
> ffffffff80aa837c:       24071a00        li      a3,6656
> ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
> ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
> ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
> ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
> ffffffff80aa8390:       0081202d        daddu   a0,a0,at
> ffffffff80aa8394:       0081082b        sltu    at,a0,at
> ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
> ffffffff80aa839c:       00812021        addu    a0,a0,at
> 
> The code produced by GCC remains the same between (1), (3) and (4),
> the last changes looks like better (with less changes based on commit
> 198688edbf77), so I will send v3 later.

Aren't those all the same - apart from register selection.
Not that I grok the mips opcodes.
But that code has horridness on its side.

The only obvious difference is that something else changes the
code offset from xxxx5c to xxxx6c.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-15 12:42     ` David Laight
@ 2021-03-17  0:26       ` Tiezhu Yang
  2021-03-17 15:35         ` Maciej W. Rozycki
  0 siblings, 1 reply; 10+ messages in thread
From: Tiezhu Yang @ 2021-03-17  0:26 UTC (permalink / raw)
  To: David Laight, Maciej W. Rozycki
  Cc: Thomas Bogendoerfer, linux-mips, linux-kernel, Xuefeng Li

On 03/15/2021 08:42 PM, David Laight wrote:
> From: Tiezhu Yang <yangtiezhu@loongson.cn>
>> Sent: 15 March 2021 12:26
>> On 03/15/2021 04:49 AM, Maciej W. Rozycki wrote:
>>> On Tue, 9 Mar 2021, Tiezhu Yang wrote:
>>>
>>>> diff --git a/arch/mips/include/asm/checksum.h b/arch/mips/include/asm/checksum.h
>>>> index 1e6c135..80eddd4 100644
>>>> --- a/arch/mips/include/asm/checksum.h
>>>> +++ b/arch/mips/include/asm/checksum.h
>>>> @@ -128,9 +128,13 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>>>>
>>>>    static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
>>>>    					__u32 len, __u8 proto,
>>>> -					__wsum sum)
>>>> +					__wsum sum_in)
>>>>    {
>>>> -	unsigned long tmp = (__force unsigned long)sum;
>>>> +#ifdef __clang__
>>>> +	unsigned long sum = (__force unsigned long)sum_in;
>>>> +#else
>>>> +	__wsum sum = sum_in;
>>>> +#endif
>>>    This looks much better to me, but I'd keep the variable names unchanged
>>> as `sum_in' isn't used beyond the initial assignment anyway (you'll have
>>> to update the references with asm operands accordingly of course).
>>>
>>>    Have you verified that code produced by GCC remains the same with your
>>> change in place as it used to be up to commit 198688edbf77?  I can see no
>>> such information in the commit description whether here or in the said
>>> commit.
>>>
>>>     Maciej
>> Hi Maciej,
>>
>> Thanks for your reply.
>>
>> gcc --version
>> gcc (Debian 10.2.1-6) 10.2.1 20210110
>>
>> net/ipv4/tcp_ipv4.c
>> tcp_v4_send_reset()
>>     csum_tcpudp_nofold()
>>
>> objdump -d vmlinux > vmlinux.dump
>>
>> (1) Before commit 198688edbf77
>> ("MIPS: Fix inline asm input/output type mismatch in checksum.h used
>> with Clang"):
>>
>> ffffffff80aa835c:       00004025        move    a4,zero
>> ffffffff80aa8360:       92020012        lbu     v0,18(s0)
>> ffffffff80aa8364:       de140030        ld      s4,48(s0)
>> ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
>> ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
>> ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
>> ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
>> ffffffff80aa8378:       afa70038        sw      a3,56(sp)
>> ffffffff80aa837c:       24071a00        li      a3,6656
>> ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
>> ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
>> ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
>> ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
>> ffffffff80aa8390:       0081202d        daddu   a0,a0,at
>> ffffffff80aa8394:       0081082b        sltu    at,a0,at
>> ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
>> ffffffff80aa839c:       00812021        addu    a0,a0,at
>>
>> (2) After commit 198688edbf77
>> ("MIPS: Fix inline asm input/output type mismatch in checksum.h used
>> with Clang"):
>>
>> ffffffff80aa836c:       00004025        move    a4,zero
>> ffffffff80aa8370:       92040012        lbu     a0,18(s0)
>> ffffffff80aa8374:       de140030        ld      s4,48(s0)
>> ffffffff80aa8378:       0062182d        daddu   v1,v1,v0
>> ffffffff80aa837c:       308400ff        andi    a0,a0,0xff
>> ffffffff80aa8380:       9c62000c        lwu     v0,12(v1)
>> ffffffff80aa8384:       9c660010        lwu     a2,16(v1)
>> ffffffff80aa8388:       afa70038        sw      a3,56(sp)
>> ffffffff80aa838c:       24071a00        li      a3,6656
>> ffffffff80aa8390:       0046102d        daddu   v0,v0,a2
>> ffffffff80aa8394:       0047102d        daddu   v0,v0,a3
>> ffffffff80aa8398:       0048102d        daddu   v0,v0,a4
>> ffffffff80aa839c:       0002083c        dsll32  at,v0,0x0
>> ffffffff80aa83a0:       0041102d        daddu   v0,v0,at
>> ffffffff80aa83a4:       0041082b        sltu    at,v0,at
>> ffffffff80aa83a8:       0002103f        dsra32  v0,v0,0x0
>> ffffffff80aa83ac:       00411021        addu    v0,v0,at
>>
>> (3) With this patch:
>>
>> ffffffff80aa835c:       00004025        move    a4,zero
>> ffffffff80aa8360:       92020012        lbu     v0,18(s0)
>> ffffffff80aa8364:       de140030        ld      s4,48(s0)
>> ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
>> ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
>> ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
>> ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
>> ffffffff80aa8378:       afa70038        sw      a3,56(sp)
>> ffffffff80aa837c:       24071a00        li      a3,6656
>> ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
>> ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
>> ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
>> ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
>> ffffffff80aa8390:       0081202d        daddu   a0,a0,at
>> ffffffff80aa8394:       0081082b        sltu    at,a0,at
>> ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
>> ffffffff80aa839c:       00812021        addu    a0,a0,at
>>
>> (4) With the following changes based on commit 198688edbf77
>> ("MIPS: Fix inline asm input/output type mismatch in checksum.h used
>> with Clang"):
>>
>> diff --git a/arch/mips/include/asm/checksum.h
>> b/arch/mips/include/asm/checksum.h
>> index 1e6c135..e1f80407 100644
>> --- a/arch/mips/include/asm/checksum.h
>> +++ b/arch/mips/include/asm/checksum.h
>> @@ -130,7 +130,11 @@ static inline __wsum csum_tcpudp_nofold(__be32
>> saddr, __be32 daddr,
>>                        __u32 len, __u8 proto,
>>                        __wsum sum)
>>    {
>> +#ifdef __clang__
>>        unsigned long tmp = (__force unsigned long)sum;
>> +#else
>> +    __wsum tmp = sum;
>> +#endif
>>
>>        __asm__(
>>        "    .set    push        # csum_tcpudp_nofold\n"
>>
>> ffffffff80aa835c:       00004025        move    a4,zero
>> ffffffff80aa8360:       92020012        lbu     v0,18(s0)
>> ffffffff80aa8364:       de140030        ld      s4,48(s0)
>> ffffffff80aa8368:       0064182d        daddu   v1,v1,a0
>> ffffffff80aa836c:       304200ff        andi    v0,v0,0xff
>> ffffffff80aa8370:       9c64000c        lwu     a0,12(v1)
>> ffffffff80aa8374:       9c660010        lwu     a2,16(v1)
>> ffffffff80aa8378:       afa70038        sw      a3,56(sp)
>> ffffffff80aa837c:       24071a00        li      a3,6656
>> ffffffff80aa8380:       0086202d        daddu   a0,a0,a2
>> ffffffff80aa8384:       0087202d        daddu   a0,a0,a3
>> ffffffff80aa8388:       0088202d        daddu   a0,a0,a4
>> ffffffff80aa838c:       0004083c        dsll32  at,a0,0x0
>> ffffffff80aa8390:       0081202d        daddu   a0,a0,at
>> ffffffff80aa8394:       0081082b        sltu    at,a0,at
>> ffffffff80aa8398:       0004203f        dsra32  a0,a0,0x0
>> ffffffff80aa839c:       00812021        addu    a0,a0,at
>>
>> The code produced by GCC remains the same between (1), (3) and (4),
>> the last changes looks like better (with less changes based on commit
>> 198688edbf77), so I will send v3 later.
> Aren't those all the same - apart from register selection.
> Not that I grok the mips opcodes.
> But that code has horridness on its side.
>
> The only obvious difference is that something else changes the
> code offset from xxxx5c to xxxx6c.
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

Hi David,

Yes, it seems no much obvious differences.
Let me wait for other feedback.

Hi Thomas and Maciej,

Is this patch necessary? If no, we can ignore it.
If yes, I will send v3 with the above (4) changes.

Thanks,
Tiezhu


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-17  0:26       ` Tiezhu Yang
@ 2021-03-17 15:35         ` Maciej W. Rozycki
  2021-03-17 16:09           ` David Laight
  0 siblings, 1 reply; 10+ messages in thread
From: Maciej W. Rozycki @ 2021-03-17 15:35 UTC (permalink / raw)
  To: Tiezhu Yang
  Cc: David Laight, Thomas Bogendoerfer, linux-mips, linux-kernel, Xuefeng Li

On Wed, 17 Mar 2021, Tiezhu Yang wrote:

> > > The code produced by GCC remains the same between (1), (3) and (4),
> > > the last changes looks like better (with less changes based on commit
> > > 198688edbf77), so I will send v3 later.
> > Aren't those all the same - apart from register selection.

 I did a quick size check with my MIPS64 config between (1) and (2):

       text       data        bss      total filename
-   6129704    2997475     215280    9342459 vmlinux
+   6129752    2997475     215280    9342507 vmlinux

so obviously there's more to that and the code snippets quoted did not 
show the full picture.  The size difference is in `tcp_v4_send_reset' 
AFAICT, maybe elsewhere as well.

 FAOD the former binary is as at 198688edbf77^ and the latter is as at 
198688edbf77, and the command used was `size --format=gnu' to prevent 
`.rodata' from interfering with the text size.

> > Not that I grok the mips opcodes.
> > But that code has horridness on its side.

 It's a 32-bit one's-complement addition.  The use of 64-bit operations 
reduces the number of calculations as any 32-bit carries accumulate in the 
high 32-bit word allowing one instruction to be saved total compared to 
the 32-bit variant.  Nothing particularly unusual for me here; I've seen 
worse stuff with x86.

> Is this patch necessary? If no, we can ignore it.
> If yes, I will send v3 with the above (4) changes.

 I have also verified this patch directly on top of 198688edbf77 and it 
does not bring the old `vmlinux' size back (there's no difference in code 
produced at all here actually), so you need to investigate this further.  
This is with an older checkout of GCC 11.

 I've attached a stripped version of my .config that you can use for 
reference.

  Maciej

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-17 15:35         ` Maciej W. Rozycki
@ 2021-03-17 16:09           ` David Laight
  2021-03-17 21:36             ` Maciej W. Rozycki
  0 siblings, 1 reply; 10+ messages in thread
From: David Laight @ 2021-03-17 16:09 UTC (permalink / raw)
  To: 'Maciej W. Rozycki', Tiezhu Yang
  Cc: Thomas Bogendoerfer, linux-mips, linux-kernel, Xuefeng Li

From: Maciej W. Rozycki
> Sent: 17 March 2021 15:36
..
> > > Not that I grok the mips opcodes.
> > > But that code has horridness on its side.
> 
>  It's a 32-bit one's-complement addition.  The use of 64-bit operations
> reduces the number of calculations as any 32-bit carries accumulate in the
> high 32-bit word allowing one instruction to be saved total compared to
> the 32-bit variant.  Nothing particularly unusual for me here; I've seen
> worse stuff with x86.

The 'problem' is that mips doesn't have a carry flag.
So the 64-bit maths is 'tricky'.
It may well be that a loop based on:
	do {
		val = *ptr++;
		sum += val;
		carry += sum < val;
	} while (ptr != limit)
will generate much better code.
I think there is a 'setlt' instruction for the compare.
It certainly would on the nios (which is mips-like).
That is (probably) 6 instructions for 4 bytes.
I suspect there may be a data stall after the memory read.
So an interleaved unroll would remove that stall.
That would be 10 clocks for 8 bytes.

The x86-64 code is 'interesting'.
It has repeated 'add carry' instructions.
On Intel cpus prior to (at least) Haswell they take two clocks each.
So the code is no faster than adding 32bit values to a 64bit sum.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold()
  2021-03-17 16:09           ` David Laight
@ 2021-03-17 21:36             ` Maciej W. Rozycki
  0 siblings, 0 replies; 10+ messages in thread
From: Maciej W. Rozycki @ 2021-03-17 21:36 UTC (permalink / raw)
  To: David Laight
  Cc: Tiezhu Yang, Thomas Bogendoerfer, linux-mips, linux-kernel, Xuefeng Li

On Wed, 17 Mar 2021, David Laight wrote:

> > > > Not that I grok the mips opcodes.
> > > > But that code has horridness on its side.
> > 
> >  It's a 32-bit one's-complement addition.  The use of 64-bit operations
> > reduces the number of calculations as any 32-bit carries accumulate in the
> > high 32-bit word allowing one instruction to be saved total compared to
> > the 32-bit variant.  Nothing particularly unusual for me here; I've seen
> > worse stuff with x86.
> 
> The 'problem' is that mips doesn't have a carry flag.
> So the 64-bit maths is 'tricky'.
> It may well be that a loop based on:
> 	do {
> 		val = *ptr++;
> 		sum += val;
> 		carry += sum < val;
> 	} while (ptr != limit)
> will generate much better code.

 This piece of assembly appears to me as good as you can get, but it is 
somewhat dated, going back to 1999 and LMO commit 0458ce25ec4e ("Fix 
MIPS64 IP checksums.") as far as the 64-bit variant is concerned, with a 
much later bug fix applied for a corner case with commit 66fd848cadaa 
("MIPS: Fix special case in 64 bit IP checksumming.") back in 2017 (!), 
which added two instructions.  It may well be that GCC has since improved 
and would produce code of similar or better quality.  Anyone is welcome to 
try it of course and submit a patch if it turns out beneficial.

 NB I note there's an earlyclobber specifier missing on the output 
operand, lost with the merge of the separate mips and mips64 ports with 
LMO commit a69fb3990ea9 ("Goodbye mips64.  31704 lines of code bite the 
dust."), back in 2003, which I previously added myself, back in 2002 with 
LMO commit acc75ed18471 ("Bug fixes for constraints and type casts.").  
Sigh.  I think the overlap with operand #1 has prevented damage from 
happening as a result of the missing specifier, but IMO it would best be 
reinstated just in case.

> I think there is a 'setlt' instruction for the compare.

 Yes and it's used in the piece of code quoted (SLT).

> It certainly would on the nios (which is mips-like).
> That is (probably) 6 instructions for 4 bytes.
> I suspect there may be a data stall after the memory read.
> So an interleaved unroll would remove that stall.
> That would be 10 clocks for 8 bytes.

 There's no memory read involved in this code; of course there could be in 
the paster of the inline function as shown in the piece shown by Tiezhu.  
There are a couple of instructions in between though, which should keep 
the pipeline occupied as long as data retrieved is hot in the cache 
(prefetch instructions could be used to assist with that).

 OTOH if data were to come from the main memory, I doubt anything could 
help here.

> The x86-64 code is 'interesting'.
> It has repeated 'add carry' instructions.
> On Intel cpus prior to (at least) Haswell they take two clocks each.
> So the code is no faster than adding 32bit values to a 64bit sum.

 GCC has DFA pipeline descriptions for various MIPS processors, so the 
details may vary, but it is assumed overall that plain ALU operations 
cause no pipeline stalls or slips, so nominally they take 1 pipeline clock 
each for scalar implementations.  This includes all the instructions in 
the piece of code discussed here.

 (Actual clock counts will of course depend on the number of pipeline 
stages a given microarchitecture implements and are not a concern here. 
The earliest MIPS chips had as few as four stages only and the inclusion 
of branch and load delay slots prevented pipeline slips from occuring or 
even actually being implemented, as in what MIPS used to stand for, that 
is Microprocessor without Interlocked Pipeline Stages.  Stalls could still 
happen of course in the case of cache misses.)

  Maciej

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-03-17 21:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09  4:18 [PATCH v2] MIPS: Check __clang__ to avoid performance influence with GCC in csum_tcpudp_nofold() Tiezhu Yang
2021-03-14 20:49 ` Maciej W. Rozycki
2021-03-15 12:26   ` Tiezhu Yang
2021-03-15 12:42     ` David Laight
2021-03-17  0:26       ` Tiezhu Yang
2021-03-17 15:35         ` Maciej W. Rozycki
2021-03-17 16:09           ` David Laight
2021-03-17 21:36             ` Maciej W. Rozycki
2021-03-15 10:24 ` Alexander Lobakin
2021-03-15 12:10   ` Tiezhu Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).