linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C
@ 2022-05-22 20:48 Maciej W. Rozycki
  2022-05-23  9:40 ` Thomas Bogendoerfer
  2022-05-24 16:38 ` Florian Fainelli
  0 siblings, 2 replies; 6+ messages in thread
From: Maciej W. Rozycki @ 2022-05-22 20:48 UTC (permalink / raw)
  To: Thomas Bogendoerfer
  Cc: Paul Cercueil, Nathan Chancellor, Tiezhu Yang, linux-mips, linux-kernel

Recent commit 198688edbf77 ("MIPS: Fix inline asm input/output type 
mismatch in checksum.h used with Clang") introduced a code size and 
performance regression with 64-bit code emitted for `csum_tcpudp_nofold' 
by GCC, caused by a redundant truncation operation produced due to a 
data type change made to the variable associated with the inline 
assembly's output operand.

The intent previously expressed here with operands and constraints for 
optimal code was to have the output operand share a register with one 
inputs, both of a different integer type each.  This is perfectly valid 
with the MIPS psABI where a register can hold integer data of different 
types and the assembly code used here makes data stored in the output 
register match the data type used with the output operand, however it 
has turned out impossible to express this arrangement in source code 
such as to satisfy LLVM, apparently due to the compiler's internal 
limitations.

There is nothing peculiar about the inline assembly `csum_tcpudp_nofold' 
includes however, though it does choose assembly instructions carefully.

Rewrite this piece of assembly in plain C then, using corresponding C 
language operations, making GCC produce the same assembly instructions, 
possibly shuffled, in the general case and sometimes actually fewer of 
them where an input is constant, because the compiler does not have to 
reload it to a register (operand constraints could be adjusted for that, 
but the plain C approach is cleaner anyway).

Example code size changes are as follows, for a 32-bit configuration:

      text       data        bss      total filename
   5920480    1347236     126592    7394308 vmlinux-old
   5920480    1347236     126592    7394308 vmlinux-now
   5919728    1347236     126592    7393556 vmlinux-c

and for a 64-bit configuration:

      text       data        bss      total filename
   6024112    1790828     225728    8040668 vmlinux-old
   6024128    1790828     225728    8040684 vmlinux-now
   6023760    1790828     225728    8040316 vmlinux-c

respectively, where "old" is with the commit referred reverted, "now" is 
with no change, and "c" is with this change applied.

Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
---
Hi,

 I have visually inspected code produced and verified this change to boot 
with TCP networking performing just fine, both with a 32-bit and a 64-bit 
configuration.  Sadly with the little endianness only, because in the 
course of this verification I have discovered the core card of my Malta 
board bit the dust a few days ago, apparently in a permanent manner, and I 
have no other big-endian MIPS system available here to try.

 The only difference between the two endiannesses is the left-shift 
operation on (proto + len) however, which doesn't happen for big-endian 
configurations, so the little endianness should in principle provide 
enough coverage.

 Also I'm leaving it to LLVM folks to verify, however this is plain C, so 
it is expected to just work.

 Please apply.

  Maciej
---
 arch/mips/include/asm/checksum.h |   71 ++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 37 deletions(-)

linux-mips-csum-tcpudp-nofold-proper-fix.diff
Index: linux-macro/arch/mips/include/asm/checksum.h
===================================================================
--- linux-macro.orig/arch/mips/include/asm/checksum.h
+++ linux-macro/arch/mips/include/asm/checksum.h
@@ -128,48 +128,45 @@ static inline __sum16 ip_fast_csum(const
 
 static inline __wsum csum_tcpudp_nofold(__be32 saddr, __be32 daddr,
 					__u32 len, __u8 proto,
-					__wsum sum)
+					__wsum isum)
 {
-	unsigned long tmp = (__force unsigned long)sum;
+	const unsigned int sh32 = IS_ENABLED(CONFIG_64BIT) ? 32 : 0;
+	unsigned long sum = (__force unsigned long)daddr;
+	unsigned long tmp;
+	__u32 osum;
 
-	__asm__(
-	"	.set	push		# csum_tcpudp_nofold\n"
-	"	.set	noat		\n"
-#ifdef CONFIG_32BIT
-	"	addu	%0, %2		\n"
-	"	sltu	$1, %0, %2	\n"
-	"	addu	%0, $1		\n"
+	tmp = (__force unsigned long)saddr;
+	sum += tmp;
 
-	"	addu	%0, %3		\n"
-	"	sltu	$1, %0, %3	\n"
-	"	addu	%0, $1		\n"
+	if (IS_ENABLED(CONFIG_32BIT))
+		sum += sum < tmp;
 
-	"	addu	%0, %4		\n"
-	"	sltu	$1, %0, %4	\n"
-	"	addu	%0, $1		\n"
-#endif
-#ifdef CONFIG_64BIT
-	"	daddu	%0, %2		\n"
-	"	daddu	%0, %3		\n"
-	"	daddu	%0, %4		\n"
-	"	dsll32	$1, %0, 0	\n"
-	"	daddu	%0, $1		\n"
-	"	sltu	$1, %0, $1	\n"
-	"	dsra32	%0, %0, 0	\n"
-	"	addu	%0, $1		\n"
-#endif
-	"	.set	pop"
-	: "=r" (tmp)
-	: "0" ((__force unsigned long)daddr),
-	  "r" ((__force unsigned long)saddr),
-#ifdef __MIPSEL__
-	  "r" ((proto + len) << 8),
-#else
-	  "r" (proto + len),
-#endif
-	  "r" ((__force unsigned long)sum));
+	/*
+	 * We know PROTO + LEN has the sign bit clear, so cast to a signed
+	 * type to avoid an extraneous zero-extension where TMP is 64-bit.
+	 */
+	tmp = (__s32)(proto + len);
+	tmp <<= IS_ENABLED(CONFIG_CPU_LITTLE_ENDIAN) ? 8 : 0;
+	sum += tmp;
+	if (IS_ENABLED(CONFIG_32BIT))
+		sum += sum < tmp;
 
-	return (__force __wsum)tmp;
+	tmp = (__force unsigned long)isum;
+	sum += tmp;
+
+	if (IS_ENABLED(CONFIG_32BIT)) {
+		sum += sum < tmp;
+		osum = sum;
+	} else if (IS_ENABLED(CONFIG_64BIT)) {
+		tmp = sum << sh32;
+		sum += tmp;
+		osum = sum < tmp;
+		osum += sum >> sh32;
+	} else {
+		BUILD_BUG();
+	}
+
+	return (__force __wsum)osum;
 }
 #define csum_tcpudp_nofold csum_tcpudp_nofold
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C
  2022-05-22 20:48 [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C Maciej W. Rozycki
@ 2022-05-23  9:40 ` Thomas Bogendoerfer
  2022-05-24 16:38 ` Florian Fainelli
  1 sibling, 0 replies; 6+ messages in thread
From: Thomas Bogendoerfer @ 2022-05-23  9:40 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Paul Cercueil, Nathan Chancellor, Tiezhu Yang, linux-mips, linux-kernel

On Sun, May 22, 2022 at 09:48:14PM +0100, Maciej W. Rozycki wrote:
> Recent commit 198688edbf77 ("MIPS: Fix inline asm input/output type 
> mismatch in checksum.h used with Clang") introduced a code size and 
> performance regression with 64-bit code emitted for `csum_tcpudp_nofold' 
> by GCC, caused by a redundant truncation operation produced due to a 
> data type change made to the variable associated with the inline 
> assembly's output operand.
> 
> The intent previously expressed here with operands and constraints for 
> optimal code was to have the output operand share a register with one 
> inputs, both of a different integer type each.  This is perfectly valid 
> with the MIPS psABI where a register can hold integer data of different 
> types and the assembly code used here makes data stored in the output 
> register match the data type used with the output operand, however it 
> has turned out impossible to express this arrangement in source code 
> such as to satisfy LLVM, apparently due to the compiler's internal 
> limitations.
> 
> There is nothing peculiar about the inline assembly `csum_tcpudp_nofold' 
> includes however, though it does choose assembly instructions carefully.
> 
> Rewrite this piece of assembly in plain C then, using corresponding C 
> language operations, making GCC produce the same assembly instructions, 
> possibly shuffled, in the general case and sometimes actually fewer of 
> them where an input is constant, because the compiler does not have to 
> reload it to a register (operand constraints could be adjusted for that, 
> but the plain C approach is cleaner anyway).
> 
> Example code size changes are as follows, for a 32-bit configuration:
> 
>       text       data        bss      total filename
>    5920480    1347236     126592    7394308 vmlinux-old
>    5920480    1347236     126592    7394308 vmlinux-now
>    5919728    1347236     126592    7393556 vmlinux-c
> 
> and for a 64-bit configuration:
> 
>       text       data        bss      total filename
>    6024112    1790828     225728    8040668 vmlinux-old
>    6024128    1790828     225728    8040684 vmlinux-now
>    6023760    1790828     225728    8040316 vmlinux-c
> 
> respectively, where "old" is with the commit referred reverted, "now" is 
> with no change, and "c" is with this change applied.
> 
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
> 
>  I have visually inspected code produced and verified this change to boot 
> with TCP networking performing just fine, both with a 32-bit and a 64-bit 
> configuration.  Sadly with the little endianness only, because in the 
> course of this verification I have discovered the core card of my Malta 
> board bit the dust a few days ago, apparently in a permanent manner, and I 
> have no other big-endian MIPS system available here to try.
> 
>  The only difference between the two endiannesses is the left-shift 
> operation on (proto + len) however, which doesn't happen for big-endian 
> configurations, so the little endianness should in principle provide 
> enough coverage.
> 
>  Also I'm leaving it to LLVM folks to verify, however this is plain C, so 
> it is expected to just work.
> 
>  Please apply.
> 
>   Maciej
> ---
>  arch/mips/include/asm/checksum.h |   71 ++++++++++++++++++---------------------

applied to mips-next.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C
  2022-05-22 20:48 [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C Maciej W. Rozycki
  2022-05-23  9:40 ` Thomas Bogendoerfer
@ 2022-05-24 16:38 ` Florian Fainelli
  2022-05-24 17:18   ` Maciej W. Rozycki
  1 sibling, 1 reply; 6+ messages in thread
From: Florian Fainelli @ 2022-05-24 16:38 UTC (permalink / raw)
  To: Maciej W. Rozycki, Thomas Bogendoerfer
  Cc: Paul Cercueil, Nathan Chancellor, Tiezhu Yang, linux-mips, linux-kernel

On 5/22/22 13:48, Maciej W. Rozycki wrote:
> Recent commit 198688edbf77 ("MIPS: Fix inline asm input/output type
> mismatch in checksum.h used with Clang") introduced a code size and
> performance regression with 64-bit code emitted for `csum_tcpudp_nofold'
> by GCC, caused by a redundant truncation operation produced due to a
> data type change made to the variable associated with the inline
> assembly's output operand.
> 
> The intent previously expressed here with operands and constraints for
> optimal code was to have the output operand share a register with one
> inputs, both of a different integer type each.  This is perfectly valid
> with the MIPS psABI where a register can hold integer data of different
> types and the assembly code used here makes data stored in the output
> register match the data type used with the output operand, however it
> has turned out impossible to express this arrangement in source code
> such as to satisfy LLVM, apparently due to the compiler's internal
> limitations.
> 
> There is nothing peculiar about the inline assembly `csum_tcpudp_nofold'
> includes however, though it does choose assembly instructions carefully.
> 
> Rewrite this piece of assembly in plain C then, using corresponding C
> language operations, making GCC produce the same assembly instructions,
> possibly shuffled, in the general case and sometimes actually fewer of
> them where an input is constant, because the compiler does not have to
> reload it to a register (operand constraints could be adjusted for that,
> but the plain C approach is cleaner anyway).
> 
> Example code size changes are as follows, for a 32-bit configuration:
> 
>        text       data        bss      total filename
>     5920480    1347236     126592    7394308 vmlinux-old
>     5920480    1347236     126592    7394308 vmlinux-now
>     5919728    1347236     126592    7393556 vmlinux-c
> 
> and for a 64-bit configuration:
> 
>        text       data        bss      total filename
>     6024112    1790828     225728    8040668 vmlinux-old
>     6024128    1790828     225728    8040684 vmlinux-now
>     6023760    1790828     225728    8040316 vmlinux-c
> 
> respectively, where "old" is with the commit referred reverted, "now" is
> with no change, and "c" is with this change applied.
> 
> Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
> ---
> Hi,
> 
>   I have visually inspected code produced and verified this change to boot
> with TCP networking performing just fine, both with a 32-bit and a 64-bit
> configuration.  Sadly with the little endianness only, because in the
> course of this verification I have discovered the core card of my Malta
> board bit the dust a few days ago, apparently in a permanent manner, and I
> have no other big-endian MIPS system available here to try.

How about QEMU is not that a viable option for testing big/little endian 
configurations?
-- 
Florian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C
  2022-05-24 16:38 ` Florian Fainelli
@ 2022-05-24 17:18   ` Maciej W. Rozycki
  2022-05-24 18:30     ` Florian Fainelli
  0 siblings, 1 reply; 6+ messages in thread
From: Maciej W. Rozycki @ 2022-05-24 17:18 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Thomas Bogendoerfer, Paul Cercueil, Nathan Chancellor,
	Tiezhu Yang, linux-mips, linux-kernel

On Tue, 24 May 2022, Florian Fainelli wrote:

> >   I have visually inspected code produced and verified this change to boot
> > with TCP networking performing just fine, both with a 32-bit and a 64-bit
> > configuration.  Sadly with the little endianness only, because in the
> > course of this verification I have discovered the core card of my Malta
> > board bit the dust a few days ago, apparently in a permanent manner, and I
> > have no other big-endian MIPS system available here to try.
> 
> How about QEMU is not that a viable option for testing big/little endian
> configurations?

 Yeah, for this particular change, sure.  I don't have QEMU set up however 
at the moment and would have to take some time to sort it, and it won't do 
for peripherals it doesn't implement.  The failure is a fresh problem and 
I yet need to figure out what to do about it.  A bad coincidence I guess 
as I have MIPS hardware 10 years older that still goes strong.

  Maciej

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C
  2022-05-24 17:18   ` Maciej W. Rozycki
@ 2022-05-24 18:30     ` Florian Fainelli
  2022-05-25 15:01       ` Maciej W. Rozycki
  0 siblings, 1 reply; 6+ messages in thread
From: Florian Fainelli @ 2022-05-24 18:30 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Thomas Bogendoerfer, Paul Cercueil, Nathan Chancellor,
	Tiezhu Yang, linux-mips, linux-kernel

On 5/24/22 10:18, Maciej W. Rozycki wrote:
> On Tue, 24 May 2022, Florian Fainelli wrote:
> 
>>>    I have visually inspected code produced and verified this change to boot
>>> with TCP networking performing just fine, both with a 32-bit and a 64-bit
>>> configuration.  Sadly with the little endianness only, because in the
>>> course of this verification I have discovered the core card of my Malta
>>> board bit the dust a few days ago, apparently in a permanent manner, and I
>>> have no other big-endian MIPS system available here to try.
>>
>> How about QEMU is not that a viable option for testing big/little endian
>> configurations?
> 
>   Yeah, for this particular change, sure.  I don't have QEMU set up however
> at the moment and would have to take some time to sort it, and it won't do
> for peripherals it doesn't implement.  The failure is a fresh problem and
> I yet need to figure out what to do about it.  A bad coincidence I guess
> as I have MIPS hardware 10 years older that still goes strong.

If that makes it any easier, OpenWrt has 4 configurations of Malta for 
QEMU which allows you to have at least networking (relevant here) for 
32/64 and le/be:

https://git.openwrt.org/?p=openwrt/openwrt.git;a=tree;f=target/linux/malta;h=90b2913dec291a1926eefc332b90b5842746c6e6;hb=HEAD

Along with a readme file on how to start those platforms:

https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=target/linux/malta/README;h=bbe806de3d6671d69ecc3db0fcfccf9f9176de13;hb=HEAD

It's really easy.
-- 
Florian

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C
  2022-05-24 18:30     ` Florian Fainelli
@ 2022-05-25 15:01       ` Maciej W. Rozycki
  0 siblings, 0 replies; 6+ messages in thread
From: Maciej W. Rozycki @ 2022-05-25 15:01 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Thomas Bogendoerfer, Paul Cercueil, Nathan Chancellor,
	Tiezhu Yang, linux-mips, linux-kernel

On Tue, 24 May 2022, Florian Fainelli wrote:

> >   Yeah, for this particular change, sure.  I don't have QEMU set up however
> > at the moment and would have to take some time to sort it, and it won't do
> > for peripherals it doesn't implement.  The failure is a fresh problem and
> > I yet need to figure out what to do about it.  A bad coincidence I guess
> > as I have MIPS hardware 10 years older that still goes strong.
> 
> If that makes it any easier, OpenWrt has 4 configurations of Malta for QEMU
> which allows you to have at least networking (relevant here) for 32/64 and
> le/be:
> 
> https://git.openwrt.org/?p=openwrt/openwrt.git;a=tree;f=target/linux/malta;h=90b2913dec291a1926eefc332b90b5842746c6e6;hb=HEAD
> 
> Along with a readme file on how to start those platforms:
> 
> https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=target/linux/malta/README;h=bbe806de3d6671d69ecc3db0fcfccf9f9176de13;hb=HEAD
> 
> It's really easy.

 Thanks.  I worked with QEMU in the past, up to contributing to the MIPS 
target, so I'm quite familiar with this stuff, but it does require some 
time to set up, which I'd rather use for something else right now.

 Also I'd rather I had the core card repaired or replaced really as the 
preferred solution before resorting to alternatives.  The baseboard seems 
fine and it would be a waste of a valuable resource to just have it lying 
around collecting dust.  I wonder if the problem could be a BGA failure 
(this is a pre-RoHS device, so no issue with early lead-free solder) or an 
actual component fault.

  Maciej

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-05-25 15:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-22 20:48 [PATCH] MIPS: Rewrite `csum_tcpudp_nofold' in plain C Maciej W. Rozycki
2022-05-23  9:40 ` Thomas Bogendoerfer
2022-05-24 16:38 ` Florian Fainelli
2022-05-24 17:18   ` Maciej W. Rozycki
2022-05-24 18:30     ` Florian Fainelli
2022-05-25 15:01       ` Maciej W. Rozycki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).