All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
@ 2016-05-11 10:24 George Spelvin
  2016-05-11 12:38 ` Greg Ungerer
  2016-05-12 12:46 ` Greg Ungerer
  0 siblings, 2 replies; 11+ messages in thread
From: George Spelvin @ 2016-05-11 10:24 UTC (permalink / raw)
  To: geert, linux-m68k; +Cc: linux

The __mulsi3 function implements 32x32-bit multiply when the processor
(mc68000 or mc68010) does not have native support.

This rewrite shrinks the code from 16 words to 13 (32 bytes to 26), and
eliminates one stack frame access, saving in total 4 words and 16 cycles.

The improvement is small (about 1/16 of the cost of the call overall),
but multiply is a popular enough operation that it's nice to have.

One open question is why the code includes a ColdFire option.
Every ColdFire processor back to the original ISA A includes a 32x32-bit
multiply, and so does not need this routine at all.

Signed-off-by: George Spelvin <linux@horizon.com>
---
 arch/m68k/lib/mulsi3.S | 132 ++++++++++++-------------------------------------
 1 file changed, 31 insertions(+), 101 deletions(-)

diff --git a/arch/m68k/lib/mulsi3.S b/arch/m68k/lib/mulsi3.S
index c39ad4e..b4204fc 100644
--- a/arch/m68k/lib/mulsi3.S
+++ b/arch/m68k/lib/mulsi3.S
@@ -1,105 +1,35 @@
-/* libgcc1 routines for 68000 w/o floating-point hardware.
-   Copyright (C) 1994, 1996, 1997, 1998 Free Software Foundation, Inc.
-
-This file is part of GNU CC.
-
-GNU CC is free software; you can redistribute it and/or modify it
-under the terms of the GNU General Public License as published by the
-Free Software Foundation; either version 2, or (at your option) any
-later version.
-
-In addition to the permissions in the GNU General Public License, the
-Free Software Foundation gives you unlimited permission to link the
-compiled version of this file with other programs, and to distribute
-those programs without any restriction coming from the use of this
-file.  (The General Public License restrictions do apply in other
-respects; for example, they cover modification of the file, and
-distribution when not linked into another program.)
-
-This file is distributed in the hope that it will be useful, but
-WITHOUT ANY WARRANTY; without even the implied warranty of
-MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-General Public License for more details. */
-
-/* As a special exception, if you link this library with files
-   compiled with GCC to produce an executable, this does not cause
-   the resulting executable to be covered by the GNU General Public License.
-   This exception does not however invalidate any other reasons why
-   the executable file might be covered by the GNU General Public License.  */
-
-/* Use this one for any 680x0; assumes no floating point hardware.
-   The trailing " '" appearing on some lines is for ANSI preprocessors.  Yuk.
-   Some of this code comes from MINIX, via the folks at ericsson.
-   D. V. Henkel-Wallace (gumby@cygnus.com) Fete Bastille, 1992
-*/
-
-/* These are predefined by new versions of GNU cpp.  */
-
-#ifndef __USER_LABEL_PREFIX__
-#define __USER_LABEL_PREFIX__ _
-#endif
-
-#ifndef __REGISTER_PREFIX__
-#define __REGISTER_PREFIX__
-#endif
-
-#ifndef __IMMEDIATE_PREFIX__
-#define __IMMEDIATE_PREFIX__ #
-#endif
-
-/* ANSI concatenation macros.  */
-
-#define CONCAT1(a, b) CONCAT2(a, b)
-#define CONCAT2(a, b) a ## b
-
-/* Use the right prefix for global labels.  */
-
-#define SYM(x) CONCAT1 (__USER_LABEL_PREFIX__, x)
-
-/* Use the right prefix for registers.  */
-
-#define REG(x) CONCAT1 (__REGISTER_PREFIX__, x)
-
-/* Use the right prefix for immediate values.  */
-
-#define IMM(x) CONCAT1 (__IMMEDIATE_PREFIX__, x)
-
-#define d0 REG (d0)
-#define d1 REG (d1)
-#define d2 REG (d2)
-#define d3 REG (d3)
-#define d4 REG (d4)
-#define d5 REG (d5)
-#define d6 REG (d6)
-#define d7 REG (d7)
-#define a0 REG (a0)
-#define a1 REG (a1)
-#define a2 REG (a2)
-#define a3 REG (a3)
-#define a4 REG (a4)
-#define a5 REG (a5)
-#define a6 REG (a6)
-#define fp REG (fp)
-#define sp REG (sp)
-
+/*
+ * 32x32->32-bit multiply.  This is made up of three 16x16->32-bit
+ * partial products.
+ *
+ * Assuming the average argument has half its bits set, this takes
+ * 238 68000 cycles.
+ */
 	.text
 	.proc
-	.globl	SYM (__mulsi3)
-SYM (__mulsi3):
-	movew	sp@(4), d0	/* x0 -> d0 */
-	muluw	sp@(10), d0	/* x0*y1 */
-	movew	sp@(6), d1	/* x1 -> d1 */
-	muluw	sp@(8), d1	/* x1*y0 */
-#if !(defined(__mcf5200__) || defined(__mcoldfire__))
-	addw	d1, d0
-#else
-	addl	d1, d0
-#endif
+	.globl	___mulsi3
+___mulsi3:
+	lea	4(sp), a0
+	move.w	(a0)+, d0	/* xhigh */
+	move.w	(a0)+, d1	/* xlow */
+	move.w	d1, a1
+	mulu.w	(a0)+, d1	/* yhigh */
+	mulu.w	(a0),  d0	/* ylow */
+
+#if defined(__mcf5200__) || defined(__mcoldfire__)
+	/* Why do we need this?  All ColdFire have MULU.L */
+	add.l	d0, d1
+	move.w	a1, d0		/* xlow */
+	mulu.w	(a0),d0		/* ylow */
+	swap	d1
+	clr.w	d1
+	add.l	d1, d0
+#else /* 68000 or 68010 */
+	add.w	d0, d1
+	move.w	a1, d0		/* xlow */
+	mulu.w	(a0),d0		/* ylow */
 	swap	d0
-	clrw	d0
-	movew	sp@(6), d1	/* x1 -> d1 */
-	muluw	sp@(10), d1	/* x1*y1 */
-	addl	d1, d0
-
+	add.w	d1, d0
+	swap	d0
+#endif
 	rts
-
-- 
2.8.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-11 10:24 [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize] George Spelvin
@ 2016-05-11 12:38 ` Greg Ungerer
  2016-05-12  8:04   ` George Spelvin
  2016-05-12 12:46 ` Greg Ungerer
  1 sibling, 1 reply; 11+ messages in thread
From: Greg Ungerer @ 2016-05-11 12:38 UTC (permalink / raw)
  To: George Spelvin, geert, linux-m68k

Hi George,

On 11/05/16 20:24, George Spelvin wrote:
> The __mulsi3 function implements 32x32-bit multiply when the processor
> (mc68000 or mc68010) does not have native support.
>
> This rewrite shrinks the code from 16 words to 13 (32 bytes to 26), and
> eliminates one stack frame access, saving in total 4 words and 16 cycles.
>
> The improvement is small (about 1/16 of the cost of the call overall),
> but multiply is a popular enough operation that it's nice to have.
>
> One open question is why the code includes a ColdFire option.
> Every ColdFire processor back to the original ISA A includes a 32x32-bit
> multiply, and so does not need this routine at all.

That mulsi3.S code was taken directly from gcc's lb1sf68.S (and that
was lb1sf68.asm in older versions of gcc). It has always had that
coldfire conditional code in it. Not sure why it was in there though.

Regards
Greg


> Signed-off-by: George Spelvin <linux@horizon.com>
> ---
>   arch/m68k/lib/mulsi3.S | 132 ++++++++++++-------------------------------------
>   1 file changed, 31 insertions(+), 101 deletions(-)
>
> diff --git a/arch/m68k/lib/mulsi3.S b/arch/m68k/lib/mulsi3.S
> index c39ad4e..b4204fc 100644
> --- a/arch/m68k/lib/mulsi3.S
> +++ b/arch/m68k/lib/mulsi3.S
> @@ -1,105 +1,35 @@
> -/* libgcc1 routines for 68000 w/o floating-point hardware.
> -   Copyright (C) 1994, 1996, 1997, 1998 Free Software Foundation, Inc.
> -
> -This file is part of GNU CC.
> -
> -GNU CC is free software; you can redistribute it and/or modify it
> -under the terms of the GNU General Public License as published by the
> -Free Software Foundation; either version 2, or (at your option) any
> -later version.
> -
> -In addition to the permissions in the GNU General Public License, the
> -Free Software Foundation gives you unlimited permission to link the
> -compiled version of this file with other programs, and to distribute
> -those programs without any restriction coming from the use of this
> -file.  (The General Public License restrictions do apply in other
> -respects; for example, they cover modification of the file, and
> -distribution when not linked into another program.)
> -
> -This file is distributed in the hope that it will be useful, but
> -WITHOUT ANY WARRANTY; without even the implied warranty of
> -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> -General Public License for more details. */
> -
> -/* As a special exception, if you link this library with files
> -   compiled with GCC to produce an executable, this does not cause
> -   the resulting executable to be covered by the GNU General Public License.
> -   This exception does not however invalidate any other reasons why
> -   the executable file might be covered by the GNU General Public License.  */
> -
> -/* Use this one for any 680x0; assumes no floating point hardware.
> -   The trailing " '" appearing on some lines is for ANSI preprocessors.  Yuk.
> -   Some of this code comes from MINIX, via the folks at ericsson.
> -   D. V. Henkel-Wallace (gumby@cygnus.com) Fete Bastille, 1992
> -*/
> -
> -/* These are predefined by new versions of GNU cpp.  */
> -
> -#ifndef __USER_LABEL_PREFIX__
> -#define __USER_LABEL_PREFIX__ _
> -#endif
> -
> -#ifndef __REGISTER_PREFIX__
> -#define __REGISTER_PREFIX__
> -#endif
> -
> -#ifndef __IMMEDIATE_PREFIX__
> -#define __IMMEDIATE_PREFIX__ #
> -#endif
> -
> -/* ANSI concatenation macros.  */
> -
> -#define CONCAT1(a, b) CONCAT2(a, b)
> -#define CONCAT2(a, b) a ## b
> -
> -/* Use the right prefix for global labels.  */
> -
> -#define SYM(x) CONCAT1 (__USER_LABEL_PREFIX__, x)
> -
> -/* Use the right prefix for registers.  */
> -
> -#define REG(x) CONCAT1 (__REGISTER_PREFIX__, x)
> -
> -/* Use the right prefix for immediate values.  */
> -
> -#define IMM(x) CONCAT1 (__IMMEDIATE_PREFIX__, x)
> -
> -#define d0 REG (d0)
> -#define d1 REG (d1)
> -#define d2 REG (d2)
> -#define d3 REG (d3)
> -#define d4 REG (d4)
> -#define d5 REG (d5)
> -#define d6 REG (d6)
> -#define d7 REG (d7)
> -#define a0 REG (a0)
> -#define a1 REG (a1)
> -#define a2 REG (a2)
> -#define a3 REG (a3)
> -#define a4 REG (a4)
> -#define a5 REG (a5)
> -#define a6 REG (a6)
> -#define fp REG (fp)
> -#define sp REG (sp)
> -
> +/*
> + * 32x32->32-bit multiply.  This is made up of three 16x16->32-bit
> + * partial products.
> + *
> + * Assuming the average argument has half its bits set, this takes
> + * 238 68000 cycles.
> + */
>   	.text
>   	.proc
> -	.globl	SYM (__mulsi3)
> -SYM (__mulsi3):
> -	movew	sp@(4), d0	/* x0 -> d0 */
> -	muluw	sp@(10), d0	/* x0*y1 */
> -	movew	sp@(6), d1	/* x1 -> d1 */
> -	muluw	sp@(8), d1	/* x1*y0 */
> -#if !(defined(__mcf5200__) || defined(__mcoldfire__))
> -	addw	d1, d0
> -#else
> -	addl	d1, d0
> -#endif
> +	.globl	___mulsi3
> +___mulsi3:
> +	lea	4(sp), a0
> +	move.w	(a0)+, d0	/* xhigh */
> +	move.w	(a0)+, d1	/* xlow */
> +	move.w	d1, a1
> +	mulu.w	(a0)+, d1	/* yhigh */
> +	mulu.w	(a0),  d0	/* ylow */
> +
> +#if defined(__mcf5200__) || defined(__mcoldfire__)
> +	/* Why do we need this?  All ColdFire have MULU.L */
> +	add.l	d0, d1
> +	move.w	a1, d0		/* xlow */
> +	mulu.w	(a0),d0		/* ylow */
> +	swap	d1
> +	clr.w	d1
> +	add.l	d1, d0
> +#else /* 68000 or 68010 */
> +	add.w	d0, d1
> +	move.w	a1, d0		/* xlow */
> +	mulu.w	(a0),d0		/* ylow */
>   	swap	d0
> -	clrw	d0
> -	movew	sp@(6), d1	/* x1 -> d1 */
> -	muluw	sp@(10), d1	/* x1*y1 */
> -	addl	d1, d0
> -
> +	add.w	d1, d0
> +	swap	d0
> +#endif
>   	rts
> -
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-11 12:38 ` Greg Ungerer
@ 2016-05-12  8:04   ` George Spelvin
  2016-05-12  8:35     ` Andreas Schwab
  2016-05-12 13:14     ` Greg Ungerer
  0 siblings, 2 replies; 11+ messages in thread
From: George Spelvin @ 2016-05-12  8:04 UTC (permalink / raw)
  To: geert, gerg, linux-m68k, linux

Thanks for the response!

> That mulsi3.S code was taken directly from gcc's lb1sf68.S (and that
> was lb1sf68.asm in older versions of gcc). It has always had that
> coldfire conditional code in it. Not sure why it was in there though.

I went digging through a lot of ColdFire documentation to see if
there were ever any that needed it, and apparently it's been in there
since the beginning.

It did occur to me that perhaps someone wanted to support tne
intersection of all 68k family CPUs, so that meant only MULU.W
and ADD.L.  But I can't see a gcc option to ask for that.


Mostly I just thought that saving a few cycles was a nice win.


This is tangential to some work I'm doing on the hash functions.
in <linux/hash.h>, there are hash_32(x, bits) and hash_64(x, bits)
which basically do "return (x * CONSTANT) >> (32 - bits)", or
64 - bits for the 64-bit version.

The CONSTANTs were chosen to be efficient when computed using
shift-and-add sequences, but that made them less effective hashes.

The 64-bit constant was recently revised to avoid a bad case found when
hashing page-aligned pointers, since it's a pointless optimization;
all 64-bit processors have fast multipliers.  But 32 bits was left alone
for fear of breaking performance.

After some research (partly by searching for __mulsi3), I found that m68k
is the only platform that doesn't have a fast multiplier.  SPARC used
to be in that boat too, but SPARCv7 syupport was dropped a few years
ago.

That led to me optimizing the multiply-by-constant code for m68k.



Do you have any interest in the matter?  Basically, the options are
to optimize the implementation of the multiplication, or come up with
a different hash function that works better in that environment.


The current multiply by 0x9e370001 compiles to, depending on compiler
flags, either a call to __mulsi3 or (omitting prologue & epilogue,
since it's designed as an in-line function):

	move.l	%d0,%d1		 4
	lsl.l	#6,%d1		20  24
	move.l	%d1,%d2		 4  28
	lsl.l	#3,%d2		14  42
	sub.l	%d1,%d2		 8  50
	sub.l	%d0,%d2		 8  58
	move.l	%d2,%d1		 4  62
	lsl.l	#3,%d1		14  76
	sub.l	%d2,%d1		 8  84
	lsl.l	#3,%d1		14  98
	add.l	%d0,%d1		 8 106
	swap	%d1		 4 110
	clr.w	%d1		 4 114
	sub.l	%d1,%d0		 8 122

(The columns on the right are cycle counts.)

Tangent: I'm surprised I can't coax it into the equivalent:
	move.w %d0,%d1		 4
	mulu.w #0x9e37,%d1	58  62
	add.l %d1,%d0		 8  70

Anyway, the problem is the low Hamming weight of the low 16 bits,
which has to be solved somehow.  The "more random" high 16 bits of the
multiplier are only multiplied by the low 16 bits of the argument.
The high 16 bits of the argument are just multiplied by 1.  If the
argument is a page-aligned pointer, with 12 low-order bits zero, we have
very poor mixing.

Using the better constant 0x61C88647 (both this and the previous one are
chosen to be close to phi, the golden ratio (sqrt(5)-1)/2, but this is
the negative), gcc struggles.

I searched for an efficient shift-and-add sequence,
and came up with the 68000-optimized:

/* Multiply by 0x61C88647 */
unsigned hash_32(unsigned x)
{
	unsigned a, b, c;

	a = b = c = x;	
	c <<= 19;
	a += c;		/* a = (x << 19) + x */
	b <<= 9;
	b += a;		/* b = (x << 9) + a */
	c <<= 4;
	c += b;		/* c = (x << 23) + b */
	b <<= 5;
	b += c;		/* (b << 5) + c */
	b <<= 3;
	b += a;		/* ((b << 5) + c << 3) + a */
	b <<= 3;
	b -= c;		/* (((b << 5) + c << 3) + a << 3) - c */
	return x;
}

which gcc turns into:
	move.l	%d1,%d2		 4
	lsl.w	#3,%d2		12  16
	swap	%d2		 4  20
	clr.w	%d2		 4  24
	move.l	%d1,%a1		 4  28
	add.l	%d2,%a1		 8  36
	moveq	#9,%d0		 4  40
	lsl.l	%d0,%d1		26  66
	add.l	%a1,%d1		 8  74
	lsl.l	#4,%d2		16  90
	move.l	%d1,%a0		 4  94
	add.l	%d2,%a0		 8 102
	lsl.l	#5,%d1		18 120
	move.l	%a0,%d0		 4 124
	add.l	%d1,%d0		 8 132
	move.l	%d0,%d1		 4 136
	lsl.l	#3,%d1		14 150
	move.l	%a1,%d0		 4 154
	add.l	%d1,%d0		 8 162
	lsl.l	#3,%d0		14 176
	sub.l	%a0,%d0		 8 184

I can shave a few cycles with hand-optimized code, but the result is
barely faster than the much smaller multiply-based:

	; y = #k * x, with z as a temp
	move.w  #klow, y        ;  8
	move.w  x, z            ;  4     12
	swap    x               ;  4     16
	mulu.w  y, x            ; 52     68	(Using klow = 0x8647)
	mulu.w  z, y            ; 54    122	(Assuming 50% ones)
	mulu.w  #khigh, z       ; 54    176	(Using khigh = 0x61C8)
	swap    y               ;  4    180
	add.w   x, y            ;  4    184
	add.w   z, y            ;  4    188
	swap    y               ;  4    192
	;; 12 words

Another way to divide it up is to compute xlow * khigh with a multiply,
and x * klow with shifts and adds.  x * 0x8647 can be evauated as:

	unsigned a = (x << 9) + x;
	unsigned b = (x << 1) + a;
	unsigned c = (b << 1) + (a << 6) + a;
	return c + ((uint16_t)(x * 0x61C8) << 16);

If I write it in the following form:
	register unsigned a, b;

	b = x << 1;
	asm("" : "=g" (a) : "0" (b));	/* "a = b, DAMN IT" */
	a <<= 8;
	a += x;
	b += a;
	b += b;
	b += a;
	a <<= 6;
	return a + b + ((uint16_t)(x * 0x61c8) << 16);

The gcc can be coaxed into generating the desired code.
(The "asm" is the only way I found to stop GCC from collapsing
"(x << 1) << 8" to "x << 9", which is slower.)

baz:
	move.l 4(%sp),%d1
	move.l %d1,%a0		 4
	add.l %d1,%a0		 8  12
	move.l %a0,%d0		 4  16
	lsl.l #8,%d0		24  40
	add.l %d1,%d0		 8  48
	add.l %d0,%a0		 8  56
	muls.w #25032,%d1	56 112
	swap %d1		 4 116
	clr.w %d1		 4 120
	add.l %d0,%d1		 8 128
	add.l %a0,%d1		 8 136
	add.l %a0,%d1		 8 144
	lsl.l #6,%d0		20 164
	add.l %d1,%d0		 8 172
	rts

Well, that's a *little* bit faster.



Alternatively, I could just invent a whole different hash function for
the case.  One idea is to use two, rather than three, 16-bit multiplies.
Split the input into two 16-bit halves, multiply them by separate
constants, sum the results, and then swap the halves.

	move.w	x, y		 4
	swap	x		 4   8
	mulu.w	#K1, y		58  66
	mulu.w	#K2, x		58 124
	add.l	y, x		 8 132
	swap	x		 4 136

The swap is important; hash_32() is documented as putting the "most mixed"
bits in the most significant bits, as hash tables are indexed by the most
significant bits.

If you have any thoughts on the matter, they'd definitely be appreciated.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-12  8:04   ` George Spelvin
@ 2016-05-12  8:35     ` Andreas Schwab
  2016-05-12 13:14     ` Greg Ungerer
  1 sibling, 0 replies; 11+ messages in thread
From: Andreas Schwab @ 2016-05-12  8:35 UTC (permalink / raw)
  To: George Spelvin; +Cc: geert, gerg, linux-m68k

"George Spelvin" <linux@horizon.com> writes:

> I went digging through a lot of ColdFire documentation to see if
> there were ever any that needed it, and apparently it's been in there
> since the beginning.
>
> It did occur to me that perhaps someone wanted to support tne
> intersection of all 68k family CPUs, so that meant only MULU.W
> and ADD.L.  But I can't see a gcc option to ask for that.

My guess is that someone added the support just to get it through the
assembler, without thinking about not compiling it at all for coldfire
configurations.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-11 10:24 [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize] George Spelvin
  2016-05-11 12:38 ` Greg Ungerer
@ 2016-05-12 12:46 ` Greg Ungerer
  2016-05-12 20:52   ` George Spelvin
  1 sibling, 1 reply; 11+ messages in thread
From: Greg Ungerer @ 2016-05-12 12:46 UTC (permalink / raw)
  To: George Spelvin, geert, linux-m68k

Hi George,

Comments inline.


On 11/05/16 20:24, George Spelvin wrote:
> The __mulsi3 function implements 32x32-bit multiply when the processor
> (mc68000 or mc68010) does not have native support.
>
> This rewrite shrinks the code from 16 words to 13 (32 bytes to 26), and
> eliminates one stack frame access, saving in total 4 words and 16 cycles.
>
> The improvement is small (about 1/16 of the cost of the call overall),
> but multiply is a popular enough operation that it's nice to have.
>
> One open question is why the code includes a ColdFire option.
> Every ColdFire processor back to the original ISA A includes a 32x32-bit
> multiply, and so does not need this routine at all.
>
> Signed-off-by: George Spelvin <linux@horizon.com>
> ---
>   arch/m68k/lib/mulsi3.S | 132 ++++++++++++-------------------------------------
>   1 file changed, 31 insertions(+), 101 deletions(-)
>
> diff --git a/arch/m68k/lib/mulsi3.S b/arch/m68k/lib/mulsi3.S
> index c39ad4e..b4204fc 100644
> --- a/arch/m68k/lib/mulsi3.S
> +++ b/arch/m68k/lib/mulsi3.S
> @@ -1,105 +1,35 @@
> -/* libgcc1 routines for 68000 w/o floating-point hardware.
> -   Copyright (C) 1994, 1996, 1997, 1998 Free Software Foundation, Inc.
> -
> -This file is part of GNU CC.
> -
> -GNU CC is free software; you can redistribute it and/or modify it
> -under the terms of the GNU General Public License as published by the
> -Free Software Foundation; either version 2, or (at your option) any
> -later version.
> -
> -In addition to the permissions in the GNU General Public License, the
> -Free Software Foundation gives you unlimited permission to link the
> -compiled version of this file with other programs, and to distribute
> -those programs without any restriction coming from the use of this
> -file.  (The General Public License restrictions do apply in other
> -respects; for example, they cover modification of the file, and
> -distribution when not linked into another program.)
> -
> -This file is distributed in the hope that it will be useful, but
> -WITHOUT ANY WARRANTY; without even the implied warranty of
> -MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> -General Public License for more details. */
> -
> -/* As a special exception, if you link this library with files
> -   compiled with GCC to produce an executable, this does not cause
> -   the resulting executable to be covered by the GNU General Public License.
> -   This exception does not however invalidate any other reasons why
> -   the executable file might be covered by the GNU General Public License.  */
> -
> -/* Use this one for any 680x0; assumes no floating point hardware.
> -   The trailing " '" appearing on some lines is for ANSI preprocessors.  Yuk.
> -   Some of this code comes from MINIX, via the folks at ericsson.
> -   D. V. Henkel-Wallace (gumby@cygnus.com) Fete Bastille, 1992
> -*/
> -
> -/* These are predefined by new versions of GNU cpp.  */
> -
> -#ifndef __USER_LABEL_PREFIX__
> -#define __USER_LABEL_PREFIX__ _
> -#endif
> -
> -#ifndef __REGISTER_PREFIX__
> -#define __REGISTER_PREFIX__
> -#endif
> -
> -#ifndef __IMMEDIATE_PREFIX__
> -#define __IMMEDIATE_PREFIX__ #
> -#endif
> -
> -/* ANSI concatenation macros.  */
> -
> -#define CONCAT1(a, b) CONCAT2(a, b)
> -#define CONCAT2(a, b) a ## b
> -
> -/* Use the right prefix for global labels.  */
> -
> -#define SYM(x) CONCAT1 (__USER_LABEL_PREFIX__, x)
> -
> -/* Use the right prefix for registers.  */
> -
> -#define REG(x) CONCAT1 (__REGISTER_PREFIX__, x)
> -
> -/* Use the right prefix for immediate values.  */
> -
> -#define IMM(x) CONCAT1 (__IMMEDIATE_PREFIX__, x)
> -
> -#define d0 REG (d0)
> -#define d1 REG (d1)
> -#define d2 REG (d2)
> -#define d3 REG (d3)
> -#define d4 REG (d4)
> -#define d5 REG (d5)
> -#define d6 REG (d6)
> -#define d7 REG (d7)
> -#define a0 REG (a0)
> -#define a1 REG (a1)
> -#define a2 REG (a2)
> -#define a3 REG (a3)
> -#define a4 REG (a4)
> -#define a5 REG (a5)
> -#define a6 REG (a6)
> -#define fp REG (fp)
> -#define sp REG (sp)
> -
> +/*
> + * 32x32->32-bit multiply.  This is made up of three 16x16->32-bit
> + * partial products.
> + *
> + * Assuming the average argument has half its bits set, this takes
> + * 238 68000 cycles.
> + */
>   	.text
>   	.proc
> -	.globl	SYM (__mulsi3)
> -SYM (__mulsi3):
> -	movew	sp@(4), d0	/* x0 -> d0 */
> -	muluw	sp@(10), d0	/* x0*y1 */
> -	movew	sp@(6), d1	/* x1 -> d1 */
> -	muluw	sp@(8), d1	/* x1*y0 */
> -#if !(defined(__mcf5200__) || defined(__mcoldfire__))
> -	addw	d1, d0
> -#else
> -	addl	d1, d0
> -#endif
> +	.globl	___mulsi3
> +___mulsi3:
> +	lea	4(sp), a0

This syntax fails for me (using a binutils-2.25.1 based toolchain).
Registers must be prefixed with a "%", so here %sp and %a0.

   arch/m68k/lib/mulsi3.S: Assembler messages:
   arch/m68k/lib/mulsi3.S:12: Error: syntax error -- statement `lea 
4(sp),a0' ignored

That was compiling with just this one patch for a ColdFire target.

Regards
Greg


> +	move.w	(a0)+, d0	/* xhigh */
> +	move.w	(a0)+, d1	/* xlow */
> +	move.w	d1, a1
> +	mulu.w	(a0)+, d1	/* yhigh */
> +	mulu.w	(a0),  d0	/* ylow */
> +
> +#if defined(__mcf5200__) || defined(__mcoldfire__)
> +	/* Why do we need this?  All ColdFire have MULU.L */
> +	add.l	d0, d1
> +	move.w	a1, d0		/* xlow */
> +	mulu.w	(a0),d0		/* ylow */
> +	swap	d1
> +	clr.w	d1
> +	add.l	d1, d0
> +#else /* 68000 or 68010 */
> +	add.w	d0, d1
> +	move.w	a1, d0		/* xlow */
> +	mulu.w	(a0),d0		/* ylow */
>   	swap	d0
> -	clrw	d0
> -	movew	sp@(6), d1	/* x1 -> d1 */
> -	muluw	sp@(10), d1	/* x1*y1 */
> -	addl	d1, d0
> -
> +	add.w	d1, d0
> +	swap	d0
> +#endif
>   	rts
> -
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-12  8:04   ` George Spelvin
  2016-05-12  8:35     ` Andreas Schwab
@ 2016-05-12 13:14     ` Greg Ungerer
  1 sibling, 0 replies; 11+ messages in thread
From: Greg Ungerer @ 2016-05-12 13:14 UTC (permalink / raw)
  To: George Spelvin, geert, linux-m68k

Hi George,

On 12/05/16 18:04, George Spelvin wrote:
> Thanks for the response!
>
>> That mulsi3.S code was taken directly from gcc's lb1sf68.S (and that
>> was lb1sf68.asm in older versions of gcc). It has always had that
>> coldfire conditional code in it. Not sure why it was in there though.
>
> I went digging through a lot of ColdFire documentation to see if
> there were ever any that needed it, and apparently it's been in there
> since the beginning.
>
> It did occur to me that perhaps someone wanted to support tne
> intersection of all 68k family CPUs, so that meant only MULU.W
> and ADD.L.  But I can't see a gcc option to ask for that.
>
>
> Mostly I just thought that saving a few cycles was a nice win.

Yep, that is always nice.

After a few test builds I am pretty sure that we never need or
use mulsi3 on ColdFire. So I think we can move to not even bothering
to compile it for that case.


> This is tangential to some work I'm doing on the hash functions.
> in <linux/hash.h>, there are hash_32(x, bits) and hash_64(x, bits)
> which basically do "return (x * CONSTANT) >> (32 - bits)", or
> 64 - bits for the 64-bit version.
>
> The CONSTANTs were chosen to be efficient when computed using
> shift-and-add sequences, but that made them less effective hashes.
>
> The 64-bit constant was recently revised to avoid a bad case found when
> hashing page-aligned pointers, since it's a pointless optimization;
> all 64-bit processors have fast multipliers.  But 32 bits was left alone
> for fear of breaking performance.
>
> After some research (partly by searching for __mulsi3), I found that m68k
> is the only platform that doesn't have a fast multiplier.  SPARC used
> to be in that boat too, but SPARCv7 syupport was dropped a few years
> ago.
>
> That led to me optimizing the multiply-by-constant code for m68k.
>
>
>
> Do you have any interest in the matter?  Basically, the options are
> to optimize the implementation of the multiplication, or come up with
> a different hash function that works better in that environment.

It looks like only the original m68000 targets will benefit from
a faster mulsi3. But there is nothing wrong with making it faster
for that case in general.

I don't have any feel for how much of a performance problem the
hash functions are on m68000 though. Keep in mind that those are
all non-MMU platforms.

Regards
Greg


> The current multiply by 0x9e370001 compiles to, depending on compiler
> flags, either a call to __mulsi3 or (omitting prologue & epilogue,
> since it's designed as an in-line function):
>
> 	move.l	%d0,%d1		 4
> 	lsl.l	#6,%d1		20  24
> 	move.l	%d1,%d2		 4  28
> 	lsl.l	#3,%d2		14  42
> 	sub.l	%d1,%d2		 8  50
> 	sub.l	%d0,%d2		 8  58
> 	move.l	%d2,%d1		 4  62
> 	lsl.l	#3,%d1		14  76
> 	sub.l	%d2,%d1		 8  84
> 	lsl.l	#3,%d1		14  98
> 	add.l	%d0,%d1		 8 106
> 	swap	%d1		 4 110
> 	clr.w	%d1		 4 114
> 	sub.l	%d1,%d0		 8 122
>
> (The columns on the right are cycle counts.)
>
> Tangent: I'm surprised I can't coax it into the equivalent:
> 	move.w %d0,%d1		 4
> 	mulu.w #0x9e37,%d1	58  62
> 	add.l %d1,%d0		 8  70
>
> Anyway, the problem is the low Hamming weight of the low 16 bits,
> which has to be solved somehow.  The "more random" high 16 bits of the
> multiplier are only multiplied by the low 16 bits of the argument.
> The high 16 bits of the argument are just multiplied by 1.  If the
> argument is a page-aligned pointer, with 12 low-order bits zero, we have
> very poor mixing.
>
> Using the better constant 0x61C88647 (both this and the previous one are
> chosen to be close to phi, the golden ratio (sqrt(5)-1)/2, but this is
> the negative), gcc struggles.
>
> I searched for an efficient shift-and-add sequence,
> and came up with the 68000-optimized:
>
> /* Multiply by 0x61C88647 */
> unsigned hash_32(unsigned x)
> {
> 	unsigned a, b, c;
>
> 	a = b = c = x;	
> 	c <<= 19;
> 	a += c;		/* a = (x << 19) + x */
> 	b <<= 9;
> 	b += a;		/* b = (x << 9) + a */
> 	c <<= 4;
> 	c += b;		/* c = (x << 23) + b */
> 	b <<= 5;
> 	b += c;		/* (b << 5) + c */
> 	b <<= 3;
> 	b += a;		/* ((b << 5) + c << 3) + a */
> 	b <<= 3;
> 	b -= c;		/* (((b << 5) + c << 3) + a << 3) - c */
> 	return x;
> }
>
> which gcc turns into:
> 	move.l	%d1,%d2		 4
> 	lsl.w	#3,%d2		12  16
> 	swap	%d2		 4  20
> 	clr.w	%d2		 4  24
> 	move.l	%d1,%a1		 4  28
> 	add.l	%d2,%a1		 8  36
> 	moveq	#9,%d0		 4  40
> 	lsl.l	%d0,%d1		26  66
> 	add.l	%a1,%d1		 8  74
> 	lsl.l	#4,%d2		16  90
> 	move.l	%d1,%a0		 4  94
> 	add.l	%d2,%a0		 8 102
> 	lsl.l	#5,%d1		18 120
> 	move.l	%a0,%d0		 4 124
> 	add.l	%d1,%d0		 8 132
> 	move.l	%d0,%d1		 4 136
> 	lsl.l	#3,%d1		14 150
> 	move.l	%a1,%d0		 4 154
> 	add.l	%d1,%d0		 8 162
> 	lsl.l	#3,%d0		14 176
> 	sub.l	%a0,%d0		 8 184
>
> I can shave a few cycles with hand-optimized code, but the result is
> barely faster than the much smaller multiply-based:
>
> 	; y = #k * x, with z as a temp
> 	move.w  #klow, y        ;  8
> 	move.w  x, z            ;  4     12
> 	swap    x               ;  4     16
> 	mulu.w  y, x            ; 52     68	(Using klow = 0x8647)
> 	mulu.w  z, y            ; 54    122	(Assuming 50% ones)
> 	mulu.w  #khigh, z       ; 54    176	(Using khigh = 0x61C8)
> 	swap    y               ;  4    180
> 	add.w   x, y            ;  4    184
> 	add.w   z, y            ;  4    188
> 	swap    y               ;  4    192
> 	;; 12 words
>
> Another way to divide it up is to compute xlow * khigh with a multiply,
> and x * klow with shifts and adds.  x * 0x8647 can be evauated as:
>
> 	unsigned a = (x << 9) + x;
> 	unsigned b = (x << 1) + a;
> 	unsigned c = (b << 1) + (a << 6) + a;
> 	return c + ((uint16_t)(x * 0x61C8) << 16);
>
> If I write it in the following form:
> 	register unsigned a, b;
>
> 	b = x << 1;
> 	asm("" : "=g" (a) : "0" (b));	/* "a = b, DAMN IT" */
> 	a <<= 8;
> 	a += x;
> 	b += a;
> 	b += b;
> 	b += a;
> 	a <<= 6;
> 	return a + b + ((uint16_t)(x * 0x61c8) << 16);
>
> The gcc can be coaxed into generating the desired code.
> (The "asm" is the only way I found to stop GCC from collapsing
> "(x << 1) << 8" to "x << 9", which is slower.)
>
> baz:
> 	move.l 4(%sp),%d1
> 	move.l %d1,%a0		 4
> 	add.l %d1,%a0		 8  12
> 	move.l %a0,%d0		 4  16
> 	lsl.l #8,%d0		24  40
> 	add.l %d1,%d0		 8  48
> 	add.l %d0,%a0		 8  56
> 	muls.w #25032,%d1	56 112
> 	swap %d1		 4 116
> 	clr.w %d1		 4 120
> 	add.l %d0,%d1		 8 128
> 	add.l %a0,%d1		 8 136
> 	add.l %a0,%d1		 8 144
> 	lsl.l #6,%d0		20 164
> 	add.l %d1,%d0		 8 172
> 	rts
>
> Well, that's a *little* bit faster.
>
>
>
> Alternatively, I could just invent a whole different hash function for
> the case.  One idea is to use two, rather than three, 16-bit multiplies.
> Split the input into two 16-bit halves, multiply them by separate
> constants, sum the results, and then swap the halves.
>
> 	move.w	x, y		 4
> 	swap	x		 4   8
> 	mulu.w	#K1, y		58  66
> 	mulu.w	#K2, x		58 124
> 	add.l	y, x		 8 132
> 	swap	x		 4 136
>
> The swap is important; hash_32() is documented as putting the "most mixed"
> bits in the most significant bits, as hash tables are indexed by the most
> significant bits.
>
> If you have any thoughts on the matter, they'd definitely be appreciated.
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-12 12:46 ` Greg Ungerer
@ 2016-05-12 20:52   ` George Spelvin
  2016-05-13  1:07     ` Greg Ungerer
  0 siblings, 1 reply; 11+ messages in thread
From: George Spelvin @ 2016-05-12 20:52 UTC (permalink / raw)
  To: geert, gregungerer, linux-m68k, linux

Thank you very much!

Greg Ungerer wrote:
> This syntax fails for me (using a binutils-2.25.1 based toolchain).
> Registers must be prefixed with a "%", so here %sp and %a0.
>
>   arch/m68k/lib/mulsi3.S: Assembler messages:
>   arch/m68k/lib/mulsi3.S:12: Error: syntax error -- statement `lea 4(sp),a0' ignored
> 
> That was compiling with just this one patch for a ColdFire target.

Well, *that* is an embarrassing oversight.  If you fix that obvious typo
(I used ":%s/[ad][01]/%&/g"), do you have a way of testing it?

Setting up a suitable Qemu environment is many times the effort needed
to write the code, so I was hoping someone with the facilities already
in place would be willing to test.

I've stared at the code and am convinced it's right, but I remember
Knuth's words: "Beware of bugs in the above code; I have only proved it
correct, not tried it."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-12 20:52   ` George Spelvin
@ 2016-05-13  1:07     ` Greg Ungerer
  2016-05-13  2:36       ` George Spelvin
  0 siblings, 1 reply; 11+ messages in thread
From: Greg Ungerer @ 2016-05-13  1:07 UTC (permalink / raw)
  To: George Spelvin, geert, linux-m68k

On 13/05/16 06:52, George Spelvin wrote:
> Thank you very much!
> 
> Greg Ungerer wrote:
>> This syntax fails for me (using a binutils-2.25.1 based toolchain).
>> Registers must be prefixed with a "%", so here %sp and %a0.
>>
>>   arch/m68k/lib/mulsi3.S: Assembler messages:
>>   arch/m68k/lib/mulsi3.S:12: Error: syntax error -- statement `lea 4(sp),a0' ignored
>>
>> That was compiling with just this one patch for a ColdFire target.
> 
> Well, *that* is an embarrassing oversight.  If you fix that obvious typo
> (I used ":%s/[ad][01]/%&/g"), do you have a way of testing it?
> 
> Setting up a suitable Qemu environment is many times the effort needed
> to write the code, so I was hoping someone with the facilities already
> in place would be willing to test.
> 
> I've stared at the code and am convinced it's right, but I remember
> Knuth's words: "Beware of bugs in the above code; I have only proved it
> correct, not tried it."

I have many test setups for ColdFire (qemu and real hardware) but
none of them actually use the mulsi3 code. I don't have anything
for testing classic m68000 builds.

So other than compiling it I don't have an easy way to currently
test it.

Regards
Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-13  1:07     ` Greg Ungerer
@ 2016-05-13  2:36       ` George Spelvin
  2016-05-13  6:45         ` Greg Ungerer
  0 siblings, 1 reply; 11+ messages in thread
From: George Spelvin @ 2016-05-13  2:36 UTC (permalink / raw)
  To: geert, gregungerer, linux-m68k, linux

> I have many test setups for ColdFire (qemu and real hardware) but
> none of them actually use the mulsi3 code. I don't have anything
> for testing classic m68000 builds.
> 
> So other than compiling it I don't have an easy way to currently
> test it.

You couldn't write a user-level test program which calls __mulsi3(x,y)
explicitly and compares the result to x*y?

I'll write it for you if you like.

Even if we can only test the ColdFire branch, that reduces the number
of untested lines considerably.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-13  2:36       ` George Spelvin
@ 2016-05-13  6:45         ` Greg Ungerer
  2016-05-13  9:02           ` George Spelvin
  0 siblings, 1 reply; 11+ messages in thread
From: Greg Ungerer @ 2016-05-13  6:45 UTC (permalink / raw)
  To: George Spelvin, geert, linux-m68k

Hi George,

On 13/05/16 12:36, George Spelvin wrote:
>> I have many test setups for ColdFire (qemu and real hardware) but
>> none of them actually use the mulsi3 code. I don't have anything
>> for testing classic m68000 builds.
>>
>> So other than compiling it I don't have an easy way to currently
>> test it.
> 
> You couldn't write a user-level test program which calls __mulsi3(x,y)
> explicitly and compares the result to x*y?

I was hoping you would write the code :-)


> I'll write it for you if you like.
> 
> Even if we can only test the ColdFire branch, that reduces the number
> of untested lines considerably.

So is something like this what you had in mind?


    #include <unistd.h>
    #include <limits.h>

    #define STEP1   (99991)
    #define MIN1    (INT_MIN + STEP1)
    #define MAX1    (INT_MAX - STEP1)

    #define STEP2   (12345)
    #define MIN2    (INT_MIN + STEP2)
    #define MAX2    (INT_MAX - STEP2)

    int __mulsi3(int x, int y);

    int main(int argc, char *argv[])
    {
        int i, j;
        for (i = MIN1; i < MAX1; i += STEP1) {
                for (j = MIN2; j < MAX2; j += STEP2) {
                        if ((i * j) != __mulsi3(i, j)) {
                                write(1, "FAIL\n", 5);
                                return 1;
                        }
                }
        }
        return 0;
    }


I minimized the loop counts to make the run time reasonable.
Maybe a rand() version wouldn't hurt either.

Anyway, compiled as flat binary for execution that gives:

00000000 <main>:
   0:	518f           	subql #8,%sp
   2:	2f0d           	movel %a5,%sp@-
   4:	2f02           	movel %d2,%sp@-
   6:	203c 8001 8697 	movel #-2147383657,%d0
   c:	2f40 000c      	movel %d0,%sp@(12)
  10:	606c           	bras 7e <main+0x7e>
  12:	207c 8000 3039 	moveal #-2147471303,%a0
  18:	2f48 0008      	movel %a0,%sp@(8)
  1c:	604a           	bras 68 <main+0x68>
  1e:	242f 000c      	movel %sp@(12),%d2
  22:	41ef 0008      	lea %sp@(8),%a0
  26:	4c10 2800      	mulsl %a0@,%d2
  2a:	2f2f 0008      	movel %sp@(8),%sp@-
  2e:	2f2f 0010      	movel %sp@(16),%sp@-
  32:	202d 0000      	movel %a5@(0),%d0
  36:	2040           	moveal %d0,%a0
  38:	4e90           	jsr %a0@
  3a:	508f           	addql #8,%sp
  3c:	b082           	cmpl %d2,%d0
  3e:	671e           	beqs 5e <main+0x5e>
  40:	4878 0005      	pea 5 <main+0x5>
  44:	202d 0000      	movel %a5@(0),%d0
  48:	2f00           	movel %d0,%sp@-
  4a:	4878 0001      	pea 1 <main+0x1>
  4e:	202d 0000      	movel %a5@(0),%d0
  52:	2040           	moveal %d0,%a0
  54:	4e90           	jsr %a0@
  56:	4fef 000c      	lea %sp@(12),%sp
  5a:	7001           	moveq #1,%d0
  5c:	602e           	bras 8c <main+0x8c>
  5e:	203c 0000 3039 	movel #12345,%d0
  64:	d1af 0008      	addl %d0,%sp@(8)
  68:	207c 7fff cfc5 	moveal #2147471301,%a0
  6e:	b1ef 0008      	cmpal %sp@(8),%a0
  72:	6caa           	bges 1e <main+0x1e>
  74:	203c 0001 8697 	movel #99991,%d0
  7a:	d1af 000c      	addl %d0,%sp@(12)
  7e:	207c 7ffe 7967 	moveal #2147383655,%a0
  84:	b1ef 000c      	cmpal %sp@(12),%a0
  88:	6c88           	bges 12 <main+0x12>
  8a:	4280           	clrl %d0
  8c:	241f           	movel %sp@+,%d2
  8e:	2a5f           	moveal %sp@+,%a5
  90:	508f           	addql #8,%sp
  92:	4e75           	rts

And the mulsi.S gives:

00000000 <___mulsi3>:
   0:	41ef 0004      	lea %sp@(4),%a0
   4:	3018           	movew %a0@+,%d0
   6:	3218           	movew %a0@+,%d1
   8:	3241           	moveaw %d1,%a1
   a:	c2d8           	muluw %a0@+,%d1
   c:	c0d0           	muluw %a0@,%d0
   e:	d280           	addl %d0,%d1
  10:	3009           	movew %a1,%d0
  12:	c0d0           	muluw %a0@,%d0
  14:	4841           	swap %d1
  16:	4241           	clrw %d1
  18:	d081           	addl %d1,%d0
  1a:	4e75           	rts

That runs with no fails in qemu and on real ColdFire hardware.
I guess it wouldn't hurt to specifically check the corner cases
either (at MAX_INT, MIN_INT and 0 for example).

Regards
Greg

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize]
  2016-05-13  6:45         ` Greg Ungerer
@ 2016-05-13  9:02           ` George Spelvin
  0 siblings, 0 replies; 11+ messages in thread
From: George Spelvin @ 2016-05-13  9:02 UTC (permalink / raw)
  To: geert, gerg, linux-m68k, linux

>> I'll write it for you if you like.
> I was hoping you would write the code :-)

And then you beat it to me anyway.

> So is something like this what you had in mind?

I would probably have done it slightly differently,
but basically yes.

> That runs with no fails in qemu and on real ColdFire hardware.

Yay!  Thank you very much.

> I guess it wouldn't hurt to specifically check the corner cases
> either (at MAX_INT, MIN_INT and 0 for example).

Not really necessary, as those aren't special-cased in the code
in any way.  An error would be multiplying the wrong parts of
the inputs or summing the partial products wrong.

Which, like most math and crypto code, would result in immediate
massive errors.  Just a handful of test cases is enough.

(Typical code coverage tests want to hit every execution path,
meaning both sides of every conditional branch.  Since this is
straight-line code, it's very easy to test.)

FWIW, I've severly rethought that second patch, but don't have time
to write it up in the detail it needs just now.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-05-13  9:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-11 10:24 [RFC PATCH 1/2] arch/m68k/lib/mulsi3.S: Optimize] George Spelvin
2016-05-11 12:38 ` Greg Ungerer
2016-05-12  8:04   ` George Spelvin
2016-05-12  8:35     ` Andreas Schwab
2016-05-12 13:14     ` Greg Ungerer
2016-05-12 12:46 ` Greg Ungerer
2016-05-12 20:52   ` George Spelvin
2016-05-13  1:07     ` Greg Ungerer
2016-05-13  2:36       ` George Spelvin
2016-05-13  6:45         ` Greg Ungerer
2016-05-13  9:02           ` George Spelvin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.