linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] riscv: Fixed misaligned memory access.  Fixed pointer comparison.
@ 2022-01-29  2:24 Michael T. Kloos
  2022-01-29 13:41 ` David Laight
  0 siblings, 1 reply; 3+ messages in thread
From: Michael T. Kloos @ 2022-01-29  2:24 UTC (permalink / raw)
  To: David Laight, Palmer Dabbelt, Paul Walmsley, Albert Ou
  Cc: linux-riscv, linux-kernel, Michael T. Kloos

Rewrote the RISC-V memmove() assembly implementation.  The
previous implementation did not check memory alignment and it
compared 2 pointers with a signed comparison.  The misaligned
memory access would cause the kernel to crash on systems that
did not emulate it in firmware and did not support it in hardware.
Firmware emulation is slow and may not exist.  The RISC-V spec
does not guarantee that support for misaligned memory accesses
will exist.  It should not be depended on.

This patch now checks for XLEN granularity of co-alignment between
the pointers.  Failing that, copying is done by loading from the 2
contiguous and naturally aligned XLEN memory locations containing
the overlapping XLEN sized data to be copied.  Only one load from
memory is needed per loop iteration as the value loaded is passed
by cpu register into the next iteration to minimize the memory
accesses required.  The data is shifted into the correct place and
binary or'ed together on each iteration.  The result is then
stored into the corresponding naturally aligned XLEN sized
location in the destination.  For unaligned data at the
terminations of the regions to be copied or for copies less than
XLEN in size, byte copy is used.

This patch also now uses unsigned comparison for the pointers and
migrates to the newer assembler annotations from the now deprecated
ones.

[v3]

Fixed the build issue reported by the test robot.  Changed the
copy implementation based on the suggestion by David Laight.

One change that could potentially still be made is to roll one
of the values loaded in the copy loop into the next iteration
of the fixup loop, rather than reloading both values from memory
on each iteration.  It would require some more logic and I'm
really not sure that it is worth it.  It could be added in a
later patch.  For now, this fixes the issues I set out to fix.

[v4]

I could not resist implementing the optimization I mentioned in
my v3 notes.  I have implemented the roll over of data by cpu
register in the misaligned fixup copy loops.  Now, only one load
from memory is required per iteration of the loop.

Signed-off-by: Michael T. Kloos <michael@michaelkloos.com>
---
 arch/riscv/lib/memmove.S | 327 ++++++++++++++++++++++++++++++++-------
 1 file changed, 267 insertions(+), 60 deletions(-)

diff --git a/arch/riscv/lib/memmove.S b/arch/riscv/lib/memmove.S
index 07d1d2152ba5..13f17affe115 100644
--- a/arch/riscv/lib/memmove.S
+++ b/arch/riscv/lib/memmove.S
@@ -1,64 +1,271 @@
-/* SPDX-License-Identifier: GPL-2.0 */
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2022 Michael T. Kloos <michael@michaelkloos.com>
+ */
 
 #include <linux/linkage.h>
 #include <asm/asm.h>
 
-ENTRY(__memmove)
-WEAK(memmove)
-        move    t0, a0
-        move    t1, a1
-
-        beq     a0, a1, exit_memcpy
-        beqz    a2, exit_memcpy
-        srli    t2, a2, 0x2
-
-        slt     t3, a0, a1
-        beqz    t3, do_reverse
-
-        andi    a2, a2, 0x3
-        li      t4, 1
-        beqz    t2, byte_copy
-
-word_copy:
-        lw      t3, 0(a1)
-        addi    t2, t2, -1
-        addi    a1, a1, 4
-        sw      t3, 0(a0)
-        addi    a0, a0, 4
-        bnez    t2, word_copy
-        beqz    a2, exit_memcpy
-        j       byte_copy
-
-do_reverse:
-        add     a0, a0, a2
-        add     a1, a1, a2
-        andi    a2, a2, 0x3
-        li      t4, -1
-        beqz    t2, reverse_byte_copy
-
-reverse_word_copy:
-        addi    a1, a1, -4
-        addi    t2, t2, -1
-        lw      t3, 0(a1)
-        addi    a0, a0, -4
-        sw      t3, 0(a0)
-        bnez    t2, reverse_word_copy
-        beqz    a2, exit_memcpy
-
-reverse_byte_copy:
-        addi    a0, a0, -1
-        addi    a1, a1, -1
-
-byte_copy:
-        lb      t3, 0(a1)
-        addi    a2, a2, -1
-        sb      t3, 0(a0)
-        add     a1, a1, t4
-        add     a0, a0, t4
-        bnez    a2, byte_copy
-
-exit_memcpy:
-        move a0, t0
-        move a1, t1
-        ret
-END(__memmove)
+SYM_FUNC_START(__memmove)
+SYM_FUNC_START_WEAK(memmove)
+	/*
+	 * Returns
+	 *   a0 - dest
+	 *
+	 * Parameters
+	 *   a0 - Inclusive first byte of dest
+	 *   a1 - Inclusive first byte of src
+	 *   a2 - Length of copy n
+	 *
+	 * Because the return matches the parameter register a0,
+	 * we will not clobber or modify that register.
+	 *
+	 * Note: This currently only works on little-endian.
+	 * To port to big-endian, reverse the direction of shifts
+	 * in the 2 misaligned fixup copy loops.
+	 */
+
+	/* Return if nothing to do */
+	beq a0, a1, 20f /* Return from memmove */
+	beqz a2, 20f /* Return from memmove */
+
+	/*
+	 * Register Uses
+	 *      Forward Copy: a3 - Index counter of src
+	 *      Reverse Copy: a4 - Index counter of src
+	 *      Forward Copy: t3 - Index counter of dest
+	 *      Reverse Copy: t4 - Index counter of dest
+	 *   Both Copy Modes: t5 - Inclusive first multibyte/aligned of dest
+	 *   Both Copy Modes: t6 - Non-Inclusive last multibyte/aligned of dest
+	 *   Both Copy Modes: t0 - Link / Temporary for load-store
+	 *   Both Copy Modes: t1 - Temporary for load-store
+	 *   Both Copy Modes: t2 - Temporary for load-store
+	 *   Both Copy Modes: a5 - dest to src alignment offset
+	 *   Both Copy Modes: a6 - Shift ammount
+	 *   Both Copy Modes: a7 - Inverse Shift ammount
+	 */
+
+	/*
+	 * Solve for some register values now.
+	 * Byte copy does not need t5 or t6.
+	 */
+	add  t3, a0, zero
+	add  t4, a0, a2
+	add  a3, a1, zero
+	add  a4, a1, a2
+
+	/*
+	 * Byte copy if copying less than SZREG bytes.
+	 * This can cause problems with the bulk copy
+	 * implementation and is small enough not
+	 * to bother.
+	 */
+	andi t0, a2, -SZREG
+	beqz t0, 21f
+
+	/*
+	 * Now solve for t5 and t6.
+	 */
+	andi t5, t3, -SZREG
+	andi t6, t4, -SZREG
+	/*
+	 * If dest(Register t3) rounded down to the nearest naturally
+	 * aligned SZREG address, does not equal dest, then add SZREG
+	 * to find the low-bound of SZREG alignment in the dest memory
+	 * region.  Note that this could overshoot the dest memory
+	 * region if n is less than SZREG.  This is one reason why
+	 * we always byte copy if n is less than SZREG.
+	 * Otherwise, dest is already naturally aligned to SZREG.
+	 */
+	beq  t5, t3, 1f
+		addi t5, t5, SZREG
+	1:
+
+	/*
+	 * If the dest and src are co-aligned to SZREG, then there is
+	 * no need for the full rigmarole of a full misaligned fixup copy.
+	 * Instead, do a simpler co-aligned copy.
+	 */
+	xor  t0, a0, a1
+	andi t1, t0, (SZREG - 1)
+	beqz t1, 26f
+	/* Fall through to misaligned fixup copy */
+
+1: /* Misaligned fixup copy */
+	bltu a1, a0, 4f /* Misaligned fixup copy: Reverse */
+
+3: /* Misaligned fixup copy: Forward */
+	jal  t0, 24f /* Byte copy until aligned: Forward */
+
+	andi a5, a3, (SZREG - 1) /* Find the alignment offset of src (a3) */
+	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
+
+	sub  a3, a3, a5 /* Align the src pointer */
+
+	/*
+	 * Compute The Inverse Shift
+	 * a7 = XLEN - a6 = XLEN + -a6
+	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
+	 * Add that to XLEN.  XLEN = SZREG * 8.
+	 */
+	not  a7, a6
+	addi a7, a7, (SZREG * 8 + 1)
+
+	/*
+	 * Fix Misalignment Copy Loop.
+	 * load_val1 = load_ptr[0];
+	 * while (store_ptr != store_ptr_end) {
+	 *   load_val0 = load_val1;
+	 *   load_val1 = load_ptr[1];
+	 *   *store_ptr = (load_val0 >> {a6}) | (load_val1 << {a7});
+	 *   load_ptr++;
+	 *   store_ptr++;
+	 * }
+	 */
+	REG_L t0, 0x000(a3)
+	1:
+	beq   t3, t6, 2f
+	mv    t1, t0
+	REG_L t0, SZREG(a3)
+	srl   t1, t1, a6
+	sll   t2, t0, a7
+	or    t1, t1, t2
+	REG_S t1, 0x000(t3)
+	addi  a3, a3, SZREG
+	addi  t3, t3, SZREG
+	j 1b
+	2:
+
+	add  a3, a3, a5 /* Restore the src pointer */
+	j 22f /* Byte copy: Forward */ /* Copy any remaining bytes */
+
+4: /* Misaligned fixup copy: Reverse */
+	jal  t0, 25f /* Byte copy until aligned: Reverse */
+
+	andi a5, a4, (SZREG - 1) /* Find the alignment offset of src (a4) */
+	slli a6, a5, 3 /* Multiply by 8 to convert that to bits to shift */
+
+	sub  a4, a4, a5 /* Align the src pointer */
+
+	/*
+	 * Compute The Inverse Shift
+	 * a7 = XLEN - a6 = XLEN + -a6
+	 * 2s complement negation to find the negative: -a6 = ~a6 + 1
+	 * Add that to XLEN.  XLEN = SZREG * 8.
+	 */
+	not  a7, a6
+	addi a7, a7, (SZREG * 8 + 1)
+
+	/*
+	 * Fix Misalignment Copy Loop.
+	 * load_val0 = load_ptr[0];
+	 * while (store_ptr != store_ptr_end) {
+	 *   load_ptr--;
+	 *   store_ptr--;
+	 *   load_val1 = load_val0;
+	 *   load_val0 = load_ptr[0];
+	 *   *store_ptr = (load_val0 >> {a6}) | (load_val1 << {a7});
+	 * }
+	 */
+	REG_L t0, 0x000(a4)
+	1:
+	beq   t4, t5, 2f
+	addi  a4, a4, -SZREG
+	addi  t4, t4, -SZREG
+	mv    t2, t0
+	REG_L t0, 0x000(a4)
+	srl   t1, t0, a6
+	sll   t2, t2, a7
+	or    t1, t1, t2
+	REG_S t1, 0x000(t4)
+	j 1b
+	2:
+
+	add  a4, a4, a5 /* Restore the src pointer */
+	j 23f /* Byte copy: Reverse */ /* Copy any remaining bytes */
+
+/*
+ * Simple copy loops for SZREG co-aligned memory locations.
+ * These also make calls to do byte copies for any unaligned
+ * data at their terminations.
+ */
+26: /* Co-Aligned copy */
+	bltu a1, a0, 3f /* Co-Aligned copy: Reverse */
+
+2: /* Co-Aligned copy: Forward */
+	jal t0, 24f /* Byte copy until aligned: Forward */
+
+	1:
+	beq   t3, t6, 22f /* Byte copy: Forward */
+	REG_L t1, (a3)
+	REG_S t1, (t3)
+	addi  a3, a3, SZREG
+	addi  t3, t3, SZREG
+	j 1b
+
+3: /* Co-Aligned copy: Reverse */
+	jal t0, 25f /* Byte copy until aligned: Reverse */
+
+	1:
+	beq   t4, t5, 23f /* Byte copy: Reverse */
+	addi  a4, a4, -SZREG
+	addi  t4, t4, -SZREG
+	REG_L t1, (a4)
+	REG_S t1, (t4)
+	j 1b
+
+/*
+ * These are basically sub-functions within the function.  They
+ * are used to byte copy until the dest pointer is in alignment.
+ * At which point, a bulk copy method can be used by the
+ * calling code.  These work on the same registers as the bulk
+ * copy loops.  Therefore, the register values can be picked
+ * up from where they were left and we avoid code duplication
+ * without any overhead except the call in and return jumps.
+ */
+24: /* Byte copy until aligned: Forward */
+	beq  t3, t5, 1f /* Reuse the return from the other copy loop */
+	lb   t1, (a3)
+	sb   t1, (t3)
+	addi a3, a3, 1
+	addi t3, t3, 1
+	j 24b
+
+25: /* Byte copy until aligned: Reverse */
+	beq  t4, t6, 1f
+	addi a4, a4, -1
+	addi t4, t4, -1
+	lb   t1, (a4)
+	sb   t1, (t4)
+	j 25b
+	1: jalr zero, 0x0(t0) /* Return to multibyte copy loop */
+
+/*
+ * Simple byte copy loops.
+ * These will byte copy until they reach the end of data to copy.
+ * At that point, they will call to return from memmove.
+ */
+21: /* Byte copy */
+	bltu a1, a0, 23f /* Byte copy: Reverse */
+
+22: /* Byte copy: Forward */
+	beq  t3, t4, 20f /* Return from memmove */
+	lb   t1, (a3)
+	sb   t1, (t3)
+	addi a3, a3, 1
+	addi t3, t3, 1
+	j 22b
+
+23: /* Byte copy: Reverse */
+	beq  t4, t3, 20f /* Return from memmove */
+	addi a4, a4, -1
+	addi t4, t4, -1
+	lb   t1, (a4)
+	sb   t1, (t4)
+	j 23b
+
+20: /* Return from memmove */
+	ret
+
+SYM_FUNC_END(memmove)
+SYM_FUNC_END(__memmove)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* RE: [PATCH v4] riscv: Fixed misaligned memory access.  Fixed pointer comparison.
  2022-01-29  2:24 [PATCH v4] riscv: Fixed misaligned memory access. Fixed pointer comparison Michael T. Kloos
@ 2022-01-29 13:41 ` David Laight
  2022-01-29 21:07   ` Michael T. Kloos
  0 siblings, 1 reply; 3+ messages in thread
From: David Laight @ 2022-01-29 13:41 UTC (permalink / raw)
  To: 'michael@michaelkloos.com',
	Palmer Dabbelt, Paul Walmsley, Albert Ou
  Cc: linux-riscv, linux-kernel

From: michael@michaelkloos.com
...
> [v4]
> 
> I could not resist implementing the optimization I mentioned in
> my v3 notes.  I have implemented the roll over of data by cpu
> register in the misaligned fixup copy loops.  Now, only one load
> from memory is required per iteration of the loop.

I nearly commented...

...
> +	/*
> +	 * Fix Misalignment Copy Loop.
> +	 * load_val1 = load_ptr[0];
> +	 * while (store_ptr != store_ptr_end) {
> +	 *   load_val0 = load_val1;
> +	 *   load_val1 = load_ptr[1];
> +	 *   *store_ptr = (load_val0 >> {a6}) | (load_val1 << {a7});
> +	 *   load_ptr++;
> +	 *   store_ptr++;
> +	 * }
> +	 */
> +	REG_L t0, 0x000(a3)
> +	1:
> +	beq   t3, t6, 2f
> +	mv    t1, t0
> +	REG_L t0, SZREG(a3)
> +	srl   t1, t1, a6
> +	sll   t2, t0, a7
> +	or    t1, t1, t2
> +	REG_S t1, 0x000(t3)
> +	addi  a3, a3, SZREG
> +	addi  t3, t3, SZREG
> +	j 1b

No point jumping to a conditional branch that jumps bak
Make this a:
	bne	t3, t6, 1b
and move 1: down one instruction.
(Or is the 'beq' at the top even possible - there is likely to
be an earlier test for zero length copies.)
> +	2:

I also suspect it is worth unrolling the loop once.
You lose the 'mv t1, t0' and one 'addi' for each word transferred.

I think someone mentioned that there is a few clocks delay before
the data from the memory read (REG_L) is actually available.
On in-order cpu this is likely to be a full pipeline stall.
So move the 'addi' up between the 'REG_L' and 'sll' instructions.
(The offset will need to be -SZREG to match.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v4] riscv: Fixed misaligned memory access. Fixed pointer comparison.
  2022-01-29 13:41 ` David Laight
@ 2022-01-29 21:07   ` Michael T. Kloos
  0 siblings, 0 replies; 3+ messages in thread
From: Michael T. Kloos @ 2022-01-29 21:07 UTC (permalink / raw)
  To: David Laight, Palmer Dabbelt, Paul Walmsley, Albert Ou
  Cc: linux-riscv, linux-kernel

Thank you for your feedback David.

> No point jumping to a conditional branch that jumps bak
> Make this a:
> 	bne	t3, t6, 1b
> and move 1: down one instruction.
> (Or is the 'beq' at the top even possible - there is likely to
> be an earlier test for zero length copies.)
There are 2 checks for length.  One returns if zero.  The other 
limits the copy to byte-copy if it is below SZREG.  It is possible 
for the 'beq' to be true on the first attempt.

This would happen with something like "memmove(0x02, 0x13, 5)" on 
32-bit.  At no point would there be the ability to do a full 
register width store.  So, misaligned copy would be jumped to 
because they are not co-aligned and size is greater than or equal 
to SZREG.  The misaligned copy would first call the byte-copy 
sub-function to bring dest into natural SZREG alignment.  From 
there, the misaligned copy loop would start, but it would 
immediately break on the 'beq' being true because there are not 
SZREG bytes left to copy.  So it would jump to byte copy which 
would again try to copy any remaining unaligned bytes (if any 
exist).

One alternative is to make the previous copy length check force 
byte-copy for any size less than 2 * SZREG, instead of just SZREG.  
This would guarantee that the loop would always be able to run at 
least once before breaking.  If you think I should make this change, 
let me know.

I also gave a lot of thought when implementing the register value 
rollover so that I would not accidentally page fault in the kernel.  
I needed to make sure that those first loads, since they occur 
before those conditional checks, would not try to load from 
anywhere outside of the page boundaries.  Given that the the code 
pathway would not have been taken unless the copy length is 
greater than zero and the low bits of dest are not equal to the low 
bits of src(they are not co-aligned), we can say that bytes would 
have to have been loaded from within that SZREG aligned bound 
anyway.  And since pages are aligned at minimum to 0x1000 
boundaries, I was confident that I would be within bounds even if 
right up against the edge.

I will make the change from j to bne instructions in the applicable 
loops.


> I also suspect it is worth unrolling the loop once.
> You lose the 'mv t1, t0' and one 'addi' for each word transferred.
I'll explore this possibility, but to my previous point, I need 
to be careful about bounds checking.  Until I try to write it, I 
can't be sure if it will make sense.  I'll see what I can come up 
with.  No promises.

> I think someone mentioned that there is a few clocks delay before
> the data from the memory read (REG_L) is actually available.
> On in-order cpu this is likely to be a full pipeline stall.
> So move the 'addi' up between the 'REG_L' and 'sll' instructions.
> (The offset will need to be -SZREG to match.)
I try to keep loads as far apart as reasonable from their usage to 
avoid this.  I'll make this change in the next version of the patch.

	Michael

On 1/29/22 08:41, David Laight wrote:

> From: michael@michaelkloos.com
> ...
>> [v4]
>>
>> I could not resist implementing the optimization I mentioned in
>> my v3 notes.  I have implemented the roll over of data by cpu
>> register in the misaligned fixup copy loops.  Now, only one load
>> from memory is required per iteration of the loop.
> I nearly commented...
>
> ...
>> +	/*
>> +	 * Fix Misalignment Copy Loop.
>> +	 * load_val1 = load_ptr[0];
>> +	 * while (store_ptr != store_ptr_end) {
>> +	 *   load_val0 = load_val1;
>> +	 *   load_val1 = load_ptr[1];
>> +	 *   *store_ptr = (load_val0 >> {a6}) | (load_val1 << {a7});
>> +	 *   load_ptr++;
>> +	 *   store_ptr++;
>> +	 * }
>> +	 */
>> +	REG_L t0, 0x000(a3)
>> +	1:
>> +	beq   t3, t6, 2f
>> +	mv    t1, t0
>> +	REG_L t0, SZREG(a3)
>> +	srl   t1, t1, a6
>> +	sll   t2, t0, a7
>> +	or    t1, t1, t2
>> +	REG_S t1, 0x000(t3)
>> +	addi  a3, a3, SZREG
>> +	addi  t3, t3, SZREG
>> +	j 1b
> No point jumping to a conditional branch that jumps bak
> Make this a:
> 	bne	t3, t6, 1b
> and move 1: down one instruction.
> (Or is the 'beq' at the top even possible - there is likely to
> be an earlier test for zero length copies.)
>> +	2:
> I also suspect it is worth unrolling the loop once.
> You lose the 'mv t1, t0' and one 'addi' for each word transferred.
>
> I think someone mentioned that there is a few clocks delay before
> the data from the memory read (REG_L) is actually available.
> On in-order cpu this is likely to be a full pipeline stall.
> So move the 'addi' up between the 'REG_L' and 'sll' instructions.
> (The offset will need to be -SZREG to match.)
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-01-29 21:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-29  2:24 [PATCH v4] riscv: Fixed misaligned memory access. Fixed pointer comparison Michael T. Kloos
2022-01-29 13:41 ` David Laight
2022-01-29 21:07   ` Michael T. Kloos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).