All of lore.kernel.org
 help / color / mirror / Atom feed
From: Akira Tsukamoto <akira.tsukamoto@gmail.com>
To: David Laight <David.Laight@aculab.com>
Cc: Matteo Croce <mcroce@linux.microsoft.com>,
	Bin Meng <bmeng.cn@gmail.com>,
	Emil Renner Berthing <kernel@esmil.dk>,
	Gary Guo <gary@garyguo.net>,
	"linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	Albert Ou <aou@eecs.berkeley.edu>,
	Atish Patra <atish.patra@wdc.com>,
	Drew Fustini <drew@beagleboard.org>
Subject: Re: [PATCH 1/3] riscv: optimized memcpy
Date: Wed, 16 Jun 2021 19:48:22 +0900	[thread overview]
Message-ID: <CACuRN0OThmL5yAAzGv9r6LjR8Z7q4-FJs4LpU50xWNDtyXQyYw@mail.gmail.com> (raw)
In-Reply-To: <db7a011867a742528beb6ec17b692842@AcuMS.aculab.com>

On Wed, Jun 16, 2021 at 5:24 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>
> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>
> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.

I am fine with Matteo's memcpy.

The two culprits seen by the `perf top -Ue task-clock` output during the
tcp and ucp network are

> Overhead  Shared O  Symbol
>  42.22%  [kernel]  [k] memcpy
>  35.00%  [kernel]  [k] __asm_copy_to_user

so we really need to optimize both memcpy and __asm_copy_to_user.

The main reason of speed up in memcpy is that

> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.

which are in the code:

Gary's:
+       /* Calculate shifts */
+       slli    t3, a3, 3
+       sub    t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+       /* Load the initial value and align a1 */
+       andi    a1, a1, ~(SZREG-1)
+       REG_L    a5, 0(a1)
+
+       addi    t0, t0, -(SZREG-1)
+       /* At least one iteration will be executed here, no check */
+1:
+       srl    a4, a5, t3
+       REG_L    a5, SZREG(a1)
+       addi    a1, a1, SZREG
+       sll    a2, a5, t4
+       or    a2, a2, a4
+       REG_S    a2, 0(a0)
+       addi    a0, a0, SZREG
+       bltu    a0, t0, 1b

and Matteo ported to C:

+#pragma GCC unroll 8
+        for (next = s.ulong[0]; count >= bytes_long + mask; count -=
bytes_long) {
+            last = next;
+            next = s.ulong[1];
+
+            d.ulong[0] = last >> (distance * 8) |
+                     next << ((bytes_long - distance) * 8);
+
+            d.ulong++;
+            s.ulong++;
+        }

I believe this is reasonable and enough to be in the upstream.

Akira


>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

WARNING: multiple messages have this Message-ID (diff)
From: Akira Tsukamoto <akira.tsukamoto@gmail.com>
To: David Laight <David.Laight@aculab.com>
Cc: Matteo Croce <mcroce@linux.microsoft.com>,
	Bin Meng <bmeng.cn@gmail.com>,
	 Emil Renner Berthing <kernel@esmil.dk>,
	Gary Guo <gary@garyguo.net>,
	 "linux-riscv@lists.infradead.org"
	<linux-riscv@lists.infradead.org>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	 "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	 Palmer Dabbelt <palmer@dabbelt.com>,
	Albert Ou <aou@eecs.berkeley.edu>,
	 Atish Patra <atish.patra@wdc.com>,
	Drew Fustini <drew@beagleboard.org>
Subject: Re: [PATCH 1/3] riscv: optimized memcpy
Date: Wed, 16 Jun 2021 19:48:22 +0900	[thread overview]
Message-ID: <CACuRN0OThmL5yAAzGv9r6LjR8Z7q4-FJs4LpU50xWNDtyXQyYw@mail.gmail.com> (raw)
In-Reply-To: <db7a011867a742528beb6ec17b692842@AcuMS.aculab.com>

On Wed, Jun 16, 2021 at 5:24 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Matteo Croce
> > Sent: 16 June 2021 03:02
> ...
> > > > That's a good idea, but if you read the replies to Gary's original
> > > > patch
> > > > https://lore.kernel.org/linux-riscv/20210216225555.4976-1-gary@garyguo.net/
> > > > .. both Gary, Palmer and David would rather like a C-based version.
> > > > This is one attempt at providing that.
> > >
> > > Yep, I prefer C as well :)
> > >
> > > But if you check commit 04091d6, the assembly version was introduced
> > > for KASAN. So if we are to change it back to C, please make sure KASAN
> > > is not broken.
> > >
> ...
> > Leaving out the first memcpy/set of every test which is always slower, (maybe
> > because of a cache miss?), the current implementation copies 260 Mb/s when
> > the low order bits match, and 114 otherwise.
> > Memset is stable at 278 Mb/s.
> >
> > Gary's implementation is much faster, copies still 260 Mb/s when euqlly placed,
> > and 230 Mb/s otherwise. Memset is the same as the current one.
>
> Any idea what the attainable performance is for the cpu you are using?
> Since both memset and memcpy are running at much the same speed
> I suspect it is all limited by the writes.
>
> 272MB/s is only 34M writes/sec.
> This seems horribly slow for a modern cpu.
> So is this actually really limited by the cache writes to physical memory?
>
> You might want to do some tests (userspace is fine) where you
> check much smaller lengths that definitely sit within the data cache.
>
> It is also worth checking how much overhead there is for
> short copies - they are almost certainly more common than
> you might expect.
> This is one problem with excessive loop unrolling - the 'special
> cases' for the ends of the buffer start having a big effect
> on small copies.
>
> For cpu that support misaligned memory accesses, one 'trick'
> for transfers longer than a 'word' is to do a (probably) misaligned
> transfer of the last word of the buffer first followed by the
> transfer of the rest of the buffer (overlapping a few bytes at the end).
> This saves on conditionals and temporary values.

I am fine with Matteo's memcpy.

The two culprits seen by the `perf top -Ue task-clock` output during the
tcp and ucp network are

> Overhead  Shared O  Symbol
>  42.22%  [kernel]  [k] memcpy
>  35.00%  [kernel]  [k] __asm_copy_to_user

so we really need to optimize both memcpy and __asm_copy_to_user.

The main reason of speed up in memcpy is that

> The Gary's assembly version of memcpy is improving by not using unaligned
> access in 64 bit boundary, uses shifting it after reading with offset of
> aligned access, because every misaligned access is trapped and switches to
> opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> and M-mode (opensbi) switching.

which are in the code:

Gary's:
+       /* Calculate shifts */
+       slli    t3, a3, 3
+       sub    t4, x0, t3 /* negate is okay as shift will only look at LSBs */
+
+       /* Load the initial value and align a1 */
+       andi    a1, a1, ~(SZREG-1)
+       REG_L    a5, 0(a1)
+
+       addi    t0, t0, -(SZREG-1)
+       /* At least one iteration will be executed here, no check */
+1:
+       srl    a4, a5, t3
+       REG_L    a5, SZREG(a1)
+       addi    a1, a1, SZREG
+       sll    a2, a5, t4
+       or    a2, a2, a4
+       REG_S    a2, 0(a0)
+       addi    a0, a0, SZREG
+       bltu    a0, t0, 1b

and Matteo ported to C:

+#pragma GCC unroll 8
+        for (next = s.ulong[0]; count >= bytes_long + mask; count -=
bytes_long) {
+            last = next;
+            next = s.ulong[1];
+
+            d.ulong[0] = last >> (distance * 8) |
+                     next << ((bytes_long - distance) * 8);
+
+            d.ulong++;
+            s.ulong++;
+        }

I believe this is reasonable and enough to be in the upstream.

Akira


>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

  reply	other threads:[~2021-06-16 10:48 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-15  2:38 [PATCH 0/3] riscv: optimized mem* functions Matteo Croce
2021-06-15  2:38 ` Matteo Croce
2021-06-15  2:38 ` [PATCH 1/3] riscv: optimized memcpy Matteo Croce
2021-06-15  2:38   ` Matteo Croce
2021-06-15  8:57   ` David Laight
2021-06-15  8:57     ` David Laight
2021-06-15 13:08     ` Bin Meng
2021-06-15 13:08       ` Bin Meng
2021-06-15 13:18       ` David Laight
2021-06-15 13:18         ` David Laight
2021-06-15 13:28         ` Bin Meng
2021-06-15 13:28           ` Bin Meng
2021-06-15 16:12           ` Emil Renner Berthing
2021-06-15 16:12             ` Emil Renner Berthing
2021-06-16  0:33             ` Bin Meng
2021-06-16  0:33               ` Bin Meng
2021-06-16  2:01               ` Matteo Croce
2021-06-16  2:01                 ` Matteo Croce
2021-06-16  8:24                 ` David Laight
2021-06-16  8:24                   ` David Laight
2021-06-16 10:48                   ` Akira Tsukamoto [this message]
2021-06-16 10:48                     ` Akira Tsukamoto
2021-06-16 19:06                   ` Matteo Croce
2021-06-16 19:06                     ` Matteo Croce
2021-06-15 13:44         ` Matteo Croce
2021-06-15 13:44           ` Matteo Croce
2021-06-16 11:46   ` Guo Ren
2021-06-16 11:46     ` Guo Ren
2021-06-16 18:52     ` Matteo Croce
2021-06-16 18:52       ` Matteo Croce
2021-06-17 21:30       ` David Laight
2021-06-17 21:30         ` David Laight
2021-06-17 21:48         ` Matteo Croce
2021-06-17 21:48           ` Matteo Croce
2021-06-18  0:32           ` Matteo Croce
2021-06-18  0:32             ` Matteo Croce
2021-06-18  1:05             ` Matteo Croce
2021-06-18  1:05               ` Matteo Croce
2021-06-18  8:32               ` David Laight
2021-06-18  8:32                 ` David Laight
2021-06-15  2:38 ` [PATCH 2/3] riscv: optimized memmove Matteo Croce
2021-06-15  2:38   ` Matteo Croce
2021-06-15  2:38 ` [PATCH 3/3] riscv: optimized memset Matteo Croce
2021-06-15  2:38   ` Matteo Croce
2021-06-15  2:43 ` [PATCH 0/3] riscv: optimized mem* functions Bin Meng
2021-06-15  2:43   ` Bin Meng
2024-01-28 11:10 [PATCH 0/3] riscv: optimize memcpy/memmove/memset Jisheng Zhang
2024-01-28 11:10 ` [PATCH 1/3] riscv: optimized memcpy Jisheng Zhang
2024-01-28 11:10   ` Jisheng Zhang
2024-01-28 12:35   ` David Laight
2024-01-28 12:35     ` David Laight
2024-01-30 12:11   ` Nick Kossifidis
2024-01-30 12:11     ` Nick Kossifidis
2024-01-30 22:44   ` kernel test robot
2024-01-31  0:19   ` kernel test robot
2024-01-31  0:19   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACuRN0OThmL5yAAzGv9r6LjR8Z7q4-FJs4LpU50xWNDtyXQyYw@mail.gmail.com \
    --to=akira.tsukamoto@gmail.com \
    --cc=David.Laight@aculab.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=atish.patra@wdc.com \
    --cc=bmeng.cn@gmail.com \
    --cc=drew@beagleboard.org \
    --cc=gary@garyguo.net \
    --cc=kernel@esmil.dk \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=mcroce@linux.microsoft.com \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.