Re: [PATCH 0/1] riscv: better network performance with memcpy, uaccess

From: Akira Tsukamoto <akira.tsukamoto@gmail.com>
To: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>,
	Albert Ou <aou@eecs.berkeley.edu>, Gary Guo <gary@garyguo.net>,
	Nick Hu <nickhu@andestech.com>, Nylon Chen <nylon7@andestech.com>,
	linux-riscv@lists.infradead.org,
	Linux kernel mailing list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 0/1] riscv: better network performance with memcpy, uaccess
Date: Sat, 5 Jun 2021 17:02:44 +0900	[thread overview]
Message-ID: <CACuRN0MV4zNj1rBTnppoSudy98aOj2Pj6Ld1+D8mz0fn8kxGtg@mail.gmail.com> (raw)
In-Reply-To: <mhng-a3a53753-73e5-4676-93d3-33c4b8760283@palmerdabbelt-glaptop>

On Sat, Jun 5, 2021 at 1:19 AM Palmer Dabbelt <palmer@dabbelt.com> wrote:
>
> On Fri, 04 Jun 2021 02:53:33 PDT (-0700), akira.tsukamoto@gmail.com wrote:
> > I am adding a cover letter to explain the history and details since
> > improvement is a combination with Gary's memcpy patch [1].
> >
> > Comparison of iperf3 benchmark results by applying Gary's memcpy patch and
> > my uaccess optimization patch. All results are from the same base kernel,
> > same rootfs and save BeagleV beta board.
> >
> > First left column : beaglev 5.13.rc4 kernel [2]
> > Second column     : Added Palmer's memcpy in C + my uaccess patch [3]
> > Third column      : Added Gary's memcpy + my uaccess patch [4]
> >
> > --- TCP recv ---
> > 686 Mbits/sec  |  700 Mbits/sec  |  904 Mbits/sec
> > 683 Mbits/sec  |  701 Mbits/sec  |  898 Mbits/sec
> > 695 Mbits/sec  |  702 Mbits/sec  |  905 Mbits/sec
> >
> > --- TCP send ---
> > 383 Mbits/sec  |  390 Mbits/sec  |  393 Mbits/sec
> > 384 Mbits/sec  |  393 Mbits/sec  |  392 Mbits/sec
> >
> > --- UDP send ---
> > 307 Mbits/sec  |  358 Mbits/sec  |  402 Mbits/sec
> > 307 Mbits/sec  |  359 Mbits/sec  |  402 Mbits/sec
> >
> > --- UDP recv ---
> > 630 Mbits/sec  |  799 Mbits/sec  |  875 Mbits/sec
> > 730 Mbits/sec  |  796 Mbits/sec  |  873 Mbits/sec
> >
> >
> > The uaccess patch is reducing pipeline stall of read after write (RAW)
> > by unroling load and store.
> > The main reason for using assembler inside uaccess.S is because the
> > __asm_to/copy_from_user() handling page fault must be done manually inside
> > the functions.
> >
> > The above result is combination from Gary $B!G (Bs memcpy speeding up
> > by reducing
> > the S-mode and M-mode switching and my uaccess reducing pipeline stall for
> > user space uses syscall with large data.
> >
> > We had a discussion of improving network performance on the BeagleV beta
> > board with Palmer.
> >
> > Palmer suggested to use C-based string routines, which checks the unaligned
> > address and use 8 bytes aligned copy if the both src and dest are aligned
> > and if not use the current copy function.
> >
> > The Gary's assembly version of memcpy is improving by not using unaligned
> > access in 64 bit boundary, uses shifting it after reading with offset of
> > aligned access, because every misaligned access is trapped and switches to
> > opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> > and M-mode (opensbi) switching.
> >
> > Processing network packets require a lot of unaligned access for the packet
> > header, which is not able to change the design of the header format to be
> > aligned.
> > And user applications pass large packet data with send/recf() and sendto/
> > recvfrom() to repeat less function calls for reading and writing data for the
> > optimization.
>
> Makes sense.  I'm still not opposed to moving to a C version, but it'd
> need to be a fairly complicated one.  I think having a fast C memcpy
> would likely benefit a handful of architectures, as everything we're
> talking about is an algorithmic improvement that can be expressed in C.
>
> Given that the simple memcpy doesn't perform well for your workload, I'm
> fine taking the assembly version.

Thanks, for merging them.

I agree that having a fast C memcpy would benefit many architectures.
I will make the patches for lib/string.c by extending your memcpy and send
them after I finish other priorities. The current functions in lib/string.c
use a byte copy, while most linux capable cpus moved to 64 bits.

Akira

>
> Thanks!
>
> >
> > Akira
> >
> > [1] https://lkml.org/lkml/2021/2/16/778
> > [2] https://github.com/mcd500/linux-jh7100/tree/starlight-sdimproved
> > [3] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-palmer-string
> > [4] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-gary
> >
> > Akira Tsukamoto (1):
> >   riscv: prevent pipeline stall in __asm_to/copy_from_user
> >
> >  arch/riscv/lib/uaccess.S | 106 +++++++++++++++++++++++++++------------
> >  1 file changed, 73 insertions(+), 33 deletions(-)