Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench

From: Akira Tsukamoto <akira.tsukamoto@gmail.com>
To: Ben Dooks <ben.dooks@codethink.co.uk>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	Albert Ou <aou@eecs.berkeley.edu>,
	linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org
Subject: Re: [PATCH v2 0/5] riscv: improving uaccess with logs from network bench
Date: Mon, 21 Jun 2021 01:36:49 +0900	[thread overview]
Message-ID: <88ba6862-b43b-bcab-9485-1339cc765f47@gmail.com> (raw)
In-Reply-To: <542310bc-840d-d5c9-a7b3-40f58504e7b5@codethink.co.uk>

On 6/20/21 19:02, Ben Dooks wrote:
> On 19/06/2021 12:21, Akira Tsukamoto wrote:
>> Optimizing copy_to_user and copy_from_user.
>>
>> I rewrote the functions in v2, heavily influenced by Garry's memcpy
>> function [1].
>> The functions must be written in assembler to handle page faults manually
>> inside the function.
>>
>> With the changes, improves in the percentage usage and some performance
>> of network speed in UDP packets.
>> Only patching copy_user. Using the original memcpy.
>>
>> All results are from the same base kernel, same rootfs and same
>> BeagleV beta board.
> 
> Is there a git tree for these to try them out?

Sure, please try.

The kernel without the patch is starlight branch and
the kernel with these patches is starlight-ua-new-up at 
https://github.com/mcd500/linux-jh7100

The starlight is maintained by Esmil where main development
is happening.

And the rootfs on beaglev is uploaded bellow.
https://github.com/mcd500/Fedora_on_StarFive#custome-fedora-image

To reproduce the results:
(please customize with your ip addreses)

The command I used for iperf3.

--- TCP recv ---
** on PC side, using default mtu 1500
$ iperf3 -c 192.168.1.112
** on riscv beaglev side, using default mtu 1500
[root@fedora-starfive ~]# iperf3 -s

--- TCP send ---
** on PC side, using default mtu 1500
$ iperf3 -s
** on riscv beaglev side, using default mtu 1500
[root@fedora-starfive ~]# iperf3 -c 192.168.1.153

--- UDP send ---
** on PC side first, changing mtu size from 1500 to 9000
$ sudo ifconfig eth0 down
$ sudo ifconfig eth0 mtu 9000 up
$ iperf3 -s
** on riscv beaglev, No changing the mtu size on riscv beaglev
[root@fedora-starfive ~]# iperf3 -u -b 1000M --length 50000 -c 192.168.1.153

--- UDP recv ---
** on PC side first, changing mtu size to 9000
$ sudo ifconfig eth0 down
$ sudo ifconfig eth0 mtu 9000 up
$ iperf3 -u -b 1000M --length 6500 -c 192.168.1.112
** on riscv beaglev side, changing mtu size to 9000 too
[root@fedora-starfive ~]# sudo ifconfig eth0 down
[root@fedora-starfive ~]# sudo ifconfig eth0 mtu 9000 up
[root@fedora-starfive ~]# iperf3 -s

The perf:
$ sudo perf top -Ue task-clock
after login with ssh.

> 
>> Comparison by "perf top -Ue task-clock" while running iperf3.
>>
>> --- TCP recv ---
>>   * Before
>>    40.40%  [kernel]  [k] memcpy
>>    33.09%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    50.35%  [kernel]  [k] memcpy
>>    13.76%  [kernel]  [k] __asm_copy_to_user
>>
>> --- TCP send ---
>>   * Before
>>    19.96%  [kernel]  [k] memcpy
>>     9.84%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    14.27%  [kernel]  [k] memcpy
>>     7.37%  [kernel]  [k] __asm_copy_to_user
>>
>> --- UDP send ---
>>   * Before
>>    25.18%  [kernel]  [k] memcpy
>>    22.50%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    28.90%  [kernel]  [k] memcpy
>>     9.49%  [kernel]  [k] __asm_copy_to_user
>>
>> --- UDP recv ---
>>   * Before
>>    44.45%  [kernel]  [k] memcpy
>>    31.04%  [kernel]  [k] __asm_copy_to_user
>>   * After
>>    55.62%  [kernel]  [k] memcpy
>>    11.22%  [kernel]  [k] __asm_copy_to_user
> 
> What's the memcpy figure in the above?
> Could you explain the figures please?

It is the output of "perf top -Ue task-clock" 
while performing the iperf3 which I described above.
Showing which functions are causing the most overhead
inside the kernel during running iperf3 in cpu usage.

The two biggest culprits were memcpy and __asm_copy_to_user
showing high cpu usage, and this is the reason I listed the two.

Initially this discussion started with Gary's memcpy patch
on this list. I will write more details bellow.

> 
>> Processing network packets require a lot of unaligned access for the packet
>> header, which is not able to change the design of the header format to be
>> aligned.
> 
> Isn't there an option to allow padding of network packets
> in the skbuff to make the fields aligned for architectures
> which do not have efficient unaligned loads (looking at you
> arm32). Has this been looked at?

I am trying at 64bit risc-v beaglev beta board.
My understanding of skbuff is that it is for aligning data when
the handling inside the kernel. It would help if memcpy and 
__asm_copy_to_user were not causing such a huge percentage.

This patch is against copy_to_user and copy_from_user
purely used for copying between kernel space and user space.

By looking the overhead on perf, the cpu usage is increasing
on copy_to_user because the user app (iperf3 here) uses socket API 
with large packet size. I used to use maximum buffer size 
of mtu for reduce the number of calling recvform() sendto()
for UDP programing too. And most of the network programmer probably
do the same.

> 
>> And user applications call system calls with a large buffer for send/recf()
>> and sendto/recvfrom() to repeat less function calls for the optimization.
>>
>> v1 -> v2:
>> - Added shift copy
>> - Separated patches for readability of changes in assembler
>> - Using perf results
>>
>> [1] https://lkml.org/lkml/2021/2/16/778
>>
>> Akira Tsukamoto (5):
>>    riscv: __asm_to/copy_from_user: delete existing code
>>    riscv: __asm_to/copy_from_user: Adding byte copy first
>>    riscv: __asm_to/copy_from_user: Copy until dst is aligned address
>>    riscv: __asm_to/copy_from_user: Bulk copy while shifting misaligned
>>      data
>>    riscv: __asm_to/copy_from_user: Bulk copy when both src dst are
>>      aligned
>>
>>   arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++--------
>>   1 file changed, 146 insertions(+), 35 deletions(-)
> 
> I'm concerned that delete and then re-add is either going to make
> the series un-bisectable or leave a point where the kernel is very
> broken?

I completely agree and understand. The only reason I split the patches
is because of the comments in the other thread. It definitely breaks
the bisect. Once the content of this patch is understood and agreed,
I will re-spin them in one patch, otherwise it will not boot when
only individual patch is applied.

The Gary's memcpy patch was posted a while ago, and even it has the 
best result in bench, it was not merged.

When we were measuring network performance on beaglev beta board,
the Gary's memcpy did have huge improvement.

The patch was not easy to review and understand, but it really helps
the network performance.

By reading the discussion on his patch, I felt the first priority
is able to be understood the cause of cpu usage and speed results.

Please read the discussion by Gary, Palmer, Matteo, others and I in 
the list.

Matteo is rewriting Gary's patch in C, which is better for 
maintainability and incorporate the wisdom of the optimization
of the compiler.
The user_copy are written in assembler or inline assembler,
as I wrote.

I just want to help making it better, so once the consensus are
made, I will make them to one patch.
Or I am fine somebody else comes out with better results.

My attempt to do similar patches dates long time ago in 2002.
https://linux-kernel.vger.kernel.narkive.com/zU6OFlI6/cft-patch-2-5-47-athlon-druon-much-faster-copy-user-function
http://lkml.iu.edu/hypermail/linux/kernel/0211.2/0928.html

Akira