Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()

From: Philippe Gerum <rpm@xenomai.org>
To: Jan Kiszka <jan.kiszka@siemens.com>
Cc: xenomai@xenomai.org
Subject: Re: [PATCH v2 5/7] drivers/net: wrap csum_partial_copy_nocheck()
Date: Tue, 04 May 2021 16:48:32 +0200	[thread overview]
Message-ID: <87wnsesgfz.fsf@xenomai.org> (raw)
In-Reply-To: <87h7k3617i.fsf@xenomai.org>

Philippe Gerum <rpm@xenomai.org> writes:

> Philippe Gerum <rpm@xenomai.org> writes:
>
>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>
>>> On 16.04.21 18:48, Philippe Gerum wrote:
>>>> 
>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>> 
>>>>> On 15.04.21 09:54, Philippe Gerum wrote:
>>>>>>
>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>
>>>>>>> On 15.04.21 09:21, Philippe Gerum wrote:
>>>>>>>>
>>>>>>>> Jan Kiszka <jan.kiszka@siemens.com> writes:
>>>>>>>>
>>>>>>>>> On 27.03.21 11:19, Philippe Gerum wrote:
>>>>>>>>>> From: Philippe Gerum <rpm@xenomai.org>
>>>>>>>>>>
>>>>>>>>>> Since v5.9-rc1, csum_partial_copy_nocheck() forces a zero seed as its
>>>>>>>>>> last argument to csum_partial(). According to #cc44c17baf7f3, passing
>>>>>>>>>> a non-zero value would not even yield the proper result on some
>>>>>>>>>> architectures.
>>>>>>>>>>
>>>>>>>>>> Nevertheless, the current ICMP code does expect a non-zero csum seed
>>>>>>>>>> to be used in the next computation, so let's wrap net_csum_copy() to
>>>>>>>>>> csum_partial_copy_nocheck() for pre-5.9 kernels, and open code it for
>>>>>>>>>> later kernels so that we still feed csum_partial() with the user-given
>>>>>>>>>> csum. We still expect the x86, ARM and arm64 implementations of
>>>>>>>>>> csum_partial() to do the right thing.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If that issue only affects the ICMP code path, why not only changing
>>>>>>>>> that, leaving rtskb_copy_and_csum_bits with the benefit of doing copy
>>>>>>>>> and csum in one step?
>>>>>>>>>
>>>>>>>>
>>>>>>>> As a result of #cc44c17baf7f3, I see no common helper available from the
>>>>>>>> kernel folding the copy and checksum operations anymore, so I see no way
>>>>>>>> to keep rtskb_copy_and_csum_bits() as is. Did you find an all-in-one
>>>>>>>> replacement for csum_partial_copy_nocheck() which would allow a csum
>>>>>>>> value to be fed in?
>>>>>>>>
>>>>>>>
>>>>>>> rtskb_copy_and_csum_dev does not need that.
>>>>>>>
>>>>>>
>>>>>> You made rtskb_copy_and_csum_bits() part of the exported API. So how do
>>>>>> you want to deal with this?
>>>>>>
>>>>>
>>>>> That is an internal API, so we don't care.
>>>>>
>>>>> But even if we converted rtskb_copy_and_csum_bits to work as it used to
>>>>> (with a csum != 0), there would be not reason to make
>>>>> rtskb_copy_and_csum_dev pay that price.
>>>>>
>>>> 
>>>> Well, there may be a reason to challenge the idea that a folded
>>>> copy_and_csum is better for a real-time system than a split memcpy+csum
>>>> in the first place. Out of curiosity, I ran a few benchmarks lately on a
>>>> few SoCs regarding this, and it turned out that optimizing the data copy
>>>> to get the buffer quickly in place before checksumming the result may
>>>> well yield much better results with respect to jitter than what
>>>> csum_and_copy currently achieves on these SoCs.
>>>> 
>>>> Inline csum_and_copy did perform slightly better on average (a couple of
>>>> hundreds of nanosecs at best) but with pathological jittery in the worst
>>>> case at times. On the contrary, the split memcpy+csum method did not
>>>> exhibit such jittery during these tests, not once.
>>>> 
>>>> - four SoCs tested (2 x x86, armv7, armv8a)
>>>> - test code ran in kernel space (real-time task context,
>>>>   out-of-band/primary context)
>>>> - csum_partial_copy_nocheck() vs memcpy()+csum_partial()
>>>> - 3 buffer sizes tested (32, 1024, 1500 bytes), 3 runs each
>>>> - all buffers (src & dst) aligned on L1_CACHE_BYTES
>>>> - each run performed 1000,000 iterations of a given checksum loop, no
>>>>   pause in between.
>>>> - no concurrent load on the board
>>>> - all results in nanoseconds
>>>> 
>>>> The worst results obtained are presented here for each SoC.
>>>> 
>>>> x86[1]
>>>> ------
>>>> 
>>>> vendor_id	: GenuineIntel
>>>> cpu family	: 6
>>>> model		: 92
>>>> model name	: Intel(R) Atom(TM) Processor E3940 @ 1.60GHz
>>>> stepping	: 9
>>>> cpu MHz		: 1593.600
>>>> cache size	: 1024 KB
>>>> cpuid level	: 21
>>>> wp		: yes
>>>> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm 3dnowprefetch cpuid_fault cat_l2 pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust smep erms mpx rdt_a rdseed smap clflushopt intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts
>>>> vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=68, max=653, avg=70
>>>> CSUM_COPY 1024b: min=248, max=373, avg=251
>>>> CSUM_COPY 1500b: min=344, max=3123, avg=350   <=================
>>>> COPY+CSUM 32b: min=101, max=790, avg=103
>>>> COPY+CSUM 1024b: min=297, max=397, avg=300
>>>> COPY+CSUM 1500b: min=402, max=490, avg=405
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=68, max=1420, avg=70
>>>> CSUM_COPY 1024b: min=248, max=29706, avg=251   <=================
>>>> CSUM_COPY 1500b: min=344, max=792, avg=350
>>>> COPY+CSUM 32b: min=101, max=872, avg=103
>>>> COPY+CSUM 1024b: min=297, max=776, avg=300
>>>> COPY+CSUM 1500b: min=402, max=853, avg=405
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=68, max=661, avg=70
>>>> CSUM_COPY 1024b: min=248, max=1714, avg=251
>>>> CSUM_COPY 1500b: min=344, max=33937, avg=350   <=================
>>>> COPY+CSUM 32b: min=101, max=610, avg=103
>>>> COPY+CSUM 1024b: min=297, max=605, avg=300
>>>> COPY+CSUM 1500b: min=402, max=712, avg=405
>>>> 
>>>> x86[2]
>>>> ------
>>>> 
>>>> vendor_id       : GenuineIntel
>>>> cpu family      : 6
>>>> model           : 23
>>>> model name      : Intel(R) Core(TM)2 Duo CPU     E7200  @ 2.53GHz
>>>> stepping        : 6
>>>> microcode       : 0x60c
>>>> cpu MHz         : 1627.113
>>>> cache size      : 3072 KB
>>>> cpuid level     : 10
>>>> wp              : yes
>>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm pti dtherm
>>>> 
>>>> CSUM_COPY 32b: min=558, max=31010, avg=674     <=================
>>>> CSUM_COPY 1024b: min=558, max=2794, avg=745
>>>> CSUM_COPY 1500b: min=558, max=2794, avg=841
>>>> COPY+CSUM 32b: min=558, max=2794, avg=671
>>>> COPY+CSUM 1024b: min=558, max=2794, avg=778
>>>> COPY+CSUM 1500b: min=838, max=2794, avg=865
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=59, max=532, avg=62
>>>> CSUM_COPY 1024b: min=198, max=270, avg=201
>>>> CSUM_COPY 1500b: min=288, max=34921, avg=289   <=================
>>>> COPY+CSUM 32b: min=53, max=326, avg=56
>>>> COPY+CSUM 1024b: min=228, max=461, avg=232
>>>> COPY+CSUM 1500b: min=311, max=341, avg=317
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=59, max=382, avg=62
>>>> CSUM_COPY 1024b: min=198, max=383, avg=201
>>>> CSUM_COPY 1500b: min=285, max=21235, avg=289   <=================
>>>> COPY+CSUM 32b: min=52, max=300, avg=56
>>>> COPY+CSUM 1024b: min=228, max=348, avg=232
>>>> COPY+CSUM 1500b: min=311, max=409, avg=317
>>>> 
>>>> Cortex A9 quad-core 1.2Ghz (iMX6qp)
>>>> -----------------------------------
>>>> 
>>>> model name	: ARMv7 Processor rev 10 (v7l)
>>>> Features	: half thumb fastmult vfp edsp neon vfpv3 tls vfpd32 
>>>> CPU implementer	: 0x41
>>>> CPU architecture: 7
>>>> CPU variant	: 0x2
>>>> CPU part	: 0xc09
>>>> CPU revision	: 10
>>>> 
>>>> CSUM_COPY 32b: min=333, max=1334, avg=440
>>>> CSUM_COPY 1024b: min=1000, max=2666, avg=1060
>>>> CSUM_COPY 1500b: min=1333, max=45333, avg=1357   <=================
>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>> COPY+CSUM 1024b: min=1000, max=2333, avg=1324
>>>> COPY+CSUM 1500b: min=1666, max=2334, avg=1713
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=333, max=1334, avg=439
>>>> CSUM_COPY 1024b: min=1000, max=46000, avg=1060   <=================
>>>> CSUM_COPY 1500b: min=1333, max=5000, avg=1351
>>>> COPY+CSUM 32b: min=333, max=1334, avg=476
>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1324
>>>> COPY+CSUM 1500b: min=1666, max=2667, avg=1713
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=333, max=1666, avg=454
>>>> CSUM_COPY 1024b: min=1000, max=2000, avg=1060
>>>> CSUM_COPY 1500b: min=1333, max=45000, avg=1348   <=================
>>>> COPY+CSUM 32b: min=333, max=1334, avg=454
>>>> COPY+CSUM 1024b: min=1000, max=2334, avg=1317
>>>> COPY+CSUM 1500b: min=1666, max=6000, avg=1712
>>>> 
>>>> Cortex A55 quad-core 2Ghz (Odroid C4)
>>>> -------------------------------------
>>>> 
>>>> Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
>>>> CPU implementer : 0x41
>>>> CPU architecture: 8
>>>> CPU variant     : 0x1
>>>> CPU part        : 0xd05
>>>> CPU revision    : 0
>>>> 
>>>> 
>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>> CSUM_COPY 1024b: min=625, max=41916, avg=673   <=================
>>>> CSUM_COPY 1500b: min=875, max=3875, avg=923
>>>> COPY+CSUM 32b: min=125, max=458, avg=140
>>>> COPY+CSUM 1024b: min=625, max=1166, avg=666
>>>> COPY+CSUM 1500b: min=875, max=1167, avg=913
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=125, max=1292, avg=139
>>>> CSUM_COPY 1024b: min=541, max=48333, avg=555
>>>> CSUM_COPY 1500b: min=708, max=3458, avg=740
>>>> COPY+CSUM 32b: min=125, max=292, avg=136
>>>> COPY+CSUM 1024b: min=541, max=750, avg=556
>>>> COPY+CSUM 1500b: min=708, max=834, avg=740
>>>> 
>>>> ==
>>>> 
>>>> CSUM_COPY 32b: min=125, max=833, avg=140
>>>> CSUM_COPY 1024b: min=666, max=55667, avg=673   <=================
>>>> CSUM_COPY 1500b: min=875, max=4208, avg=913
>>>> COPY+CSUM 32b: min=125, max=375, avg=140
>>>> COPY+CSUM 1024b: min=666, max=916, avg=673
>>>> COPY+CSUM 1500b: min=875, max=1042, avg=913
>>>> 
>>>> ============
>>>> 
>>>> A few additional observations from looking at the implementation:
>>>> 
>>>> For memcpy, legacy x86[2] uses movsq, finishing with movsb to complete
>>>> buffers of unaligned length. Current x86[1] uses ERMS-optimized movsb
>>>> which is faster.
>>>> 
>>>> arm32/armv7 optimizes memcpy by loading up to 8 words in a single
>>>> instruction. csum_and_copy loads/stores at best 4 words at a time,
>>>> only when src and dst are 32bit aligned (which matches the test case).
>>>> 
>>>> arm64/armv8a uses load/store pair instructions to copy memory
>>>> blocks. It does not have asm-optimized csum_and_copy support, so it
>>>> uses the generic C version.
>>>> 
>>>> What could be inferred in terms of prefetching and speculation might
>>>> explain some differences between the approaches too.
>>>> 
>>>> I would be interested in any converging / diverging results testing the
>>>> same combo with a different test code, because from my standpoint,
>>>> things do not seem as obvious as they are supposed to be at the moment.
>>>> 
>>>
>>> If copy+csum is not using any recent memcopy optimizations, that is an
>>> argument for at least equivalent performance.
>>>
>>
>> You mean the folded version, i.e. copy_and_csum? If so, I can't see any
>> way for that one to optimize via fast string operations.
>>
>>> But I don't get yet where the huge jittery should be coming from. Were
>>> the measurement loop preemptible? In that case I would expect a split
>>
>> Out of band stage, so only preemptible by Xenomai timer ticks, which
>> means only the host tick emulation at this point since there was no
>> outstanding Xenomai timers started yet when running the loops. Pretty
>> slim chance to see these latency spots consistently reproduced, and only
>> for the folded copy_sum version.
>>
>>> copy followed by another loop to csum should give much worse results as
>>> it needs the cache to stay warm - while copy-csum only touches the data
>>> once.
>>>
>>
>> Conversely, if the copy is much faster, the odds of being preempted may
>> increase, yielding better results overall.
>
> False alarm. Preemption was the issue, by the top half of the host tick
> handling in primary mode. The latest clock event scheduled by the kernel
> managed to enter the pipeline at a random time, but always within the
> execution window of the all-in-one csum_and_copy code. Although this
> event was deferred and not immediately passed to the in-band context,
> the time spent dealing with it was enough to show up in the results.
>
>> This said, I'm unsure this is
>> related to preemption anyway, this looks like the fingerprints of minor
>> faults with PTEs. Why this would only happen in the folded version is
>> still a mystery to me at the moment.
>
> It did not actually, no minor faults.
>
> The results are now consistent, both implementations are comparable
> performance-wise as the optimized memcpy tends to offset the advantage
> of calculating the checksum on the fly, saving a read access. armv8
> benefits more from the former, since it does not have an optimized
> csum_and_copy but uses the generic C version instead.
>
> == x86[1]
>
> CSUM_COPY 32b: min=68, max=640, avg=70
> CSUM_COPY 1024b: min=247, max=773, avg=252
> CSUM_COPY 1500b: min=343, max=832, avg=350
> COPY+CSUM 32b: min=100, max=651, avg=131
> COPY+CSUM 1024b: min=296, max=752, avg=298
> COPY+CSUM 1500b: min=397, max=845, avg=400
>
> == x86[2]
>
> CSUM_COPY 32b: min=63, max=267, avg=66
> CSUM_COPY 1024b: min=198, max=300, avg=201
> CSUM_COPY 1500b: min=288, max=611, avg=291
> COPY+CSUM 32b: min=56, max=360, avg=56
> COPY+CSUM 1024b: min=228, max=420, avg=231
> COPY+CSUM 1500b: min=307, max=337, avg=318
>
> == armv7 (imx6qp)
>
> CSUM_COPY 32b: min=333, max=1334, avg=439
> CSUM_COPY 1024b: min=1000, max=2000, avg=1045
> CSUM_COPY 1500b: min=1000, max=2334, avg=1325
> COPY+CSUM 32b: min=333, max=1334, avg=454
> COPY+CSUM 1024b: min=1333, max=2334, avg=1347
> COPY+CSUM 1500b: min=1666, max=2667, avg=1734
>
> == armv8a (C4)
>
> CSUM_COPY 32b: min=125, max=792, avg=130
> CSUM_COPY 1024b: min=500, max=1125, avg=550
> CSUM_COPY 1500b: min=708, max=1833, avg=726
> COPY+CSUM 32b: min=125, max=292, avg=130
> COPY+CSUM 1024b: min=541, max=708, avg=550
> COPY+CSUM 1500b: min=708, max=875, avg=730

Last round of results about this issue, now measuring the csum_copy vs
csum+copy performances in idle vs busy scenarios. Busy means
hackbench+dd loop streaming 128M in the background from zero -> null, in
order to badly trash the D-caches while the test runs. All figures in
nanosecs.

iMX6QP (Cortex A9)
------------------

=== idle

CSUM_COPY 32b: min=333, max=1333, avg=439
CSUM_COPY 1024b: min=1000, max=2000, avg=1045
CSUM_COPY 1500b: min=1333, max=2000, avg=1333
COPY+CSUM 32b: min=333, max=1333, avg=443
COPY+CSUM 1024b: min=1000, max=2334, avg=1345
COPY+CSUM 1500b: min=1666, max=2667, avg=1737

=== busy

CSUM_COPY 32b: min=333, max=4333, avg=466
CSUM_COPY 1024b: min=1000, max=5000, avg=1088
CSUM_COPY 1500b: min=1333, max=5667, avg=1393
COPY+CSUM 32b: min=333, max=1334, avg=454
COPY+CSUM 1024b: min=1000, max=2000, avg=1341
COPY+CSUM 1500b: min=1666, max=2666, avg=1745

C4 (Cortex A55)
---------------

=== idle

CSUM_COPY 32b: min=125, max=791, avg=130
CSUM_COPY 1024b: min=541, max=834, avg=550
CSUM_COPY 1500b: min=708, max=1875, avg=740
COPY+CSUM 32b: min=125, max=167, avg=133
COPY+CSUM 1024b: min=541, max=625, avg=553
COPY+CSUM 1500b: min=708, max=750, avg=730

=== busy

CSUM_COPY 32b: min=125, max=792, avg=133
CSUM_COPY 1024b: min=500, max=2000, avg=552
CSUM_COPY 1500b: min=708, max=1542, avg=744
COPY+CSUM 32b: min=125, max=375, avg=133
COPY+CSUM 1024b: min=500, max=709, avg=553
COPY+CSUM 1500b: min=708, max=916, avg=743

x86 (atom x5)
-------------

=== idle

CSUM_COPY 32b: min=67, max=590, avg=70
CSUM_COPY 1024b: min=245, max=385, avg=251
CSUM_COPY 1500b: min=343, max=521, avg=350
COPY+CSUM 32b: min=101, max=679, avg=117
COPY+CSUM 1024b: min=296, max=379, avg=298
COPY+CSUM 1500b: min=399, max=502, avg=404

== busy

CSUM_COPY 32b: min=65, max=709, avg=71
CSUM_COPY 1024b: min=243, max=702, avg=252
CSUM_COPY 1500b: min=340, max=1055, avg=351
COPY+CSUM 32b: min=100, max=665, avg=120
COPY+CSUM 1024b: min=295, max=669, avg=298
COPY+CSUM 1500b: min=399, max=686, avg=403

As expected from the code, arm64 which has no folded csum_copy
implementation makes the best of using the split copy+csum path. All
architectures seem to benefit from optimized memcpy under load when it
comes to the worst case execution time. x86 is less prone to jittery
under cache trashing than others as usual, but even there, the
max. figures for csum+copy in busy context look pretty much on par with
the csum_copy version.

-- 
Philippe.