netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* TCP checksum not offloaded during GSO
@ 2020-02-04  5:55 Yadu Kishore
  2020-02-05 21:43 ` Willem de Bruijn
  2020-02-06 12:13 ` TCP checksum not offloaded during GSO David Laight
  0 siblings, 2 replies; 28+ messages in thread
From: Yadu Kishore @ 2020-02-04  5:55 UTC (permalink / raw)
  To: netdev

Hi,

I'm working on enhancing a driver for a Network Controller that
supports "Checksum Offloads".
So I'm offloading TCP/UDP checksum computation in the network driver
using NETIF_F_HW_CSUM on
linux kernel version 4.19.23 aarch64 for hikey android platform. The
Network Controller does not support scatter-gather (SG) DMA.
Hence I'm not enabling the NETIF_IF_SG feature.
I see that GSO for TCP is enabled by default in the kernel 4.19.23
When running iperf TCP traffic I observed that the TCP checksum is not
offloaded for the majority
of the TCP packets. Most of the skbs received in the output path in
the driver have skb->ip_summed
set to CHECKSUM_NONE.
The csum is offloaded only for the initial TCP connection establishment packets.
For UDP I do not observe this problem.
It appears that a decision was taken not to offload TCP csum (during GSO)
if the network driver does not support SG :

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 02c638a643ea..9c065ac72e87 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3098,8 +3098,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
  if (nskb->len == len + doffset)
  goto perform_csum_check;

- if (!sg && !nskb->remcsum_offload) {
- nskb->ip_summed = CHECKSUM_NONE;
+ if (!sg) {
+ if (!nskb->remcsum_offload)
+ nskb->ip_summed = CHECKSUM_NONE;
  SKB_GSO_CB(nskb)->csum =
  skb_copy_and_csum_bits(head_skb, offset,
        skb_put(nskb, len),

The above is a code snippet from the actual commit :

commit 7fbeffed77c130ecf64e8a2f7f9d6d63a9d60a19
Author: Alexander Duyck <aduyck@mirantis.com>
Date:   Fri Feb 5 15:27:43 2016 -0800

    net: Update remote checksum segmentation to support use of GSO checksum

    This patch addresses two main issues.

    First in the case of remote checksum offload we were avoiding dealing with
    scatter-gather issues.  As a result it would be possible to assemble a
    series of frames that used frags instead of being linearized as they should
    have if remote checksum offload was enabled.

    Second I have updated the code so that we now let GSO take care of doing
    the checksum on the data itself and drop the special case that was added
    for remote checksum offload.

    Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

However it appears that even before the above commit the original
condition did seem to indicate
that if the network driver does not support SG, the csum would be
computed by the kernel itself (in case of GSO).
We would like to know the reason for this design decision (to NOT
offload checksum computation to the HW in case of GSO).
Our intention is to find out a way to enable/offload CSUM computation
to the HW even when GSO is active (TCP).

Thanks,
Yadu Kishore

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: TCP checksum not offloaded during GSO
  2020-02-04  5:55 TCP checksum not offloaded during GSO Yadu Kishore
@ 2020-02-05 21:43 ` Willem de Bruijn
  2020-02-21  5:13   ` [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming Yadu Kishore
  2020-02-06 12:13 ` TCP checksum not offloaded during GSO David Laight
  1 sibling, 1 reply; 28+ messages in thread
From: Willem de Bruijn @ 2020-02-05 21:43 UTC (permalink / raw)
  To: Yadu Kishore; +Cc: Network Development

On Tue, Feb 4, 2020 at 12:55 AM Yadu Kishore <kyk.segfault@gmail.com> wrote:
>
> Hi,
>
> I'm working on enhancing a driver for a Network Controller that
> supports "Checksum Offloads".
> So I'm offloading TCP/UDP checksum computation in the network driver
> using NETIF_F_HW_CSUM on
> linux kernel version 4.19.23 aarch64 for hikey android platform. The
> Network Controller does not support scatter-gather (SG) DMA.
> Hence I'm not enabling the NETIF_IF_SG feature.
> I see that GSO for TCP is enabled by default in the kernel 4.19.23
> When running iperf TCP traffic I observed that the TCP checksum is not
> offloaded for the majority
> of the TCP packets. Most of the skbs received in the output path in
> the driver have skb->ip_summed
> set to CHECKSUM_NONE.
> The csum is offloaded only for the initial TCP connection establishment packets.
> For UDP I do not observe this problem.
> It appears that a decision was taken not to offload TCP csum (during GSO)
> if the network driver does not support SG :
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 02c638a643ea..9c065ac72e87 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3098,8 +3098,9 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>   if (nskb->len == len + doffset)
>   goto perform_csum_check;
>
> - if (!sg && !nskb->remcsum_offload) {
> - nskb->ip_summed = CHECKSUM_NONE;
> + if (!sg) {
> + if (!nskb->remcsum_offload)
> + nskb->ip_summed = CHECKSUM_NONE;
>   SKB_GSO_CB(nskb)->csum =
>   skb_copy_and_csum_bits(head_skb, offset,
>         skb_put(nskb, len),
>
> The above is a code snippet from the actual commit :
>
> commit 7fbeffed77c130ecf64e8a2f7f9d6d63a9d60a19

This behavior goes back to the original introduction of gso and
skb_segment, commit f4c50d990dcf ("[NET]: Add software TSOv4").

Without scatter-gather, the data has to be copied from skb linear to
nskb linear. This code is probably the way that it is because if it
has to copy anyway, it might as well perform the copy-and-checksum
optimization.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: TCP checksum not offloaded during GSO
  2020-02-04  5:55 TCP checksum not offloaded during GSO Yadu Kishore
  2020-02-05 21:43 ` Willem de Bruijn
@ 2020-02-06 12:13 ` David Laight
  1 sibling, 0 replies; 28+ messages in thread
From: David Laight @ 2020-02-06 12:13 UTC (permalink / raw)
  To: 'Yadu Kishore', netdev

From: Yadu Kishore
> Sent: 04 February 2020 05:55
> I'm working on enhancing a driver for a Network Controller that
> supports "Checksum Offloads".
...
> The Network Controller does not support scatter-gather (SG) DMA.

Is it one of the really broken ones that also requires the
tx and rx buffers be 4 byte aligned?

Get a different Network Controller, the one you have isn't
'fit for purpose' :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-02-05 21:43 ` Willem de Bruijn
@ 2020-02-21  5:13   ` Yadu Kishore
  2020-02-23  2:41     ` Willem de Bruijn
  0 siblings, 1 reply; 28+ messages in thread
From: Yadu Kishore @ 2020-02-21  5:13 UTC (permalink / raw)
  To: netdev; +Cc: davem, Yadu Kishore

Problem:
TCP checksum in the output path is not being offloaded during GSO
in the following case:
The network driver does not support scatter-gather but supports
checksum offload with NETIF_F_HW_CSUM.

Cause:
skb_segment calls skb_copy_and_csum_bits if the network driver
does not announce NETIF_F_SG. It does not check if the driver
supports NETIF_F_HW_CSUM.
So for devices which might want to offload checksum but do not support SG
there is currently no way to do so if GSO is enabled.

Solution:
In skb_segment check if the network controller does checksum and if so
call skb_copy_bits instead of skb_copy_and_csum_bits.

Testing:
Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
in the network driver. Observed the TCP checksum offload is not happening
since the skbs received by the driver in the output path have
skb->ip_summed set to CHECKSUM_NONE.

With the patch ran iperf TCP traffic and observed that TCP checksum
is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
to cover the newly introduced if-else code path in skb_segment.

In-Reply-To: CABGOaVTY6BrzJTYEtVXwawzP7-D8sb1KASDWFk15v0QFaJVbUg@mail.gmail.com
Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>
---
 net/core/skbuff.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1365a55..82a5b53 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3926,14 +3926,22 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			goto perform_csum_check;
 
 		if (!sg) {
-			if (!nskb->remcsum_offload)
-				nskb->ip_summed = CHECKSUM_NONE;
-			SKB_GSO_CB(nskb)->csum =
-				skb_copy_and_csum_bits(head_skb, offset,
-						       skb_put(nskb, len),
-						       len, 0);
-			SKB_GSO_CB(nskb)->csum_start =
-				skb_headroom(nskb) + doffset;
+			if (!csum) {
+				if (!nskb->remcsum_offload)
+					nskb->ip_summed = CHECKSUM_NONE;
+				SKB_GSO_CB(nskb)->csum =
+					skb_copy_and_csum_bits(head_skb, offset,
+							       skb_put(nskb,
+								       len),
+							       len, 0);
+				SKB_GSO_CB(nskb)->csum_start =
+					skb_headroom(nskb) + doffset;
+			} else {
+				nskb->ip_summed = CHECKSUM_PARTIAL;
+				skb_copy_bits(head_skb, offset,
+					      skb_put(nskb, len),
+					      len);
+			}
 			continue;
 		}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-02-21  5:13   ` [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming Yadu Kishore
@ 2020-02-23  2:41     ` Willem de Bruijn
  2020-02-28  5:24       ` Yadu Kishore
  0 siblings, 1 reply; 28+ messages in thread
From: Willem de Bruijn @ 2020-02-23  2:41 UTC (permalink / raw)
  To: Yadu Kishore; +Cc: Network Development, David Miller

On Thu, Feb 20, 2020 at 11:14 PM Yadu Kishore <kyk.segfault@gmail.com> wrote:
>
> Problem:
> TCP checksum in the output path is not being offloaded during GSO
> in the following case:
> The network driver does not support scatter-gather but supports
> checksum offload with NETIF_F_HW_CSUM.
>
> Cause:
> skb_segment calls skb_copy_and_csum_bits if the network driver
> does not announce NETIF_F_SG. It does not check if the driver
> supports NETIF_F_HW_CSUM.
> So for devices which might want to offload checksum but do not support SG
> there is currently no way to do so if GSO is enabled.
>
> Solution:
> In skb_segment check if the network controller does checksum and if so
> call skb_copy_bits instead of skb_copy_and_csum_bits.
>
> Testing:
> Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
> in the network driver. Observed the TCP checksum offload is not happening
> since the skbs received by the driver in the output path have
> skb->ip_summed set to CHECKSUM_NONE.
>
> With the patch ran iperf TCP traffic and observed that TCP checksum
> is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.

Did you measure a cycle efficiency improvement? As discussed in the
referred email thread, the kernel uses checksum_and_copy because it is
generally not significantly more expensive than copy alone.

skb_segment already is a very complex function. New code needs to
offer a tangible benefit.

> Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
> to cover the newly introduced if-else code path in skb_segment.
>
> In-Reply-To: CABGOaVTY6BrzJTYEtVXwawzP7-D8sb1KASDWFk15v0QFaJVbUg@mail.gmail.com

This does not seem to be a commonly used tag. And indeed differs from
the actual In-Reply-To in the email headers. Perhaps

Link: https://lore.kernel.org/netdev/CABGOaVTY6BrzJTYEtVXwawzP7-D8sb1KASDWFk15v0QFaJVbUg@mail.gmail.com/

> Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>
> ---
>  net/core/skbuff.c | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1365a55..82a5b53 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3926,14 +3926,22 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>                         goto perform_csum_check;
>
>                 if (!sg) {
> -                       if (!nskb->remcsum_offload)
> -                               nskb->ip_summed = CHECKSUM_NONE;
> -                       SKB_GSO_CB(nskb)->csum =
> -                               skb_copy_and_csum_bits(head_skb, offset,
> -                                                      skb_put(nskb, len),
> -                                                      len, 0);
> -                       SKB_GSO_CB(nskb)->csum_start =
> -                               skb_headroom(nskb) + doffset;
> +                       if (!csum) {
> +                               if (!nskb->remcsum_offload)
> +                                       nskb->ip_summed = CHECKSUM_NONE;
> +                               SKB_GSO_CB(nskb)->csum =
> +                                       skb_copy_and_csum_bits(head_skb, offset,
> +                                                              skb_put(nskb,
> +                                                                      len),
> +                                                              len, 0);
> +                               SKB_GSO_CB(nskb)->csum_start =
> +                                       skb_headroom(nskb) + doffset;
> +                       } else {
> +                               nskb->ip_summed = CHECKSUM_PARTIAL;

Is this not already handled by __copy_skb_header above? If ip_summed
has to be initialized, so have csum_start and csum_offset. That call
should have initialized all three.



> +                               skb_copy_bits(head_skb, offset,
> +                                             skb_put(nskb, len),
> +                                             len);
> +                       }
>                         continue;
>                 }
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-02-23  2:41     ` Willem de Bruijn
@ 2020-02-28  5:24       ` Yadu Kishore
  2020-02-28 14:30         ` Willem de Bruijn
  0 siblings, 1 reply; 28+ messages in thread
From: Yadu Kishore @ 2020-02-28  5:24 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: Network Development, David Miller

> Did you measure a cycle efficiency improvement? As discussed in the
> referred email thread, the kernel uses checksum_and_copy because it is
> generally not significantly more expensive than copy alone
> skb_segment already is a very complex function. New code needs to
> offer a tangible benefit.

I ran iperf TCP Tx traffic of 1000 megabytes and captured the cpu cycle
utilization using perf:
"perf record -e cycles -a iperf \
-c 192.168.2.53 -p 5002 -fm -n 1048576000 -i 2  -l 8k -w 8m"

I see the following are the top consumers of cpu cycles:

Function                                   %cpu cycles
=======                                   =========
skb_mac_gso_segment            0.02
inet_gso_segment                     0.26
tcp4_gso_segment                    0.02
tcp_gso_segment                      0.19
skb_segment                             0.52
skb_copy_and_csum_bits         0.64
do_csum                                    7.25
memcpy                                     3.71
__alloc_skb                                0.91
==========                              ====
SUM                                           13.52

The measurement was done on an arm64 hikey960 platform running android with
linux kernel ver 4.19.23.
I see that 7.25% of the cpu cycles is spent computing the checksum against the
total of 13.52% of cpu cycles.
Which means around 52.9% of the total cycles is spent doing checksum.
Hence the attempt to try to offload checksum in the case of GSO also.

> Is this not already handled by __copy_skb_header above? If ip_summed
> has to be initialized, so have csum_start and csum_offset. That call
> should have initialized all three.

Thanks, I will look into why even though __copy_skb_header is being
called, I am still
seeing skb->ip_summed set to CHECKSUM_NONE in the network driver.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-02-28  5:24       ` Yadu Kishore
@ 2020-02-28 14:30         ` Willem de Bruijn
  2020-02-28 20:01           ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Willem de Bruijn @ 2020-02-28 14:30 UTC (permalink / raw)
  To: Yadu Kishore; +Cc: Network Development, David Miller

On Fri, Feb 28, 2020 at 12:25 AM Yadu Kishore <kyk.segfault@gmail.com> wrote:
>
> > Did you measure a cycle efficiency improvement? As discussed in the
> > referred email thread, the kernel uses checksum_and_copy because it is
> > generally not significantly more expensive than copy alone
> > skb_segment already is a very complex function. New code needs to
> > offer a tangible benefit.
>
> I ran iperf TCP Tx traffic of 1000 megabytes and captured the cpu cycle
> utilization using perf:
> "perf record -e cycles -a iperf \
> -c 192.168.2.53 -p 5002 -fm -n 1048576000 -i 2  -l 8k -w 8m"
>
> I see the following are the top consumers of cpu cycles:
>
> Function                                   %cpu cycles
> =======                                   =========
> skb_mac_gso_segment            0.02
> inet_gso_segment                     0.26
> tcp4_gso_segment                    0.02
> tcp_gso_segment                      0.19
> skb_segment                             0.52
> skb_copy_and_csum_bits         0.64
> do_csum                                    7.25
> memcpy                                     3.71
> __alloc_skb                                0.91
> ==========                              ====
> SUM                                           13.52
>
> The measurement was done on an arm64 hikey960 platform running android with
> linux kernel ver 4.19.23.
> I see that 7.25% of the cpu cycles is spent computing the checksum against the
> total of 13.52% of cpu cycles.
> Which means around 52.9% of the total cycles is spent doing checksum.
> Hence the attempt to try to offload checksum in the case of GSO also.

Can you contrast this against a run with your changes? The thought is
that the majority of this cost is due to the memory loads and stores, not
the arithmetic ops to compute the checksum. When enabling checksum
offload, the same stalls will occur, but will simply be attributed to
memcpy instead of to do_csum. A:B comparisons of absolute (-n) cycle
counts are usually very noisy, but it's worth a shot.


> > Is this not already handled by __copy_skb_header above? If ip_summed
> > has to be initialized, so have csum_start and csum_offset. That call
> > should have initialized all three.
>
> Thanks, I will look into why even though __copy_skb_header is being
> called, I am still
> seeing skb->ip_summed set to CHECKSUM_NONE in the network driver.

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-02-28 14:30         ` Willem de Bruijn
@ 2020-02-28 20:01           ` David Miller
  2020-03-02  6:51             ` [PATCH v2] " Yadu Kishore
  0 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2020-02-28 20:01 UTC (permalink / raw)
  To: willemdebruijn.kernel; +Cc: kyk.segfault, netdev

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Fri, 28 Feb 2020 09:30:56 -0500

> Can you contrast this against a run with your changes? The thought is
> that the majority of this cost is due to the memory loads and stores, not
> the arithmetic ops to compute the checksum. When enabling checksum
> offload, the same stalls will occur, but will simply be attributed to
> memcpy instead of to do_csum.

Agreed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-02-28 20:01           ` David Miller
@ 2020-03-02  6:51             ` Yadu Kishore
  2020-03-02  8:42               ` Yadu Kishore
  0 siblings, 1 reply; 28+ messages in thread
From: Yadu Kishore @ 2020-03-02  6:51 UTC (permalink / raw)
  To: davem; +Cc: willemdebruijn.kernel, netdev, Yadu Kishore

> > Can you contrast this against a run with your changes? The thought is
> > that the majority of this cost is due to the memory loads and stores, not
> > the arithmetic ops to compute the checksum. When enabling checksum
> > offload, the same stalls will occur, but will simply be attributed to
> > memcpy instead of to do_csum.

> Agreed.

Below is the data from perf with and without the patch for the same
TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)

Without patch :-
============
[Function = %cpu cycles]
skb_mac_gso_segment = 0.05
inet_gso_segment = 0.26
tcp4_gso_segment = 0.05
tcp_gso_segment = 0.17
skb_segment = 0.55
skb_copy_and_csum_bits = 0.60
do_csum = 7.43
memcpy = 3.81
__alloc_skb = 0.93
==================
SUM = 13.85


With patch :-
============
[Function = %cpu cycles]
skb_mac_gso_segment = 0.05
inet_gso_segment = 0.34
tcp4_gso_segment = 0.06
tcp_gso_segment = 0.26
skb_segment = 0.55
skb_copy_and_csum_bits = 0.00
do_csum = 0.04
memcpy = 4.29
__alloc_skb = 0.73
==================
SUM = 6.32

So, with the patch, from the above data we can see
that the percentage of cpu cycles spent in do_csum
has come down from 7.43% to 0.04%.

> > > Is this not already handled by __copy_skb_header above? If ip_summed
> > > has to be initialized, so have csum_start and csum_offset. That call
> > > should have initialized all three.

> > Thanks, I will look into why even though __copy_skb_header is being
> > called, I am still
> > seeing skb->ip_summed set to CHECKSUM_NONE in the network driver.

> Thanks.

My mistake. I had removed the 'skb->ip_summed = CHECKSUM_PARTIAL' line
from the patch but had forgotten to enable NETIF_F_HW_CSUM in the network
driver. Hence I was still seeing skb->ip_summed set to CHECKSUM_NONE.
After re-enabling NETIF_F_HW_CSUM in the driver, I now see that
skb->ip_summed is being set correctly to CHECKSUM_PARTIAL.
So as suggested, the __copy_skb_header is indeed taking care of doing
this and hence there is no need to explicitly set 'ip_summed' in the patch.
Below is V2 of the patch with the changes.


Problem:
TCP checksum in the output path is not being offloaded during GSO
in the following case:
The network driver does not support scatter-gather but supports
checksum offload with NETIF_F_HW_CSUM.

Cause:
skb_segment calls skb_copy_and_csum_bits if the network driver
does not announce NETIF_F_SG. It does not check if the driver
supports NETIF_F_HW_CSUM.
So for devices which might want to offload checksum but do not support SG
there is currently no way to do so if GSO is enabled.

Solution:
In skb_segment check if the network controller does checksum and if so
call skb_copy_bits instead of skb_copy_and_csum_bits.

Testing:
Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
in the network driver. Observed the TCP checksum offload is not happening
since the skbs received by the driver in the output path have
skb->ip_summed set to CHECKSUM_NONE.

With the patch ran iperf TCP traffic and observed that TCP checksum
is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
to cover the newly introduced if-else code path in skb_segment.

Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>
---
 net/core/skbuff.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1365a55..eca72bc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3926,14 +3926,21 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			goto perform_csum_check;
 
 		if (!sg) {
-			if (!nskb->remcsum_offload)
-				nskb->ip_summed = CHECKSUM_NONE;
-			SKB_GSO_CB(nskb)->csum =
-				skb_copy_and_csum_bits(head_skb, offset,
-						       skb_put(nskb, len),
-						       len, 0);
-			SKB_GSO_CB(nskb)->csum_start =
-				skb_headroom(nskb) + doffset;
+			if (!csum) {
+				if (!nskb->remcsum_offload)
+					nskb->ip_summed = CHECKSUM_NONE;
+				SKB_GSO_CB(nskb)->csum =
+					skb_copy_and_csum_bits(head_skb, offset,
+							       skb_put(nskb,
+								       len),
+							       len, 0);
+				SKB_GSO_CB(nskb)->csum_start =
+					skb_headroom(nskb) + doffset;
+			} else {
+				skb_copy_bits(head_skb, offset,
+					      skb_put(nskb, len),
+					      len);
+			}
 			continue;
 		}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-02  6:51             ` [PATCH v2] " Yadu Kishore
@ 2020-03-02  8:42               ` Yadu Kishore
  2020-03-02 15:19                 ` David Laight
  0 siblings, 1 reply; 28+ messages in thread
From: Yadu Kishore @ 2020-03-02  8:42 UTC (permalink / raw)
  To: David Miller; +Cc: Willem de Bruijn, Network Development

Hi all,

A small correction to the data I sent earlier:

Without patch :-
 ============
[Function = %cpu cycles]
skb_mac_gso_segment = 0.05
inet_gso_segment = 0.26
tcp4_gso_segment = 0.05
tcp_gso_segment = 0.17
skb_segment = 0.55
skb_copy_and_csum_bits = 0.60
do_csum = 7.43
memcpy = 3.81
__alloc_skb = 0.93
==================
SUM = 13.85

With patch :-
============
[Function = %cpu cycles]
skb_mac_gso_segment = 0.05
inet_gso_segment = 0.34
tcp4_gso_segment = 0.06
tcp_gso_segment = 0.26
skb_segment = 0.55
** skb_copy_bits = 0.62 **  <-- corrected
do_csum = 0.04
memcpy = 4.29
__alloc_skb = 0.73
==================
** SUM = 6.94 ** <-- corrected

Thanks,
Yadu Kishore

On Mon, Mar 2, 2020 at 12:22 PM Yadu Kishore <kyk.segfault@gmail.com> wrote:
>
> > > Can you contrast this against a run with your changes? The thought is
> > > that the majority of this cost is due to the memory loads and stores, not
> > > the arithmetic ops to compute the checksum. When enabling checksum
> > > offload, the same stalls will occur, but will simply be attributed to
> > > memcpy instead of to do_csum.
>
> > Agreed.
>
> Below is the data from perf with and without the patch for the same
> TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)
>
> Without patch :-
> ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.26
> tcp4_gso_segment = 0.05
> tcp_gso_segment = 0.17
> skb_segment = 0.55
> skb_copy_and_csum_bits = 0.60
> do_csum = 7.43
> memcpy = 3.81
> __alloc_skb = 0.93
> ==================
> SUM = 13.85
>
>
> With patch :-
> ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.34
> tcp4_gso_segment = 0.06
> tcp_gso_segment = 0.26
> skb_segment = 0.55
> skb_copy_and_csum_bits = 0.00
> do_csum = 0.04
> memcpy = 4.29
> __alloc_skb = 0.73
> ==================
> SUM = 6.32
>
> So, with the patch, from the above data we can see
> that the percentage of cpu cycles spent in do_csum
> has come down from 7.43% to 0.04%.
>
> > > > Is this not already handled by __copy_skb_header above? If ip_summed
> > > > has to be initialized, so have csum_start and csum_offset. That call
> > > > should have initialized all three.
>
> > > Thanks, I will look into why even though __copy_skb_header is being
> > > called, I am still
> > > seeing skb->ip_summed set to CHECKSUM_NONE in the network driver.
>
> > Thanks.
>
> My mistake. I had removed the 'skb->ip_summed = CHECKSUM_PARTIAL' line
> from the patch but had forgotten to enable NETIF_F_HW_CSUM in the network
> driver. Hence I was still seeing skb->ip_summed set to CHECKSUM_NONE.
> After re-enabling NETIF_F_HW_CSUM in the driver, I now see that
> skb->ip_summed is being set correctly to CHECKSUM_PARTIAL.
> So as suggested, the __copy_skb_header is indeed taking care of doing
> this and hence there is no need to explicitly set 'ip_summed' in the patch.
> Below is V2 of the patch with the changes.
>
>
> Problem:
> TCP checksum in the output path is not being offloaded during GSO
> in the following case:
> The network driver does not support scatter-gather but supports
> checksum offload with NETIF_F_HW_CSUM.
>
> Cause:
> skb_segment calls skb_copy_and_csum_bits if the network driver
> does not announce NETIF_F_SG. It does not check if the driver
> supports NETIF_F_HW_CSUM.
> So for devices which might want to offload checksum but do not support SG
> there is currently no way to do so if GSO is enabled.
>
> Solution:
> In skb_segment check if the network controller does checksum and if so
> call skb_copy_bits instead of skb_copy_and_csum_bits.
>
> Testing:
> Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
> in the network driver. Observed the TCP checksum offload is not happening
> since the skbs received by the driver in the output path have
> skb->ip_summed set to CHECKSUM_NONE.
>
> With the patch ran iperf TCP traffic and observed that TCP checksum
> is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
> Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
> to cover the newly introduced if-else code path in skb_segment.
>
> Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
> Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>
> ---
>  net/core/skbuff.c | 23 +++++++++++++++--------
>  1 file changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1365a55..eca72bc 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3926,14 +3926,21 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>                         goto perform_csum_check;
>
>                 if (!sg) {
> -                       if (!nskb->remcsum_offload)
> -                               nskb->ip_summed = CHECKSUM_NONE;
> -                       SKB_GSO_CB(nskb)->csum =
> -                               skb_copy_and_csum_bits(head_skb, offset,
> -                                                      skb_put(nskb, len),
> -                                                      len, 0);
> -                       SKB_GSO_CB(nskb)->csum_start =
> -                               skb_headroom(nskb) + doffset;
> +                       if (!csum) {
> +                               if (!nskb->remcsum_offload)
> +                                       nskb->ip_summed = CHECKSUM_NONE;
> +                               SKB_GSO_CB(nskb)->csum =
> +                                       skb_copy_and_csum_bits(head_skb, offset,
> +                                                              skb_put(nskb,
> +                                                                      len),
> +                                                              len, 0);
> +                               SKB_GSO_CB(nskb)->csum_start =
> +                                       skb_headroom(nskb) + doffset;
> +                       } else {
> +                               skb_copy_bits(head_skb, offset,
> +                                             skb_put(nskb, len),
> +                                             len);
> +                       }
>                         continue;
>                 }
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-02  8:42               ` Yadu Kishore
@ 2020-03-02 15:19                 ` David Laight
  2020-03-03  9:15                   ` Yadu Kishore
  0 siblings, 1 reply; 28+ messages in thread
From: David Laight @ 2020-03-02 15:19 UTC (permalink / raw)
  To: 'Yadu Kishore', David Miller
  Cc: Willem de Bruijn, Network Development

From: Yadu Kishore
> Sent: 02 March 2020 08:43
> 
> On Mon, Mar 2, 2020 at 12:22 PM Yadu Kishore <kyk.segfault@gmail.com> wrote:
> >
> > > > Can you contrast this against a run with your changes? The thought is
> > > > that the majority of this cost is due to the memory loads and stores, not
> > > > the arithmetic ops to compute the checksum. When enabling checksum
> > > > offload, the same stalls will occur, but will simply be attributed to
> > > > memcpy instead of to do_csum.
> >
> > > Agreed.
> >
> > Below is the data from perf with and without the patch for the same
> > TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)
> >
> A small correction to the data I sent earlier:
> 
> Without patch :-
>  ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.26
> tcp4_gso_segment = 0.05
> tcp_gso_segment = 0.17
> skb_segment = 0.55
> skb_copy_and_csum_bits = 0.60
> do_csum = 7.43
> memcpy = 3.81
> __alloc_skb = 0.93
> ==================
> SUM = 13.85
> 
> With patch :-
> ============
> [Function = %cpu cycles]
> skb_mac_gso_segment = 0.05
> inet_gso_segment = 0.34
> tcp4_gso_segment = 0.06
> tcp_gso_segment = 0.26
> skb_segment = 0.55
> ** skb_copy_bits = 0.62 **  <-- corrected
> do_csum = 0.04
> memcpy = 4.29
> __alloc_skb = 0.73
> ==================
> ** SUM = 6.94 ** <-- corrected

I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
(My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
could be a lot faster.)
However the big difference in the times for do_csum().

The fastest loop I can find for a non-copying checksum is:
+       /*
+        * Align the byte count to a multiple of 16 then
+        * add 64 bit words to alternating registers.
+        * Finally reduce to 64 bits.
+        */
+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

(I need to resubmit that patch)

Which is twice as fast as the one in csum-partial_64.c for
most cpu types (especially pre-Haswell ones).

IIRC that does 7 bytes/clock on my ivy bridge and
one 8 bytes/clock on Haswell.
Nothing prior to sandy bridge can beat adding 32bit words
to a 64bit register!
The faffing with 'len_tmp' made a marginal difference.

You can't go faster than 1 read/clock without using ad[oc]x.
The loop above does work (both carry and overflow are preserved)
but it needs further unrolling (and more edge code) and only
runs on a few cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-02 15:19                 ` David Laight
@ 2020-03-03  9:15                   ` Yadu Kishore
  2020-03-03  9:56                     ` David Laight
  0 siblings, 1 reply; 28+ messages in thread
From: Yadu Kishore @ 2020-03-03  9:15 UTC (permalink / raw)
  To: David Laight; +Cc: David Miller, Willem de Bruijn, Network Development

Hi David,

Thanks for the response.

> I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
> (My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
> could be a lot faster.)
> However the big difference in the times for do_csum().
>
> The fastest loop I can find for a non-copying checksum is:
> Which is twice as fast as the one in csum-partial_64.c for
> most cpu types (especially pre-Haswell ones).

The perf data I presented was collected on an arm64 platform (hikey960) where
the do_csum implementation that is called is not in assembly but in C
(lib/checksum.c)

Thanks,
Yadu Kishore

On Mon, Mar 2, 2020 at 8:49 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Yadu Kishore
> > Sent: 02 March 2020 08:43
> >
> > On Mon, Mar 2, 2020 at 12:22 PM Yadu Kishore <kyk.segfault@gmail.com> wrote:
> > >
> > > > > Can you contrast this against a run with your changes? The thought is
> > > > > that the majority of this cost is due to the memory loads and stores, not
> > > > > the arithmetic ops to compute the checksum. When enabling checksum
> > > > > offload, the same stalls will occur, but will simply be attributed to
> > > > > memcpy instead of to do_csum.
> > >
> > > > Agreed.
> > >
> > > Below is the data from perf with and without the patch for the same
> > > TCP Tx iperf run: (network driver has NETIF_F_HW_CSUM enabled)
> > >
> > A small correction to the data I sent earlier:
> >
> > Without patch :-
> >  ============
> > [Function = %cpu cycles]
> > skb_mac_gso_segment = 0.05
> > inet_gso_segment = 0.26
> > tcp4_gso_segment = 0.05
> > tcp_gso_segment = 0.17
> > skb_segment = 0.55
> > skb_copy_and_csum_bits = 0.60
> > do_csum = 7.43
> > memcpy = 3.81
> > __alloc_skb = 0.93
> > ==================
> > SUM = 13.85
> >
> > With patch :-
> > ============
> > [Function = %cpu cycles]
> > skb_mac_gso_segment = 0.05
> > inet_gso_segment = 0.34
> > tcp4_gso_segment = 0.06
> > tcp_gso_segment = 0.26
> > skb_segment = 0.55
> > ** skb_copy_bits = 0.62 **  <-- corrected
> > do_csum = 0.04
> > memcpy = 4.29
> > __alloc_skb = 0.73
> > ==================
> > ** SUM = 6.94 ** <-- corrected
>
> I was writing a reply about how 'horrid' the asm in csum_partial_copy_generic is.
> (My guess is no better than 3 clocks for 16 bytes - and a 'rep movs' copy
> could be a lot faster.)
> However the big difference in the times for do_csum().
>
> The fastest loop I can find for a non-copying checksum is:
> +       /*
> +        * Align the byte count to a multiple of 16 then
> +        * add 64 bit words to alternating registers.
> +        * Finally reduce to 64 bits.
> +        */
> +       asm(    "       bt    $4, %[len]\n"
> +               "       jnc   10f\n"
> +               "       add   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   16(%[len]), %[len]\n"
> +               "10:    jecxz 20f\n"
> +               "       adc   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   32(%[len]), %[len_tmp]\n"
> +               "       adc   16(%[buff], %[len]), %[sum_0]\n"
> +               "       adc   24(%[buff], %[len]), %[sum_1]\n"
> +               "       mov   %[len_tmp], %[len]\n"
> +               "       jmp   10b\n"
> +               "20:    adc   %[sum_0], %[sum]\n"
> +               "       adc   %[sum_1], %[sum]\n"
> +               "       adc   $0, %[sum]\n"
> +           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
> +               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
> +           : [buff] "r" (buff)
> +           : "memory" );
>
> (I need to resubmit that patch)
>
> Which is twice as fast as the one in csum-partial_64.c for
> most cpu types (especially pre-Haswell ones).
>
> IIRC that does 7 bytes/clock on my ivy bridge and
> one 8 bytes/clock on Haswell.
> Nothing prior to sandy bridge can beat adding 32bit words
> to a 64bit register!
> The faffing with 'len_tmp' made a marginal difference.
>
> You can't go faster than 1 read/clock without using ad[oc]x.
> The loop above does work (both carry and overflow are preserved)
> but it needs further unrolling (and more edge code) and only
> runs on a few cpu.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-03  9:15                   ` Yadu Kishore
@ 2020-03-03  9:56                     ` David Laight
  2020-03-05  6:32                       ` Yadu Kishore
  0 siblings, 1 reply; 28+ messages in thread
From: David Laight @ 2020-03-03  9:56 UTC (permalink / raw)
  To: 'Yadu Kishore'
  Cc: David Miller, Willem de Bruijn, Network Development

From: Yadu Kishore
> Sent: 03 March 2020 09:15
...
> The perf data I presented was collected on an arm64 platform (hikey960) where
> the do_csum implementation that is called is not in assembly but in C
> (lib/checksum.c)

It is a long time since I've written any arm assembler, but an
asm checksum loop ought to be faster than a C one because using
'add with carry' ought to be a gain.
(Unlike mips style instruction sets without a carry flag.)

However what it more interesting is that do_csum() is being
called at all.
It implies that a large data block is being checksummed 'in situ'
whereas the expectation is that 'linearising' the skb requires
all the data be copied - so the checksum would be done during the
copy.

Additionally unless the copy loop is 'load + store' and
'load + store + adc' can be executed in the same number of
clocks (without excessive loop unrolling) then doing the
checksum in the copy loop isn't 'free'.

For x86 (including old intel cpu where adc is 2 clocks)
the 'checksum in copy' isn't free.

Clearly, if you have to do a copy and a software checksum
it is very likely that doing them together is faster.
(Although a fast 'rep movs' copy and an ad[co]x (or AVX2?)
checksum may be faster on very recent Intel cpu for large
enough buffers.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-03  9:56                     ` David Laight
@ 2020-03-05  6:32                       ` Yadu Kishore
  2020-03-05 16:06                         ` Willem de Bruijn
  0 siblings, 1 reply; 28+ messages in thread
From: Yadu Kishore @ 2020-03-05  6:32 UTC (permalink / raw)
  To: David Laight, Willem de Bruijn, David Miller; +Cc: Network Development

Hi all,

Though there is scope to optimise the checksum code (from C to asm) for such
architectures, it is not the intent of this patchset.
The intent here is only to enable offloading checksum during GSO.

The perf data I presented shows that ~7.4% of the CPU is spent doing checksum
in the GSO path for architectures where the checksum code is not implemented in
assembly (arm64).
If the network controller hardware supports checksumming, then I feel
that it is worthwhile to offload this even during GSO for such architectures
and save the 7.25% of the host cpu cycles.

Thanks,
Yadu Kishore

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-05  6:32                       ` Yadu Kishore
@ 2020-03-05 16:06                         ` Willem de Bruijn
  2020-03-05 17:00                           ` David Laight
  0 siblings, 1 reply; 28+ messages in thread
From: Willem de Bruijn @ 2020-03-05 16:06 UTC (permalink / raw)
  To: Yadu Kishore
  Cc: David Laight, Willem de Bruijn, David Miller, Network Development

On Thu, Mar 5, 2020 at 1:33 AM Yadu Kishore <kyk.segfault@gmail.com> wrote:
>
> Hi all,
>
> Though there is scope to optimise the checksum code (from C to asm) for such
> architectures, it is not the intent of this patchset.
> The intent here is only to enable offloading checksum during GSO.
>
> The perf data I presented shows that ~7.4% of the CPU is spent doing checksum
> in the GSO path for architectures where the checksum code is not implemented in
> assembly (arm64).
> If the network controller hardware supports checksumming, then I feel
> that it is worthwhile to offload this even during GSO for such architectures
> and save the 7.25% of the host cpu cycles.

Yes, given the discussion I have no objections. The change to
skb_segment in v2 look fine.

Thanks for sharing the in depth analysis, David. I expected that ~300
cycles per memory access would always dwarf the arithmetic cost.
Perhaps back of the envelope would be that 300/64 ~=5 cyc/B, on the
order of the 2 cyc/B and hence the operation is not entirely
insignificant.  Or more likely that memory access cost is simply
markedly lower here if the data is still warm in a cache.

It seems do_csum is called because csum_partial_copy executes the
two operations independently:

__wsum
csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
{
        memcpy(dst, src, len);
        return csum_partial(dst, len, sum);
}

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-05 16:06                         ` Willem de Bruijn
@ 2020-03-05 17:00                           ` David Laight
  2020-03-05 17:19                             ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: David Laight @ 2020-03-05 17:00 UTC (permalink / raw)
  To: 'Willem de Bruijn', Yadu Kishore
  Cc: David Miller, Network Development

From: Willem de Bruijn
> Sent: 05 March 2020 16:07
..
> It seems do_csum is called because csum_partial_copy executes the
> two operations independently:
> 
> __wsum
> csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
> {
>         memcpy(dst, src, len);
>         return csum_partial(dst, len, sum);
> }

And do_csum() is superbly horrid.
Not the least because it is 32bit on 64bit systems.

A better inner loop (even on 32bit) would be:
	u64 result = 0; // better old 'sum' the caller wants to add in.
	...
	do {
		result += *(u32 *)buff;
		buff += 4;
	} while (buff < end);

(That is as fast as the x86 'adc' loop on intel x86 cpus
prior to Haswell!)
Tweaking the above might cause gcc to generate code that
executes one iteration per clock on some superscaler cpus.
Adding alternate words to different registers may be
beneficial on cpus that can do two memory reads in 1 clock.

Then reduce from 64bits down to 16. Maybe with:
	if (odd)
		result <<= 8;
	result = (u32)result + (result >> 32);
	result32 = (u16)result + (result >> 16);
	result32 = (u16)result32 + (result32 >> 16);
	result32 = (u16)result32 + (result32 >> 16);
I think you need 4 reduces.
Modulo 0xffff might generate faster code, but the result domain
is then 0..0xfffe not 1..0xffff.
Which is actually better because it is correct when inverted.

If reducing with adds, it is best to initialise the csum to 1.
Then add 1 after inverting.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-05 17:00                           ` David Laight
@ 2020-03-05 17:19                             ` Eric Dumazet
  2020-03-06 17:12                               ` David Laight
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-03-05 17:19 UTC (permalink / raw)
  To: David Laight, 'Willem de Bruijn', Yadu Kishore
  Cc: David Miller, Network Development



On 3/5/20 9:00 AM, David Laight wrote:
> From: Willem de Bruijn
>> Sent: 05 March 2020 16:07
> ..
>> It seems do_csum is called because csum_partial_copy executes the
>> two operations independently:
>>
>> __wsum
>> csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
>> {
>>         memcpy(dst, src, len);
>>         return csum_partial(dst, len, sum);
>> }
> 
> And do_csum() is superbly horrid.
> Not the least because it is 32bit on 64bit systems.

There are many versions, which one is discussed here ?

At least the current one seems to be 64bit optimized.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5777eaed566a1d63e344d3dd8f2b5e33be20643e


> 
> A better inner loop (even on 32bit) would be:
> 	u64 result = 0; // better old 'sum' the caller wants to add in.
> 	...
> 	do {
> 		result += *(u32 *)buff;
> 		buff += 4;
> 	} while (buff < end);
> 
> (That is as fast as the x86 'adc' loop on intel x86 cpus
> prior to Haswell!)
> Tweaking the above might cause gcc to generate code that
> executes one iteration per clock on some superscaler cpus.
> Adding alternate words to different registers may be
> beneficial on cpus that can do two memory reads in 1 clock.
> 
> Then reduce from 64bits down to 16. Maybe with:
> 	if (odd)
> 		result <<= 8;
> 	result = (u32)result + (result >> 32);
> 	result32 = (u16)result + (result >> 16);
> 	result32 = (u16)result32 + (result32 >> 16);
> 	result32 = (u16)result32 + (result32 >> 16);
> I think you need 4 reduces.
> Modulo 0xffff might generate faster code, but the result domain
> is then 0..0xfffe not 1..0xffff.
> Which is actually better because it is correct when inverted.
> 
> If reducing with adds, it is best to initialise the csum to 1.
> Then add 1 after inverting.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-05 17:19                             ` Eric Dumazet
@ 2020-03-06 17:12                               ` David Laight
  2020-03-13  6:36                                 ` Yadu Kishore
  2020-03-22 19:53                                 ` [PATCH v2] " Tom Herbert
  0 siblings, 2 replies; 28+ messages in thread
From: David Laight @ 2020-03-06 17:12 UTC (permalink / raw)
  To: 'Eric Dumazet', 'Willem de Bruijn', Yadu Kishore
  Cc: David Miller, Network Development

From: Eric Dumazet
> Sent: 05 March 2020 17:20
> 
> On 3/5/20 9:00 AM, David Laight wrote:
> > From: Willem de Bruijn
> >> Sent: 05 March 2020 16:07
> > ..
> >> It seems do_csum is called because csum_partial_copy executes the
> >> two operations independently:
> >>
> >> __wsum
> >> csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
> >> {
> >>         memcpy(dst, src, len);
> >>         return csum_partial(dst, len, sum);
> >> }
> >
> > And do_csum() is superbly horrid.
> > Not the least because it is 32bit on 64bit systems.
> 
> There are many versions, which one is discussed here ?
> 
> At least the current one seems to be 64bit optimized.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5777eaed566a1d63e344d3dd
> 8f2b5e33be20643e

I was looking at the generic one in $(SRC)/lib/checksum.c.

FWIW I suspect the fastest code on pre sandy bridge 64bit intel cpus
(where adc is 2 clocks) is to do a normal 'add', shift the carries
into a 64bit register and do a software 'popcnt' every 512 bytes.
That may run at 8 bytes/clock + the popcnt.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-06 17:12                               ` David Laight
@ 2020-03-13  6:36                                 ` Yadu Kishore
  2020-03-13 13:57                                   ` Willem de Bruijn
  2020-03-13 18:05                                   ` David Miller
  2020-03-22 19:53                                 ` [PATCH v2] " Tom Herbert
  1 sibling, 2 replies; 28+ messages in thread
From: Yadu Kishore @ 2020-03-13  6:36 UTC (permalink / raw)
  To: Willem de Bruijn, David Miller
  Cc: Eric Dumazet, Network Development, David Laight

> Yes, given the discussion I have no objections. The change to
> skb_segment in v2 look fine.

I'm assuming that the changes in patch V2 are ok to be accepted and merged
to the kernel. Please let me know if there is anything else that is pending
from my side with respect to the patch.

Thanks,
Yadu Kishore

On Fri, Mar 6, 2020 at 10:42 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 05 March 2020 17:20
> >
> > On 3/5/20 9:00 AM, David Laight wrote:
> > > From: Willem de Bruijn
> > >> Sent: 05 March 2020 16:07
> > > ..
> > >> It seems do_csum is called because csum_partial_copy executes the
> > >> two operations independently:
> > >>
> > >> __wsum
> > >> csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
> > >> {
> > >>         memcpy(dst, src, len);
> > >>         return csum_partial(dst, len, sum);
> > >> }
> > >
> > > And do_csum() is superbly horrid.
> > > Not the least because it is 32bit on 64bit systems.
> >
> > There are many versions, which one is discussed here ?
> >
> > At least the current one seems to be 64bit optimized.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5777eaed566a1d63e344d3dd
> > 8f2b5e33be20643e
>
> I was looking at the generic one in $(SRC)/lib/checksum.c.
>
> FWIW I suspect the fastest code on pre sandy bridge 64bit intel cpus
> (where adc is 2 clocks) is to do a normal 'add', shift the carries
> into a 64bit register and do a software 'popcnt' every 512 bytes.
> That may run at 8 bytes/clock + the popcnt.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-13  6:36                                 ` Yadu Kishore
@ 2020-03-13 13:57                                   ` Willem de Bruijn
  2020-03-13 14:04                                     ` David Laight
  2020-03-13 18:05                                   ` David Miller
  1 sibling, 1 reply; 28+ messages in thread
From: Willem de Bruijn @ 2020-03-13 13:57 UTC (permalink / raw)
  To: Yadu Kishore
  Cc: David Miller, Eric Dumazet, Network Development, David Laight

On Fri, Mar 13, 2020 at 2:36 AM Yadu Kishore <kyk.segfault@gmail.com> wrote:
>
> > Yes, given the discussion I have no objections. The change to
> > skb_segment in v2 look fine.
>
> I'm assuming that the changes in patch V2 are ok to be accepted and merged
> to the kernel. Please let me know if there is anything else that is pending
> from my side with respect to the patch.

I think you can rebase and submit against net-next.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-13 13:57                                   ` Willem de Bruijn
@ 2020-03-13 14:04                                     ` David Laight
  0 siblings, 0 replies; 28+ messages in thread
From: David Laight @ 2020-03-13 14:04 UTC (permalink / raw)
  To: 'Willem de Bruijn', Yadu Kishore
  Cc: David Miller, Eric Dumazet, Network Development

> On Fri, Mar 13, 2020 at 2:36 AM Yadu Kishore <kyk.segfault@gmail.com> wrote:
> >
> > > Yes, given the discussion I have no objections. The change to
> > > skb_segment in v2 look fine.
> >
> > I'm assuming that the changes in patch V2 are ok to be accepted and merged
> > to the kernel. Please let me know if there is anything else that is pending
> > from my side with respect to the patch.
> 
> I think you can rebase and submit against net-next.

It's also worth mentioning (probably in 0/n) that
this is likely to be a gain even on x86 where the
checksum is calculated during the copy loop.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-13  6:36                                 ` Yadu Kishore
  2020-03-13 13:57                                   ` Willem de Bruijn
@ 2020-03-13 18:05                                   ` David Miller
  2020-03-17  8:38                                     ` [PATCH v3] " Yadu Kishore
  1 sibling, 1 reply; 28+ messages in thread
From: David Miller @ 2020-03-13 18:05 UTC (permalink / raw)
  To: kyk.segfault; +Cc: willemdebruijn.kernel, eric.dumazet, netdev, David.Laight

From: Yadu Kishore <kyk.segfault@gmail.com>
Date: Fri, 13 Mar 2020 12:06:34 +0530

>> Yes, given the discussion I have no objections. The change to
>> skb_segment in v2 look fine.
> 
> I'm assuming that the changes in patch V2 are ok to be accepted and merged
> to the kernel. Please let me know if there is anything else that is pending
> from my side with respect to the patch.

If your patch isn't active in the networking development patchwork instance,
it is not pending to be applied and you must resend it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v3] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-13 18:05                                   ` David Miller
@ 2020-03-17  8:38                                     ` Yadu Kishore
  2020-03-22  3:06                                       ` David Miller
                                                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Yadu Kishore @ 2020-03-17  8:38 UTC (permalink / raw)
  To: davem
  Cc: David.Laight, eric.dumazet, kyk.segfault, netdev, willemdebruijn.kernel


> I think you can rebase and submit against net-next.

> If your patch isn't active in the networking development patchwork instance,
> it is not pending to be applied and you must resend it.

Rebasing the patch on net-next and resending it.

Problem:
TCP checksum in the output path is not being offloaded during GSO
in the following case:
The network driver does not support scatter-gather but supports
checksum offload with NETIF_F_HW_CSUM.

Cause:
skb_segment calls skb_copy_and_csum_bits if the network driver
does not announce NETIF_F_SG. It does not check if the driver
supports NETIF_F_HW_CSUM.
So for devices which might want to offload checksum but do not support SG
there is currently no way to do so if GSO is enabled.

Solution:
In skb_segment check if the network controller does checksum and if so
call skb_copy_bits instead of skb_copy_and_csum_bits.

Testing:
Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
in the network driver. Observed the TCP checksum offload is not happening
since the skbs received by the driver in the output path have
skb->ip_summed set to CHECKSUM_NONE.

With the patch ran iperf TCP traffic and observed that TCP checksum
is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
to cover the newly introduced if-else code path in skb_segment.

Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>
---
 net/core/skbuff.c | 23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e1101a4..621b447 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3926,14 +3926,21 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 			goto perform_csum_check;
 
 		if (!sg) {
-			if (!nskb->remcsum_offload)
-				nskb->ip_summed = CHECKSUM_NONE;
-			SKB_GSO_CB(nskb)->csum =
-				skb_copy_and_csum_bits(head_skb, offset,
-						       skb_put(nskb, len),
-						       len, 0);
-			SKB_GSO_CB(nskb)->csum_start =
-				skb_headroom(nskb) + doffset;
+			if (!csum) {
+				if (!nskb->remcsum_offload)
+					nskb->ip_summed = CHECKSUM_NONE;
+				SKB_GSO_CB(nskb)->csum =
+					skb_copy_and_csum_bits(head_skb, offset,
+							       skb_put(nskb,
+								       len),
+							       len, 0);
+				SKB_GSO_CB(nskb)->csum_start =
+					skb_headroom(nskb) + doffset;
+			} else {
+				skb_copy_bits(head_skb, offset,
+					      skb_put(nskb, len),
+					      len);
+			}
 			continue;
 		}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-17  8:38                                     ` [PATCH v3] " Yadu Kishore
@ 2020-03-22  3:06                                       ` David Miller
  2020-03-22 14:40                                       ` Willem de Bruijn
  2020-03-23 20:00                                       ` David Miller
  2 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2020-03-22  3:06 UTC (permalink / raw)
  To: kyk.segfault; +Cc: David.Laight, eric.dumazet, netdev, willemdebruijn.kernel

From: Yadu Kishore <kyk.segfault@gmail.com>
Date: Tue, 17 Mar 2020 14:08:38 +0530

> 
>> I think you can rebase and submit against net-next.
> 
>> If your patch isn't active in the networking development patchwork instance,
>> it is not pending to be applied and you must resend it.
> 
> Rebasing the patch on net-next and resending it.
> 
> Problem:
> TCP checksum in the output path is not being offloaded during GSO
> in the following case:
> The network driver does not support scatter-gather but supports
> checksum offload with NETIF_F_HW_CSUM.
> 
> Cause:
> skb_segment calls skb_copy_and_csum_bits if the network driver
> does not announce NETIF_F_SG. It does not check if the driver
> supports NETIF_F_HW_CSUM.
> So for devices which might want to offload checksum but do not support SG
> there is currently no way to do so if GSO is enabled.
> 
> Solution:
> In skb_segment check if the network controller does checksum and if so
> call skb_copy_bits instead of skb_copy_and_csum_bits.
> 
> Testing:
> Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
> in the network driver. Observed the TCP checksum offload is not happening
> since the skbs received by the driver in the output path have
> skb->ip_summed set to CHECKSUM_NONE.
> 
> With the patch ran iperf TCP traffic and observed that TCP checksum
> is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
> Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
> to cover the newly introduced if-else code path in skb_segment.
> 
> Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
> Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>

Willem, please review and ACK/NACK.

Thank you.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-17  8:38                                     ` [PATCH v3] " Yadu Kishore
  2020-03-22  3:06                                       ` David Miller
@ 2020-03-22 14:40                                       ` Willem de Bruijn
  2020-03-23 20:00                                       ` David Miller
  2 siblings, 0 replies; 28+ messages in thread
From: Willem de Bruijn @ 2020-03-22 14:40 UTC (permalink / raw)
  To: Yadu Kishore
  Cc: David Miller, David Laight, Eric Dumazet, Network Development,
	Willem de Bruijn, Alexander Duyck, Tom Herbert

On Tue, Mar 17, 2020 at 4:39 AM Yadu Kishore <kyk.segfault@gmail.com> wrote:
>
>
> > I think you can rebase and submit against net-next.
>
> > If your patch isn't active in the networking development patchwork instance,
> > it is not pending to be applied and you must resend it.
>
> Rebasing the patch on net-next and resending it.
>
> Problem:
> TCP checksum in the output path is not being offloaded during GSO
> in the following case:
> The network driver does not support scatter-gather but supports
> checksum offload with NETIF_F_HW_CSUM.
>
> Cause:
> skb_segment calls skb_copy_and_csum_bits if the network driver
> does not announce NETIF_F_SG. It does not check if the driver
> supports NETIF_F_HW_CSUM.
> So for devices which might want to offload checksum but do not support SG
> there is currently no way to do so if GSO is enabled.
>
> Solution:
> In skb_segment check if the network controller does checksum and if so
> call skb_copy_bits instead of skb_copy_and_csum_bits.
>
> Testing:
> Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
> in the network driver. Observed the TCP checksum offload is not happening
> since the skbs received by the driver in the output path have
> skb->ip_summed set to CHECKSUM_NONE.
>
> With the patch ran iperf TCP traffic and observed that TCP checksum
> is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
> Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
> to cover the newly introduced if-else code path in skb_segment.
>
> Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
> Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>

Acked-by: Willem de Bruijn <willemb@google.com>

Given that there are multiple micro architectures on which this is
apparently a win, it makes sense (even if we should also optimize
those arch specific csum codes).

The tricky part here is not the basic operation, but how this behaves
with the skb_gso_cb infra for handling multiple checksums, some of
which may be computed in software, such as with remote checksum
offload.

In the case of CHECKSUM_PARTIAL the csum is not computed. Inner
most segmentation protocols (tcp/udp) will call gso_reset_checksum on
return from skb_segment to set up these skb_gso_cb fields. So this
looks correct to me.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-06 17:12                               ` David Laight
  2020-03-13  6:36                                 ` Yadu Kishore
@ 2020-03-22 19:53                                 ` Tom Herbert
  2020-03-23  9:15                                   ` David Laight
  1 sibling, 1 reply; 28+ messages in thread
From: Tom Herbert @ 2020-03-22 19:53 UTC (permalink / raw)
  To: David Laight
  Cc: Eric Dumazet, Willem de Bruijn, Yadu Kishore, David Miller,
	Network Development

On Fri, Mar 6, 2020 at 9:12 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 05 March 2020 17:20
> >
> > On 3/5/20 9:00 AM, David Laight wrote:
> > > From: Willem de Bruijn
> > >> Sent: 05 March 2020 16:07
> > > ..
> > >> It seems do_csum is called because csum_partial_copy executes the
> > >> two operations independently:
> > >>
> > >> __wsum
> > >> csum_partial_copy(const void *src, void *dst, int len, __wsum sum)
> > >> {
> > >>         memcpy(dst, src, len);
> > >>         return csum_partial(dst, len, sum);
> > >> }
> > >
> > > And do_csum() is superbly horrid.
> > > Not the least because it is 32bit on 64bit systems.
> >
> > There are many versions, which one is discussed here ?
> >
> > At least the current one seems to be 64bit optimized.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5777eaed566a1d63e344d3dd
> > 8f2b5e33be20643e
>
> I was looking at the generic one in $(SRC)/lib/checksum.c.
>
> FWIW I suspect the fastest code on pre sandy bridge 64bit intel cpus
> (where adc is 2 clocks) is to do a normal 'add', shift the carries
> into a 64bit register and do a software 'popcnt' every 512 bytes.
> That may run at 8 bytes/clock + the popcnt.

A while back, I had proposed an optimized x86 checksum function using
unrolled addq a while back https://lwn.net/Articles/679137/. Also,
this tries to optimize from small checksum like over header when doing
skb_postpull_rcsum.

Tom

>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-22 19:53                                 ` [PATCH v2] " Tom Herbert
@ 2020-03-23  9:15                                   ` David Laight
  0 siblings, 0 replies; 28+ messages in thread
From: David Laight @ 2020-03-23  9:15 UTC (permalink / raw)
  To: 'Tom Herbert'
  Cc: Eric Dumazet, Willem de Bruijn, Yadu Kishore, David Miller,
	Network Development

[-- Attachment #1: Type: text/plain, Size: 3052 bytes --]

From: Tom Herbert
> Sent: 22 March 2020 19:54
...
> A while back, I had proposed an optimized x86 checksum function using
> unrolled addq a while back https://lwn.net/Articles/679137/. Also,
> this tries to optimize from small checksum like over header when doing
> skb_postpull_rcsum.

Look for the patch I submitted a few weeks back.
Attached, loop body is:

+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

You don't need a very unrolled loop - just a loop that doesn't
have dependencies on parts of the flags register
(in particular dont require 'dec' to preserve carry).
You also need to adc to alternating destinations for some chips.

With the fast 'rep movs' on modern cpu I suspect that trying
to do the copy at the same time as the checksum is a loss.

The 'adc memory' is one instruction and 2-uops.
Fairly easy to execute a 1/clock.
IIRC on Haswell you don't need to unroll the loop at all.
(You can use dec and adc).
A checksum+copy loop is load+adc+store which is 3 instructions
and 5 uops - starts hitting the limits on uops/clock
and retirement, so won't run in 1 clock and the loop will
always be extra.
Whereas a 'rep movs' is likely to move 1 cache line/clock.
ISTR 'rep movq' is faster than 8 bytes/clock on my 'ivy bridge'.

On Haswell you can approach 16 bytes/clock using ad[oc]x.
But it is painful, and I'm not sure that all cpu that support
the instructions actually execute them quickly.

If you can't use adc (eg in C or MIPS) there are two
obvious solutions:

1) Add 32bit words to a 64bit register.
   On x86 this is as fast as adc on everything prior to sandy bridge!

2) Add 64bit words to a 64bit reg and (word >> 32) to a second.
   Fixup for the missing carry (check high bit of main sum matches
   relevant bit of shifted sum).
   This could be faster than the horrid arm64 asm code.
   1 memory load, 1 add, 1 'shift + add' per 8 bytes.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

[-- Attachment #2: 0001-Optimise-x86-IP-checksum-code.patch --]
[-- Type: application/octet-stream, Size: 7395 bytes --]

From b232a96556c92dba2e4a856cfc2b0b0df9f4f44b Mon Sep 17 00:00:00 2001
From: David Laight <david.laight@aculab.com>
Date: Tue, 3 Dec 2019 10:41:03 +0000
Subject: [PATCH] Optimise x86 IP checksum code.

Performance improvements to the amd64 IP checksum code.
Summing to alternate registers almost doubles the performance
(probably 4 to 6.2 bytes/clock) on Ivy Bridge cpu.
Loop carrying the carry flag improves Haswell from 7 to 8 bytes/clock.
Older cpu will still approach 4 bytes/clock.
All achieved with a less loop unrolling - improving the performance
for small buffers that are not a multiple of the loop span.

Signed-off-by: David Laight <david.laight@aculab.com>
---
 arch/x86/lib/csum-partial_64.c | 192 +++++++++++++++++++++--------------------
 1 file changed, 98 insertions(+), 94 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index e7925d6..7f25ab2 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -10,113 +10,118 @@
 #include <linux/export.h>
 #include <asm/checksum.h>
 
-static inline unsigned short from32to16(unsigned a) 
-{
-	unsigned short b = a >> 16; 
-	asm("addw %w2,%w0\n\t"
-	    "adcw $0,%w0\n" 
-	    : "=r" (b)
-	    : "0" (b), "r" (a));
-	return b;
-}
-
 /*
  * Do a 64-bit checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
  *
  * This isn't as time critical as it used to be because many NICs
  * do hardware checksumming these days.
+ *
+ * All Intel cpus prior to Ivy Bridge (mayby Sandy Bridge) have a 2 clock
+ * latency on the result of adc.
+ * This limits any adc loop to 4 bytes/clock - the same as a C loop
+ * that adds 32 bit values to a 64 bit register.
+ * On Ivy bridge the adc result latency is still 2 clocks, but the carry
+ * latency is reduced to 1 clock.
+ * So summing to alternate registers increases the throughput to approaching
+ * 8 bytes/clock.
+ * Older cpu (eg core 2) have a 6+ clock delay accessing any of the flags
+ * after a partial update (eg adc after inc).
+ * The stange 'jecxz' loop avoids this.
+ * The Ivy bridge then needs the loop unrolling once more to approach
+ * 8 bytes per clock.
  * 
- * Things tried and found to not make it faster:
- * Manual Prefetching
- * Unrolling to an 128 bytes inner loop.
- * Using interleaving with more registers to break the carry chains.
+ * On 7th gen cpu using adox/adoc can get 12 bytes/clock (maybe 16?)
+ * provided the loop is unrolled further than the one below.
+ * But some other cpu that support the adx take 6 clocks for each.
+ *
+ * The 'sum' value on entry is added in, it can exceed 32 bits but
+ * must not get to 56 bits.
  */
-static unsigned do_csum(const unsigned char *buff, unsigned len)
+static unsigned do_csum(const unsigned char *buff, long len, u64 sum)
 {
-	unsigned odd, count;
-	unsigned long result = 0;
+	unsigned int src_align;
+	u64 sum_0 = 0, sum_1;
+	long len_tmp;
+	bool odd = false;
 
-	if (unlikely(len == 0))
-		return result; 
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		result = *buff << 8;
-		len--;
-		buff++;
-	}
-	count = len >> 1;		/* nr of 16-bit words.. */
-	if (count) {
-		if (2 & (unsigned long) buff) {
-			result += *(unsigned short *)buff;
-			count--;
-			len -= 2;
-			buff += 2;
+	/* 64bit align the base address */
+	src_align = (unsigned long)buff & 7;
+	if (src_align) {
+		if (unlikely(src_align & 1)) {
+			sum <<= 8;
+			/* The extra flag generates better code! */
+			odd = true;
 		}
-		count >>= 1;		/* nr of 32-bit words.. */
-		if (count) {
-			unsigned long zero;
-			unsigned count64;
-			if (4 & (unsigned long) buff) {
-				result += *(unsigned int *) buff;
-				count--;
-				len -= 4;
-				buff += 4;
-			}
-			count >>= 1;	/* nr of 64-bit words.. */
+		buff -= src_align;
+		len += src_align;
+		if (likely(src_align == 4))
+			sum_0 = *(u32 *)(buff + 4);
+		else
+			/* Mask off unwanted low bytes from full 64bit word */
+			sum_0 = *(u64 *)buff & (~0ull << (src_align * 8));
+		if (unlikely(len < 8)) {
+			/* Mask off the unwanted high bytes */
+			sum += sum_0 & ~(~0ull << (len * 8));
+			goto reduce_32;
+		}
+		len -= 8;
+		buff += 8;
+	}
 
-			/* main loop using 64byte blocks */
-			zero = 0;
-			count64 = count >> 3;
-			while (count64) { 
-				asm("addq 0*8(%[src]),%[res]\n\t"
-				    "adcq 1*8(%[src]),%[res]\n\t"
-				    "adcq 2*8(%[src]),%[res]\n\t"
-				    "adcq 3*8(%[src]),%[res]\n\t"
-				    "adcq 4*8(%[src]),%[res]\n\t"
-				    "adcq 5*8(%[src]),%[res]\n\t"
-				    "adcq 6*8(%[src]),%[res]\n\t"
-				    "adcq 7*8(%[src]),%[res]\n\t"
-				    "adcq %[zero],%[res]"
-				    : [res] "=r" (result)
-				    : [src] "r" (buff), [zero] "r" (zero),
-				    "[res]" (result));
-				buff += 64;
-				count64--;
-			}
+	/* Read first 8 bytes to 16 byte align the loop below */
+	sum_1 = len & 8 ? *(u64 *)buff : 0;
 
-			/* last up to 7 8byte blocks */
-			count %= 8; 
-			while (count) { 
-				asm("addq %1,%0\n\t"
-				    "adcq %2,%0\n" 
-					    : "=r" (result)
-				    : "m" (*(unsigned long *)buff), 
-				    "r" (zero),  "0" (result));
-				--count; 
-				buff += 8;
-			}
-			result = add32_with_carry(result>>32,
-						  result&0xffffffff); 
+	/* The main loop uses negative offsets from the end of the buffer */
+	buff += len;
 
-			if (len & 4) {
-				result += *(unsigned int *) buff;
-				buff += 4;
-			}
-		}
-		if (len & 2) {
-			result += *(unsigned short *) buff;
-			buff += 2;
-		}
+	/* Add in trailing bytes to 64bit align the length */
+	if (len & 7) {
+		unsigned int tail_len = len & 7;
+		buff -= tail_len;
+		if (likely(tail_len == 4))
+			sum += *(u32 *)buff;
+		else
+			/* Mask off the unwanted high bytes */
+			sum += *(u64 *)buff & ~(~0ull << (tail_len * 8));
 	}
-	if (len & 1)
-		result += *buff;
-	result = add32_with_carry(result>>32, result & 0xffffffff); 
-	if (unlikely(odd)) { 
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
-	}
-	return result;
+
+	/* Align and negate len so that we need to sum [buff[len]..buf[0]) */
+	len = -(len & ~15);
+
+	/*
+	 * Align the byte count to a multiple of 16 then
+	 * add 64 bit words to alternating registers.
+	 * Finally reduce to 64 bits.
+	 */
+	asm(	"	bt    $4, %[len]\n"
+		"	jnc   10f\n"
+		"	add   (%[buff], %[len]), %[sum_0]\n"
+		"	adc   8(%[buff], %[len]), %[sum_1]\n"
+		"	lea   16(%[len]), %[len]\n"
+		"10:	jecxz 20f\n"
+		"	adc   (%[buff], %[len]), %[sum_0]\n"
+		"	adc   8(%[buff], %[len]), %[sum_1]\n"
+		"	lea   32(%[len]), %[len_tmp]\n"
+		"	adc   16(%[buff], %[len]), %[sum_0]\n"
+		"	adc   24(%[buff], %[len]), %[sum_1]\n"
+		"	mov   %[len_tmp], %[len]\n"
+		"	jmp   10b\n"
+		"20:	adc   %[sum_0], %[sum]\n"
+		"	adc   %[sum_1], %[sum]\n"
+		"	adc   $0, %[sum]\n"
+	    : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+	    	[len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+	    : [buff] "r" (buff)
+	    : "memory" );
+
+reduce_32:
+	sum = add32_with_carry(sum>>32, sum & 0xffffffff); 
+
+	if (unlikely(odd))
+		return __builtin_bswap32(sum);
+
+	return sum;
 }
 
 /*
@@ -133,8 +138,7 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
  */
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
-	return (__force __wsum)add32_with_carry(do_csum(buff, len),
-						(__force u32)sum);
+	return do_csum(buff, len, (__force u32)sum);
 }
 EXPORT_SYMBOL(csum_partial);
 
-- 
1.8.1.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v3] net: Make skb_segment not to compute checksum if network controller supports checksumming
  2020-03-17  8:38                                     ` [PATCH v3] " Yadu Kishore
  2020-03-22  3:06                                       ` David Miller
  2020-03-22 14:40                                       ` Willem de Bruijn
@ 2020-03-23 20:00                                       ` David Miller
  2 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2020-03-23 20:00 UTC (permalink / raw)
  To: kyk.segfault; +Cc: David.Laight, eric.dumazet, netdev, willemdebruijn.kernel

From: Yadu Kishore <kyk.segfault@gmail.com>
Date: Tue, 17 Mar 2020 14:08:38 +0530

> Problem:
> TCP checksum in the output path is not being offloaded during GSO
> in the following case:
> The network driver does not support scatter-gather but supports
> checksum offload with NETIF_F_HW_CSUM.
> 
> Cause:
> skb_segment calls skb_copy_and_csum_bits if the network driver
> does not announce NETIF_F_SG. It does not check if the driver
> supports NETIF_F_HW_CSUM.
> So for devices which might want to offload checksum but do not support SG
> there is currently no way to do so if GSO is enabled.
> 
> Solution:
> In skb_segment check if the network controller does checksum and if so
> call skb_copy_bits instead of skb_copy_and_csum_bits.
> 
> Testing:
> Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
> in the network driver. Observed the TCP checksum offload is not happening
> since the skbs received by the driver in the output path have
> skb->ip_summed set to CHECKSUM_NONE.
> 
> With the patch ran iperf TCP traffic and observed that TCP checksum
> is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
> Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
> to cover the newly introduced if-else code path in skb_segment.
> 
> Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
> Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>

Applied to net-next.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-03-23 20:00 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-04  5:55 TCP checksum not offloaded during GSO Yadu Kishore
2020-02-05 21:43 ` Willem de Bruijn
2020-02-21  5:13   ` [PATCH] net: Make skb_segment not to compute checksum if network controller supports checksumming Yadu Kishore
2020-02-23  2:41     ` Willem de Bruijn
2020-02-28  5:24       ` Yadu Kishore
2020-02-28 14:30         ` Willem de Bruijn
2020-02-28 20:01           ` David Miller
2020-03-02  6:51             ` [PATCH v2] " Yadu Kishore
2020-03-02  8:42               ` Yadu Kishore
2020-03-02 15:19                 ` David Laight
2020-03-03  9:15                   ` Yadu Kishore
2020-03-03  9:56                     ` David Laight
2020-03-05  6:32                       ` Yadu Kishore
2020-03-05 16:06                         ` Willem de Bruijn
2020-03-05 17:00                           ` David Laight
2020-03-05 17:19                             ` Eric Dumazet
2020-03-06 17:12                               ` David Laight
2020-03-13  6:36                                 ` Yadu Kishore
2020-03-13 13:57                                   ` Willem de Bruijn
2020-03-13 14:04                                     ` David Laight
2020-03-13 18:05                                   ` David Miller
2020-03-17  8:38                                     ` [PATCH v3] " Yadu Kishore
2020-03-22  3:06                                       ` David Miller
2020-03-22 14:40                                       ` Willem de Bruijn
2020-03-23 20:00                                       ` David Miller
2020-03-22 19:53                                 ` [PATCH v2] " Tom Herbert
2020-03-23  9:15                                   ` David Laight
2020-02-06 12:13 ` TCP checksum not offloaded during GSO David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).