All of lore.kernel.org
 help / color / mirror / Atom feed
* Checksum offload queries
@ 2015-12-07 15:39 Edward Cree
  2015-12-07 17:29 ` Tom Herbert
  2015-12-07 19:38 ` David Miller
  0 siblings, 2 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-07 15:39 UTC (permalink / raw)
  To: netdev; +Cc: Tom Herbert

Having decided to take Dave Miller's advice to push our hardware guys in the direction of generic checksum offload, I found I wasn't quite sure exactly what's being encouraged.  After discussing the subject with a colleague, some questions crystallised.  I expect it's mostly a result of misunderstandings on my part, but here goes:

1) Receive checksums.  Given that CHECKSUM_UNNECESSARY conversion exists (and is a cheap operation), what is the advantage to the stack of using CHECKSUM_COMPLETE if the packet happens to be a protocol which CHECKSUM_UNNECESSARY conversion can handle?  As I see it, CHECKSUM_UNNECESSARY is strictly better as the stack is told "the first csum_level+1 checksums are good" *and* (indirectly) "here is the whole-packet checksum, which you can use to help with anything beyond csum_level+1".  Is it not, then, best for a device only to use CHECKSUM_COMPLETE for protocols the conversion doesn't handle?  (I agree that having that fallback of CHECKSUM_COMPLETE is a good thing, sadly I don't think our new chip does that.  (But maybe firmware can fix it.))

2) Transmit checksums.  While many protocols permit using 0 in the outer checksum, it doesn't seem prudent to assume all will.  Besides, many NICs will still have IP and TCP/UDP checksum offload hardware, if only to support less enlightened operating systems; why not use it?  Would it not be better for a device to have both NETIF_F_HW_CSUM *and* NETIF_F_IP[|V6]_CSUM, and be smart enough to fill in IP checksum, TCP/UDP checksum and one encapsulated checksum of your choice (i.e. whatever csum_start and friends asked for)?  (Again, I agree that having a NETIF_F_IP_CSUM device do specific magic for a list of specific encapsulation protocols is unsatisfactory.  Sadly, guess what our new chip does!  (But maybe firmware can fix it.))

3) Related to the above, what does a NETIF_F_HW_CSUM device do when transmitting an unencapsulated packet (let's say it's UDP) currently?  Will it simply get no checksum offload at all?  Will csum_start point at the regular UDP checksum (and the stack will do the IP header checksum)?  Again, a device that does both HW_ and IP_CSUM could cope with this (do the IP and UDP checksums as per NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though that would require more checksum flags (there's no way for CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore csum_start and friends).

4) Where, precisely, should I tell our hardware guys to stuff the protocol-specific encapsulated checksum offloads they're so proud of having added to our new chip? ;)

--
Edward Cree, not speaking for Solarflare Communications

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-07 15:39 Checksum offload queries Edward Cree
@ 2015-12-07 17:29 ` Tom Herbert
  2015-12-07 17:52   ` Tom Herbert
  2015-12-08 16:03   ` Edward Cree
  2015-12-07 19:38 ` David Miller
  1 sibling, 2 replies; 38+ messages in thread
From: Tom Herbert @ 2015-12-07 17:29 UTC (permalink / raw)
  To: Edward Cree; +Cc: netdev

On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote:
> Having decided to take Dave Miller's advice to push our hardware guys in the direction of generic checksum offload, I found I wasn't quite sure exactly what's being encouraged.  After discussing the subject with a colleague, some questions crystallised.  I expect it's mostly a result of misunderstandings on my part, but here goes:
>

Hi Edward, thanks for looking into this.

> 1) Receive checksums.  Given that CHECKSUM_UNNECESSARY conversion exists (and is a cheap operation), what is the advantage to the stack of using CHECKSUM_COMPLETE if the packet happens to be a protocol which CHECKSUM_UNNECESSARY conversion can handle?  As I see it, CHECKSUM_UNNECESSARY is strictly better as the stack is told "the first csum_level+1 checksums are good" *and* (indirectly) "here is the whole-packet checksum, which you can use to help with anything beyond csum_level+1".  Is it not, then, best for a device only to use CHECKSUM_COMPLETE for protocols the conversion doesn't handle?  (I agree that having that fallback of CHECKSUM_COMPLETE is a good thing, sadly I don't think our new chip does that.  (But maybe firmware can fix it.))
>
Checksum unnecessary conversion works great but it's applicability is
limited. This only helps in encapsulation when the UDP checksum can be
enabled, but due to restrictions of other devices we may need to
communicate with (e.g. Cisco switches) it might not be usable. Also,
checksum conversion is not relevant to many other protocols we want to
run like GRE, IPIP, SIT, MPLS/IP, etc., and does not help with IPv6
extension headers. CHECKSUM_COMPLETE is the really closet thing to a
universal solution.

> 2) Transmit checksums.  While many protocols permit using 0 in the outer checksum, it doesn't seem prudent to assume all will.  Besides, many NICs will still have IP and TCP/UDP checksum offload hardware, if only to support less enlightened operating systems; why not use it?  Would it not be better for a device to have both NETIF_F_HW_CSUM *and* NETIF_F_IP[|V6]_CSUM, and be smart enough to fill in IP checksum, TCP/UDP checksum and one encapsulated checksum of your choice (i.e. whatever csum_start and friends asked for)?  (Again, I agree that having a NETIF_F_IP_CSUM device do specific magic for a list of specific encapsulation protocols is unsatisfactory.  Sadly, guess what our new chip does!  (But maybe firmware can fix it.))
>
It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all
cases of checksum offload and any combination of protocol layering.
NETIF_F_IP[V6]_CSUM is limited and requires a lot of logic in both
driver and HW to implement correctly. I can't help you with these less
enlightening operating systems, maybe if they see the advantages of a
protocol generic offload model they'll "get with the program" as Dave
might say? In any case I do not believe we should be at all
constrained in building Linux interfaces or capabilities by the design
decisions made in other operating systems.

> 3) Related to the above, what does a NETIF_F_HW_CSUM device do when transmitting an unencapsulated packet (let's say it's UDP) currently?  Will it simply get no checksum offload at all?  Will csum_start point at the regular UDP checksum (and the stack will do the IP header checksum)?  Again, a device that does both HW_ and IP_CSUM could cope with this (do the IP and UDP checksums as per NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though that would require more checksum flags (there's no way for CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore csum_start and friends).
>
The NETIF_F_*_CSUM flags only describe the capabilities of the device,
the interface between the stack and the driver to offload the checksum
of a particular packet is solely the csum_start and csum_offset fields
in the skb (non-GSO case). It is up to the driver to decide whether
the particular checksum can actually be offloaded, but in any case it
must set the correct checksum (through skb_checksum_help if
necessary).

> 4) Where, precisely, should I tell our hardware guys to stuff the protocol-specific encapsulated checksum offloads they're so proud of having added to our new chip? ;)

Tell them that if the support generic checksum you're marketing guys
will be able to list at least fifteen protocols in the product specs
that can be checksum offloaded to the device instead of just one or
two like the competition ;-)

Thanks,
Tom
.
>
> --
> Edward Cree, not speaking for Solarflare Communications
> The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-07 17:29 ` Tom Herbert
@ 2015-12-07 17:52   ` Tom Herbert
  2015-12-08 16:03   ` Edward Cree
  1 sibling, 0 replies; 38+ messages in thread
From: Tom Herbert @ 2015-12-07 17:52 UTC (permalink / raw)
  To: Edward Cree; +Cc: netdev

On Mon, Dec 7, 2015 at 9:29 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote:
>> Having decided to take Dave Miller's advice to push our hardware guys in the direction of generic checksum offload, I found I wasn't quite sure exactly what's being encouraged.  After discussing the subject with a colleague, some questions crystallised.  I expect it's mostly a result of misunderstandings on my part, but here goes:
>>
>
> Hi Edward, thanks for looking into this.
>
>> 1) Receive checksums.  Given that CHECKSUM_UNNECESSARY conversion exists (and is a cheap operation), what is the advantage to the stack of using CHECKSUM_COMPLETE if the packet happens to be a protocol which CHECKSUM_UNNECESSARY conversion can handle?  As I see it, CHECKSUM_UNNECESSARY is strictly better as the stack is told "the first csum_level+1 checksums are good" *and* (indirectly) "here is the whole-packet checksum, which you can use to help with anything beyond csum_level+1".  Is it not, then, best for a device only to use CHECKSUM_COMPLETE for protocols the conversion doesn't handle?  (I agree that having that fallback of CHECKSUM_COMPLETE is a good thing, sadly I don't think our new chip does that.  (But maybe firmware can fix it.))
>>
> Checksum unnecessary conversion works great but it's applicability is
> limited. This only helps in encapsulation when the UDP checksum can be
> enabled, but due to restrictions of other devices we may need to
> communicate with (e.g. Cisco switches) it might not be usable. Also,
> checksum conversion is not relevant to many other protocols we want to
> run like GRE, IPIP, SIT, MPLS/IP, etc., and does not help with IPv6
> extension headers. CHECKSUM_COMPLETE is the really closet thing to a
> universal solution.
>
>> 2) Transmit checksums.  While many protocols permit using 0 in the outer checksum, it doesn't seem prudent to assume all will.  Besides, many NICs will still have IP and TCP/UDP checksum offload hardware, if only to support less enlightened operating systems; why not use it?  Would it not be better for a device to have both NETIF_F_HW_CSUM *and* NETIF_F_IP[|V6]_CSUM, and be smart enough to fill in IP checksum, TCP/UDP checksum and one encapsulated checksum of your choice (i.e. whatever csum_start and friends asked for)?  (Again, I agree that having a NETIF_F_IP_CSUM device do specific magic for a list of specific encapsulation protocols is unsatisfactory.  Sadly, guess what our new chip does!  (But maybe firmware can fix it.))
>>
> It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all
> cases of checksum offload and any combination of protocol layering.
> NETIF_F_IP[V6]_CSUM is limited and requires a lot of logic in both
> driver and HW to implement correctly. I can't help you with these less
> enlightening operating systems, maybe if they see the advantages of a
> protocol generic offload model they'll "get with the program" as Dave
> might say? In any case I do not believe we should be at all
> constrained in building Linux interfaces or capabilities by the design
> decisions made in other operating systems.
>
>> 3) Related to the above, what does a NETIF_F_HW_CSUM device do when transmitting an unencapsulated packet (let's say it's UDP) currently?  Will it simply get no checksum offload at all?  Will csum_start point at the regular UDP checksum (and the stack will do the IP header checksum)?  Again, a device that does both HW_ and IP_CSUM could cope with this (do the IP and UDP checksums as per NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though that would require more checksum flags (there's no way for CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore csum_start and friends).
>>
> The NETIF_F_*_CSUM flags only describe the capabilities of the device,
> the interface between the stack and the driver to offload the checksum
> of a particular packet is solely the csum_start and csum_offset fields
> in the skb (non-GSO case). It is up to the driver to decide whether
> the particular checksum can actually be offloaded, but in any case it
> must set the correct checksum (through skb_checksum_help if
> necessary).
>
>> 4) Where, precisely, should I tell our hardware guys to stuff the protocol-specific encapsulated checksum offloads they're so proud of having added to our new chip? ;)
>
> Tell them that if the support generic checksum you're marketing guys
> will be able to list at least fifteen protocols in the product specs
> that can be checksum offloaded to the device instead of just one or
> two like the competition ;-)
>

Here is a list of protocols I came up with that you could claim are
supported for checksum offload with protocol generic mechanisms:

VXLAN
VXLAN-GPE
GUE
Geneve
GRE (over IP or UDP)
MPLS (over IP or UDP)
IPIP
SIT
L2TP (over IP or UDP)
GTP
NVGRE
DCCP
LISP
IPv6 extension headers
PPPoE
ICMP

You mostly can double this list if you want to by list each protocol
individually as being supported for both IPv4 and IPv6.

> Thanks,
> Tom
> .
>>
>> --
>> Edward Cree, not speaking for Solarflare Communications
>> The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-07 15:39 Checksum offload queries Edward Cree
  2015-12-07 17:29 ` Tom Herbert
@ 2015-12-07 19:38 ` David Miller
  2015-12-08 14:42   ` Edward Cree
  1 sibling, 1 reply; 38+ messages in thread
From: David Miller @ 2015-12-07 19:38 UTC (permalink / raw)
  To: ecree; +Cc: netdev, tom

From: Edward Cree <ecree@solarflare.com>
Date: Mon, 7 Dec 2015 15:39:52 +0000

> 1) Receive checksums.  Given that CHECKSUM_UNNECESSARY conversion
> exists (and is a cheap operation), what is the advantage to the
> stack of using CHECKSUM_COMPLETE if the packet happens to be a
> protocol which CHECKSUM_UNNECESSARY conversion can handle?  As I see
> it, CHECKSUM_UNNECESSARY is strictly better as the stack is told
> "the first csum_level+1 checksums are good" *and* (indirectly) "here
> is the whole-packet checksum, which you can use to help with
> anything beyond csum_level+1".  Is it not, then, best for a device
> only to use CHECKSUM_COMPLETE for protocols the conversion doesn't
> handle?  (I agree that having that fallback of CHECKSUM_COMPLETE is
> a good thing, sadly I don't think our new chip does that.  (But
> maybe firmware can fix it.))

No, it is better to universally provide the 1's complement sum for
all receive packets.  This allows the stack more flexibility in
checksum handling.

> 3) Related to the above, what does a NETIF_F_HW_CSUM device do when
> transmitting an unencapsulated packet (let's say it's UDP)
> currently?  Will it simply get no checksum offload at all?  Will
> csum_start point at the regular UDP checksum (and the stack will do
> the IP header checksum)?  Again, a device that does both HW_ and
> IP_CSUM could cope with this (do the IP and UDP checksums as per
> NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though
> that would require more checksum flags (there's no way for
> CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore
> csum_start and friends).

The stack will have skb->csum_start point to the UDP header's checksum
field for unencapsulated packets, and it has done this for decades.

Sun Microsystems had NETIF_F_HW_CSUM supporting NICs nearly two
decades ago, and this is what NETIF_F_HW_CSUM was designed for.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-07 19:38 ` David Miller
@ 2015-12-08 14:42   ` Edward Cree
  2015-12-08 17:04     ` Tom Herbert
  2015-12-08 17:06     ` David Miller
  0 siblings, 2 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-08 14:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Tom Herbert

On 07/12/15 19:38, David Miller wrote:
> No, it is better to universally provide the 1's complement sum for
> all receive packets.  This allows the stack more flexibility in
> checksum handling.
I'm afraid I still don't see it.  If a device can both provide the 1's complement sum _and_ validate some of the checksums in the packet, that should be strictly better than just providing the 1's complement sum - the stack has at least as much information, and less work to do.  And while there is no general way at present for a driver to tell the stack it has done both (and in my opinion there should be such a way), it _is_ possible in the specific case of a UDP packet with the checksum filled in, thanks to CHECKSUM_UNNECESSARY conversion.  So why shouldn't a device (that otherwise gives the full ones complement sum with CHECKSUM_COMPLETE) use CHECKSUM_UNNECESSARY in this specific case?  Is there a flaw in my logic, or is it just that this would be a hack and the Right Thing is to change 
 the interface to let a driver report both pieces of information *directly*?  Or am I wrong for some other reason?
>> 3) Related to the above, what does a NETIF_F_HW_CSUM device do when
>> transmitting an unencapsulated packet
> The stack will have skb->csum_start point to the UDP header's checksum
> field for unencapsulated packets, and it has done this for decades.
>
> Sun Microsystems had NETIF_F_HW_CSUM supporting NICs nearly two
> decades ago, and this is what NETIF_F_HW_CSUM was designed for.
Thanks, that makes more sense now.  Though, does that mean that there's no way in this case to offload the IP header checksum?  (Of course, it's generally much less work than the payload checksum, and it goes away in v6.)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-07 17:29 ` Tom Herbert
  2015-12-07 17:52   ` Tom Herbert
@ 2015-12-08 16:03   ` Edward Cree
  2015-12-08 16:43     ` Tom Herbert
  2015-12-08 17:09     ` David Miller
  1 sibling, 2 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-08 16:03 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev

On 07/12/15 17:29, Tom Herbert wrote:
> On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote:
>> 2) Transmit checksums.
> It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all
> cases of checksum offload and any combination of protocol layering.
... until a protocol combination comes along that doesn't allow all but the innermost checksum to be 0.  NETIF_F_HW_CSUM is limited to one checksum per packet.
Though I suppose if you only care about offloading payload checksums, and are happy to checksum headers in software, you _could_ maybe have the stack (somehow) say "once you've filled in the checksum we specified with start/offset, add each of these constants (in ones complement) and fill into these corresponding locations".  That way, you can compute e.g. the difference between the inner and outer UDP checksums (for the example of UDP over VXLAN) without needing to know the checksum payload.  But it's not clear how you could pass these instructions to the driver without making struct sk_buff bigger (even replacing start/offset with a list head adds a few bytes, plus you have to allocate and free the list elements).  And walking a list might slow down the hardware too much for it to be wor
 thwhile.
> NETIF_F_IP[V6]_CSUM is limited and requires a lot of logic in both
> driver and HW to implement correctly.
But the HW already has the logic, and practically speaking it probably will for a long time to come, and NETIF_F_HW_CSUM _doesn't_ cover everything, only the innermost checksum.  Is there really no way to have both?  Or are outer checksums Officially Not Important?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 16:03   ` Edward Cree
@ 2015-12-08 16:43     ` Tom Herbert
  2015-12-08 18:03       ` Edward Cree
  2015-12-08 17:09     ` David Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-08 16:43 UTC (permalink / raw)
  To: Edward Cree; +Cc: netdev

On Tue, Dec 8, 2015 at 8:03 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 07/12/15 17:29, Tom Herbert wrote:
>> On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote:
>>> 2) Transmit checksums.
>> It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all
>> cases of checksum offload and any combination of protocol layering.
> ... until a protocol combination comes along that doesn't allow all but the innermost checksum to be 0.  NETIF_F_HW_CSUM is limited to one checksum per packet.

So is NETIF_F_IP[V6]_CSUM, again the interface into the driver to
offload a checksum is csum_start, csum_offset. Only one checksum can
be offloaded at a time. We have no way to tell the driver that there
are two checksums in a packet to be offloaded. Good news is that we
probably don't need that. Extensible encapsulation protocols can
implement Remote Checksum Offload, and this is never an issue on the
receive side since CHECKSUM_COMPLETE can be used to validate and
arbitrary number of checksums.

Tom

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 14:42   ` Edward Cree
@ 2015-12-08 17:04     ` Tom Herbert
  2015-12-09  1:56       ` Thomas Graf
  2015-12-08 17:06     ` David Miller
  1 sibling, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-08 17:04 UTC (permalink / raw)
  To: Edward Cree; +Cc: David Miller, Linux Kernel Network Developers

On Tue, Dec 8, 2015 at 6:42 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 07/12/15 19:38, David Miller wrote:
>> No, it is better to universally provide the 1's complement sum for
>> all receive packets.  This allows the stack more flexibility in
>> checksum handling.
> I'm afraid I still don't see it.  If a device can both provide the 1's complement sum _and_ validate some of the checksums in the packet, that should be strictly better than just providing the 1's complement sum - the stack has at least as much information, and less work to do.  And while there is no general way at present for a driver to tell the stack it has done both (and in my opinion there should be such a way), it _is_ possible in the specific case of a UDP packet with the checksum filled in, thanks to CHECKSUM_UNNECESSARY conversion.  So why shouldn't a device (that otherwise gives the full ones complement sum with CHECKSUM_COMPLETE) use CHECKSUM_UNNECESSARY in this specific case?  Is there a flaw in my logic, or is it just that this would be a hack and the Right Thing is to chang
 e the interface to let a driver report both pieces of information *directly*?  Or am I wrong for some other reason?

The overhead of the stack to process CHECKSUM_COMPLETE is negligible,
we need to pull the checksum in protocol headers as they are process
which means the data is in cache any way (also IPv4 headers don't
needed to be pulled). In my testing I see CPU utilization csum_partial
drop from >6% to <0.5% in comparing computing csum on host and
CHECKSUM_COMPLETE.

There are other reasons why CHECKSUM_COMPLETE is preferable:

- CHECKSUM_COMPLETE  is more robust. We have no way to validate that
the device is actually correct in CHECKSUM_UNNECESSARY. For instance,
how do we know that there isn't some failure in the device where
everything is being marked as good even if it's not. With
CHECKSUM_COMPLETE it is the host that actually makes the decision of
whether the checksum is correct it is highly unlikely that failing
checksum calculation on the device won't be detected. HW failures and
bugs are real concern.
-  CHECKSUM_UNNECESSARY does not report bad checksums. There is a
csum_bad flag in the sk_buff that could be set if the driver detects a
bad checksum in the packet, but no drivers seem to be setting that
currently. So for any packets with bad checksums the stack will need
to compute the checksum itself, so this potentially becomes the basis
of a DDOS attack. CHECKSUM_COMPLETE does not have this problem, we get
the checksum of the packet rather the checksum is correct or not.

Tom

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 14:42   ` Edward Cree
  2015-12-08 17:04     ` Tom Herbert
@ 2015-12-08 17:06     ` David Miller
  2015-12-09 12:14       ` Edward Cree
  1 sibling, 1 reply; 38+ messages in thread
From: David Miller @ 2015-12-08 17:06 UTC (permalink / raw)
  To: ecree; +Cc: netdev, tom

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 8 Dec 2015 14:42:19 +0000

> I'm afraid I still don't see it.

You assume that nothing in the stack mangles the packet and would
benefit from having the 1's complement to work with in order to
assist with checksum adjustments etc.

So we want 1's complement, all the time.

All of the headers get touched anyways as each layer of the stack
demuxes the packet.  Therefore there is _ZERO_ cost to use 1's
complement to validate checksums in input.

It's always superior.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 16:03   ` Edward Cree
  2015-12-08 16:43     ` Tom Herbert
@ 2015-12-08 17:09     ` David Miller
  2015-12-08 17:24       ` Edward Cree
  1 sibling, 1 reply; 38+ messages in thread
From: David Miller @ 2015-12-08 17:09 UTC (permalink / raw)
  To: ecree; +Cc: tom, netdev

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 8 Dec 2015 16:03:18 +0000

> NETIF_F_HW_CSUM is limited to one checksum per packet.

This is false.

As we demux encapsulation layers, we re-adjust the provided 1's
complement sum provided and use that to validate the next layer.

This works for arbitrarily nested encapsulations, and therefore
can support mutliple checksums per packet.

Tom has also created a scheme, called remote checksum offload,
that deals with the corner cases that are actually not able
to be handled fully by NETIF_F_HW_CSUM.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 17:09     ` David Miller
@ 2015-12-08 17:24       ` Edward Cree
  2015-12-08 17:28         ` David Miller
  0 siblings, 1 reply; 38+ messages in thread
From: Edward Cree @ 2015-12-08 17:24 UTC (permalink / raw)
  To: David Miller; +Cc: tom, netdev

On 08/12/15 17:09, David Miller wrote:
> From: Edward Cree <ecree@solarflare.com>
> Date: Tue, 8 Dec 2015 16:03:18 +0000
>
>> NETIF_F_HW_CSUM is limited to one checksum per packet.
> This is false.
>
> As we demux encapsulation layers, we re-adjust the provided 1's
> complement sum provided and use that to validate the next layer.
The context here was transmit, not receive; the stack muxes encapsulation layers and then asks the hardware to fill in one checksum, it doesn't have any way to fill in 'adjusted' checksums in other layers.  The following paragraph (which you snipped) was discussing ways to get this kind of adjusted checksumming on transmit, by feeding the adjustments to the hardware.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 17:24       ` Edward Cree
@ 2015-12-08 17:28         ` David Miller
  0 siblings, 0 replies; 38+ messages in thread
From: David Miller @ 2015-12-08 17:28 UTC (permalink / raw)
  To: ecree; +Cc: tom, netdev

From: Edward Cree <ecree@solarflare.com>
Date: Tue, 8 Dec 2015 17:24:55 +0000

> The context here was transmit, not receive; the stack muxes
> encapsulation layers and then asks the hardware to fill in one
> checksum, it doesn't have any way to fill in 'adjusted' checksums in
> other layers.  The following paragraph (which you snipped) was
> discussing ways to get this kind of adjusted checksumming on
> transmit, by feeding the adjustments to the hardware.

[ Please properly line-wrap your emails, it looks like one very long
  line to everyone on the list. ]

See Tom's reply re: TX.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 16:43     ` Tom Herbert
@ 2015-12-08 18:03       ` Edward Cree
  0 siblings, 0 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-08 18:03 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev

On 08/12/15 16:43, Tom Herbert wrote:
> On Tue, Dec 8, 2015 at 8:03 AM, Edward Cree <ecree@solarflare.com> wrote:
>> On 07/12/15 17:29, Tom Herbert wrote:
>>> On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote:
>>>> 2) Transmit checksums.
>>> It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all
>>> cases of checksum offload and any combination of protocol layering.
>> ... until a protocol combination comes along that doesn't allow all but the innermost checksum to be 0.  NETIF_F_HW_CSUM is limited to one checksum per packet.
> So is NETIF_F_IP[V6]_CSUM, again the interface into the driver to
> offload a checksum is csum_start, csum_offset.
Well, we don't use those values, and I'd be surprised if other NETIF_F_IP[V6]_CSUM drivers did.  The reason why our hardware is restricted to checksumming IP, rather than being able to checksum everything, is that it works out the start and offset itself (and only knows how to do that for IP).  Our transmit descriptors don't include csum_start and csum_offset.  That may mean we're Doing It Wrong, but I don't see why anyone Doing It Right would need to be NETIF_F_IP[V6]_CSUM in the first place, rather than NETIF_F_HW_CSUM.
> Only one checksum can
> be offloaded at a time. We have no way to tell the driver that there
> are two checksums in a packet to be offloaded.
But again, hardware that figures it out for itself doesn't _need_ to be told.  For instance, for a VXLAN-encapsulated TCP packet, our new chip can fill in the IP header checksum, the (outer) UDP checksum, and the (inner) TCP checksum.  (Our previous chip would do the first two, but not the inner checksum.)  I had assumed that NETIF_F_IP_CSUM included IP header checksum offload; now I'm wondering if the stack is already filling that in and our hardware is overwriting it with the same value.  From a quick look at our driver, it looks like that is indeed happening; I guess I should fix that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 17:04     ` Tom Herbert
@ 2015-12-09  1:56       ` Thomas Graf
  2015-12-09 16:08         ` Tom Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Graf @ 2015-12-09  1:56 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers

On 12/08/15 at 09:04am, Tom Herbert wrote:
> There are other reasons why CHECKSUM_COMPLETE is preferable:
> 
> - CHECKSUM_COMPLETE  is more robust. We have no way to validate that
> the device is actually correct in CHECKSUM_UNNECESSARY. For instance,
> how do we know that there isn't some failure in the device where
> everything is being marked as good even if it's not. With
> CHECKSUM_COMPLETE it is the host that actually makes the decision of
> whether the checksum is correct it is highly unlikely that failing
> checksum calculation on the device won't be detected. HW failures and
> bugs are real concern.
> -  CHECKSUM_UNNECESSARY does not report bad checksums. There is a
> csum_bad flag in the sk_buff that could be set if the driver detects a
> bad checksum in the packet, but no drivers seem to be setting that
> currently. So for any packets with bad checksums the stack will need
> to compute the checksum itself, so this potentially becomes the basis
> of a DDOS attack. CHECKSUM_COMPLETE does not have this problem, we get
> the checksum of the packet rather the checksum is correct or not.

If I understood Edward correctly, his proposal would be for the
card to provide both, the csum as for CHECKSUM_COMPLETE plus the
validation yes/no hint. It would be up to the kernel to decide
whether to validate itself or trust the card.

I'm all in favour CHECKSUM_COMPLETE as the only way to go but 
we should be aware that it depends on the penetration of RCO in
hardware VTEPs.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-08 17:06     ` David Miller
@ 2015-12-09 12:14       ` Edward Cree
  2015-12-09 16:01         ` Tom Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Edward Cree @ 2015-12-09 12:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tom

On 08/12/15 17:06, David Miller wrote:
> All of the headers get touched anyways as each layer of the stack > demuxes the packet. Therefore there is _ZERO_ cost to use 1's > complement to validate checksums in input. It's always superior.

Ah, I understand now, it's touching the memory that's expensive, not adding
up the numbers. Thanks for being patient with me :)

So that just leaves the question of offloading multiple _TX_ checksums.
Remote checksum offload would deal with this but the receiver might not
support it, meaning that after decapsulation the packet will have a zero
checksum (and still has chances to get corrupted); thus it's not necessarily
safe to do unless you control both endpoints and know you have RCO
everywhere.
Convincing hardware designers to go the HW_CSUM way and only fill in the
inner checksum, when their current approach can fill in both inner and outer
checksums (though admittedly only for the protocols the hardware knows
about), might be difficult.

Remember that NICs are going to have packet parsing for those protocols
anyway, because they need it for things like Windows-spec RSS, so in the
hardware designer's mind, ossified offloads are "free".

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 12:14       ` Edward Cree
@ 2015-12-09 16:01         ` Tom Herbert
  2015-12-09 17:28           ` Edward Cree
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-09 16:01 UTC (permalink / raw)
  To: Edward Cree; +Cc: David Miller, Linux Kernel Network Developers

On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 08/12/15 17:06, David Miller wrote:
>> All of the headers get touched anyways as each layer of the stack > demuxes the packet. Therefore there is _ZERO_ cost to use 1's > complement to validate checksums in input. It's always superior.
>
> Ah, I understand now, it's touching the memory that's expensive, not adding
> up the numbers. Thanks for being patient with me :)
>
> So that just leaves the question of offloading multiple _TX_ checksums.
> Remote checksum offload would deal with this but the receiver might not
> support it, meaning that after decapsulation the packet will have a zero
> checksum (and still has chances to get corrupted); thus it's not necessarily
> safe to do unless you control both endpoints and know you have RCO
> everywhere.
> Convincing hardware designers to go the HW_CSUM way and only fill in the
> inner checksum, when their current approach can fill in both inner and outer
> checksums (though admittedly only for the protocols the hardware knows
> about), might be difficult.
>
But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM describe
capabilities._not_ the interface. The interface currently allows only
one checksum to be offloaded at time, if we want to be able to offload
two checksums then the interface needs to be changed-- probably
something like defining a new capability like NETIF_F_HW_2CSUMS,
adding another csum_start,csum_offset pair into the sk_buff. The stack
will need to be modified also wherever CHECKSUM_PARTIAL is handled. If
your device is trying do offload more than one checksum on its own
accord without being asked to do so by the stack it is doing the wrong
thing!

> Remember that NICs are going to have packet parsing for those protocols
> anyway, because they need it for things like Windows-spec RSS, so in the
> hardware designer's mind, ossified offloads are "free".

> The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited.

Please stop adding this disclaimer to your messages, it is not
appropriate for the list.

Tom

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09  1:56       ` Thomas Graf
@ 2015-12-09 16:08         ` Tom Herbert
  2015-12-09 22:29           ` Thomas Graf
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-09 16:08 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers

On Tue, Dec 8, 2015 at 5:56 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/08/15 at 09:04am, Tom Herbert wrote:
>> There are other reasons why CHECKSUM_COMPLETE is preferable:
>>
>> - CHECKSUM_COMPLETE  is more robust. We have no way to validate that
>> the device is actually correct in CHECKSUM_UNNECESSARY. For instance,
>> how do we know that there isn't some failure in the device where
>> everything is being marked as good even if it's not. With
>> CHECKSUM_COMPLETE it is the host that actually makes the decision of
>> whether the checksum is correct it is highly unlikely that failing
>> checksum calculation on the device won't be detected. HW failures and
>> bugs are real concern.
>> -  CHECKSUM_UNNECESSARY does not report bad checksums. There is a
>> csum_bad flag in the sk_buff that could be set if the driver detects a
>> bad checksum in the packet, but no drivers seem to be setting that
>> currently. So for any packets with bad checksums the stack will need
>> to compute the checksum itself, so this potentially becomes the basis
>> of a DDOS attack. CHECKSUM_COMPLETE does not have this problem, we get
>> the checksum of the packet rather the checksum is correct or not.
>
> If I understood Edward correctly, his proposal would be for the
> card to provide both, the csum as for CHECKSUM_COMPLETE plus the
> validation yes/no hint. It would be up to the kernel to decide
> whether to validate itself or trust the card.
>
> I'm all in favour CHECKSUM_COMPLETE as the only way to go but
> we should be aware that it depends on the penetration of RCO in
> hardware VTEPs.

Thomas, I don't understand what you are saying here. CHECKSUM_COMPLETE
is an interface for drivers providing the computed checksum of a
packet on receive, how is this dependent on any use case or any other
mechanisms?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 16:01         ` Tom Herbert
@ 2015-12-09 17:28           ` Edward Cree
  2015-12-09 17:31             ` David Laight
                               ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-09 17:28 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: David Miller

On 09/12/15 16:01, Tom Herbert wrote:
> On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> > wrote: >> Convincing hardware designers to go the HW_CSUM way and only fill >> in the inner checksum, when their current approach can fill in >> both inner and outer checksums (though admittedly only for the >> protocols the hardware knows about), might be difficult. >> > But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM describe > capabilities._not_ the interface. The interface currently allows only > one checksum to be offloaded at time, if we want to be able to > offload two checksums then the interface needs to be changed-- > probably something like defining a new capability like > NETIF_F_HW_2CSUMS, adding another csum_start,csum_offset pair into > the sk_buff.
Which only pushes the problem onto when someone wants to nest
encapsulations.  (I heard you like tunnels, so I put a tunnel in your
tunnel so you can encapsulate while you encapsulate.)
Or to put it another way, 2 isn't a number; the only numbers are 0, 1
and infinity ;)
Perhaps in practice 2 csums would be enough, for now.  But isn't the
whole point of the brave new world of generic checksums that it should
be future-proof?

> The stack will need to be modified also wherever CHECKSUM_PARTIAL is > handled.
Naturally.

> If your device is trying do offload more than one checksum on its own > accord without being asked to do so by the stack it is doing the > wrong thing!
>From the stack's perspective: yes, it is doing the wrong thing.  (I've
been discussing with colleagues today how we could change that, and I
think we can, but it involves having _three_ hardware TXQs per kernel
queue, instead of the two we have now...)
But from the outside perspective, the system as a whole isn't doing
anything bad - the packet going on the network is valid and just
happens to have both inner and outer checksums filled in.  Is there a
good reason _why_ the stack forbids a device to do this?  (Sure, it's
not necessary, and makes the hardware more complex.  But the hardware's
already been made, and it's not a *completely* useless thing to do...)

> Please stop adding this disclaimer to your messages, it is not > appropriate for the list.
Actually, the copy that goes the list doesn't have the disclaimer.  But
thanks to a combination of crappy email server and corporate politics,
it still sticks it on any CCs.  If it were up to me (it's not) we
wouldn't add that disclaimer to anything ever.
I'm now attempting (for the nth time) to convince our mgmt to get rid
of the disclaimer, but I don't hold out much hope :(

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: Checksum offload queries
  2015-12-09 17:28           ` Edward Cree
@ 2015-12-09 17:31             ` David Laight
  2015-12-09 18:00             ` Tom Herbert
  2015-12-11 23:50             ` Checksum offload queries Tom Herbert
  2 siblings, 0 replies; 38+ messages in thread
From: David Laight @ 2015-12-09 17:31 UTC (permalink / raw)
  To: 'Edward Cree', Linux Kernel Network Developers; +Cc: David Miller

From: Edward Cree
> Sent: 09 December 2015 17:29

You also need to stop (probably outluck?) from deleting newlines and
flowing the text.

The message below is completely unreadable.

	David


> On 09/12/15 16:01, Tom Herbert wrote:
> > On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> > wrote: >> Convincing hardware
> designers to go the HW_CSUM way and only fill >> in the inner checksum, when their current approach
> can fill in >> both inner and outer checksums (though admittedly only for the >> protocols the
> hardware knows about), might be difficult. >> > But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM
> describe > capabilities._not_ the interface. The interface currently allows only > one checksum to be
> offloaded at time, if we want to be able to > offload two checksums then the interface needs to be
> changed-- > probably something like defining a new capability like > NETIF_F_HW_2CSUMS, adding another
> csum_start,csum_offset pair into > the sk_buff.
> Which only pushes the problem onto when someone wants to nest
> encapsulations.  (I heard you like tunnels, so I put a tunnel in your
> tunnel so you can encapsulate while you encapsulate.)
> Or to put it another way, 2 isn't a number; the only numbers are 0, 1
> and infinity ;)
> Perhaps in practice 2 csums would be enough, for now.  But isn't the
> whole point of the brave new world of generic checksums that it should
> be future-proof?
> 
> > The stack will need to be modified also wherever CHECKSUM_PARTIAL is > handled.
> Naturally.
> 
> > If your device is trying do offload more than one checksum on its own > accord without being asked
> to do so by the stack it is doing the > wrong thing!
> From the stack's perspective: yes, it is doing the wrong thing.  (I've
> been discussing with colleagues today how we could change that, and I
> think we can, but it involves having _three_ hardware TXQs per kernel
> queue, instead of the two we have now...)
> But from the outside perspective, the system as a whole isn't doing
> anything bad - the packet going on the network is valid and just
> happens to have both inner and outer checksums filled in.  Is there a
> good reason _why_ the stack forbids a device to do this?  (Sure, it's
> not necessary, and makes the hardware more complex.  But the hardware's
> already been made, and it's not a *completely* useless thing to do...)
> 
> > Please stop adding this disclaimer to your messages, it is not > appropriate for the list.
> Actually, the copy that goes the list doesn't have the disclaimer.  But
> thanks to a combination of crappy email server and corporate politics,
> it still sticks it on any CCs.  If it were up to me (it's not) we
> wouldn't add that disclaimer to anything ever.
> I'm now attempting (for the nth time) to convince our mgmt to get rid
> of the disclaimer, but I don't hold out much hope :(
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 17:28           ` Edward Cree
  2015-12-09 17:31             ` David Laight
@ 2015-12-09 18:00             ` Tom Herbert
  2015-12-09 22:21               ` Thomas Graf
  2015-12-10 15:49               ` Edward Cree
  2015-12-11 23:50             ` Checksum offload queries Tom Herbert
  2 siblings, 2 replies; 38+ messages in thread
From: Tom Herbert @ 2015-12-09 18:00 UTC (permalink / raw)
  To: Edward Cree; +Cc: Linux Kernel Network Developers, David Miller

On Wed, Dec 9, 2015 at 9:28 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 09/12/15 16:01, Tom Herbert wrote:
>> On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> > wrote: >> Convincing hardware designers to go the HW_CSUM way and only fill >> in the inner checksum, when their current approach can fill in >> both inner and outer checksums (though admittedly only for the >> protocols the hardware knows about), might be difficult. >> > But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM describe > capabilities._not_ the interface. The interface currently allows only > one checksum to be offloaded at time, if we want to be able to > offload two checksums then the interface needs to be changed-- > probably something like defining a new capability like > NETIF_F_HW_2CSUMS, adding another csum_start,csum_offset pair into > the sk_buff.
> Which only pushes the problem onto when someone wants to nest
> encapsulations.  (I heard you like tunnels, so I put a tunnel in your
> tunnel so you can encapsulate while you encapsulate.)
> Or to put it another way, 2 isn't a number; the only numbers are 0, 1
> and infinity ;)
> Perhaps in practice 2 csums would be enough, for now.  But isn't the
> whole point of the brave new world of generic checksums that it should
> be future-proof?
>
If there is a need then we can add an arbitrary number. But no one has
proven there is a need, however we do have a real need for checksum
offload outside of the narrow uses of  NETIF_F_IP[V6]_CSUM.

>> The stack will need to be modified also wherever CHECKSUM_PARTIAL is > handled.
> Naturally.
>
>> If your device is trying do offload more than one checksum on its own > accord without being asked to do so by the stack it is doing the > wrong thing!
> From the stack's perspective: yes, it is doing the wrong thing.  (I've
> been discussing with colleagues today how we could change that, and I
> think we can, but it involves having _three_ hardware TXQs per kernel
> queue, instead of the two we have now...)
> But from the outside perspective, the system as a whole isn't doing
> anything bad - the packet going on the network is valid and just
> happens to have both inner and outer checksums filled in.  Is there a
> good reason _why_ the stack forbids a device to do this?  (Sure, it's
> not necessary, and makes the hardware more complex.  But the hardware's
> already been made, and it's not a *completely* useless thing to do...)
>
That is not at all true. If the stack has set up VXLAN RCO and the
device decides to set the inner checksum itself then the checksum will
be bad. The checksum interface is very specific please read it
carefully (sk_buff.h), if the driver/device thinks it is smarter than
the stack and tries to do set its own rules on how checksum offload
works then things will eventually break miserably.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 18:00             ` Tom Herbert
@ 2015-12-09 22:21               ` Thomas Graf
  2015-12-09 22:42                 ` Tom Herbert
  2015-12-10 15:49               ` Edward Cree
  1 sibling, 1 reply; 38+ messages in thread
From: Thomas Graf @ 2015-12-09 22:21 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller

On 12/09/15 at 10:00am, Tom Herbert wrote:
> On Wed, Dec 9, 2015 at 9:28 AM, Edward Cree <ecree@solarflare.com> wrote:
> > Which only pushes the problem onto when someone wants to nest
> > encapsulations.  (I heard you like tunnels, so I put a tunnel in your
> > tunnel so you can encapsulate while you encapsulate.)
> > Or to put it another way, 2 isn't a number; the only numbers are 0, 1
> > and infinity ;)
> > Perhaps in practice 2 csums would be enough, for now.  But isn't the
> > whole point of the brave new world of generic checksums that it should
> > be future-proof?
> >
> If there is a need then we can add an arbitrary number. But no one has
> proven there is a need, however we do have a real need for checksum
> offload outside of the narrow uses of  NETIF_F_IP[V6]_CSUM.

Need may be a strong word here but people have started doing nested
tunneling by running container orchestration tools which use VXLAN
to isolate containers inside of OpenStack virtual infrastructure which
also creates virtual networks.

I'm not saying it's sane or desirable but we will start seeing nested
tunnels in the wild :-(

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 16:08         ` Tom Herbert
@ 2015-12-09 22:29           ` Thomas Graf
  2015-12-09 22:51             ` Tom Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Thomas Graf @ 2015-12-09 22:29 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers

On 12/09/15 at 08:08am, Tom Herbert wrote:
> On Tue, Dec 8, 2015 at 5:56 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > If I understood Edward correctly, his proposal would be for the
> > card to provide both, the csum as for CHECKSUM_COMPLETE plus the
> > validation yes/no hint. It would be up to the kernel to decide
> > whether to validate itself or trust the card.
> >
> > I'm all in favour CHECKSUM_COMPLETE as the only way to go but
> > we should be aware that it depends on the penetration of RCO in
> > hardware VTEPs.
> 
> Thomas, I don't understand what you are saying here. CHECKSUM_COMPLETE
> is an interface for drivers providing the computed checksum of a
> packet on receive, how is this dependent on any use case or any other
> mechanisms?

I'm referring to the overall change which comes with a pure
CHECSUM_COMPLETE model which would forbid a NIC to do automatic nested
checksum filling even if untold. RCO solves that to some extent but
only if RCO is widely supported. I'm not aware of anybody relying on
the performance of such hardware yet so I doubt we would create a
performance regression.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 22:21               ` Thomas Graf
@ 2015-12-09 22:42                 ` Tom Herbert
  2015-12-09 22:44                   ` Thomas Graf
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-09 22:42 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller

On Wed, Dec 9, 2015 at 2:21 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/09/15 at 10:00am, Tom Herbert wrote:
>> On Wed, Dec 9, 2015 at 9:28 AM, Edward Cree <ecree@solarflare.com> wrote:
>> > Which only pushes the problem onto when someone wants to nest
>> > encapsulations.  (I heard you like tunnels, so I put a tunnel in your
>> > tunnel so you can encapsulate while you encapsulate.)
>> > Or to put it another way, 2 isn't a number; the only numbers are 0, 1
>> > and infinity ;)
>> > Perhaps in practice 2 csums would be enough, for now.  But isn't the
>> > whole point of the brave new world of generic checksums that it should
>> > be future-proof?
>> >
>> If there is a need then we can add an arbitrary number. But no one has
>> proven there is a need, however we do have a real need for checksum
>> offload outside of the narrow uses of  NETIF_F_IP[V6]_CSUM.
>
> Need may be a strong word here but people have started doing nested
> tunneling by running container orchestration tools which use VXLAN
> to isolate containers inside of OpenStack virtual infrastructure which
> also creates virtual networks.
>
> I'm not saying it's sane or desirable but we will start seeing nested
> tunnels in the wild :-(

csum_start and csum_offset together occupy 32 bits. As demonstrated in
VXLAN RCO we can compress csum_start/csum_offset down to 8 bits which
means if necessary we could get up to four pairs in an sk_buff without
increasing its size. If you need more that four checksums to be
offloaded in single packet then I doubt getting checksum offload to
work is going to be your biggest problem.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 22:42                 ` Tom Herbert
@ 2015-12-09 22:44                   ` Thomas Graf
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Graf @ 2015-12-09 22:44 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller

On 12/09/15 at 02:42pm, Tom Herbert wrote:
> csum_start and csum_offset together occupy 32 bits. As demonstrated in
> VXLAN RCO we can compress csum_start/csum_offset down to 8 bits which
> means if necessary we could get up to four pairs in an sk_buff without
> increasing its size. If you need more that four checksums to be
> offloaded in single packet then I doubt getting checksum offload to
> work is going to be your biggest problem.

Fair point. I wouldn't suggest even creating it without huge demand
first. There should be as many incentives as possible to move or stay
away from such horrible architectures.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 22:29           ` Thomas Graf
@ 2015-12-09 22:51             ` Tom Herbert
  2015-12-09 23:13               ` Thomas Graf
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-09 22:51 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers

On Wed, Dec 9, 2015 at 2:29 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/09/15 at 08:08am, Tom Herbert wrote:
>> On Tue, Dec 8, 2015 at 5:56 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> > If I understood Edward correctly, his proposal would be for the
>> > card to provide both, the csum as for CHECKSUM_COMPLETE plus the
>> > validation yes/no hint. It would be up to the kernel to decide
>> > whether to validate itself or trust the card.
>> >
>> > I'm all in favour CHECKSUM_COMPLETE as the only way to go but
>> > we should be aware that it depends on the penetration of RCO in
>> > hardware VTEPs.
>>
>> Thomas, I don't understand what you are saying here. CHECKSUM_COMPLETE
>> is an interface for drivers providing the computed checksum of a
>> packet on receive, how is this dependent on any use case or any other
>> mechanisms?
>
> I'm referring to the overall change which comes with a pure
> CHECSUM_COMPLETE model which would forbid a NIC to do automatic nested
> checksum filling even if untold. RCO solves that to some extent but
> only if RCO is widely supported. I'm not aware of anybody relying on
> the performance of such hardware yet so I doubt we would create a
> performance regression.
>
I'm sorry, I still don't understand your point. What is "automatic
nested  checksum filling" and how does this relate to RX (e.g.
CHECSUM_COMPLETE).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 22:51             ` Tom Herbert
@ 2015-12-09 23:13               ` Thomas Graf
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Graf @ 2015-12-09 23:13 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers

On 12/09/15 at 02:51pm, Tom Herbert wrote:
> I'm sorry, I still don't understand your point. What is "automatic
> nested  checksum filling" and how does this relate to RX (e.g.
> CHECSUM_COMPLETE).

Too much compression ;-)

My understanding of the thread was that the desired state is no
checksum validation on RX and no automatic checksum offload on TX
unless explicitly instructed via csum offset. This is what I would
call a CHECKSUM_COMPLETE card (no protocol parsing).

As opposed to a CHECKSUM_UNNECESSARY card which does protocol parsing.
Some do nested checksum offload on TX as we know even if only
instructed to do one of the checksums.

So let's assume everybody goes CHECKSUM_COMPLETE and we have at most
a single level of checksum offload on TX. (As I understand, a
requirement to not break RCO anyway.) RCO would resolve the possible
software checksum performance bottleneck in the scenario of outer and
inner header checksum requirements. While I agree that this is the
case, we need to have support in hardware VTEPs for this in order for
it to be usable. (excluding those which require a checksum 0 anyway)

Multiple nested tunnels would be another beast outside of the scope of
RCO but as I stated in the other email, I don't think we should
proactively solve that.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 18:00             ` Tom Herbert
  2015-12-09 22:21               ` Thomas Graf
@ 2015-12-10 15:49               ` Edward Cree
  2015-12-10 16:26                 ` Tom Herbert
  1 sibling, 1 reply; 38+ messages in thread
From: Edward Cree @ 2015-12-10 15:49 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, David Miller

On 09/12/15 18:00, Tom Herbert wrote:
> That is not at all true. If the stack has set up VXLAN RCO and the > device decides to set the inner checksum itself then the checksum > will be bad. The checksum interface is very specific please read it > carefully (sk_buff.h), if the driver/device thinks it is smarter > than the stack and tries to do set its own rules on how checksum > offload works then things will eventually break miserably.

Ok, I've passed that on to the guy working on this bit of the driver.

It looks like the best way to support the capabilities of NICs like the
sfc 8000 series (which can fill in two checksums but uses packet parsing
to figure out what to do, rather than using csum start/offset) is:

(core / stack)
* add NETIF_F_HW_2CSUMS (or whatever name)
* squeeze a second csum start/offset pair into the skb (as you mention,
 we can do this without size increase)
* Modify (Tx) CHECKSUM_PARTIAL generation to use both csum pairs.
 Presumably by creating CHECKSUM_PARTIAL_2CSUMS to indicate that the
 second csum pair has been filled in as well.

(sfc driver)
* declare 2CSUMS support
* on getting an skb for xmit, check whether the csum pairs match what our
 eeevil packet parsing hardware will do.  If so, send it with appropriate
 csum offload settings (we can enable/disable inner & outer offload
 independently, with TX Option descriptors).  Any csum pair that doesn't
 match, we call skb_checksum_help to do it in software, and tell the
 hardware not to do that one.

Optionally, we could also create NETIF_F_IP[V6]_2CSUMS in the stack and
have our driver advertise that instead, but since there has to be a
fallback to skb_checksum_help in the driver anyway, there doesn't seem
to be much point.

Does that seem reasonable?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-10 15:49               ` Edward Cree
@ 2015-12-10 16:26                 ` Tom Herbert
  2015-12-10 20:28                   ` Edward Cree
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-10 16:26 UTC (permalink / raw)
  To: Edward Cree; +Cc: Linux Kernel Network Developers, David Miller

On Thu, Dec 10, 2015 at 7:49 AM, Edward Cree <ecree@solarflare.com> wrote:
> On 09/12/15 18:00, Tom Herbert wrote:
>> That is not at all true. If the stack has set up VXLAN RCO and the > device decides to set the inner checksum itself then the checksum > will be bad. The checksum interface is very specific please read it > carefully (sk_buff.h), if the driver/device thinks it is smarter > than the stack and tries to do set its own rules on how checksum > offload works then things will eventually break miserably.
>
> Ok, I've passed that on to the guy working on this bit of the driver.
>
> It looks like the best way to support the capabilities of NICs like the
> sfc 8000 series (which can fill in two checksums but uses packet parsing
> to figure out what to do, rather than using csum start/offset) is:
>
> (core / stack)
> * add NETIF_F_HW_2CSUMS (or whatever name)
> * squeeze a second csum start/offset pair into the skb (as you mention,
>  we can do this without size increase)
> * Modify (Tx) CHECKSUM_PARTIAL generation to use both csum pairs.
>  Presumably by creating CHECKSUM_PARTIAL_2CSUMS to indicate that the
>  second csum pair has been filled in as well.
>
> (sfc driver)
> * declare 2CSUMS support
> * on getting an skb for xmit, check whether the csum pairs match what our
>  eeevil packet parsing hardware will do.  If so, send it with appropriate
>  csum offload settings (we can enable/disable inner & outer offload
>  independently, with TX Option descriptors).  Any csum pair that doesn't
>  match, we call skb_checksum_help to do it in software, and tell the
>  hardware not to do that one.
>

> Does that seem reasonable?

It sounds like potentially interesting work. You'll probably want my
patches that provider helper functions that allow a driver to verify
that it can offload a checksum. We'll have to update those also to
allow two checksums.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-10 16:26                 ` Tom Herbert
@ 2015-12-10 20:28                   ` Edward Cree
  2015-12-10 21:02                     ` Rustad, Mark D
  2015-12-14 15:11                     ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree
  0 siblings, 2 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-10 20:28 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, David Miller

On 10/12/15 16:26, Tom Herbert wrote:
> It sounds like potentially interesting work. You'll probably want my patches that provider helper functions that allow a driver to verify that it can offload a checksum. We'll have to update those also to allow two checksums. 
I have just realised something startling.  Assuming the inner protocol uses the ones complement checksum in the way IP, UDP and TCP do, the outer checksum can be computed *without looking at the payload*.  Why?  Because the ones complement sum of (say) a correctly checksummed UDP datagram is simply the complement of the ones complement sum of the pseudo header.  Similarly, the ones complement sum of a correctly checksummed IP header is zero.
Therefore, the outer checksum depends _only_ on the inner and outer pseudo headers and the encapsulation headers.  For example, with UDP encapsulated in VXLAN, we have the following packet structure:
ETH IP UDP VXLAN inner-ETH inner-IP inner-UDP PAYLOAD
and the outer checksum equals
~([outer_pseudo] + [UDP] + [VXLAN] + [inner-ETH] + ~[inner_pseudo])
where [] denotes summation, and all addition is ones complement.
This can easily be computed in software, especially as the stack already has ~[inner_pseudo]: it's stored in the inner checksum field to help inner checksum offload.

Have I made a mistake in my ones-complement maths, or is outer checksum offload as unnecessary as IP header checksum offload?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-10 20:28                   ` Edward Cree
@ 2015-12-10 21:02                     ` Rustad, Mark D
  2015-12-14 15:11                     ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree
  1 sibling, 0 replies; 38+ messages in thread
From: Rustad, Mark D @ 2015-12-10 21:02 UTC (permalink / raw)
  To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers, David Miller

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

Edward Cree <ecree@solarflare.com> wrote:

> I have just realised something startling.  Assuming the inner protocol uses the ones complement checksum in the way IP, UDP and TCP do, the outer checksum can be computed *without looking at the payload*.  Why?  Because the ones complement sum of (say) a correctly checksummed UDP datagram is simply the complement of the ones complement sum of the pseudo header.  Similarly, the ones complement sum of a correctly checksummed IP header is zero.
> Therefore, the outer checksum depends _only_ on the inner and outer pseudo headers and the encapsulation headers.  For example, with UDP encapsulated in VXLAN, we have the following packet structure:
> ETH IP UDP VXLAN inner-ETH inner-IP inner-UDP PAYLOAD
> and the outer checksum equals
> ~([outer_pseudo] + [UDP] + [VXLAN] + [inner-ETH] + ~[inner_pseudo])
> where [] denotes summation, and all addition is ones complement.
> This can easily be computed in software, especially as the stack already has ~[inner_pseudo]: it's stored in the inner checksum field to help inner checksum offload.
> 
> Have I made a mistake in my ones-complement maths, or is outer checksum offload as unnecessary as IP header checksum offload?

I agree with the overall observation, in that the outer checksum can be derived from the inner one. I think that the inner-ip header needs to be added (after subtracting out the inner_pseudo as you indicate above), because the entire raw inner IP header needs to be included in the outer checksum. I haven't thought this all through in detail yet. It would be really nice to have a function that implemented something like this. Could one be structured to handle most encapsulations?

--
Mark Rustad, Networking Division, Intel Corporation

[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 841 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-09 17:28           ` Edward Cree
  2015-12-09 17:31             ` David Laight
  2015-12-09 18:00             ` Tom Herbert
@ 2015-12-11 23:50             ` Tom Herbert
  2015-12-12 16:41               ` Sowmini Varadhan
  2 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-11 23:50 UTC (permalink / raw)
  To: Edward Cree; +Cc: Linux Kernel Network Developers, David Miller

>> If your device is trying do offload more than one checksum on its own > accord without being asked to do so by the stack it is doing the > wrong thing!
> From the stack's perspective: yes, it is doing the wrong thing.  (I've
> been discussing with colleagues today how we could change that, and I
> think we can, but it involves having _three_ hardware TXQs per kernel
> queue, instead of the two we have now...)
> But from the outside perspective, the system as a whole isn't doing
> anything bad - the packet going on the network is valid and just
> happens to have both inner and outer checksums filled in.  Is there a
> good reason _why_ the stack forbids a device to do this?  (Sure, it's
> not necessary, and makes the hardware more complex.  But the hardware's
> already been made, and it's not a *completely* useless thing to do...)
>
By the way, I am pretty sure that any driver that supports checksum
offload of encapsulation is going to break at some point with multiple
layers of encapsulation. For example, suppose send a UDP packet in
VXLAN in VXLAN, where packet looks like:
IP|UDP|VXLAN|Eth|IP|UDP|VXLAN|Eth|IP|UDP. I believe the kernel can set
up checksum offload for any of the three UDP checksums. If the outer
checksum is offloaded, then skb->encapsulation is not set so the
driver will know that a plain UDP/IP checksum is being offload-- no
problem. For the other two, skb->encapsulation will be set so driver
will know that one of the encapsulated checksums is being offloaded,
but it cannot tell which one without comparing against csum_start.
Inner headers don't help here since they would likely point to the
outermost encapsulation headers even when checksum being offloaded is
the innermost one. If the driver guesses about which checksum is being
offloaded and its wrong then packet is sent with a bad checksum
(possibly two bad checksums).

Unless I'm reading this incorrectly, it is a real bug. I hope to the
send base helper functions for this next week (the NETIF_F_IP[V6]_CSUM
sunsetting patches). Any drivers that are offloading encapsulated
checksums and don't check csum_start really should be subsequently
fixed.

Tom

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-11 23:50             ` Checksum offload queries Tom Herbert
@ 2015-12-12 16:41               ` Sowmini Varadhan
  2015-12-12 17:24                 ` Tom Herbert
  0 siblings, 1 reply; 38+ messages in thread
From: Sowmini Varadhan @ 2015-12-12 16:41 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller

On (12/11/15 15:50), Tom Herbert wrote:
> layers of encapsulation. For example, suppose send a UDP packet in
> VXLAN in VXLAN, where packet looks like:
> IP|UDP|VXLAN|Eth|IP|UDP|VXLAN|Eth|IP|UDP. I believe the kernel can set

Without detracting from the fact that you have probably found a bug
for checksumming here..

if you have multiple encapsulations above (vxlan over vxlan?)
I think lot of other things like pmtu may also be broken?  (Each
encaps layer lowers the actual application mtu till the thing starts 
to get absurd)

--Sowmini

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Checksum offload queries
  2015-12-12 16:41               ` Sowmini Varadhan
@ 2015-12-12 17:24                 ` Tom Herbert
  0 siblings, 0 replies; 38+ messages in thread
From: Tom Herbert @ 2015-12-12 17:24 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: Edward Cree, Linux Kernel Network Developers, David Miller

On Sat, Dec 12, 2015 at 8:41 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (12/11/15 15:50), Tom Herbert wrote:
>> layers of encapsulation. For example, suppose send a UDP packet in
>> VXLAN in VXLAN, where packet looks like:
>> IP|UDP|VXLAN|Eth|IP|UDP|VXLAN|Eth|IP|UDP. I believe the kernel can set
>
> Without detracting from the fact that you have probably found a bug
> for checksumming here..
>
> if you have multiple encapsulations above (vxlan over vxlan?)
> I think lot of other things like pmtu may also be broken?  (Each
> encaps layer lowers the actual application mtu till the thing starts
> to get absurd)
>
PMTU is going to be a headache with any encapsulation, multiple layers
just makes the headache worse. This is one of the reasons why ILA is
better for network virtualization-- no PMTU issue since there is no
encapsulation being done.

> --Sowmini
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC PATCH net-next 0/2] Local checksum offload for VXLAN
  2015-12-10 20:28                   ` Edward Cree
  2015-12-10 21:02                     ` Rustad, Mark D
@ 2015-12-14 15:11                     ` Edward Cree
  2015-12-14 15:13                       ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree
  2015-12-14 15:13                       ` [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices Edward Cree
  1 sibling, 2 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-14 15:11 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, David Miller

When the inner packet checksum is offloaded, the outer UDP checksum is easy
 to calculate as it doesn't depend on the payload (because the inner checksum
 cancels out everything from the inner packet except the pseudo header).
Thus, transmit checksums for VXLAN (and in principle other encapsulations,
 but I haven't enabled it for / tested with those) can be offloaded on any
 device supporting NETIF_F_HW_CSUM.  Only the innermost checksum has to be
 offloaded, the rest are filled in by the stack.
Tested by hacking a driver to report NETIF_F_HW_CSUM, call skb_checksum_help
 before transmitting a packet, and not actually offload anything to the hw.
 I did it that way because I don't have any hw that can actually offload the
 inner checksum; but I should be able to get hold of some soon.

Edward Cree (2):
  net: udp: local checksum offload for encapsulation
  net: vxlan: enable local checksum offload on HW_CSUM devices

 drivers/net/vxlan.c |  5 ++++-
 net/ipv4/udp.c      | 31 +++++++++++++++++++++++++++----
 2 files changed, 31 insertions(+), 5 deletions(-)

-- 
2.4.3

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/2] net: udp: local checksum offload for encapsulation
  2015-12-14 15:11                     ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree
@ 2015-12-14 15:13                       ` Edward Cree
  2015-12-14 17:16                         ` Tom Herbert
  2015-12-14 15:13                       ` [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices Edward Cree
  1 sibling, 1 reply; 38+ messages in thread
From: Edward Cree @ 2015-12-14 15:13 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, David Miller

The arithmetic properties of the ones-complement checksum mean that a
 correctly checksummed inner packet, including its checksum, has a ones
 complement sum depending only on whatever value was used to initialise
 the checksum field before checksumming (in the case of TCP and UDP,
 this is the ones complement sum of the pseudo header, complemented).
Consequently, if we are going to offload the inner checksum with
 CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
 packed data not covered by the inner checksum, and the initial value of
 the inner checksum field.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 net/ipv4/udp.c | 31 +++++++++++++++++++++++++++----
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0c7b0e6..07d679e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -767,12 +767,35 @@ void udp_set_csum(bool nocheck, struct sk_buff *skb,
 {
 	struct udphdr *uh = udp_hdr(skb);
 
-	if (nocheck)
+	if (nocheck) {
 		uh->check = 0;
-	else if (skb_is_gso(skb))
+	} else if (skb_is_gso(skb)) {
 		uh->check = ~udp_v4_check(len, saddr, daddr, 0);
-	else if (skb_dst(skb) && skb_dst(skb)->dev &&
-		 (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) {
+	} else if (skb->ip_summed == CHECKSUM_PARTIAL &&
+		   skb_dst(skb) && skb_dst(skb)->dev &&
+		   (skb_dst(skb)->dev->features & NETIF_F_HW_CSUM)) {
+		/* Everything from csum_start onwards will be
+		 * checksummed and will thus have a sum of whatever
+		 * we previously put in the checksum field (eg. sum
+		 * of pseudo-header)
+		 */
+		__wsum csum;
+
+		/* Fill in our pseudo-header checksum */
+		uh->check = ~udp_v4_check(len, saddr, daddr, 0);
+		/* Start with complement of inner pseudo-header checksum */
+		csum = ~skb_checksum(skb, skb_checksum_start_offset(skb) + skb->csum_offset,
+				     2, 0);
+		/* Add in checksum of our headers (incl. pseudo-header
+		 * checksum filled in above)
+		 */
+		csum = skb_checksum(skb, 0, skb_checksum_start_offset(skb), csum);
+		/* The result is the outer checksum */
+		uh->check = csum_fold(csum);
+		if (uh->check == 0)
+			uh->check = CSUM_MANGLED_0;
+	} else if (skb_dst(skb) && skb_dst(skb)->dev &&
+		   (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) {
 
 		BUG_ON(skb->ip_summed == CHECKSUM_PARTIAL);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices
  2015-12-14 15:11                     ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree
  2015-12-14 15:13                       ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree
@ 2015-12-14 15:13                       ` Edward Cree
  1 sibling, 0 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-14 15:13 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, David Miller

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/vxlan.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 6369a57..c1660d6 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1785,6 +1785,9 @@ static int vxlan_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *sk
 	bool udp_sum = !!(vxflags & VXLAN_F_UDP_CSUM);
 	int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
 	u16 hdrlen = sizeof(struct vxlanhdr);
+	/* Is device able to do the inner checksum? */
+	bool inner_csum = skb_dst(skb) && skb_dst(skb)->dev &&
+				(skb_dst(skb)->dev->features & NETIF_F_HW_CSUM);
 
 	if ((vxflags & VXLAN_F_REMCSUM_TX) &&
 	    skb->ip_summed == CHECKSUM_PARTIAL) {
@@ -1814,7 +1817,7 @@ static int vxlan_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *sk
 	if (WARN_ON(!skb))
 		return -ENOMEM;
 
-	skb = iptunnel_handle_offloads(skb, udp_sum, type);
+	skb = iptunnel_handle_offloads(skb, udp_sum && !inner_csum, type);
 	if (IS_ERR(skb))
 		return PTR_ERR(skb);
 
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/2] net: udp: local checksum offload for encapsulation
  2015-12-14 15:13                       ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree
@ 2015-12-14 17:16                         ` Tom Herbert
  2015-12-15 18:07                           ` Edward Cree
  0 siblings, 1 reply; 38+ messages in thread
From: Tom Herbert @ 2015-12-14 17:16 UTC (permalink / raw)
  To: Edward Cree; +Cc: netdev, David Miller

On Mon, Dec 14, 2015 at 7:13 AM, Edward Cree <ecree@solarflare.com> wrote:
> The arithmetic properties of the ones-complement checksum mean that a
>  correctly checksummed inner packet, including its checksum, has a ones
>  complement sum depending only on whatever value was used to initialise
>  the checksum field before checksumming (in the case of TCP and UDP,
>  this is the ones complement sum of the pseudo header, complemented).
> Consequently, if we are going to offload the inner checksum with
>  CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
>  packed data not covered by the inner checksum, and the initial value of
>  the inner checksum field.
>
> Signed-off-by: Edward Cree <ecree@solarflare.com>
> ---
>  net/ipv4/udp.c | 31 +++++++++++++++++++++++++++----
>  1 file changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 0c7b0e6..07d679e 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -767,12 +767,35 @@ void udp_set_csum(bool nocheck, struct sk_buff *skb,
>  {
>         struct udphdr *uh = udp_hdr(skb);
>
> -       if (nocheck)
> +       if (nocheck) {
>                 uh->check = 0;
> -       else if (skb_is_gso(skb))
> +       } else if (skb_is_gso(skb)) {
>                 uh->check = ~udp_v4_check(len, saddr, daddr, 0);
> -       else if (skb_dst(skb) && skb_dst(skb)->dev &&
> -                (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) {
> +       } else if (skb->ip_summed == CHECKSUM_PARTIAL &&
> +                  skb_dst(skb) && skb_dst(skb)->dev &&
> +                  (skb_dst(skb)->dev->features & NETIF_F_HW_CSUM)) {
> +               /* Everything from csum_start onwards will be
> +                * checksummed and will thus have a sum of whatever
> +                * we previously put in the checksum field (eg. sum
> +                * of pseudo-header)
> +                */
> +               __wsum csum;
> +
> +               /* Fill in our pseudo-header checksum */
> +               uh->check = ~udp_v4_check(len, saddr, daddr, 0);
> +               /* Start with complement of inner pseudo-header checksum */
> +               csum = ~skb_checksum(skb, skb_checksum_start_offset(skb) + skb->csum_offset,
> +                                    2, 0);
> +               /* Add in checksum of our headers (incl. pseudo-header
> +                * checksum filled in above)
> +                */
> +               csum = skb_checksum(skb, 0, skb_checksum_start_offset(skb), csum);
> +               /* The result is the outer checksum */
> +               uh->check = csum_fold(csum);
> +               if (uh->check == 0)
> +                       uh->check = CSUM_MANGLED_0;
> +       } else if (skb_dst(skb) && skb_dst(skb)->dev &&
> +                  (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) {
>
It's clever, but I'm not sure this saves much. The outer checksum
could still be offloaded to the device without the extra work. Where
this technique would be nice is if the device doesn't support checksum
offload at all, then we would definitely avoid doing multiple
checksums. That's going to be harder since we won't see
CHECKSUM_PARTIAL in that case for the inner checksum, but it would get
us to the principle that we only ever calculate the packet checksum
once or zero times.
.

>                 BUG_ON(skb->ip_summed == CHECKSUM_PARTIAL);
>
> --
> 2.4.3
>
>
> The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/2] net: udp: local checksum offload for encapsulation
  2015-12-14 17:16                         ` Tom Herbert
@ 2015-12-15 18:07                           ` Edward Cree
  0 siblings, 0 replies; 38+ messages in thread
From: Edward Cree @ 2015-12-15 18:07 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev, David Miller

On 14/12/15 17:16, Tom Herbert wrote:
> It's clever, but I'm not sure this saves much. The outer checksum
> could still be offloaded to the device without the extra work.
The main thing it saves you is having to fit a second csum_start/offset pair
 into (a) the SKB and (b) the TX descriptor.  Also, it will (at least in
 principle) support arbitrarily deep nesting of encapsulation.  (Whether
 that's a remotely sensible thing to do is another matter!)

Also, local checksum offload probably helps by simplifying TSO.
I've just been talking to our firmware team and we concluded that if the
 outer checksum (of the superpacket) is filled in by the stack, datapath
 firmware on the sfc 8000 series could fold appropriate corrections (to
 account for changes in outer length and the length fields in both pseudo
 headers) into the outer checksum when performing TSO, which may (we
 haven't benchmarked it yet!) improve performance, compared to computing the
 whole outer checksum from scratch.
> Where
> this technique would be nice is if the device doesn't support checksum
> offload at all, then we would definitely avoid doing multiple
> checksums. That's going to be harder since we won't see
> CHECKSUM_PARTIAL in that case for the inner checksum, but it would get
> us to the principle that we only ever calculate the packet checksum
> once or zero times.
Presumably in that case the inner checksum would be marked as
 CHECKSUM_UNNECESSARY, and the code that fills in the inner checksum could
 also fill in skb->csum_start and skb->csum_offset.  Then in udp_set_csum()
 we could just treat it exactly the same as CHECKSUM_PARTIAL.  Then we could
 update start and offset to refer to the checksum we've just filled in, so
 that any nested encap doesn't have to re-checksum our header.
I have no objections to that, but I don't think I grok the rest of the stack
 deeply enough to implement it.

In the meantime, it seems worthwhile getting the CHECKSUM_PARTIAL version
 in, and working any bugs out of it; do you have any objections to my patch
 series as it stands or should I repost without RFC tags?  I notice it's
 marked "Changes Requested" in patchwork but it's not entirely clear to me
 what changes are wanted.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2015-12-15 18:07 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-07 15:39 Checksum offload queries Edward Cree
2015-12-07 17:29 ` Tom Herbert
2015-12-07 17:52   ` Tom Herbert
2015-12-08 16:03   ` Edward Cree
2015-12-08 16:43     ` Tom Herbert
2015-12-08 18:03       ` Edward Cree
2015-12-08 17:09     ` David Miller
2015-12-08 17:24       ` Edward Cree
2015-12-08 17:28         ` David Miller
2015-12-07 19:38 ` David Miller
2015-12-08 14:42   ` Edward Cree
2015-12-08 17:04     ` Tom Herbert
2015-12-09  1:56       ` Thomas Graf
2015-12-09 16:08         ` Tom Herbert
2015-12-09 22:29           ` Thomas Graf
2015-12-09 22:51             ` Tom Herbert
2015-12-09 23:13               ` Thomas Graf
2015-12-08 17:06     ` David Miller
2015-12-09 12:14       ` Edward Cree
2015-12-09 16:01         ` Tom Herbert
2015-12-09 17:28           ` Edward Cree
2015-12-09 17:31             ` David Laight
2015-12-09 18:00             ` Tom Herbert
2015-12-09 22:21               ` Thomas Graf
2015-12-09 22:42                 ` Tom Herbert
2015-12-09 22:44                   ` Thomas Graf
2015-12-10 15:49               ` Edward Cree
2015-12-10 16:26                 ` Tom Herbert
2015-12-10 20:28                   ` Edward Cree
2015-12-10 21:02                     ` Rustad, Mark D
2015-12-14 15:11                     ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree
2015-12-14 15:13                       ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree
2015-12-14 17:16                         ` Tom Herbert
2015-12-15 18:07                           ` Edward Cree
2015-12-14 15:13                       ` [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices Edward Cree
2015-12-11 23:50             ` Checksum offload queries Tom Herbert
2015-12-12 16:41               ` Sowmini Varadhan
2015-12-12 17:24                 ` Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.