* Checksum offload queries @ 2015-12-07 15:39 Edward Cree 2015-12-07 17:29 ` Tom Herbert 2015-12-07 19:38 ` David Miller 0 siblings, 2 replies; 38+ messages in thread From: Edward Cree @ 2015-12-07 15:39 UTC (permalink / raw) To: netdev; +Cc: Tom Herbert Having decided to take Dave Miller's advice to push our hardware guys in the direction of generic checksum offload, I found I wasn't quite sure exactly what's being encouraged. After discussing the subject with a colleague, some questions crystallised. I expect it's mostly a result of misunderstandings on my part, but here goes: 1) Receive checksums. Given that CHECKSUM_UNNECESSARY conversion exists (and is a cheap operation), what is the advantage to the stack of using CHECKSUM_COMPLETE if the packet happens to be a protocol which CHECKSUM_UNNECESSARY conversion can handle? As I see it, CHECKSUM_UNNECESSARY is strictly better as the stack is told "the first csum_level+1 checksums are good" *and* (indirectly) "here is the whole-packet checksum, which you can use to help with anything beyond csum_level+1". Is it not, then, best for a device only to use CHECKSUM_COMPLETE for protocols the conversion doesn't handle? (I agree that having that fallback of CHECKSUM_COMPLETE is a good thing, sadly I don't think our new chip does that. (But maybe firmware can fix it.)) 2) Transmit checksums. While many protocols permit using 0 in the outer checksum, it doesn't seem prudent to assume all will. Besides, many NICs will still have IP and TCP/UDP checksum offload hardware, if only to support less enlightened operating systems; why not use it? Would it not be better for a device to have both NETIF_F_HW_CSUM *and* NETIF_F_IP[|V6]_CSUM, and be smart enough to fill in IP checksum, TCP/UDP checksum and one encapsulated checksum of your choice (i.e. whatever csum_start and friends asked for)? (Again, I agree that having a NETIF_F_IP_CSUM device do specific magic for a list of specific encapsulation protocols is unsatisfactory. Sadly, guess what our new chip does! (But maybe firmware can fix it.)) 3) Related to the above, what does a NETIF_F_HW_CSUM device do when transmitting an unencapsulated packet (let's say it's UDP) currently? Will it simply get no checksum offload at all? Will csum_start point at the regular UDP checksum (and the stack will do the IP header checksum)? Again, a device that does both HW_ and IP_CSUM could cope with this (do the IP and UDP checksums as per NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though that would require more checksum flags (there's no way for CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore csum_start and friends). 4) Where, precisely, should I tell our hardware guys to stuff the protocol-specific encapsulated checksum offloads they're so proud of having added to our new chip? ;) -- Edward Cree, not speaking for Solarflare Communications ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-07 15:39 Checksum offload queries Edward Cree @ 2015-12-07 17:29 ` Tom Herbert 2015-12-07 17:52 ` Tom Herbert 2015-12-08 16:03 ` Edward Cree 2015-12-07 19:38 ` David Miller 1 sibling, 2 replies; 38+ messages in thread From: Tom Herbert @ 2015-12-07 17:29 UTC (permalink / raw) To: Edward Cree; +Cc: netdev On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote: > Having decided to take Dave Miller's advice to push our hardware guys in the direction of generic checksum offload, I found I wasn't quite sure exactly what's being encouraged. After discussing the subject with a colleague, some questions crystallised. I expect it's mostly a result of misunderstandings on my part, but here goes: > Hi Edward, thanks for looking into this. > 1) Receive checksums. Given that CHECKSUM_UNNECESSARY conversion exists (and is a cheap operation), what is the advantage to the stack of using CHECKSUM_COMPLETE if the packet happens to be a protocol which CHECKSUM_UNNECESSARY conversion can handle? As I see it, CHECKSUM_UNNECESSARY is strictly better as the stack is told "the first csum_level+1 checksums are good" *and* (indirectly) "here is the whole-packet checksum, which you can use to help with anything beyond csum_level+1". Is it not, then, best for a device only to use CHECKSUM_COMPLETE for protocols the conversion doesn't handle? (I agree that having that fallback of CHECKSUM_COMPLETE is a good thing, sadly I don't think our new chip does that. (But maybe firmware can fix it.)) > Checksum unnecessary conversion works great but it's applicability is limited. This only helps in encapsulation when the UDP checksum can be enabled, but due to restrictions of other devices we may need to communicate with (e.g. Cisco switches) it might not be usable. Also, checksum conversion is not relevant to many other protocols we want to run like GRE, IPIP, SIT, MPLS/IP, etc., and does not help with IPv6 extension headers. CHECKSUM_COMPLETE is the really closet thing to a universal solution. > 2) Transmit checksums. While many protocols permit using 0 in the outer checksum, it doesn't seem prudent to assume all will. Besides, many NICs will still have IP and TCP/UDP checksum offload hardware, if only to support less enlightened operating systems; why not use it? Would it not be better for a device to have both NETIF_F_HW_CSUM *and* NETIF_F_IP[|V6]_CSUM, and be smart enough to fill in IP checksum, TCP/UDP checksum and one encapsulated checksum of your choice (i.e. whatever csum_start and friends asked for)? (Again, I agree that having a NETIF_F_IP_CSUM device do specific magic for a list of specific encapsulation protocols is unsatisfactory. Sadly, guess what our new chip does! (But maybe firmware can fix it.)) > It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all cases of checksum offload and any combination of protocol layering. NETIF_F_IP[V6]_CSUM is limited and requires a lot of logic in both driver and HW to implement correctly. I can't help you with these less enlightening operating systems, maybe if they see the advantages of a protocol generic offload model they'll "get with the program" as Dave might say? In any case I do not believe we should be at all constrained in building Linux interfaces or capabilities by the design decisions made in other operating systems. > 3) Related to the above, what does a NETIF_F_HW_CSUM device do when transmitting an unencapsulated packet (let's say it's UDP) currently? Will it simply get no checksum offload at all? Will csum_start point at the regular UDP checksum (and the stack will do the IP header checksum)? Again, a device that does both HW_ and IP_CSUM could cope with this (do the IP and UDP checksums as per NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though that would require more checksum flags (there's no way for CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore csum_start and friends). > The NETIF_F_*_CSUM flags only describe the capabilities of the device, the interface between the stack and the driver to offload the checksum of a particular packet is solely the csum_start and csum_offset fields in the skb (non-GSO case). It is up to the driver to decide whether the particular checksum can actually be offloaded, but in any case it must set the correct checksum (through skb_checksum_help if necessary). > 4) Where, precisely, should I tell our hardware guys to stuff the protocol-specific encapsulated checksum offloads they're so proud of having added to our new chip? ;) Tell them that if the support generic checksum you're marketing guys will be able to list at least fifteen protocols in the product specs that can be checksum offloaded to the device instead of just one or two like the competition ;-) Thanks, Tom . > > -- > Edward Cree, not speaking for Solarflare Communications > The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-07 17:29 ` Tom Herbert @ 2015-12-07 17:52 ` Tom Herbert 2015-12-08 16:03 ` Edward Cree 1 sibling, 0 replies; 38+ messages in thread From: Tom Herbert @ 2015-12-07 17:52 UTC (permalink / raw) To: Edward Cree; +Cc: netdev On Mon, Dec 7, 2015 at 9:29 AM, Tom Herbert <tom@herbertland.com> wrote: > On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote: >> Having decided to take Dave Miller's advice to push our hardware guys in the direction of generic checksum offload, I found I wasn't quite sure exactly what's being encouraged. After discussing the subject with a colleague, some questions crystallised. I expect it's mostly a result of misunderstandings on my part, but here goes: >> > > Hi Edward, thanks for looking into this. > >> 1) Receive checksums. Given that CHECKSUM_UNNECESSARY conversion exists (and is a cheap operation), what is the advantage to the stack of using CHECKSUM_COMPLETE if the packet happens to be a protocol which CHECKSUM_UNNECESSARY conversion can handle? As I see it, CHECKSUM_UNNECESSARY is strictly better as the stack is told "the first csum_level+1 checksums are good" *and* (indirectly) "here is the whole-packet checksum, which you can use to help with anything beyond csum_level+1". Is it not, then, best for a device only to use CHECKSUM_COMPLETE for protocols the conversion doesn't handle? (I agree that having that fallback of CHECKSUM_COMPLETE is a good thing, sadly I don't think our new chip does that. (But maybe firmware can fix it.)) >> > Checksum unnecessary conversion works great but it's applicability is > limited. This only helps in encapsulation when the UDP checksum can be > enabled, but due to restrictions of other devices we may need to > communicate with (e.g. Cisco switches) it might not be usable. Also, > checksum conversion is not relevant to many other protocols we want to > run like GRE, IPIP, SIT, MPLS/IP, etc., and does not help with IPv6 > extension headers. CHECKSUM_COMPLETE is the really closet thing to a > universal solution. > >> 2) Transmit checksums. While many protocols permit using 0 in the outer checksum, it doesn't seem prudent to assume all will. Besides, many NICs will still have IP and TCP/UDP checksum offload hardware, if only to support less enlightened operating systems; why not use it? Would it not be better for a device to have both NETIF_F_HW_CSUM *and* NETIF_F_IP[|V6]_CSUM, and be smart enough to fill in IP checksum, TCP/UDP checksum and one encapsulated checksum of your choice (i.e. whatever csum_start and friends asked for)? (Again, I agree that having a NETIF_F_IP_CSUM device do specific magic for a list of specific encapsulation protocols is unsatisfactory. Sadly, guess what our new chip does! (But maybe firmware can fix it.)) >> > It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all > cases of checksum offload and any combination of protocol layering. > NETIF_F_IP[V6]_CSUM is limited and requires a lot of logic in both > driver and HW to implement correctly. I can't help you with these less > enlightening operating systems, maybe if they see the advantages of a > protocol generic offload model they'll "get with the program" as Dave > might say? In any case I do not believe we should be at all > constrained in building Linux interfaces or capabilities by the design > decisions made in other operating systems. > >> 3) Related to the above, what does a NETIF_F_HW_CSUM device do when transmitting an unencapsulated packet (let's say it's UDP) currently? Will it simply get no checksum offload at all? Will csum_start point at the regular UDP checksum (and the stack will do the IP header checksum)? Again, a device that does both HW_ and IP_CSUM could cope with this (do the IP and UDP checksums as per NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though that would require more checksum flags (there's no way for CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore csum_start and friends). >> > The NETIF_F_*_CSUM flags only describe the capabilities of the device, > the interface between the stack and the driver to offload the checksum > of a particular packet is solely the csum_start and csum_offset fields > in the skb (non-GSO case). It is up to the driver to decide whether > the particular checksum can actually be offloaded, but in any case it > must set the correct checksum (through skb_checksum_help if > necessary). > >> 4) Where, precisely, should I tell our hardware guys to stuff the protocol-specific encapsulated checksum offloads they're so proud of having added to our new chip? ;) > > Tell them that if the support generic checksum you're marketing guys > will be able to list at least fifteen protocols in the product specs > that can be checksum offloaded to the device instead of just one or > two like the competition ;-) > Here is a list of protocols I came up with that you could claim are supported for checksum offload with protocol generic mechanisms: VXLAN VXLAN-GPE GUE Geneve GRE (over IP or UDP) MPLS (over IP or UDP) IPIP SIT L2TP (over IP or UDP) GTP NVGRE DCCP LISP IPv6 extension headers PPPoE ICMP You mostly can double this list if you want to by list each protocol individually as being supported for both IPv4 and IPv6. > Thanks, > Tom > . >> >> -- >> Edward Cree, not speaking for Solarflare Communications >> The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-07 17:29 ` Tom Herbert 2015-12-07 17:52 ` Tom Herbert @ 2015-12-08 16:03 ` Edward Cree 2015-12-08 16:43 ` Tom Herbert 2015-12-08 17:09 ` David Miller 1 sibling, 2 replies; 38+ messages in thread From: Edward Cree @ 2015-12-08 16:03 UTC (permalink / raw) To: Tom Herbert; +Cc: netdev On 07/12/15 17:29, Tom Herbert wrote: > On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote: >> 2) Transmit checksums. > It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all > cases of checksum offload and any combination of protocol layering. ... until a protocol combination comes along that doesn't allow all but the innermost checksum to be 0. NETIF_F_HW_CSUM is limited to one checksum per packet. Though I suppose if you only care about offloading payload checksums, and are happy to checksum headers in software, you _could_ maybe have the stack (somehow) say "once you've filled in the checksum we specified with start/offset, add each of these constants (in ones complement) and fill into these corresponding locations". That way, you can compute e.g. the difference between the inner and outer UDP checksums (for the example of UDP over VXLAN) without needing to know the checksum payload. But it's not clear how you could pass these instructions to the driver without making struct sk_buff bigger (even replacing start/offset with a list head adds a few bytes, plus you have to allocate and free the list elements). And walking a list might slow down the hardware too much for it to be wor thwhile. > NETIF_F_IP[V6]_CSUM is limited and requires a lot of logic in both > driver and HW to implement correctly. But the HW already has the logic, and practically speaking it probably will for a long time to come, and NETIF_F_HW_CSUM _doesn't_ cover everything, only the innermost checksum. Is there really no way to have both? Or are outer checksums Officially Not Important? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 16:03 ` Edward Cree @ 2015-12-08 16:43 ` Tom Herbert 2015-12-08 18:03 ` Edward Cree 2015-12-08 17:09 ` David Miller 1 sibling, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-08 16:43 UTC (permalink / raw) To: Edward Cree; +Cc: netdev On Tue, Dec 8, 2015 at 8:03 AM, Edward Cree <ecree@solarflare.com> wrote: > On 07/12/15 17:29, Tom Herbert wrote: >> On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote: >>> 2) Transmit checksums. >> It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all >> cases of checksum offload and any combination of protocol layering. > ... until a protocol combination comes along that doesn't allow all but the innermost checksum to be 0. NETIF_F_HW_CSUM is limited to one checksum per packet. So is NETIF_F_IP[V6]_CSUM, again the interface into the driver to offload a checksum is csum_start, csum_offset. Only one checksum can be offloaded at a time. We have no way to tell the driver that there are two checksums in a packet to be offloaded. Good news is that we probably don't need that. Extensible encapsulation protocols can implement Remote Checksum Offload, and this is never an issue on the receive side since CHECKSUM_COMPLETE can be used to validate and arbitrary number of checksums. Tom ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 16:43 ` Tom Herbert @ 2015-12-08 18:03 ` Edward Cree 0 siblings, 0 replies; 38+ messages in thread From: Edward Cree @ 2015-12-08 18:03 UTC (permalink / raw) To: Tom Herbert; +Cc: netdev On 08/12/15 16:43, Tom Herbert wrote: > On Tue, Dec 8, 2015 at 8:03 AM, Edward Cree <ecree@solarflare.com> wrote: >> On 07/12/15 17:29, Tom Herbert wrote: >>> On Mon, Dec 7, 2015 at 7:39 AM, Edward Cree <ecree@solarflare.com> wrote: >>>> 2) Transmit checksums. >>> It's analogous to CHECKSUM_COMPLETE, NETIF_F_HW_CSUM works for all >>> cases of checksum offload and any combination of protocol layering. >> ... until a protocol combination comes along that doesn't allow all but the innermost checksum to be 0. NETIF_F_HW_CSUM is limited to one checksum per packet. > So is NETIF_F_IP[V6]_CSUM, again the interface into the driver to > offload a checksum is csum_start, csum_offset. Well, we don't use those values, and I'd be surprised if other NETIF_F_IP[V6]_CSUM drivers did. The reason why our hardware is restricted to checksumming IP, rather than being able to checksum everything, is that it works out the start and offset itself (and only knows how to do that for IP). Our transmit descriptors don't include csum_start and csum_offset. That may mean we're Doing It Wrong, but I don't see why anyone Doing It Right would need to be NETIF_F_IP[V6]_CSUM in the first place, rather than NETIF_F_HW_CSUM. > Only one checksum can > be offloaded at a time. We have no way to tell the driver that there > are two checksums in a packet to be offloaded. But again, hardware that figures it out for itself doesn't _need_ to be told. For instance, for a VXLAN-encapsulated TCP packet, our new chip can fill in the IP header checksum, the (outer) UDP checksum, and the (inner) TCP checksum. (Our previous chip would do the first two, but not the inner checksum.) I had assumed that NETIF_F_IP_CSUM included IP header checksum offload; now I'm wondering if the stack is already filling that in and our hardware is overwriting it with the same value. From a quick look at our driver, it looks like that is indeed happening; I guess I should fix that. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 16:03 ` Edward Cree 2015-12-08 16:43 ` Tom Herbert @ 2015-12-08 17:09 ` David Miller 2015-12-08 17:24 ` Edward Cree 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2015-12-08 17:09 UTC (permalink / raw) To: ecree; +Cc: tom, netdev From: Edward Cree <ecree@solarflare.com> Date: Tue, 8 Dec 2015 16:03:18 +0000 > NETIF_F_HW_CSUM is limited to one checksum per packet. This is false. As we demux encapsulation layers, we re-adjust the provided 1's complement sum provided and use that to validate the next layer. This works for arbitrarily nested encapsulations, and therefore can support mutliple checksums per packet. Tom has also created a scheme, called remote checksum offload, that deals with the corner cases that are actually not able to be handled fully by NETIF_F_HW_CSUM. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 17:09 ` David Miller @ 2015-12-08 17:24 ` Edward Cree 2015-12-08 17:28 ` David Miller 0 siblings, 1 reply; 38+ messages in thread From: Edward Cree @ 2015-12-08 17:24 UTC (permalink / raw) To: David Miller; +Cc: tom, netdev On 08/12/15 17:09, David Miller wrote: > From: Edward Cree <ecree@solarflare.com> > Date: Tue, 8 Dec 2015 16:03:18 +0000 > >> NETIF_F_HW_CSUM is limited to one checksum per packet. > This is false. > > As we demux encapsulation layers, we re-adjust the provided 1's > complement sum provided and use that to validate the next layer. The context here was transmit, not receive; the stack muxes encapsulation layers and then asks the hardware to fill in one checksum, it doesn't have any way to fill in 'adjusted' checksums in other layers. The following paragraph (which you snipped) was discussing ways to get this kind of adjusted checksumming on transmit, by feeding the adjustments to the hardware. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 17:24 ` Edward Cree @ 2015-12-08 17:28 ` David Miller 0 siblings, 0 replies; 38+ messages in thread From: David Miller @ 2015-12-08 17:28 UTC (permalink / raw) To: ecree; +Cc: tom, netdev From: Edward Cree <ecree@solarflare.com> Date: Tue, 8 Dec 2015 17:24:55 +0000 > The context here was transmit, not receive; the stack muxes > encapsulation layers and then asks the hardware to fill in one > checksum, it doesn't have any way to fill in 'adjusted' checksums in > other layers. The following paragraph (which you snipped) was > discussing ways to get this kind of adjusted checksumming on > transmit, by feeding the adjustments to the hardware. [ Please properly line-wrap your emails, it looks like one very long line to everyone on the list. ] See Tom's reply re: TX. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-07 15:39 Checksum offload queries Edward Cree 2015-12-07 17:29 ` Tom Herbert @ 2015-12-07 19:38 ` David Miller 2015-12-08 14:42 ` Edward Cree 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2015-12-07 19:38 UTC (permalink / raw) To: ecree; +Cc: netdev, tom From: Edward Cree <ecree@solarflare.com> Date: Mon, 7 Dec 2015 15:39:52 +0000 > 1) Receive checksums. Given that CHECKSUM_UNNECESSARY conversion > exists (and is a cheap operation), what is the advantage to the > stack of using CHECKSUM_COMPLETE if the packet happens to be a > protocol which CHECKSUM_UNNECESSARY conversion can handle? As I see > it, CHECKSUM_UNNECESSARY is strictly better as the stack is told > "the first csum_level+1 checksums are good" *and* (indirectly) "here > is the whole-packet checksum, which you can use to help with > anything beyond csum_level+1". Is it not, then, best for a device > only to use CHECKSUM_COMPLETE for protocols the conversion doesn't > handle? (I agree that having that fallback of CHECKSUM_COMPLETE is > a good thing, sadly I don't think our new chip does that. (But > maybe firmware can fix it.)) No, it is better to universally provide the 1's complement sum for all receive packets. This allows the stack more flexibility in checksum handling. > 3) Related to the above, what does a NETIF_F_HW_CSUM device do when > transmitting an unencapsulated packet (let's say it's UDP) > currently? Will it simply get no checksum offload at all? Will > csum_start point at the regular UDP checksum (and the stack will do > the IP header checksum)? Again, a device that does both HW_ and > IP_CSUM could cope with this (do the IP and UDP checksums as per > NETIF_F_IP_CSUM, and just don't ask for a 'generic' HW_CSUM), though > that would require more checksum flags (there's no way for > CHECKSUM_PARTIAL to say "do your IP-specific stuff but ignore > csum_start and friends). The stack will have skb->csum_start point to the UDP header's checksum field for unencapsulated packets, and it has done this for decades. Sun Microsystems had NETIF_F_HW_CSUM supporting NICs nearly two decades ago, and this is what NETIF_F_HW_CSUM was designed for. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-07 19:38 ` David Miller @ 2015-12-08 14:42 ` Edward Cree 2015-12-08 17:04 ` Tom Herbert 2015-12-08 17:06 ` David Miller 0 siblings, 2 replies; 38+ messages in thread From: Edward Cree @ 2015-12-08 14:42 UTC (permalink / raw) To: David Miller; +Cc: netdev, Tom Herbert On 07/12/15 19:38, David Miller wrote: > No, it is better to universally provide the 1's complement sum for > all receive packets. This allows the stack more flexibility in > checksum handling. I'm afraid I still don't see it. If a device can both provide the 1's complement sum _and_ validate some of the checksums in the packet, that should be strictly better than just providing the 1's complement sum - the stack has at least as much information, and less work to do. And while there is no general way at present for a driver to tell the stack it has done both (and in my opinion there should be such a way), it _is_ possible in the specific case of a UDP packet with the checksum filled in, thanks to CHECKSUM_UNNECESSARY conversion. So why shouldn't a device (that otherwise gives the full ones complement sum with CHECKSUM_COMPLETE) use CHECKSUM_UNNECESSARY in this specific case? Is there a flaw in my logic, or is it just that this would be a hack and the Right Thing is to change the interface to let a driver report both pieces of information *directly*? Or am I wrong for some other reason? >> 3) Related to the above, what does a NETIF_F_HW_CSUM device do when >> transmitting an unencapsulated packet > The stack will have skb->csum_start point to the UDP header's checksum > field for unencapsulated packets, and it has done this for decades. > > Sun Microsystems had NETIF_F_HW_CSUM supporting NICs nearly two > decades ago, and this is what NETIF_F_HW_CSUM was designed for. Thanks, that makes more sense now. Though, does that mean that there's no way in this case to offload the IP header checksum? (Of course, it's generally much less work than the payload checksum, and it goes away in v6.) ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 14:42 ` Edward Cree @ 2015-12-08 17:04 ` Tom Herbert 2015-12-09 1:56 ` Thomas Graf 2015-12-08 17:06 ` David Miller 1 sibling, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-08 17:04 UTC (permalink / raw) To: Edward Cree; +Cc: David Miller, Linux Kernel Network Developers On Tue, Dec 8, 2015 at 6:42 AM, Edward Cree <ecree@solarflare.com> wrote: > On 07/12/15 19:38, David Miller wrote: >> No, it is better to universally provide the 1's complement sum for >> all receive packets. This allows the stack more flexibility in >> checksum handling. > I'm afraid I still don't see it. If a device can both provide the 1's complement sum _and_ validate some of the checksums in the packet, that should be strictly better than just providing the 1's complement sum - the stack has at least as much information, and less work to do. And while there is no general way at present for a driver to tell the stack it has done both (and in my opinion there should be such a way), it _is_ possible in the specific case of a UDP packet with the checksum filled in, thanks to CHECKSUM_UNNECESSARY conversion. So why shouldn't a device (that otherwise gives the full ones complement sum with CHECKSUM_COMPLETE) use CHECKSUM_UNNECESSARY in this specific case? Is there a flaw in my logic, or is it just that this would be a hack and the Right Thing is to chang e the interface to let a driver report both pieces of information *directly*? Or am I wrong for some other reason? The overhead of the stack to process CHECKSUM_COMPLETE is negligible, we need to pull the checksum in protocol headers as they are process which means the data is in cache any way (also IPv4 headers don't needed to be pulled). In my testing I see CPU utilization csum_partial drop from >6% to <0.5% in comparing computing csum on host and CHECKSUM_COMPLETE. There are other reasons why CHECKSUM_COMPLETE is preferable: - CHECKSUM_COMPLETE is more robust. We have no way to validate that the device is actually correct in CHECKSUM_UNNECESSARY. For instance, how do we know that there isn't some failure in the device where everything is being marked as good even if it's not. With CHECKSUM_COMPLETE it is the host that actually makes the decision of whether the checksum is correct it is highly unlikely that failing checksum calculation on the device won't be detected. HW failures and bugs are real concern. - CHECKSUM_UNNECESSARY does not report bad checksums. There is a csum_bad flag in the sk_buff that could be set if the driver detects a bad checksum in the packet, but no drivers seem to be setting that currently. So for any packets with bad checksums the stack will need to compute the checksum itself, so this potentially becomes the basis of a DDOS attack. CHECKSUM_COMPLETE does not have this problem, we get the checksum of the packet rather the checksum is correct or not. Tom ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 17:04 ` Tom Herbert @ 2015-12-09 1:56 ` Thomas Graf 2015-12-09 16:08 ` Tom Herbert 0 siblings, 1 reply; 38+ messages in thread From: Thomas Graf @ 2015-12-09 1:56 UTC (permalink / raw) To: Tom Herbert; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers On 12/08/15 at 09:04am, Tom Herbert wrote: > There are other reasons why CHECKSUM_COMPLETE is preferable: > > - CHECKSUM_COMPLETE is more robust. We have no way to validate that > the device is actually correct in CHECKSUM_UNNECESSARY. For instance, > how do we know that there isn't some failure in the device where > everything is being marked as good even if it's not. With > CHECKSUM_COMPLETE it is the host that actually makes the decision of > whether the checksum is correct it is highly unlikely that failing > checksum calculation on the device won't be detected. HW failures and > bugs are real concern. > - CHECKSUM_UNNECESSARY does not report bad checksums. There is a > csum_bad flag in the sk_buff that could be set if the driver detects a > bad checksum in the packet, but no drivers seem to be setting that > currently. So for any packets with bad checksums the stack will need > to compute the checksum itself, so this potentially becomes the basis > of a DDOS attack. CHECKSUM_COMPLETE does not have this problem, we get > the checksum of the packet rather the checksum is correct or not. If I understood Edward correctly, his proposal would be for the card to provide both, the csum as for CHECKSUM_COMPLETE plus the validation yes/no hint. It would be up to the kernel to decide whether to validate itself or trust the card. I'm all in favour CHECKSUM_COMPLETE as the only way to go but we should be aware that it depends on the penetration of RCO in hardware VTEPs. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 1:56 ` Thomas Graf @ 2015-12-09 16:08 ` Tom Herbert 2015-12-09 22:29 ` Thomas Graf 0 siblings, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-09 16:08 UTC (permalink / raw) To: Thomas Graf; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers On Tue, Dec 8, 2015 at 5:56 PM, Thomas Graf <tgraf@suug.ch> wrote: > On 12/08/15 at 09:04am, Tom Herbert wrote: >> There are other reasons why CHECKSUM_COMPLETE is preferable: >> >> - CHECKSUM_COMPLETE is more robust. We have no way to validate that >> the device is actually correct in CHECKSUM_UNNECESSARY. For instance, >> how do we know that there isn't some failure in the device where >> everything is being marked as good even if it's not. With >> CHECKSUM_COMPLETE it is the host that actually makes the decision of >> whether the checksum is correct it is highly unlikely that failing >> checksum calculation on the device won't be detected. HW failures and >> bugs are real concern. >> - CHECKSUM_UNNECESSARY does not report bad checksums. There is a >> csum_bad flag in the sk_buff that could be set if the driver detects a >> bad checksum in the packet, but no drivers seem to be setting that >> currently. So for any packets with bad checksums the stack will need >> to compute the checksum itself, so this potentially becomes the basis >> of a DDOS attack. CHECKSUM_COMPLETE does not have this problem, we get >> the checksum of the packet rather the checksum is correct or not. > > If I understood Edward correctly, his proposal would be for the > card to provide both, the csum as for CHECKSUM_COMPLETE plus the > validation yes/no hint. It would be up to the kernel to decide > whether to validate itself or trust the card. > > I'm all in favour CHECKSUM_COMPLETE as the only way to go but > we should be aware that it depends on the penetration of RCO in > hardware VTEPs. Thomas, I don't understand what you are saying here. CHECKSUM_COMPLETE is an interface for drivers providing the computed checksum of a packet on receive, how is this dependent on any use case or any other mechanisms? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 16:08 ` Tom Herbert @ 2015-12-09 22:29 ` Thomas Graf 2015-12-09 22:51 ` Tom Herbert 0 siblings, 1 reply; 38+ messages in thread From: Thomas Graf @ 2015-12-09 22:29 UTC (permalink / raw) To: Tom Herbert; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers On 12/09/15 at 08:08am, Tom Herbert wrote: > On Tue, Dec 8, 2015 at 5:56 PM, Thomas Graf <tgraf@suug.ch> wrote: > > If I understood Edward correctly, his proposal would be for the > > card to provide both, the csum as for CHECKSUM_COMPLETE plus the > > validation yes/no hint. It would be up to the kernel to decide > > whether to validate itself or trust the card. > > > > I'm all in favour CHECKSUM_COMPLETE as the only way to go but > > we should be aware that it depends on the penetration of RCO in > > hardware VTEPs. > > Thomas, I don't understand what you are saying here. CHECKSUM_COMPLETE > is an interface for drivers providing the computed checksum of a > packet on receive, how is this dependent on any use case or any other > mechanisms? I'm referring to the overall change which comes with a pure CHECSUM_COMPLETE model which would forbid a NIC to do automatic nested checksum filling even if untold. RCO solves that to some extent but only if RCO is widely supported. I'm not aware of anybody relying on the performance of such hardware yet so I doubt we would create a performance regression. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 22:29 ` Thomas Graf @ 2015-12-09 22:51 ` Tom Herbert 2015-12-09 23:13 ` Thomas Graf 0 siblings, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-09 22:51 UTC (permalink / raw) To: Thomas Graf; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers On Wed, Dec 9, 2015 at 2:29 PM, Thomas Graf <tgraf@suug.ch> wrote: > On 12/09/15 at 08:08am, Tom Herbert wrote: >> On Tue, Dec 8, 2015 at 5:56 PM, Thomas Graf <tgraf@suug.ch> wrote: >> > If I understood Edward correctly, his proposal would be for the >> > card to provide both, the csum as for CHECKSUM_COMPLETE plus the >> > validation yes/no hint. It would be up to the kernel to decide >> > whether to validate itself or trust the card. >> > >> > I'm all in favour CHECKSUM_COMPLETE as the only way to go but >> > we should be aware that it depends on the penetration of RCO in >> > hardware VTEPs. >> >> Thomas, I don't understand what you are saying here. CHECKSUM_COMPLETE >> is an interface for drivers providing the computed checksum of a >> packet on receive, how is this dependent on any use case or any other >> mechanisms? > > I'm referring to the overall change which comes with a pure > CHECSUM_COMPLETE model which would forbid a NIC to do automatic nested > checksum filling even if untold. RCO solves that to some extent but > only if RCO is widely supported. I'm not aware of anybody relying on > the performance of such hardware yet so I doubt we would create a > performance regression. > I'm sorry, I still don't understand your point. What is "automatic nested checksum filling" and how does this relate to RX (e.g. CHECSUM_COMPLETE). ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 22:51 ` Tom Herbert @ 2015-12-09 23:13 ` Thomas Graf 0 siblings, 0 replies; 38+ messages in thread From: Thomas Graf @ 2015-12-09 23:13 UTC (permalink / raw) To: Tom Herbert; +Cc: Edward Cree, David Miller, Linux Kernel Network Developers On 12/09/15 at 02:51pm, Tom Herbert wrote: > I'm sorry, I still don't understand your point. What is "automatic > nested checksum filling" and how does this relate to RX (e.g. > CHECSUM_COMPLETE). Too much compression ;-) My understanding of the thread was that the desired state is no checksum validation on RX and no automatic checksum offload on TX unless explicitly instructed via csum offset. This is what I would call a CHECKSUM_COMPLETE card (no protocol parsing). As opposed to a CHECKSUM_UNNECESSARY card which does protocol parsing. Some do nested checksum offload on TX as we know even if only instructed to do one of the checksums. So let's assume everybody goes CHECKSUM_COMPLETE and we have at most a single level of checksum offload on TX. (As I understand, a requirement to not break RCO anyway.) RCO would resolve the possible software checksum performance bottleneck in the scenario of outer and inner header checksum requirements. While I agree that this is the case, we need to have support in hardware VTEPs for this in order for it to be usable. (excluding those which require a checksum 0 anyway) Multiple nested tunnels would be another beast outside of the scope of RCO but as I stated in the other email, I don't think we should proactively solve that. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 14:42 ` Edward Cree 2015-12-08 17:04 ` Tom Herbert @ 2015-12-08 17:06 ` David Miller 2015-12-09 12:14 ` Edward Cree 1 sibling, 1 reply; 38+ messages in thread From: David Miller @ 2015-12-08 17:06 UTC (permalink / raw) To: ecree; +Cc: netdev, tom From: Edward Cree <ecree@solarflare.com> Date: Tue, 8 Dec 2015 14:42:19 +0000 > I'm afraid I still don't see it. You assume that nothing in the stack mangles the packet and would benefit from having the 1's complement to work with in order to assist with checksum adjustments etc. So we want 1's complement, all the time. All of the headers get touched anyways as each layer of the stack demuxes the packet. Therefore there is _ZERO_ cost to use 1's complement to validate checksums in input. It's always superior. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-08 17:06 ` David Miller @ 2015-12-09 12:14 ` Edward Cree 2015-12-09 16:01 ` Tom Herbert 0 siblings, 1 reply; 38+ messages in thread From: Edward Cree @ 2015-12-09 12:14 UTC (permalink / raw) To: David Miller; +Cc: netdev, tom On 08/12/15 17:06, David Miller wrote: > All of the headers get touched anyways as each layer of the stack > demuxes the packet. Therefore there is _ZERO_ cost to use 1's > complement to validate checksums in input. It's always superior. Ah, I understand now, it's touching the memory that's expensive, not adding up the numbers. Thanks for being patient with me :) So that just leaves the question of offloading multiple _TX_ checksums. Remote checksum offload would deal with this but the receiver might not support it, meaning that after decapsulation the packet will have a zero checksum (and still has chances to get corrupted); thus it's not necessarily safe to do unless you control both endpoints and know you have RCO everywhere. Convincing hardware designers to go the HW_CSUM way and only fill in the inner checksum, when their current approach can fill in both inner and outer checksums (though admittedly only for the protocols the hardware knows about), might be difficult. Remember that NICs are going to have packet parsing for those protocols anyway, because they need it for things like Windows-spec RSS, so in the hardware designer's mind, ossified offloads are "free". ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 12:14 ` Edward Cree @ 2015-12-09 16:01 ` Tom Herbert 2015-12-09 17:28 ` Edward Cree 0 siblings, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-09 16:01 UTC (permalink / raw) To: Edward Cree; +Cc: David Miller, Linux Kernel Network Developers On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> wrote: > On 08/12/15 17:06, David Miller wrote: >> All of the headers get touched anyways as each layer of the stack > demuxes the packet. Therefore there is _ZERO_ cost to use 1's > complement to validate checksums in input. It's always superior. > > Ah, I understand now, it's touching the memory that's expensive, not adding > up the numbers. Thanks for being patient with me :) > > So that just leaves the question of offloading multiple _TX_ checksums. > Remote checksum offload would deal with this but the receiver might not > support it, meaning that after decapsulation the packet will have a zero > checksum (and still has chances to get corrupted); thus it's not necessarily > safe to do unless you control both endpoints and know you have RCO > everywhere. > Convincing hardware designers to go the HW_CSUM way and only fill in the > inner checksum, when their current approach can fill in both inner and outer > checksums (though admittedly only for the protocols the hardware knows > about), might be difficult. > But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM describe capabilities._not_ the interface. The interface currently allows only one checksum to be offloaded at time, if we want to be able to offload two checksums then the interface needs to be changed-- probably something like defining a new capability like NETIF_F_HW_2CSUMS, adding another csum_start,csum_offset pair into the sk_buff. The stack will need to be modified also wherever CHECKSUM_PARTIAL is handled. If your device is trying do offload more than one checksum on its own accord without being asked to do so by the stack it is doing the wrong thing! > Remember that NICs are going to have packet parsing for those protocols > anyway, because they need it for things like Windows-spec RSS, so in the > hardware designer's mind, ossified offloads are "free". > The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited. Please stop adding this disclaimer to your messages, it is not appropriate for the list. Tom ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 16:01 ` Tom Herbert @ 2015-12-09 17:28 ` Edward Cree 2015-12-09 17:31 ` David Laight ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Edward Cree @ 2015-12-09 17:28 UTC (permalink / raw) To: Linux Kernel Network Developers; +Cc: David Miller On 09/12/15 16:01, Tom Herbert wrote: > On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> > wrote: >> Convincing hardware designers to go the HW_CSUM way and only fill >> in the inner checksum, when their current approach can fill in >> both inner and outer checksums (though admittedly only for the >> protocols the hardware knows about), might be difficult. >> > But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM describe > capabilities._not_ the interface. The interface currently allows only > one checksum to be offloaded at time, if we want to be able to > offload two checksums then the interface needs to be changed-- > probably something like defining a new capability like > NETIF_F_HW_2CSUMS, adding another csum_start,csum_offset pair into > the sk_buff. Which only pushes the problem onto when someone wants to nest encapsulations. (I heard you like tunnels, so I put a tunnel in your tunnel so you can encapsulate while you encapsulate.) Or to put it another way, 2 isn't a number; the only numbers are 0, 1 and infinity ;) Perhaps in practice 2 csums would be enough, for now. But isn't the whole point of the brave new world of generic checksums that it should be future-proof? > The stack will need to be modified also wherever CHECKSUM_PARTIAL is > handled. Naturally. > If your device is trying do offload more than one checksum on its own > accord without being asked to do so by the stack it is doing the > wrong thing! >From the stack's perspective: yes, it is doing the wrong thing. (I've been discussing with colleagues today how we could change that, and I think we can, but it involves having _three_ hardware TXQs per kernel queue, instead of the two we have now...) But from the outside perspective, the system as a whole isn't doing anything bad - the packet going on the network is valid and just happens to have both inner and outer checksums filled in. Is there a good reason _why_ the stack forbids a device to do this? (Sure, it's not necessary, and makes the hardware more complex. But the hardware's already been made, and it's not a *completely* useless thing to do...) > Please stop adding this disclaimer to your messages, it is not > appropriate for the list. Actually, the copy that goes the list doesn't have the disclaimer. But thanks to a combination of crappy email server and corporate politics, it still sticks it on any CCs. If it were up to me (it's not) we wouldn't add that disclaimer to anything ever. I'm now attempting (for the nth time) to convince our mgmt to get rid of the disclaimer, but I don't hold out much hope :( ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: Checksum offload queries 2015-12-09 17:28 ` Edward Cree @ 2015-12-09 17:31 ` David Laight 2015-12-09 18:00 ` Tom Herbert 2015-12-11 23:50 ` Checksum offload queries Tom Herbert 2 siblings, 0 replies; 38+ messages in thread From: David Laight @ 2015-12-09 17:31 UTC (permalink / raw) To: 'Edward Cree', Linux Kernel Network Developers; +Cc: David Miller From: Edward Cree > Sent: 09 December 2015 17:29 You also need to stop (probably outluck?) from deleting newlines and flowing the text. The message below is completely unreadable. David > On 09/12/15 16:01, Tom Herbert wrote: > > On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> > wrote: >> Convincing hardware > designers to go the HW_CSUM way and only fill >> in the inner checksum, when their current approach > can fill in >> both inner and outer checksums (though admittedly only for the >> protocols the > hardware knows about), might be difficult. >> > But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM > describe > capabilities._not_ the interface. The interface currently allows only > one checksum to be > offloaded at time, if we want to be able to > offload two checksums then the interface needs to be > changed-- > probably something like defining a new capability like > NETIF_F_HW_2CSUMS, adding another > csum_start,csum_offset pair into > the sk_buff. > Which only pushes the problem onto when someone wants to nest > encapsulations. (I heard you like tunnels, so I put a tunnel in your > tunnel so you can encapsulate while you encapsulate.) > Or to put it another way, 2 isn't a number; the only numbers are 0, 1 > and infinity ;) > Perhaps in practice 2 csums would be enough, for now. But isn't the > whole point of the brave new world of generic checksums that it should > be future-proof? > > > The stack will need to be modified also wherever CHECKSUM_PARTIAL is > handled. > Naturally. > > > If your device is trying do offload more than one checksum on its own > accord without being asked > to do so by the stack it is doing the > wrong thing! > From the stack's perspective: yes, it is doing the wrong thing. (I've > been discussing with colleagues today how we could change that, and I > think we can, but it involves having _three_ hardware TXQs per kernel > queue, instead of the two we have now...) > But from the outside perspective, the system as a whole isn't doing > anything bad - the packet going on the network is valid and just > happens to have both inner and outer checksums filled in. Is there a > good reason _why_ the stack forbids a device to do this? (Sure, it's > not necessary, and makes the hardware more complex. But the hardware's > already been made, and it's not a *completely* useless thing to do...) > > > Please stop adding this disclaimer to your messages, it is not > appropriate for the list. > Actually, the copy that goes the list doesn't have the disclaimer. But > thanks to a combination of crappy email server and corporate politics, > it still sticks it on any CCs. If it were up to me (it's not) we > wouldn't add that disclaimer to anything ever. > I'm now attempting (for the nth time) to convince our mgmt to get rid > of the disclaimer, but I don't hold out much hope :( > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 17:28 ` Edward Cree 2015-12-09 17:31 ` David Laight @ 2015-12-09 18:00 ` Tom Herbert 2015-12-09 22:21 ` Thomas Graf 2015-12-10 15:49 ` Edward Cree 2015-12-11 23:50 ` Checksum offload queries Tom Herbert 2 siblings, 2 replies; 38+ messages in thread From: Tom Herbert @ 2015-12-09 18:00 UTC (permalink / raw) To: Edward Cree; +Cc: Linux Kernel Network Developers, David Miller On Wed, Dec 9, 2015 at 9:28 AM, Edward Cree <ecree@solarflare.com> wrote: > On 09/12/15 16:01, Tom Herbert wrote: >> On Wed, Dec 9, 2015 at 4:14 AM, Edward Cree <ecree@solarflare.com> > wrote: >> Convincing hardware designers to go the HW_CSUM way and only fill >> in the inner checksum, when their current approach can fill in >> both inner and outer checksums (though admittedly only for the >> protocols the hardware knows about), might be difficult. >> > But again, NETIF_F_IP[V6]_CSUM and NETIF_F_HW_CSUM describe > capabilities._not_ the interface. The interface currently allows only > one checksum to be offloaded at time, if we want to be able to > offload two checksums then the interface needs to be changed-- > probably something like defining a new capability like > NETIF_F_HW_2CSUMS, adding another csum_start,csum_offset pair into > the sk_buff. > Which only pushes the problem onto when someone wants to nest > encapsulations. (I heard you like tunnels, so I put a tunnel in your > tunnel so you can encapsulate while you encapsulate.) > Or to put it another way, 2 isn't a number; the only numbers are 0, 1 > and infinity ;) > Perhaps in practice 2 csums would be enough, for now. But isn't the > whole point of the brave new world of generic checksums that it should > be future-proof? > If there is a need then we can add an arbitrary number. But no one has proven there is a need, however we do have a real need for checksum offload outside of the narrow uses of NETIF_F_IP[V6]_CSUM. >> The stack will need to be modified also wherever CHECKSUM_PARTIAL is > handled. > Naturally. > >> If your device is trying do offload more than one checksum on its own > accord without being asked to do so by the stack it is doing the > wrong thing! > From the stack's perspective: yes, it is doing the wrong thing. (I've > been discussing with colleagues today how we could change that, and I > think we can, but it involves having _three_ hardware TXQs per kernel > queue, instead of the two we have now...) > But from the outside perspective, the system as a whole isn't doing > anything bad - the packet going on the network is valid and just > happens to have both inner and outer checksums filled in. Is there a > good reason _why_ the stack forbids a device to do this? (Sure, it's > not necessary, and makes the hardware more complex. But the hardware's > already been made, and it's not a *completely* useless thing to do...) > That is not at all true. If the stack has set up VXLAN RCO and the device decides to set the inner checksum itself then the checksum will be bad. The checksum interface is very specific please read it carefully (sk_buff.h), if the driver/device thinks it is smarter than the stack and tries to do set its own rules on how checksum offload works then things will eventually break miserably. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 18:00 ` Tom Herbert @ 2015-12-09 22:21 ` Thomas Graf 2015-12-09 22:42 ` Tom Herbert 2015-12-10 15:49 ` Edward Cree 1 sibling, 1 reply; 38+ messages in thread From: Thomas Graf @ 2015-12-09 22:21 UTC (permalink / raw) To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller On 12/09/15 at 10:00am, Tom Herbert wrote: > On Wed, Dec 9, 2015 at 9:28 AM, Edward Cree <ecree@solarflare.com> wrote: > > Which only pushes the problem onto when someone wants to nest > > encapsulations. (I heard you like tunnels, so I put a tunnel in your > > tunnel so you can encapsulate while you encapsulate.) > > Or to put it another way, 2 isn't a number; the only numbers are 0, 1 > > and infinity ;) > > Perhaps in practice 2 csums would be enough, for now. But isn't the > > whole point of the brave new world of generic checksums that it should > > be future-proof? > > > If there is a need then we can add an arbitrary number. But no one has > proven there is a need, however we do have a real need for checksum > offload outside of the narrow uses of NETIF_F_IP[V6]_CSUM. Need may be a strong word here but people have started doing nested tunneling by running container orchestration tools which use VXLAN to isolate containers inside of OpenStack virtual infrastructure which also creates virtual networks. I'm not saying it's sane or desirable but we will start seeing nested tunnels in the wild :-( ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 22:21 ` Thomas Graf @ 2015-12-09 22:42 ` Tom Herbert 2015-12-09 22:44 ` Thomas Graf 0 siblings, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-09 22:42 UTC (permalink / raw) To: Thomas Graf; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller On Wed, Dec 9, 2015 at 2:21 PM, Thomas Graf <tgraf@suug.ch> wrote: > On 12/09/15 at 10:00am, Tom Herbert wrote: >> On Wed, Dec 9, 2015 at 9:28 AM, Edward Cree <ecree@solarflare.com> wrote: >> > Which only pushes the problem onto when someone wants to nest >> > encapsulations. (I heard you like tunnels, so I put a tunnel in your >> > tunnel so you can encapsulate while you encapsulate.) >> > Or to put it another way, 2 isn't a number; the only numbers are 0, 1 >> > and infinity ;) >> > Perhaps in practice 2 csums would be enough, for now. But isn't the >> > whole point of the brave new world of generic checksums that it should >> > be future-proof? >> > >> If there is a need then we can add an arbitrary number. But no one has >> proven there is a need, however we do have a real need for checksum >> offload outside of the narrow uses of NETIF_F_IP[V6]_CSUM. > > Need may be a strong word here but people have started doing nested > tunneling by running container orchestration tools which use VXLAN > to isolate containers inside of OpenStack virtual infrastructure which > also creates virtual networks. > > I'm not saying it's sane or desirable but we will start seeing nested > tunnels in the wild :-( csum_start and csum_offset together occupy 32 bits. As demonstrated in VXLAN RCO we can compress csum_start/csum_offset down to 8 bits which means if necessary we could get up to four pairs in an sk_buff without increasing its size. If you need more that four checksums to be offloaded in single packet then I doubt getting checksum offload to work is going to be your biggest problem. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 22:42 ` Tom Herbert @ 2015-12-09 22:44 ` Thomas Graf 0 siblings, 0 replies; 38+ messages in thread From: Thomas Graf @ 2015-12-09 22:44 UTC (permalink / raw) To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller On 12/09/15 at 02:42pm, Tom Herbert wrote: > csum_start and csum_offset together occupy 32 bits. As demonstrated in > VXLAN RCO we can compress csum_start/csum_offset down to 8 bits which > means if necessary we could get up to four pairs in an sk_buff without > increasing its size. If you need more that four checksums to be > offloaded in single packet then I doubt getting checksum offload to > work is going to be your biggest problem. Fair point. I wouldn't suggest even creating it without huge demand first. There should be as many incentives as possible to move or stay away from such horrible architectures. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 18:00 ` Tom Herbert 2015-12-09 22:21 ` Thomas Graf @ 2015-12-10 15:49 ` Edward Cree 2015-12-10 16:26 ` Tom Herbert 1 sibling, 1 reply; 38+ messages in thread From: Edward Cree @ 2015-12-10 15:49 UTC (permalink / raw) To: Tom Herbert; +Cc: Linux Kernel Network Developers, David Miller On 09/12/15 18:00, Tom Herbert wrote: > That is not at all true. If the stack has set up VXLAN RCO and the > device decides to set the inner checksum itself then the checksum > will be bad. The checksum interface is very specific please read it > carefully (sk_buff.h), if the driver/device thinks it is smarter > than the stack and tries to do set its own rules on how checksum > offload works then things will eventually break miserably. Ok, I've passed that on to the guy working on this bit of the driver. It looks like the best way to support the capabilities of NICs like the sfc 8000 series (which can fill in two checksums but uses packet parsing to figure out what to do, rather than using csum start/offset) is: (core / stack) * add NETIF_F_HW_2CSUMS (or whatever name) * squeeze a second csum start/offset pair into the skb (as you mention, we can do this without size increase) * Modify (Tx) CHECKSUM_PARTIAL generation to use both csum pairs. Presumably by creating CHECKSUM_PARTIAL_2CSUMS to indicate that the second csum pair has been filled in as well. (sfc driver) * declare 2CSUMS support * on getting an skb for xmit, check whether the csum pairs match what our eeevil packet parsing hardware will do. If so, send it with appropriate csum offload settings (we can enable/disable inner & outer offload independently, with TX Option descriptors). Any csum pair that doesn't match, we call skb_checksum_help to do it in software, and tell the hardware not to do that one. Optionally, we could also create NETIF_F_IP[V6]_2CSUMS in the stack and have our driver advertise that instead, but since there has to be a fallback to skb_checksum_help in the driver anyway, there doesn't seem to be much point. Does that seem reasonable? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-10 15:49 ` Edward Cree @ 2015-12-10 16:26 ` Tom Herbert 2015-12-10 20:28 ` Edward Cree 0 siblings, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-10 16:26 UTC (permalink / raw) To: Edward Cree; +Cc: Linux Kernel Network Developers, David Miller On Thu, Dec 10, 2015 at 7:49 AM, Edward Cree <ecree@solarflare.com> wrote: > On 09/12/15 18:00, Tom Herbert wrote: >> That is not at all true. If the stack has set up VXLAN RCO and the > device decides to set the inner checksum itself then the checksum > will be bad. The checksum interface is very specific please read it > carefully (sk_buff.h), if the driver/device thinks it is smarter > than the stack and tries to do set its own rules on how checksum > offload works then things will eventually break miserably. > > Ok, I've passed that on to the guy working on this bit of the driver. > > It looks like the best way to support the capabilities of NICs like the > sfc 8000 series (which can fill in two checksums but uses packet parsing > to figure out what to do, rather than using csum start/offset) is: > > (core / stack) > * add NETIF_F_HW_2CSUMS (or whatever name) > * squeeze a second csum start/offset pair into the skb (as you mention, > we can do this without size increase) > * Modify (Tx) CHECKSUM_PARTIAL generation to use both csum pairs. > Presumably by creating CHECKSUM_PARTIAL_2CSUMS to indicate that the > second csum pair has been filled in as well. > > (sfc driver) > * declare 2CSUMS support > * on getting an skb for xmit, check whether the csum pairs match what our > eeevil packet parsing hardware will do. If so, send it with appropriate > csum offload settings (we can enable/disable inner & outer offload > independently, with TX Option descriptors). Any csum pair that doesn't > match, we call skb_checksum_help to do it in software, and tell the > hardware not to do that one. > > Does that seem reasonable? It sounds like potentially interesting work. You'll probably want my patches that provider helper functions that allow a driver to verify that it can offload a checksum. We'll have to update those also to allow two checksums. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-10 16:26 ` Tom Herbert @ 2015-12-10 20:28 ` Edward Cree 2015-12-10 21:02 ` Rustad, Mark D 2015-12-14 15:11 ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree 0 siblings, 2 replies; 38+ messages in thread From: Edward Cree @ 2015-12-10 20:28 UTC (permalink / raw) To: Tom Herbert; +Cc: Linux Kernel Network Developers, David Miller On 10/12/15 16:26, Tom Herbert wrote: > It sounds like potentially interesting work. You'll probably want my patches that provider helper functions that allow a driver to verify that it can offload a checksum. We'll have to update those also to allow two checksums. I have just realised something startling. Assuming the inner protocol uses the ones complement checksum in the way IP, UDP and TCP do, the outer checksum can be computed *without looking at the payload*. Why? Because the ones complement sum of (say) a correctly checksummed UDP datagram is simply the complement of the ones complement sum of the pseudo header. Similarly, the ones complement sum of a correctly checksummed IP header is zero. Therefore, the outer checksum depends _only_ on the inner and outer pseudo headers and the encapsulation headers. For example, with UDP encapsulated in VXLAN, we have the following packet structure: ETH IP UDP VXLAN inner-ETH inner-IP inner-UDP PAYLOAD and the outer checksum equals ~([outer_pseudo] + [UDP] + [VXLAN] + [inner-ETH] + ~[inner_pseudo]) where [] denotes summation, and all addition is ones complement. This can easily be computed in software, especially as the stack already has ~[inner_pseudo]: it's stored in the inner checksum field to help inner checksum offload. Have I made a mistake in my ones-complement maths, or is outer checksum offload as unnecessary as IP header checksum offload? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-10 20:28 ` Edward Cree @ 2015-12-10 21:02 ` Rustad, Mark D 2015-12-14 15:11 ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree 1 sibling, 0 replies; 38+ messages in thread From: Rustad, Mark D @ 2015-12-10 21:02 UTC (permalink / raw) To: Edward Cree; +Cc: Tom Herbert, Linux Kernel Network Developers, David Miller [-- Attachment #1: Type: text/plain, Size: 1774 bytes --] Edward Cree <ecree@solarflare.com> wrote: > I have just realised something startling. Assuming the inner protocol uses the ones complement checksum in the way IP, UDP and TCP do, the outer checksum can be computed *without looking at the payload*. Why? Because the ones complement sum of (say) a correctly checksummed UDP datagram is simply the complement of the ones complement sum of the pseudo header. Similarly, the ones complement sum of a correctly checksummed IP header is zero. > Therefore, the outer checksum depends _only_ on the inner and outer pseudo headers and the encapsulation headers. For example, with UDP encapsulated in VXLAN, we have the following packet structure: > ETH IP UDP VXLAN inner-ETH inner-IP inner-UDP PAYLOAD > and the outer checksum equals > ~([outer_pseudo] + [UDP] + [VXLAN] + [inner-ETH] + ~[inner_pseudo]) > where [] denotes summation, and all addition is ones complement. > This can easily be computed in software, especially as the stack already has ~[inner_pseudo]: it's stored in the inner checksum field to help inner checksum offload. > > Have I made a mistake in my ones-complement maths, or is outer checksum offload as unnecessary as IP header checksum offload? I agree with the overall observation, in that the outer checksum can be derived from the inner one. I think that the inner-ip header needs to be added (after subtracting out the inner_pseudo as you indicate above), because the entire raw inner IP header needs to be included in the outer checksum. I haven't thought this all through in detail yet. It would be really nice to have a function that implemented something like this. Could one be structured to handle most encapsulations? -- Mark Rustad, Networking Division, Intel Corporation [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 841 bytes --] ^ permalink raw reply [flat|nested] 38+ messages in thread
* [RFC PATCH net-next 0/2] Local checksum offload for VXLAN 2015-12-10 20:28 ` Edward Cree 2015-12-10 21:02 ` Rustad, Mark D @ 2015-12-14 15:11 ` Edward Cree 2015-12-14 15:13 ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree 2015-12-14 15:13 ` [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices Edward Cree 1 sibling, 2 replies; 38+ messages in thread From: Edward Cree @ 2015-12-14 15:11 UTC (permalink / raw) To: Tom Herbert; +Cc: netdev, David Miller When the inner packet checksum is offloaded, the outer UDP checksum is easy to calculate as it doesn't depend on the payload (because the inner checksum cancels out everything from the inner packet except the pseudo header). Thus, transmit checksums for VXLAN (and in principle other encapsulations, but I haven't enabled it for / tested with those) can be offloaded on any device supporting NETIF_F_HW_CSUM. Only the innermost checksum has to be offloaded, the rest are filled in by the stack. Tested by hacking a driver to report NETIF_F_HW_CSUM, call skb_checksum_help before transmitting a packet, and not actually offload anything to the hw. I did it that way because I don't have any hw that can actually offload the inner checksum; but I should be able to get hold of some soon. Edward Cree (2): net: udp: local checksum offload for encapsulation net: vxlan: enable local checksum offload on HW_CSUM devices drivers/net/vxlan.c | 5 ++++- net/ipv4/udp.c | 31 +++++++++++++++++++++++++++---- 2 files changed, 31 insertions(+), 5 deletions(-) -- 2.4.3 ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/2] net: udp: local checksum offload for encapsulation 2015-12-14 15:11 ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree @ 2015-12-14 15:13 ` Edward Cree 2015-12-14 17:16 ` Tom Herbert 2015-12-14 15:13 ` [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices Edward Cree 1 sibling, 1 reply; 38+ messages in thread From: Edward Cree @ 2015-12-14 15:13 UTC (permalink / raw) To: Tom Herbert; +Cc: netdev, David Miller The arithmetic properties of the ones-complement checksum mean that a correctly checksummed inner packet, including its checksum, has a ones complement sum depending only on whatever value was used to initialise the checksum field before checksumming (in the case of TCP and UDP, this is the ones complement sum of the pseudo header, complemented). Consequently, if we are going to offload the inner checksum with CHECKSUM_PARTIAL, we can compute the outer checksum based only on the packed data not covered by the inner checksum, and the initial value of the inner checksum field. Signed-off-by: Edward Cree <ecree@solarflare.com> --- net/ipv4/udp.c | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 0c7b0e6..07d679e 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -767,12 +767,35 @@ void udp_set_csum(bool nocheck, struct sk_buff *skb, { struct udphdr *uh = udp_hdr(skb); - if (nocheck) + if (nocheck) { uh->check = 0; - else if (skb_is_gso(skb)) + } else if (skb_is_gso(skb)) { uh->check = ~udp_v4_check(len, saddr, daddr, 0); - else if (skb_dst(skb) && skb_dst(skb)->dev && - (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) { + } else if (skb->ip_summed == CHECKSUM_PARTIAL && + skb_dst(skb) && skb_dst(skb)->dev && + (skb_dst(skb)->dev->features & NETIF_F_HW_CSUM)) { + /* Everything from csum_start onwards will be + * checksummed and will thus have a sum of whatever + * we previously put in the checksum field (eg. sum + * of pseudo-header) + */ + __wsum csum; + + /* Fill in our pseudo-header checksum */ + uh->check = ~udp_v4_check(len, saddr, daddr, 0); + /* Start with complement of inner pseudo-header checksum */ + csum = ~skb_checksum(skb, skb_checksum_start_offset(skb) + skb->csum_offset, + 2, 0); + /* Add in checksum of our headers (incl. pseudo-header + * checksum filled in above) + */ + csum = skb_checksum(skb, 0, skb_checksum_start_offset(skb), csum); + /* The result is the outer checksum */ + uh->check = csum_fold(csum); + if (uh->check == 0) + uh->check = CSUM_MANGLED_0; + } else if (skb_dst(skb) && skb_dst(skb)->dev && + (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) { BUG_ON(skb->ip_summed == CHECKSUM_PARTIAL); -- 2.4.3 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] net: udp: local checksum offload for encapsulation 2015-12-14 15:13 ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree @ 2015-12-14 17:16 ` Tom Herbert 2015-12-15 18:07 ` Edward Cree 0 siblings, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-14 17:16 UTC (permalink / raw) To: Edward Cree; +Cc: netdev, David Miller On Mon, Dec 14, 2015 at 7:13 AM, Edward Cree <ecree@solarflare.com> wrote: > The arithmetic properties of the ones-complement checksum mean that a > correctly checksummed inner packet, including its checksum, has a ones > complement sum depending only on whatever value was used to initialise > the checksum field before checksumming (in the case of TCP and UDP, > this is the ones complement sum of the pseudo header, complemented). > Consequently, if we are going to offload the inner checksum with > CHECKSUM_PARTIAL, we can compute the outer checksum based only on the > packed data not covered by the inner checksum, and the initial value of > the inner checksum field. > > Signed-off-by: Edward Cree <ecree@solarflare.com> > --- > net/ipv4/udp.c | 31 +++++++++++++++++++++++++++---- > 1 file changed, 27 insertions(+), 4 deletions(-) > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c > index 0c7b0e6..07d679e 100644 > --- a/net/ipv4/udp.c > +++ b/net/ipv4/udp.c > @@ -767,12 +767,35 @@ void udp_set_csum(bool nocheck, struct sk_buff *skb, > { > struct udphdr *uh = udp_hdr(skb); > > - if (nocheck) > + if (nocheck) { > uh->check = 0; > - else if (skb_is_gso(skb)) > + } else if (skb_is_gso(skb)) { > uh->check = ~udp_v4_check(len, saddr, daddr, 0); > - else if (skb_dst(skb) && skb_dst(skb)->dev && > - (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) { > + } else if (skb->ip_summed == CHECKSUM_PARTIAL && > + skb_dst(skb) && skb_dst(skb)->dev && > + (skb_dst(skb)->dev->features & NETIF_F_HW_CSUM)) { > + /* Everything from csum_start onwards will be > + * checksummed and will thus have a sum of whatever > + * we previously put in the checksum field (eg. sum > + * of pseudo-header) > + */ > + __wsum csum; > + > + /* Fill in our pseudo-header checksum */ > + uh->check = ~udp_v4_check(len, saddr, daddr, 0); > + /* Start with complement of inner pseudo-header checksum */ > + csum = ~skb_checksum(skb, skb_checksum_start_offset(skb) + skb->csum_offset, > + 2, 0); > + /* Add in checksum of our headers (incl. pseudo-header > + * checksum filled in above) > + */ > + csum = skb_checksum(skb, 0, skb_checksum_start_offset(skb), csum); > + /* The result is the outer checksum */ > + uh->check = csum_fold(csum); > + if (uh->check == 0) > + uh->check = CSUM_MANGLED_0; > + } else if (skb_dst(skb) && skb_dst(skb)->dev && > + (skb_dst(skb)->dev->features & NETIF_F_V4_CSUM)) { > It's clever, but I'm not sure this saves much. The outer checksum could still be offloaded to the device without the extra work. Where this technique would be nice is if the device doesn't support checksum offload at all, then we would definitely avoid doing multiple checksums. That's going to be harder since we won't see CHECKSUM_PARTIAL in that case for the inner checksum, but it would get us to the principle that we only ever calculate the packet checksum once or zero times. . > BUG_ON(skb->ip_summed == CHECKSUM_PARTIAL); > > -- > 2.4.3 > > > The information contained in this message is confidential and is intended for the addressee(s) only. If you have received this message in error, please notify the sender immediately and delete the message. Unless you are an addressee (or authorized to receive for an addressee), you may not use, copy or disclose to anyone this message or any information contained in this message. The unauthorized use, disclosure, copying or alteration of this message is strictly prohibited. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] net: udp: local checksum offload for encapsulation 2015-12-14 17:16 ` Tom Herbert @ 2015-12-15 18:07 ` Edward Cree 0 siblings, 0 replies; 38+ messages in thread From: Edward Cree @ 2015-12-15 18:07 UTC (permalink / raw) To: Tom Herbert; +Cc: netdev, David Miller On 14/12/15 17:16, Tom Herbert wrote: > It's clever, but I'm not sure this saves much. The outer checksum > could still be offloaded to the device without the extra work. The main thing it saves you is having to fit a second csum_start/offset pair into (a) the SKB and (b) the TX descriptor. Also, it will (at least in principle) support arbitrarily deep nesting of encapsulation. (Whether that's a remotely sensible thing to do is another matter!) Also, local checksum offload probably helps by simplifying TSO. I've just been talking to our firmware team and we concluded that if the outer checksum (of the superpacket) is filled in by the stack, datapath firmware on the sfc 8000 series could fold appropriate corrections (to account for changes in outer length and the length fields in both pseudo headers) into the outer checksum when performing TSO, which may (we haven't benchmarked it yet!) improve performance, compared to computing the whole outer checksum from scratch. > Where > this technique would be nice is if the device doesn't support checksum > offload at all, then we would definitely avoid doing multiple > checksums. That's going to be harder since we won't see > CHECKSUM_PARTIAL in that case for the inner checksum, but it would get > us to the principle that we only ever calculate the packet checksum > once or zero times. Presumably in that case the inner checksum would be marked as CHECKSUM_UNNECESSARY, and the code that fills in the inner checksum could also fill in skb->csum_start and skb->csum_offset. Then in udp_set_csum() we could just treat it exactly the same as CHECKSUM_PARTIAL. Then we could update start and offset to refer to the checksum we've just filled in, so that any nested encap doesn't have to re-checksum our header. I have no objections to that, but I don't think I grok the rest of the stack deeply enough to implement it. In the meantime, it seems worthwhile getting the CHECKSUM_PARTIAL version in, and working any bugs out of it; do you have any objections to my patch series as it stands or should I repost without RFC tags? I notice it's marked "Changes Requested" in patchwork but it's not entirely clear to me what changes are wanted. ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices 2015-12-14 15:11 ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree 2015-12-14 15:13 ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree @ 2015-12-14 15:13 ` Edward Cree 1 sibling, 0 replies; 38+ messages in thread From: Edward Cree @ 2015-12-14 15:13 UTC (permalink / raw) To: Tom Herbert; +Cc: netdev, David Miller Signed-off-by: Edward Cree <ecree@solarflare.com> --- drivers/net/vxlan.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c index 6369a57..c1660d6 100644 --- a/drivers/net/vxlan.c +++ b/drivers/net/vxlan.c @@ -1785,6 +1785,9 @@ static int vxlan_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *sk bool udp_sum = !!(vxflags & VXLAN_F_UDP_CSUM); int type = udp_sum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL; u16 hdrlen = sizeof(struct vxlanhdr); + /* Is device able to do the inner checksum? */ + bool inner_csum = skb_dst(skb) && skb_dst(skb)->dev && + (skb_dst(skb)->dev->features & NETIF_F_HW_CSUM); if ((vxflags & VXLAN_F_REMCSUM_TX) && skb->ip_summed == CHECKSUM_PARTIAL) { @@ -1814,7 +1817,7 @@ static int vxlan_xmit_skb(struct rtable *rt, struct sock *sk, struct sk_buff *sk if (WARN_ON(!skb)) return -ENOMEM; - skb = iptunnel_handle_offloads(skb, udp_sum, type); + skb = iptunnel_handle_offloads(skb, udp_sum && !inner_csum, type); if (IS_ERR(skb)) return PTR_ERR(skb); -- 2.4.3 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-09 17:28 ` Edward Cree 2015-12-09 17:31 ` David Laight 2015-12-09 18:00 ` Tom Herbert @ 2015-12-11 23:50 ` Tom Herbert 2015-12-12 16:41 ` Sowmini Varadhan 2 siblings, 1 reply; 38+ messages in thread From: Tom Herbert @ 2015-12-11 23:50 UTC (permalink / raw) To: Edward Cree; +Cc: Linux Kernel Network Developers, David Miller >> If your device is trying do offload more than one checksum on its own > accord without being asked to do so by the stack it is doing the > wrong thing! > From the stack's perspective: yes, it is doing the wrong thing. (I've > been discussing with colleagues today how we could change that, and I > think we can, but it involves having _three_ hardware TXQs per kernel > queue, instead of the two we have now...) > But from the outside perspective, the system as a whole isn't doing > anything bad - the packet going on the network is valid and just > happens to have both inner and outer checksums filled in. Is there a > good reason _why_ the stack forbids a device to do this? (Sure, it's > not necessary, and makes the hardware more complex. But the hardware's > already been made, and it's not a *completely* useless thing to do...) > By the way, I am pretty sure that any driver that supports checksum offload of encapsulation is going to break at some point with multiple layers of encapsulation. For example, suppose send a UDP packet in VXLAN in VXLAN, where packet looks like: IP|UDP|VXLAN|Eth|IP|UDP|VXLAN|Eth|IP|UDP. I believe the kernel can set up checksum offload for any of the three UDP checksums. If the outer checksum is offloaded, then skb->encapsulation is not set so the driver will know that a plain UDP/IP checksum is being offload-- no problem. For the other two, skb->encapsulation will be set so driver will know that one of the encapsulated checksums is being offloaded, but it cannot tell which one without comparing against csum_start. Inner headers don't help here since they would likely point to the outermost encapsulation headers even when checksum being offloaded is the innermost one. If the driver guesses about which checksum is being offloaded and its wrong then packet is sent with a bad checksum (possibly two bad checksums). Unless I'm reading this incorrectly, it is a real bug. I hope to the send base helper functions for this next week (the NETIF_F_IP[V6]_CSUM sunsetting patches). Any drivers that are offloading encapsulated checksums and don't check csum_start really should be subsequently fixed. Tom ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-11 23:50 ` Checksum offload queries Tom Herbert @ 2015-12-12 16:41 ` Sowmini Varadhan 2015-12-12 17:24 ` Tom Herbert 0 siblings, 1 reply; 38+ messages in thread From: Sowmini Varadhan @ 2015-12-12 16:41 UTC (permalink / raw) To: Tom Herbert; +Cc: Edward Cree, Linux Kernel Network Developers, David Miller On (12/11/15 15:50), Tom Herbert wrote: > layers of encapsulation. For example, suppose send a UDP packet in > VXLAN in VXLAN, where packet looks like: > IP|UDP|VXLAN|Eth|IP|UDP|VXLAN|Eth|IP|UDP. I believe the kernel can set Without detracting from the fact that you have probably found a bug for checksumming here.. if you have multiple encapsulations above (vxlan over vxlan?) I think lot of other things like pmtu may also be broken? (Each encaps layer lowers the actual application mtu till the thing starts to get absurd) --Sowmini ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: Checksum offload queries 2015-12-12 16:41 ` Sowmini Varadhan @ 2015-12-12 17:24 ` Tom Herbert 0 siblings, 0 replies; 38+ messages in thread From: Tom Herbert @ 2015-12-12 17:24 UTC (permalink / raw) To: Sowmini Varadhan Cc: Edward Cree, Linux Kernel Network Developers, David Miller On Sat, Dec 12, 2015 at 8:41 AM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > On (12/11/15 15:50), Tom Herbert wrote: >> layers of encapsulation. For example, suppose send a UDP packet in >> VXLAN in VXLAN, where packet looks like: >> IP|UDP|VXLAN|Eth|IP|UDP|VXLAN|Eth|IP|UDP. I believe the kernel can set > > Without detracting from the fact that you have probably found a bug > for checksumming here.. > > if you have multiple encapsulations above (vxlan over vxlan?) > I think lot of other things like pmtu may also be broken? (Each > encaps layer lowers the actual application mtu till the thing starts > to get absurd) > PMTU is going to be a headache with any encapsulation, multiple layers just makes the headache worse. This is one of the reasons why ILA is better for network virtualization-- no PMTU issue since there is no encapsulation being done. > --Sowmini > ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2015-12-15 18:07 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-12-07 15:39 Checksum offload queries Edward Cree 2015-12-07 17:29 ` Tom Herbert 2015-12-07 17:52 ` Tom Herbert 2015-12-08 16:03 ` Edward Cree 2015-12-08 16:43 ` Tom Herbert 2015-12-08 18:03 ` Edward Cree 2015-12-08 17:09 ` David Miller 2015-12-08 17:24 ` Edward Cree 2015-12-08 17:28 ` David Miller 2015-12-07 19:38 ` David Miller 2015-12-08 14:42 ` Edward Cree 2015-12-08 17:04 ` Tom Herbert 2015-12-09 1:56 ` Thomas Graf 2015-12-09 16:08 ` Tom Herbert 2015-12-09 22:29 ` Thomas Graf 2015-12-09 22:51 ` Tom Herbert 2015-12-09 23:13 ` Thomas Graf 2015-12-08 17:06 ` David Miller 2015-12-09 12:14 ` Edward Cree 2015-12-09 16:01 ` Tom Herbert 2015-12-09 17:28 ` Edward Cree 2015-12-09 17:31 ` David Laight 2015-12-09 18:00 ` Tom Herbert 2015-12-09 22:21 ` Thomas Graf 2015-12-09 22:42 ` Tom Herbert 2015-12-09 22:44 ` Thomas Graf 2015-12-10 15:49 ` Edward Cree 2015-12-10 16:26 ` Tom Herbert 2015-12-10 20:28 ` Edward Cree 2015-12-10 21:02 ` Rustad, Mark D 2015-12-14 15:11 ` [RFC PATCH net-next 0/2] Local checksum offload for VXLAN Edward Cree 2015-12-14 15:13 ` [PATCH 1/2] net: udp: local checksum offload for encapsulation Edward Cree 2015-12-14 17:16 ` Tom Herbert 2015-12-15 18:07 ` Edward Cree 2015-12-14 15:13 ` [PATCH 2/2] net: vxlan: enable local checksum offload on HW_CSUM devices Edward Cree 2015-12-11 23:50 ` Checksum offload queries Tom Herbert 2015-12-12 16:41 ` Sowmini Varadhan 2015-12-12 17:24 ` Tom Herbert
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.