All of lore.kernel.org
 help / color / mirror / Atom feed
* Checksum behaviour of bpf_redirected packets
@ 2020-05-04 16:11 Lorenz Bauer
  2020-05-06  1:28 ` Alexei Starovoitov
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenz Bauer @ 2020-05-04 16:11 UTC (permalink / raw)
  To: bpf; +Cc: kernel-team

In our TC classifier cls_redirect [1], we use the following sequence
of helper calls to
decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:

  skb_adjust_room(skb, -encap_len,
BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
  bpf_redirect(skb->ifindex, BPF_F_INGRESS)

It seems like some checksums of the inner headers are not validated in
this case.
For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
network stack and elicits a SYN ACK.

Is this known but undocumented behaviour or a bug? In either case, is
there a work
around I'm not aware of?

1: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/test_cls_redirect.c#n370
-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-04 16:11 Checksum behaviour of bpf_redirected packets Lorenz Bauer
@ 2020-05-06  1:28 ` Alexei Starovoitov
  2020-05-06 16:24   ` Lorenz Bauer
  0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2020-05-06  1:28 UTC (permalink / raw)
  To: Lorenz Bauer; +Cc: bpf, kernel-team

On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> In our TC classifier cls_redirect [1], we use the following sequence
> of helper calls to
> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
>
>   skb_adjust_room(skb, -encap_len,
> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
>   bpf_redirect(skb->ifindex, BPF_F_INGRESS)
>
> It seems like some checksums of the inner headers are not validated in
> this case.
> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
> network stack and elicits a SYN ACK.
>
> Is this known but undocumented behaviour or a bug? In either case, is
> there a work
> around I'm not aware of?

I thought inner and outer csums are covered by different flags and driver
suppose to set the right one depending on level of in-hw checking it did.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-06  1:28 ` Alexei Starovoitov
@ 2020-05-06 16:24   ` Lorenz Bauer
  2020-05-06 17:26     ` Jakub Kicinski
  2020-05-06 21:55     ` Daniel Borkmann
  0 siblings, 2 replies; 17+ messages in thread
From: Lorenz Bauer @ 2020-05-06 16:24 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann; +Cc: bpf, kernel-team

On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >
> > In our TC classifier cls_redirect [1], we use the following sequence
> > of helper calls to
> > decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
> >
> >   skb_adjust_room(skb, -encap_len,
> > BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
> >   bpf_redirect(skb->ifindex, BPF_F_INGRESS)
> >
> > It seems like some checksums of the inner headers are not validated in
> > this case.
> > For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
> > network stack and elicits a SYN ACK.
> >
> > Is this known but undocumented behaviour or a bug? In either case, is
> > there a work
> > around I'm not aware of?
>
> I thought inner and outer csums are covered by different flags and driver
> suppose to set the right one depending on level of in-hw checking it did.

I've figured out what the problem is. We receive the following packet from
the driver:

    | ETH | IP | UDP | GUE | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
and get the following:

    | ETH | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

Note that ip_summed is still CHECKSUM_UNNECESSARY. After
bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
skb_checksum_init is turned into a no-op due to
CHECKSUM_UNNECESSARY.

I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
accordingly. Unfortunately I don't understand how checksums work
sufficiently. Daniel, it seems like you wrote the helper, could you
take a look?

Thanks!
Lorenz

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-06 16:24   ` Lorenz Bauer
@ 2020-05-06 17:26     ` Jakub Kicinski
  2020-05-06 21:55     ` Daniel Borkmann
  1 sibling, 0 replies; 17+ messages in thread
From: Jakub Kicinski @ 2020-05-06 17:26 UTC (permalink / raw)
  To: Lorenz Bauer; +Cc: Alexei Starovoitov, Daniel Borkmann, bpf, kernel-team

On Wed, 6 May 2020 17:24:43 +0100 Lorenz Bauer wrote:
> On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:  
> > >
> > > In our TC classifier cls_redirect [1], we use the following sequence
> > > of helper calls to
> > > decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
> > >
> > >   skb_adjust_room(skb, -encap_len,
> > > BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
> > >   bpf_redirect(skb->ifindex, BPF_F_INGRESS)
> > >
> > > It seems like some checksums of the inner headers are not validated in
> > > this case.
> > > For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
> > > network stack and elicits a SYN ACK.
> > >
> > > Is this known but undocumented behaviour or a bug? In either case, is
> > > there a work
> > > around I'm not aware of?  
> >
> > I thought inner and outer csums are covered by different flags and driver
> > suppose to set the right one depending on level of in-hw checking it did.  
> 
> I've figured out what the problem is. We receive the following packet from
> the driver:
> 
>     | ETH | IP | UDP | GUE | IP | TCP |
>     skb->ip_summed == CHECKSUM_UNNECESSARY
> 
> ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
> checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
> and get the following:
> 
>     | ETH | IP | TCP |
>     skb->ip_summed == CHECKSUM_UNNECESSARY
> 
> Note that ip_summed is still CHECKSUM_UNNECESSARY. After
> bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
> skb_checksum_init is turned into a no-op due to
> CHECKSUM_UNNECESSARY.
> 
> I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
> accordingly. 

Sounds like we need a call to __skb_decr_checksum_unnecessary(),
but as you indicate below when and where to call it is challenging :S

> Unfortunately I don't understand how checksums work
> sufficiently. Daniel, it seems like you wrote the helper, could you
> take a look?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-06 16:24   ` Lorenz Bauer
  2020-05-06 17:26     ` Jakub Kicinski
@ 2020-05-06 21:55     ` Daniel Borkmann
  2020-05-07 15:54       ` Lorenz Bauer
  1 sibling, 1 reply; 17+ messages in thread
From: Daniel Borkmann @ 2020-05-06 21:55 UTC (permalink / raw)
  To: Lorenz Bauer, Alexei Starovoitov; +Cc: bpf, kernel-team

On 5/6/20 6:24 PM, Lorenz Bauer wrote:
> On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>>>
>>> In our TC classifier cls_redirect [1], we use the following sequence
>>> of helper calls to
>>> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
>>>
>>>    skb_adjust_room(skb, -encap_len,
>>> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
>>>    bpf_redirect(skb->ifindex, BPF_F_INGRESS)
>>>
>>> It seems like some checksums of the inner headers are not validated in
>>> this case.
>>> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
>>> network stack and elicits a SYN ACK.
>>>
>>> Is this known but undocumented behaviour or a bug? In either case, is
>>> there a work
>>> around I'm not aware of?
>>
>> I thought inner and outer csums are covered by different flags and driver
>> suppose to set the right one depending on level of in-hw checking it did.
> 
> I've figured out what the problem is. We receive the following packet from
> the driver:
> 
>      | ETH | IP | UDP | GUE | IP | TCP |
>      skb->ip_summed == CHECKSUM_UNNECESSARY
> 
> ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
> checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
> and get the following:
> 
>      | ETH | IP | TCP |
>      skb->ip_summed == CHECKSUM_UNNECESSARY
> 
> Note that ip_summed is still CHECKSUM_UNNECESSARY. After
> bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
> skb_checksum_init is turned into a no-op due to
> CHECKSUM_UNNECESSARY.
> 
> I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
> accordingly. Unfortunately I don't understand how checksums work
> sufficiently. Daniel, it seems like you wrote the helper, could you
> take a look?

Right, so in the skb_adjust_room() case we're not aware of protocol
specifics. We do handle the csum complete case via skb_postpull_rcsum(),
but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
skb->csum_level of the original skb prior to skb_adjust_room() call
might have been 0 (that is, covering UDP)? So if we'd add the possibility
to __skb_decr_checksum_unnecessary() via flag, then it would become
skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
same for the reverse case. Below is a quick hack (compile tested-only);
would this resolve your case ...

 >>>    skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO|BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL)
 >>>    bpf_redirect(skb->ifindex, BPF_F_INGRESS)

 From 7439724fcfff7742223198c620349a4fc89d4835 Mon Sep 17 00:00:00 2001
Message-Id: <7439724fcfff7742223198c620349a4fc89d4835.1588801971.git.daniel@iogearbox.net>
From: Daniel Borkmann <daniel@iogearbox.net>
Date: Wed, 6 May 2020 23:50:31 +0200
Subject: [PATCH bpf-next] bpf: inc/dec csum level for csum_unnecessary in skb_adjust_room

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
  include/uapi/linux/bpf.h |  2 ++
  net/core/filter.c        | 23 ++++++++++++++++++++---
  2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b3643e27e264..9877807b8f28 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3279,6 +3279,8 @@ enum {
  	BPF_F_ADJ_ROOM_ENCAP_L3_IPV6	= (1ULL << 2),
  	BPF_F_ADJ_ROOM_ENCAP_L4_GRE	= (1ULL << 3),
  	BPF_F_ADJ_ROOM_ENCAP_L4_UDP	= (1ULL << 4),
+	BPF_F_ADJ_ROOM_INC_CSUM_LEVEL	= (1ULL << 5),
+	BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL	= (1ULL << 6),
  };

  enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index dfaf5df13722..10551dabb7b5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3008,7 +3008,9 @@ static u32 bpf_skb_net_base_len(const struct sk_buff *skb)
  					 BPF_F_ADJ_ROOM_ENCAP_L4_GRE | \
  					 BPF_F_ADJ_ROOM_ENCAP_L4_UDP | \
  					 BPF_F_ADJ_ROOM_ENCAP_L2( \
-					  BPF_ADJ_ROOM_ENCAP_L2_MASK))
+					  BPF_ADJ_ROOM_ENCAP_L2_MASK) | \
+					 BPF_F_ADJ_ROOM_INC_CSUM_LEVEL | \
+					 BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL)

  static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
  			    u64 flags)
@@ -3019,6 +3021,10 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
  	unsigned int gso_type = SKB_GSO_DODGY;
  	int ret;

+	if (unlikely(flags & ~(BPF_F_ADJ_ROOM_MASK &
+			       ~(BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL))))
+		return -EINVAL;
+
  	if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) {
  		/* udp gso_size delineates datagrams, only allow if fixed */
  		if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) ||
@@ -3105,6 +3111,9 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 off, u32 len_diff,
  		shinfo->gso_segs = 0;
  	}

+	if (flags & BPF_F_ADJ_ROOM_INC_CSUM_LEVEL)
+		__skb_incr_checksum_unnecessary(skb);
+
  	return 0;
  }

@@ -3113,7 +3122,8 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
  {
  	int ret;

-	if (flags & ~BPF_F_ADJ_ROOM_FIXED_GSO)
+	if (unlikely(flags & ~(BPF_F_ADJ_ROOM_FIXED_GSO |
+			       BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL)))
  		return -EINVAL;

  	if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) {
@@ -3143,6 +3153,9 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
  		shinfo->gso_segs = 0;
  	}

+	if (flags & BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL)
+		__skb_decr_checksum_unnecessary(skb);
+
  	return 0;
  }

@@ -3163,7 +3176,11 @@ BPF_CALL_4(bpf_skb_adjust_room, struct sk_buff *, skb, s32, len_diff,
  	u32 off;
  	int ret;

-	if (unlikely(flags & ~BPF_F_ADJ_ROOM_MASK))
+	if (unlikely((flags & ~BPF_F_ADJ_ROOM_MASK) ||
+		     ((flags & (BPF_F_ADJ_ROOM_INC_CSUM_LEVEL |
+				BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL)) ==
+		      (BPF_F_ADJ_ROOM_INC_CSUM_LEVEL |
+		       BPF_F_ADJ_ROOM_DEC_CSUM_LEVEL))))
  		return -EINVAL;
  	if (unlikely(len_diff_abs > 0xfffU))
  		return -EFAULT;
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-06 21:55     ` Daniel Borkmann
@ 2020-05-07 15:54       ` Lorenz Bauer
  2020-05-07 16:43         ` Daniel Borkmann
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenz Bauer @ 2020-05-07 15:54 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, bpf, kernel-team

On Wed, 6 May 2020 at 22:55, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 5/6/20 6:24 PM, Lorenz Bauer wrote:
> > On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >>>
> >>> In our TC classifier cls_redirect [1], we use the following sequence
> >>> of helper calls to
> >>> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
> >>>
> >>>    skb_adjust_room(skb, -encap_len,
> >>> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
> >>>    bpf_redirect(skb->ifindex, BPF_F_INGRESS)
> >>>
> >>> It seems like some checksums of the inner headers are not validated in
> >>> this case.
> >>> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
> >>> network stack and elicits a SYN ACK.
> >>>
> >>> Is this known but undocumented behaviour or a bug? In either case, is
> >>> there a work
> >>> around I'm not aware of?
> >>
> >> I thought inner and outer csums are covered by different flags and driver
> >> suppose to set the right one depending on level of in-hw checking it did.
> >
> > I've figured out what the problem is. We receive the following packet from
> > the driver:
> >
> >      | ETH | IP | UDP | GUE | IP | TCP |
> >      skb->ip_summed == CHECKSUM_UNNECESSARY
> >
> > ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
> > checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
> > and get the following:
> >
> >      | ETH | IP | TCP |
> >      skb->ip_summed == CHECKSUM_UNNECESSARY
> >
> > Note that ip_summed is still CHECKSUM_UNNECESSARY. After
> > bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
> > skb_checksum_init is turned into a no-op due to
> > CHECKSUM_UNNECESSARY.
> >
> > I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
> > accordingly. Unfortunately I don't understand how checksums work
> > sufficiently. Daniel, it seems like you wrote the helper, could you
> > take a look?
>
> Right, so in the skb_adjust_room() case we're not aware of protocol
> specifics. We do handle the csum complete case via skb_postpull_rcsum(),
> but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
> skb->csum_level of the original skb prior to skb_adjust_room() call
> might have been 0 (that is, covering UDP)? So if we'd add the possibility
> to __skb_decr_checksum_unnecessary() via flag, then it would become
> skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
> same for the reverse case. Below is a quick hack (compile tested-only);
> would this resolve your case ...

Thanks for the patch, it indeed fixes our problem! I spent some more time
trying to understand the checksum offload stuff, here is where I am:

On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
everything works by default since the rest of the stack does checksumming in
software.

On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
will adjust for the data that is being removed from the skb. The rest of the
stack will use the correct value, all is well.

However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
the API of skb_adjust_room doesn't tell us whether the user intends to
remove headers or data, and how that will influence csum_level.
From my POV, skb_adjust_room currently does the wrong thing.
I think we need to fix skb_adjust_room to do the right thing by default,
rather than extending the API. We spent a lot of time on tracking this down,
so hopefully we can spare others the pain.

As Jakub alludes to, we don't know when and how often to call
__skb_decr_checksum_unnecessary so we should just
unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
enough to land as a fix via the bpf tree (which is important for our
production kernel). As a follow up we could add the inverse of the flags you
propose via bpf-next.

What do you think?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-07 15:54       ` Lorenz Bauer
@ 2020-05-07 16:43         ` Daniel Borkmann
  2020-05-07 21:25           ` Jakub Kicinski
  2020-05-11  9:29           ` Lorenz Bauer
  0 siblings, 2 replies; 17+ messages in thread
From: Daniel Borkmann @ 2020-05-07 16:43 UTC (permalink / raw)
  To: Lorenz Bauer; +Cc: Alexei Starovoitov, bpf, kernel-team

On 5/7/20 5:54 PM, Lorenz Bauer wrote:
> On Wed, 6 May 2020 at 22:55, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 5/6/20 6:24 PM, Lorenz Bauer wrote:
>>> On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>>>>>
>>>>> In our TC classifier cls_redirect [1], we use the following sequence
>>>>> of helper calls to
>>>>> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
>>>>>
>>>>>     skb_adjust_room(skb, -encap_len,
>>>>> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
>>>>>     bpf_redirect(skb->ifindex, BPF_F_INGRESS)
>>>>>
>>>>> It seems like some checksums of the inner headers are not validated in
>>>>> this case.
>>>>> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
>>>>> network stack and elicits a SYN ACK.
>>>>>
>>>>> Is this known but undocumented behaviour or a bug? In either case, is
>>>>> there a work
>>>>> around I'm not aware of?
>>>>
>>>> I thought inner and outer csums are covered by different flags and driver
>>>> suppose to set the right one depending on level of in-hw checking it did.
>>>
>>> I've figured out what the problem is. We receive the following packet from
>>> the driver:
>>>
>>>       | ETH | IP | UDP | GUE | IP | TCP |
>>>       skb->ip_summed == CHECKSUM_UNNECESSARY
>>>
>>> ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
>>> checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
>>> and get the following:
>>>
>>>       | ETH | IP | TCP |
>>>       skb->ip_summed == CHECKSUM_UNNECESSARY
>>>
>>> Note that ip_summed is still CHECKSUM_UNNECESSARY. After
>>> bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
>>> skb_checksum_init is turned into a no-op due to
>>> CHECKSUM_UNNECESSARY.
>>>
>>> I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
>>> accordingly. Unfortunately I don't understand how checksums work
>>> sufficiently. Daniel, it seems like you wrote the helper, could you
>>> take a look?
>>
>> Right, so in the skb_adjust_room() case we're not aware of protocol
>> specifics. We do handle the csum complete case via skb_postpull_rcsum(),
>> but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
>> skb->csum_level of the original skb prior to skb_adjust_room() call
>> might have been 0 (that is, covering UDP)? So if we'd add the possibility
>> to __skb_decr_checksum_unnecessary() via flag, then it would become
>> skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
>> same for the reverse case. Below is a quick hack (compile tested-only);
>> would this resolve your case ...
> 
> Thanks for the patch, it indeed fixes our problem! I spent some more time
> trying to understand the checksum offload stuff, here is where I am:
> 
> On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
> everything works by default since the rest of the stack does checksumming in
> software.
> 
> On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
> will adjust for the data that is being removed from the skb. The rest of the
> stack will use the correct value, all is well.
> 
> However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
> the API of skb_adjust_room doesn't tell us whether the user intends to
> remove headers or data, and how that will influence csum_level.
>  From my POV, skb_adjust_room currently does the wrong thing.
> I think we need to fix skb_adjust_room to do the right thing by default,
> rather than extending the API. We spent a lot of time on tracking this down,
> so hopefully we can spare others the pain.
> 
> As Jakub alludes to, we don't know when and how often to call
> __skb_decr_checksum_unnecessary so we should just
> unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
> CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
> enough to land as a fix via the bpf tree (which is important for our
> production kernel). As a follow up we could add the inverse of the flags you
> propose via bpf-next.
> 
> What do you think?

My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
basically trash performance if we have to fallback to sw in fast-path, these
helpers are also used in our LB case for DSR, for example. I agree that it
sucks to expose these implementation details though. So eventually we'd end
up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
a complex to use helper with all its flags where you end up looking into the
implementation detail to understand what it is really doing. I'm not sure if
we make anything worse, but I do see your concern. :/ (We do have bpf_csum_update()
helper as well. I wonder whether we should split such control into a different
helper.)

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-07 16:43         ` Daniel Borkmann
@ 2020-05-07 21:25           ` Jakub Kicinski
  2020-05-11  9:31             ` Lorenz Bauer
  2020-05-11  9:29           ` Lorenz Bauer
  1 sibling, 1 reply; 17+ messages in thread
From: Jakub Kicinski @ 2020-05-07 21:25 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Lorenz Bauer, Alexei Starovoitov, bpf, kernel-team

On Thu, 7 May 2020 18:43:47 +0200 Daniel Borkmann wrote:
> > Thanks for the patch, it indeed fixes our problem! I spent some more time
> > trying to understand the checksum offload stuff, here is where I am:
> > 
> > On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
> > everything works by default since the rest of the stack does checksumming in
> > software.
> > 
> > On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
> > will adjust for the data that is being removed from the skb. The rest of the
> > stack will use the correct value, all is well.
> > 
> > However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
> > the API of skb_adjust_room doesn't tell us whether the user intends to
> > remove headers or data, and how that will influence csum_level.
> >  From my POV, skb_adjust_room currently does the wrong thing.
> > I think we need to fix skb_adjust_room to do the right thing by default,
> > rather than extending the API. We spent a lot of time on tracking this down,
> > so hopefully we can spare others the pain.
> > 
> > As Jakub alludes to, we don't know when and how often to call
> > __skb_decr_checksum_unnecessary so we should just
> > unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
> > CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
> > enough to land as a fix via the bpf tree (which is important for our
> > production kernel). As a follow up we could add the inverse of the flags you
> > propose via bpf-next.
> > 
> > What do you think?  
> 
> My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
> basically trash performance if we have to fallback to sw in fast-path, these
> helpers are also used in our LB case for DSR, for example. I agree that it
> sucks to expose these implementation details though. So eventually we'd end
> up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
> a complex to use helper with all its flags where you end up looking into the
> implementation detail to understand what it is really doing. I'm not sure if
> we make anything worse, but I do see your concern. :/ (We do have bpf_csum_update()
> helper as well. I wonder whether we should split such control into a different
> helper.)

Probably stating the obvious but for decap of UDP tunnels which carry
locally terminated flows - we'd probably also want the upgrade from
UNNECESSARY to COMPLETE, like we do in the kernel
(skb_checksum_try_convert()). Tricky.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-07 16:43         ` Daniel Borkmann
  2020-05-07 21:25           ` Jakub Kicinski
@ 2020-05-11  9:29           ` Lorenz Bauer
  2020-05-12 21:25             ` Daniel Borkmann
  1 sibling, 1 reply; 17+ messages in thread
From: Lorenz Bauer @ 2020-05-11  9:29 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, bpf, kernel-team, Jakub Kicinski

On Thu, 7 May 2020 at 17:43, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 5/7/20 5:54 PM, Lorenz Bauer wrote:
> > On Wed, 6 May 2020 at 22:55, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 5/6/20 6:24 PM, Lorenz Bauer wrote:
> >>> On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com> wrote:
> >>>> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >>>>>
> >>>>> In our TC classifier cls_redirect [1], we use the following sequence
> >>>>> of helper calls to
> >>>>> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
> >>>>>
> >>>>>     skb_adjust_room(skb, -encap_len,
> >>>>> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
> >>>>>     bpf_redirect(skb->ifindex, BPF_F_INGRESS)
> >>>>>
> >>>>> It seems like some checksums of the inner headers are not validated in
> >>>>> this case.
> >>>>> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
> >>>>> network stack and elicits a SYN ACK.
> >>>>>
> >>>>> Is this known but undocumented behaviour or a bug? In either case, is
> >>>>> there a work
> >>>>> around I'm not aware of?
> >>>>
> >>>> I thought inner and outer csums are covered by different flags and driver
> >>>> suppose to set the right one depending on level of in-hw checking it did.
> >>>
> >>> I've figured out what the problem is. We receive the following packet from
> >>> the driver:
> >>>
> >>>       | ETH | IP | UDP | GUE | IP | TCP |
> >>>       skb->ip_summed == CHECKSUM_UNNECESSARY
> >>>
> >>> ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
> >>> checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
> >>> and get the following:
> >>>
> >>>       | ETH | IP | TCP |
> >>>       skb->ip_summed == CHECKSUM_UNNECESSARY
> >>>
> >>> Note that ip_summed is still CHECKSUM_UNNECESSARY. After
> >>> bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
> >>> skb_checksum_init is turned into a no-op due to
> >>> CHECKSUM_UNNECESSARY.
> >>>
> >>> I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
> >>> accordingly. Unfortunately I don't understand how checksums work
> >>> sufficiently. Daniel, it seems like you wrote the helper, could you
> >>> take a look?
> >>
> >> Right, so in the skb_adjust_room() case we're not aware of protocol
> >> specifics. We do handle the csum complete case via skb_postpull_rcsum(),
> >> but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
> >> skb->csum_level of the original skb prior to skb_adjust_room() call
> >> might have been 0 (that is, covering UDP)? So if we'd add the possibility
> >> to __skb_decr_checksum_unnecessary() via flag, then it would become
> >> skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
> >> same for the reverse case. Below is a quick hack (compile tested-only);
> >> would this resolve your case ...
> >
> > Thanks for the patch, it indeed fixes our problem! I spent some more time
> > trying to understand the checksum offload stuff, here is where I am:
> >
> > On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
> > everything works by default since the rest of the stack does checksumming in
> > software.
> >
> > On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
> > will adjust for the data that is being removed from the skb. The rest of the
> > stack will use the correct value, all is well.
> >
> > However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
> > the API of skb_adjust_room doesn't tell us whether the user intends to
> > remove headers or data, and how that will influence csum_level.
> >  From my POV, skb_adjust_room currently does the wrong thing.
> > I think we need to fix skb_adjust_room to do the right thing by default,
> > rather than extending the API. We spent a lot of time on tracking this down,
> > so hopefully we can spare others the pain.
> >
> > As Jakub alludes to, we don't know when and how often to call
> > __skb_decr_checksum_unnecessary so we should just
> > unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
> > CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
> > enough to land as a fix via the bpf tree (which is important for our
> > production kernel). As a follow up we could add the inverse of the flags you
> > propose via bpf-next.
> >
> > What do you think?
>
> My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
> basically trash performance if we have to fallback to sw in fast-path, these
> helpers are also used in our LB case for DSR, for example.

Our setup also uses DSR, so I wonder how you manage to avoid this
checksum issue.
Why is Cilium not affected by this bug as well? You never pop headers?

FWIW, currently the only work around I know is to disable rx
checksumming for ALL
inbound traffic via ethtool -K bla rx off. Which in theory trashes
performance for all
RX traffic, not just the one going to the load balancer. I applied this in a
couple of production data centers, and did not see an increase in
softirq, which is
where I assume this would show up. My guess is that this is because our
RX << TX. Unconditionally setting CHECKSUM_NONE would be even less visible,
since we could turn rx checksumming back on in the general case.
Is there a way for you to quantify what the impact for Cilium would be?

> I agree that it
> sucks to expose these implementation details though. So eventually we'd end
> up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
> a complex to use helper with all its flags where you end up looking into the
> implementation detail to understand what it is really doing. I'm not sure if
> we make anything worse, but I do see your concern. :/

Having those flags seems fine to me, you're right that it's already complicated.
My concern is really with the current state of the helper however: I think that
as it exists right now it's buggy wrt checksum offload, and we need a
backportable fix.

Option 1: always downgrade UNNECESSARY to NONE
- Easiest to back port
- The helper is safe by default
- Performance impact unclear
- No escape hatch for Cilium

Option 2: add a flag to force CHECKSUM_NONE
- New UAPI, can this be backported?
- The helper isn't safe by default, needs documentation
- Escape hatch for Cilium

Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
- New UAPI, can this be backported?
- The helper is safe by default
- Escape hatch for Cilium (though you'd need to detect availability of the
  flag somehow)

I guess there is also Option 0, add a flag but don't backport, which to
me is admitting defeat. If we were to do that we'd at least
want to document the problem. Thinking about how to do that already
makes my head spin:

- If you have a NIC that does CHECKSUM_UNNECESSARY
- And you pop network headers
- You will run into this bug
- To fix it you have to disable rx checksum offload

How do users figure out whether a NIC does UNNECESSARY vs. COMPLETE
vs. NONE? I have the luxury of only caring about two different drivers, but
what if I ship BPF (like Cilium does)? Ultimately vendors would either have
buggy programs, or would tell people to unconditionally disable rx
checksumming I believe.

From my POV, I'd prefer option 1 or 3, since I strongly believe that the
helper should be safe by default, and that the user can assert invariants
via flags to get better performance.I could live with option 2 as well since
I just have to care about a single kernel version.

> (We do have bpf_csum_update()
> helper as well. I wonder whether we should split such control into a different
> helper.)

I'm not sure what you mean, maybe you can elaborate a little?

Lorenz

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-07 21:25           ` Jakub Kicinski
@ 2020-05-11  9:31             ` Lorenz Bauer
  0 siblings, 0 replies; 17+ messages in thread
From: Lorenz Bauer @ 2020-05-11  9:31 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: Daniel Borkmann, Alexei Starovoitov, bpf, kernel-team

On Thu, 7 May 2020 at 22:25, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 7 May 2020 18:43:47 +0200 Daniel Borkmann wrote:
> > > Thanks for the patch, it indeed fixes our problem! I spent some more time
> > > trying to understand the checksum offload stuff, here is where I am:
> > >
> > > On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
> > > everything works by default since the rest of the stack does checksumming in
> > > software.
> > >
> > > On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
> > > will adjust for the data that is being removed from the skb. The rest of the
> > > stack will use the correct value, all is well.
> > >
> > > However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
> > > the API of skb_adjust_room doesn't tell us whether the user intends to
> > > remove headers or data, and how that will influence csum_level.
> > >  From my POV, skb_adjust_room currently does the wrong thing.
> > > I think we need to fix skb_adjust_room to do the right thing by default,
> > > rather than extending the API. We spent a lot of time on tracking this down,
> > > so hopefully we can spare others the pain.
> > >
> > > As Jakub alludes to, we don't know when and how often to call
> > > __skb_decr_checksum_unnecessary so we should just
> > > unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
> > > CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
> > > enough to land as a fix via the bpf tree (which is important for our
> > > production kernel). As a follow up we could add the inverse of the flags you
> > > propose via bpf-next.
> > >
> > > What do you think?
> >
> > My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
> > basically trash performance if we have to fallback to sw in fast-path, these
> > helpers are also used in our LB case for DSR, for example. I agree that it
> > sucks to expose these implementation details though. So eventually we'd end
> > up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
> > a complex to use helper with all its flags where you end up looking into the
> > implementation detail to understand what it is really doing. I'm not sure if
> > we make anything worse, but I do see your concern. :/ (We do have bpf_csum_update()
> > helper as well. I wonder whether we should split such control into a different
> > helper.)
>
> Probably stating the obvious but for decap of UDP tunnels which carry
> locally terminated flows - we'd probably also want the upgrade from
> UNNECESSARY to COMPLETE, like we do in the kernel
> (skb_checksum_try_convert()). Tricky.

I guess this is an argument in the direction that bpf_adjust_room is too
low level an API?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-11  9:29           ` Lorenz Bauer
@ 2020-05-12 21:25             ` Daniel Borkmann
  2020-05-13 14:14               ` Lorenz Bauer
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Borkmann @ 2020-05-12 21:25 UTC (permalink / raw)
  To: Lorenz Bauer; +Cc: Alexei Starovoitov, bpf, kernel-team, Jakub Kicinski

On 5/11/20 11:29 AM, Lorenz Bauer wrote:
> On Thu, 7 May 2020 at 17:43, Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 5/7/20 5:54 PM, Lorenz Bauer wrote:
>>> On Wed, 6 May 2020 at 22:55, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 5/6/20 6:24 PM, Lorenz Bauer wrote:
>>>>> On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
>>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>>>>>>>
>>>>>>> In our TC classifier cls_redirect [1], we use the following sequence
>>>>>>> of helper calls to
>>>>>>> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
>>>>>>>
>>>>>>>      skb_adjust_room(skb, -encap_len,
>>>>>>> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
>>>>>>>      bpf_redirect(skb->ifindex, BPF_F_INGRESS)
>>>>>>>
>>>>>>> It seems like some checksums of the inner headers are not validated in
>>>>>>> this case.
>>>>>>> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
>>>>>>> network stack and elicits a SYN ACK.
>>>>>>>
>>>>>>> Is this known but undocumented behaviour or a bug? In either case, is
>>>>>>> there a work
>>>>>>> around I'm not aware of?
>>>>>>
>>>>>> I thought inner and outer csums are covered by different flags and driver
>>>>>> suppose to set the right one depending on level of in-hw checking it did.
>>>>>
>>>>> I've figured out what the problem is. We receive the following packet from
>>>>> the driver:
>>>>>
>>>>>        | ETH | IP | UDP | GUE | IP | TCP |
>>>>>        skb->ip_summed == CHECKSUM_UNNECESSARY
>>>>>
>>>>> ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
>>>>> checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
>>>>> and get the following:
>>>>>
>>>>>        | ETH | IP | TCP |
>>>>>        skb->ip_summed == CHECKSUM_UNNECESSARY
>>>>>
>>>>> Note that ip_summed is still CHECKSUM_UNNECESSARY. After
>>>>> bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
>>>>> skb_checksum_init is turned into a no-op due to
>>>>> CHECKSUM_UNNECESSARY.
>>>>>
>>>>> I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
>>>>> accordingly. Unfortunately I don't understand how checksums work
>>>>> sufficiently. Daniel, it seems like you wrote the helper, could you
>>>>> take a look?
>>>>
>>>> Right, so in the skb_adjust_room() case we're not aware of protocol
>>>> specifics. We do handle the csum complete case via skb_postpull_rcsum(),
>>>> but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
>>>> skb->csum_level of the original skb prior to skb_adjust_room() call
>>>> might have been 0 (that is, covering UDP)? So if we'd add the possibility
>>>> to __skb_decr_checksum_unnecessary() via flag, then it would become
>>>> skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
>>>> same for the reverse case. Below is a quick hack (compile tested-only);
>>>> would this resolve your case ...
>>>
>>> Thanks for the patch, it indeed fixes our problem! I spent some more time
>>> trying to understand the checksum offload stuff, here is where I am:
>>>
>>> On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
>>> everything works by default since the rest of the stack does checksumming in
>>> software.
>>>
>>> On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
>>> will adjust for the data that is being removed from the skb. The rest of the
>>> stack will use the correct value, all is well.
>>>
>>> However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
>>> the API of skb_adjust_room doesn't tell us whether the user intends to
>>> remove headers or data, and how that will influence csum_level.
>>>   From my POV, skb_adjust_room currently does the wrong thing.
>>> I think we need to fix skb_adjust_room to do the right thing by default,
>>> rather than extending the API. We spent a lot of time on tracking this down,
>>> so hopefully we can spare others the pain.
>>>
>>> As Jakub alludes to, we don't know when and how often to call
>>> __skb_decr_checksum_unnecessary so we should just
>>> unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
>>> CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
>>> enough to land as a fix via the bpf tree (which is important for our
>>> production kernel). As a follow up we could add the inverse of the flags you
>>> propose via bpf-next.
>>>
>>> What do you think?
>>
>> My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
>> basically trash performance if we have to fallback to sw in fast-path, these
>> helpers are also used in our LB case for DSR, for example.
> 
> Our setup also uses DSR, so I wonder how you manage to avoid this
> checksum issue.
> Why is Cilium not affected by this bug as well? You never pop headers?

We have different modes in our LB on how to apply DSR: pure DSR and hybrid. In
pure DSR, DSR is used for TCP and UDP, and in hybrid we use DSR for TCP and SNAT
for UDP (under the assumption that the main workload is on TCP anyway). For the
proto under DSR we basically use ctx_adjust_room(ctx, 8, BPF_ADJ_ROOM_NET, 0)
for IPv4 and similar ctx_adjust_room() for IPv6 (just of different size). Meaning
we push/pop an IP option for these cases w/ svc IP/port (for TCP under DSR only
in the SYN, but not subsequent packets). Now in the example of CHECKSUM_UNNECESSARY
("skb->csum_level indicates the number of consecutive checksums found in the packet
minus one that have been verified as CHECKSUM_UNNECESSARY. For instance if a device
receives an IPv6->UDP->GRE->IPv4->TCP packet and a device is able to verify the
checksums for UDP (possibly zero), GRE (checksum flag is set) and TCP, skb->csum_level
would be set to two") the IP hdr does not account for it, which might also explain
why we haven't seen it on our side so far.

> FWIW, currently the only work around I know is to disable rx
> checksumming for ALL
> inbound traffic via ethtool -K bla rx off. Which in theory trashes
> performance for all
> RX traffic, not just the one going to the load balancer. I applied this in a
> couple of production data centers, and did not see an increase in
> softirq, which is
> where I assume this would show up. My guess is that this is because our
> RX << TX. Unconditionally setting CHECKSUM_NONE would be even less visible,
> since we could turn rx checksumming back on in the general case.
> Is there a way for you to quantify what the impact for Cilium would be?
> 
>> I agree that it
>> sucks to expose these implementation details though. So eventually we'd end
>> up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
>> a complex to use helper with all its flags where you end up looking into the
>> implementation detail to understand what it is really doing. I'm not sure if
>> we make anything worse, but I do see your concern. :/
> 
> Having those flags seems fine to me, you're right that it's already complicated.
> My concern is really with the current state of the helper however: I think that
> as it exists right now it's buggy wrt checksum offload, and we need a
> backportable fix.
> 
> Option 1: always downgrade UNNECESSARY to NONE
> - Easiest to back port
> - The helper is safe by default
> - Performance impact unclear
> - No escape hatch for Cilium
> 
> Option 2: add a flag to force CHECKSUM_NONE
> - New UAPI, can this be backported?
> - The helper isn't safe by default, needs documentation
> - Escape hatch for Cilium
> 
> Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
> - New UAPI, can this be backported?
> - The helper is safe by default
> - Escape hatch for Cilium (though you'd need to detect availability of the
>    flag somehow)

This seems most reasonable to me; I can try and cook a proposal for tomorrow as
potential fix. Even if we add a flag, this is still backportable to stable (as
long as the overall patch doesn't get too complex and the backport itself stays
compatible uapi-wise to latest kernels. We've done that before.). I happen to
have two ixgbe NICs on some of my test machines which seem to be setting the
CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.

> I guess there is also Option 0, add a flag but don't backport, which to
> me is admitting defeat. If we were to do that we'd at least
> want to document the problem. Thinking about how to do that already
> makes my head spin:
> 
> - If you have a NIC that does CHECKSUM_UNNECESSARY
> - And you pop network headers
> - You will run into this bug
> - To fix it you have to disable rx checksum offload
> 
> How do users figure out whether a NIC does UNNECESSARY vs. COMPLETE
> vs. NONE? I have the luxury of only caring about two different drivers, but
> what if I ship BPF (like Cilium does)? Ultimately vendors would either have
> buggy programs, or would tell people to unconditionally disable rx
> checksumming I believe.
> 
>  From my POV, I'd prefer option 1 or 3, since I strongly believe that the
> helper should be safe by default, and that the user can assert invariants
> via flags to get better performance.I could live with option 2 as well since
> I just have to care about a single kernel version.
> 
>> (We do have bpf_csum_update()
>> helper as well. I wonder whether we should split such control into a different
>> helper.)
> 
> I'm not sure what you mean, maybe you can elaborate a little?

Meaning, a different helper to control these settings, e.g. bpf_csum_adjust(skb,
{inc/dec/..}) which would then fall into option 2 category though.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-12 21:25             ` Daniel Borkmann
@ 2020-05-13 14:14               ` Lorenz Bauer
  2020-06-01 17:48                 ` Alan Maguire
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenz Bauer @ 2020-05-13 14:14 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, bpf, kernel-team, Jakub Kicinski

On Tue, 12 May 2020 at 22:25, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 5/11/20 11:29 AM, Lorenz Bauer wrote:
> > On Thu, 7 May 2020 at 17:43, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 5/7/20 5:54 PM, Lorenz Bauer wrote:
> >>> On Wed, 6 May 2020 at 22:55, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>> On 5/6/20 6:24 PM, Lorenz Bauer wrote:
> >>>>> On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
> >>>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >>>>>>>
> >>>>>>> In our TC classifier cls_redirect [1], we use the following sequence
> >>>>>>> of helper calls to
> >>>>>>> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
> >>>>>>>
> >>>>>>>      skb_adjust_room(skb, -encap_len,
> >>>>>>> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
> >>>>>>>      bpf_redirect(skb->ifindex, BPF_F_INGRESS)
> >>>>>>>
> >>>>>>> It seems like some checksums of the inner headers are not validated in
> >>>>>>> this case.
> >>>>>>> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
> >>>>>>> network stack and elicits a SYN ACK.
> >>>>>>>
> >>>>>>> Is this known but undocumented behaviour or a bug? In either case, is
> >>>>>>> there a work
> >>>>>>> around I'm not aware of?
> >>>>>>
> >>>>>> I thought inner and outer csums are covered by different flags and driver
> >>>>>> suppose to set the right one depending on level of in-hw checking it did.
> >>>>>
> >>>>> I've figured out what the problem is. We receive the following packet from
> >>>>> the driver:
> >>>>>
> >>>>>        | ETH | IP | UDP | GUE | IP | TCP |
> >>>>>        skb->ip_summed == CHECKSUM_UNNECESSARY
> >>>>>
> >>>>> ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
> >>>>> checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
> >>>>> and get the following:
> >>>>>
> >>>>>        | ETH | IP | TCP |
> >>>>>        skb->ip_summed == CHECKSUM_UNNECESSARY
> >>>>>
> >>>>> Note that ip_summed is still CHECKSUM_UNNECESSARY. After
> >>>>> bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
> >>>>> skb_checksum_init is turned into a no-op due to
> >>>>> CHECKSUM_UNNECESSARY.
> >>>>>
> >>>>> I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
> >>>>> accordingly. Unfortunately I don't understand how checksums work
> >>>>> sufficiently. Daniel, it seems like you wrote the helper, could you
> >>>>> take a look?
> >>>>
> >>>> Right, so in the skb_adjust_room() case we're not aware of protocol
> >>>> specifics. We do handle the csum complete case via skb_postpull_rcsum(),
> >>>> but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
> >>>> skb->csum_level of the original skb prior to skb_adjust_room() call
> >>>> might have been 0 (that is, covering UDP)? So if we'd add the possibility
> >>>> to __skb_decr_checksum_unnecessary() via flag, then it would become
> >>>> skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
> >>>> same for the reverse case. Below is a quick hack (compile tested-only);
> >>>> would this resolve your case ...
> >>>
> >>> Thanks for the patch, it indeed fixes our problem! I spent some more time
> >>> trying to understand the checksum offload stuff, here is where I am:
> >>>
> >>> On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
> >>> everything works by default since the rest of the stack does checksumming in
> >>> software.
> >>>
> >>> On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
> >>> will adjust for the data that is being removed from the skb. The rest of the
> >>> stack will use the correct value, all is well.
> >>>
> >>> However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
> >>> the API of skb_adjust_room doesn't tell us whether the user intends to
> >>> remove headers or data, and how that will influence csum_level.
> >>>   From my POV, skb_adjust_room currently does the wrong thing.
> >>> I think we need to fix skb_adjust_room to do the right thing by default,
> >>> rather than extending the API. We spent a lot of time on tracking this down,
> >>> so hopefully we can spare others the pain.
> >>>
> >>> As Jakub alludes to, we don't know when and how often to call
> >>> __skb_decr_checksum_unnecessary so we should just
> >>> unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
> >>> CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
> >>> enough to land as a fix via the bpf tree (which is important for our
> >>> production kernel). As a follow up we could add the inverse of the flags you
> >>> propose via bpf-next.
> >>>
> >>> What do you think?
> >>
> >> My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
> >> basically trash performance if we have to fallback to sw in fast-path, these
> >> helpers are also used in our LB case for DSR, for example.
> >
> > Our setup also uses DSR, so I wonder how you manage to avoid this
> > checksum issue.
> > Why is Cilium not affected by this bug as well? You never pop headers?
>
> We have different modes in our LB on how to apply DSR: pure DSR and hybrid. In
> pure DSR, DSR is used for TCP and UDP, and in hybrid we use DSR for TCP and SNAT
> for UDP (under the assumption that the main workload is on TCP anyway). For the
> proto under DSR we basically use ctx_adjust_room(ctx, 8, BPF_ADJ_ROOM_NET, 0)
> for IPv4 and similar ctx_adjust_room() for IPv6 (just of different size). Meaning
> we push/pop an IP option for these cases w/ svc IP/port (for TCP under DSR only
> in the SYN, but not subsequent packets). Now in the example of CHECKSUM_UNNECESSARY
> ("skb->csum_level indicates the number of consecutive checksums found in the packet
> minus one that have been verified as CHECKSUM_UNNECESSARY. For instance if a device
> receives an IPv6->UDP->GRE->IPv4->TCP packet and a device is able to verify the
> checksums for UDP (possibly zero), GRE (checksum flag is set) and TCP, skb->csum_level
> would be set to two") the IP hdr does not account for it, which might also explain
> why we haven't seen it on our side so far.

I can see two explanations for this: first IP receive processing
ignores RX checksum offload from
what I can tell, it always calls ip_fast_csum (from ip_rcv_core).
Since the TCP pseudo header
doesn't include options things work.

Second, I think you never modify the packet in the RX path if you let
it continue up the stack.
Looking at tail_nodeport_ipv4_dsr, I think that even unconditionally dropping to
CHECKSUM_NONE will not make a difference to you: you add the header,
do a fib_lookup
and then redirect into the device egress. Checksum offloading can't
even kick in at this point.
Maybe option 1 is feasible after all?

>
> > FWIW, currently the only work around I know is to disable rx
> > checksumming for ALL
> > inbound traffic via ethtool -K bla rx off. Which in theory trashes
> > performance for all
> > RX traffic, not just the one going to the load balancer. I applied this in a
> > couple of production data centers, and did not see an increase in
> > softirq, which is
> > where I assume this would show up. My guess is that this is because our
> > RX << TX. Unconditionally setting CHECKSUM_NONE would be even less visible,
> > since we could turn rx checksumming back on in the general case.
> > Is there a way for you to quantify what the impact for Cilium would be?
> >
> >> I agree that it
> >> sucks to expose these implementation details though. So eventually we'd end
> >> up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
> >> a complex to use helper with all its flags where you end up looking into the
> >> implementation detail to understand what it is really doing. I'm not sure if
> >> we make anything worse, but I do see your concern. :/
> >
> > Having those flags seems fine to me, you're right that it's already complicated.
> > My concern is really with the current state of the helper however: I think that
> > as it exists right now it's buggy wrt checksum offload, and we need a
> > backportable fix.
> >
> > Option 1: always downgrade UNNECESSARY to NONE
> > - Easiest to back port
> > - The helper is safe by default
> > - Performance impact unclear
> > - No escape hatch for Cilium
> >
> > Option 2: add a flag to force CHECKSUM_NONE
> > - New UAPI, can this be backported?
> > - The helper isn't safe by default, needs documentation
> > - Escape hatch for Cilium
> >
> > Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
> > - New UAPI, can this be backported?
> > - The helper is safe by default
> > - Escape hatch for Cilium (though you'd need to detect availability of the
> >    flag somehow)
>
> This seems most reasonable to me; I can try and cook a proposal for tomorrow as
> potential fix. Even if we add a flag, this is still backportable to stable (as
> long as the overall patch doesn't get too complex and the backport itself stays
> compatible uapi-wise to latest kernels. We've done that before.). I happen to
> have two ixgbe NICs on some of my test machines which seem to be setting the
> CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.

Great! I'm happy to test, of course.

>
> > I guess there is also Option 0, add a flag but don't backport, which to
> > me is admitting defeat. If we were to do that we'd at least
> > want to document the problem. Thinking about how to do that already
> > makes my head spin:
> >
> > - If you have a NIC that does CHECKSUM_UNNECESSARY
> > - And you pop network headers
> > - You will run into this bug
> > - To fix it you have to disable rx checksum offload
> >
> > How do users figure out whether a NIC does UNNECESSARY vs. COMPLETE
> > vs. NONE? I have the luxury of only caring about two different drivers, but
> > what if I ship BPF (like Cilium does)? Ultimately vendors would either have
> > buggy programs, or would tell people to unconditionally disable rx
> > checksumming I believe.
> >
> >  From my POV, I'd prefer option 1 or 3, since I strongly believe that the
> > helper should be safe by default, and that the user can assert invariants
> > via flags to get better performance.I could live with option 2 as well since
> > I just have to care about a single kernel version.
> >
> >> (We do have bpf_csum_update()
> >> helper as well. I wonder whether we should split such control into a different
> >> helper.)
> >
> > I'm not sure what you mean, maybe you can elaborate a little?
>
> Meaning, a different helper to control these settings, e.g. bpf_csum_adjust(skb,
> {inc/dec/..}) which would then fall into option 2 category though.

Ah, okay. I think option 3 and this aren't mutually exclusive.
Backport an opt out
flag and add a new helper to bpf-next. Maybe we can even come up with something
that hides the checksum mess.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-05-13 14:14               ` Lorenz Bauer
@ 2020-06-01 17:48                 ` Alan Maguire
  2020-06-01 20:13                   ` Daniel Borkmann
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Maguire @ 2020-06-01 17:48 UTC (permalink / raw)
  To: Lorenz Bauer
  Cc: Daniel Borkmann, Alexei Starovoitov, bpf, kernel-team, Jakub Kicinski



On Wed, 13 May 2020, Lorenz Bauer wrote:

> > > Option 1: always downgrade UNNECESSARY to NONE
> > > - Easiest to back port
> > > - The helper is safe by default
> > > - Performance impact unclear
> > > - No escape hatch for Cilium
> > >
> > > Option 2: add a flag to force CHECKSUM_NONE
> > > - New UAPI, can this be backported?
> > > - The helper isn't safe by default, needs documentation
> > > - Escape hatch for Cilium
> > >
> > > Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
> > > - New UAPI, can this be backported?
> > > - The helper is safe by default
> > > - Escape hatch for Cilium (though you'd need to detect availability of the
> > >    flag somehow)
> >
> > This seems most reasonable to me; I can try and cook a proposal for tomorrow as
> > potential fix. Even if we add a flag, this is still backportable to stable (as
> > long as the overall patch doesn't get too complex and the backport itself stays
> > compatible uapi-wise to latest kernels. We've done that before.). I happen to
> > have two ixgbe NICs on some of my test machines which seem to be setting the
> > CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.
> 
> Great! I'm happy to test, of course.
> 

I had a go at implementing option 3 as a few colleagues ran into this 
problem. They confirmed the fix below resolved the issue.  Daniel is
this  roughly what you had in mind? I can submit a patch for the bpf
tree if that's acceptable with the new flag. Do we need a few
tests though?

From 7e0b0c78530f3800e5c40aa1fe87e5db82c5fb59 Mon Sep 17 00:00:00 2001
From: Alan Maguire <alan.maguire@oracle.com>
Date: Mon, 1 Jun 2020 13:10:37 +0200
Subject: [PATCH bpf-next 1/2] bpf: fix bpf_skb_adjust_room decap for
 CHECKSUM_UNNECESSESARY skbs

When hardware verifies checksums for some of the headers it
will set CHECKSUM_UNNECESSESARY and csum_level indicates the
number of consecutive checksums found.  If we de-encapsulate
data however these values become invalid since we likely
just removed the checksum-validated headers.  The best option
in such cases is to revert to CHECKSUM_NONE as all checksums
will then be checked in software.  Otherwise such checks can
be skipped.

Other checksum states are handled via skb_postpull_rcsum().

Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
---
 include/uapi/linux/bpf.h |  7 +++++++
 net/core/filter.c        | 15 ++++++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 974ca6e..03ab70c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1646,6 +1646,12 @@ struct bpf_stack_build_id {
  *		* **BPF_F_ADJ_ROOM_FIXED_GSO**: Do not adjust gso_size.
  *		  Adjusting mss in this way is not allowed for datagrams.
  *
+ *		* **BPF_F_ADJ_ROOM_SKIP_CSUM_RESET**: When shrinking skbs
+ *		  marked CHECKSUM_UNNECESSARY, avoid default behavior which
+ *		  resets to CHECKSUM_NONE.  In most cases, this flag will
+ *		  not be needed as the default behavior ensures checksums
+ *		  will be verified in sofware.
+ *
  *		* **BPF_F_ADJ_ROOM_ENCAP_L3_IPV4**,
  *		  **BPF_F_ADJ_ROOM_ENCAP_L3_IPV6**:
  *		  Any new space is reserved to hold a tunnel header.
@@ -3431,6 +3437,7 @@ enum {
 	BPF_F_ADJ_ROOM_ENCAP_L3_IPV6	= (1ULL << 2),
 	BPF_F_ADJ_ROOM_ENCAP_L4_GRE	= (1ULL << 3),
 	BPF_F_ADJ_ROOM_ENCAP_L4_UDP	= (1ULL << 4),
+	BPF_F_ADJ_ROOM_SKIP_CSUM_RESET	= (1ULL << 5),
 };
 
 enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index a6fc234..47c8a31 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3113,7 +3113,8 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 {
 	int ret;
 
-	if (flags & ~BPF_F_ADJ_ROOM_FIXED_GSO)
+	if (flags & ~(BPF_F_ADJ_ROOM_FIXED_GSO |
+		      BPF_F_ADJ_ROOM_SKIP_CSUM_RESET))
 		return -EINVAL;
 
 	if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) {
@@ -3143,6 +3144,18 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 		shinfo->gso_segs = 0;
 	}
 
+	/*
+	 * Decap should invalidate checksum checks done by hardware.
+	 * skb_csum_unnecessary() is not used as the other conditions
+	 * in that predicate do not need to be considered here; we only
+	 * wish to downgrade CHECKSUM_UNNECESSARY to CHECKSUM_NONE.
+	 */
+	if (unlikely(!(flags & BPF_F_ADJ_ROOM_SKIP_CSUM_RESET) &&
+		     skb->ip_summed == CHECKSUM_UNNECESSARY)) {
+		skb->ip_summed = CHECKSUM_NONE;
+		skb->csum_level = 0;
+	}
+
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-06-01 17:48                 ` Alan Maguire
@ 2020-06-01 20:13                   ` Daniel Borkmann
  2020-06-01 21:25                     ` Alan Maguire
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Borkmann @ 2020-06-01 20:13 UTC (permalink / raw)
  To: Alan Maguire, Lorenz Bauer
  Cc: Alexei Starovoitov, bpf, kernel-team, Jakub Kicinski

On 6/1/20 7:48 PM, Alan Maguire wrote:
> On Wed, 13 May 2020, Lorenz Bauer wrote:
> 
>>>> Option 1: always downgrade UNNECESSARY to NONE
>>>> - Easiest to back port
>>>> - The helper is safe by default
>>>> - Performance impact unclear
>>>> - No escape hatch for Cilium
>>>>
>>>> Option 2: add a flag to force CHECKSUM_NONE
>>>> - New UAPI, can this be backported?
>>>> - The helper isn't safe by default, needs documentation
>>>> - Escape hatch for Cilium
>>>>
>>>> Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
>>>> - New UAPI, can this be backported?
>>>> - The helper is safe by default
>>>> - Escape hatch for Cilium (though you'd need to detect availability of the
>>>>     flag somehow)
>>>
>>> This seems most reasonable to me; I can try and cook a proposal for tomorrow as
>>> potential fix. Even if we add a flag, this is still backportable to stable (as
>>> long as the overall patch doesn't get too complex and the backport itself stays
>>> compatible uapi-wise to latest kernels. We've done that before.). I happen to
>>> have two ixgbe NICs on some of my test machines which seem to be setting the
>>> CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.
>>
>> Great! I'm happy to test, of course.
> 
> I had a go at implementing option 3 as a few colleagues ran into this
> problem. They confirmed the fix below resolved the issue.  Daniel is
> this  roughly what you had in mind? I can submit a patch for the bpf
> tree if that's acceptable with the new flag. Do we need a few
> tests though?

Coded this [0] up last week which Lorenz gave a spin as well. Originally wanted to
get it out Friday night, but due to internal release stuff it got too late Fri night
and didn't want to rush it at 3am anymore, so the series as fixes is going out tomorrow
morning [today was public holiday in CH over here].

Thanks,
Daniel

   [0] https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/bpf.git/log/?h=pr/adjust-csum

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-06-01 20:13                   ` Daniel Borkmann
@ 2020-06-01 21:25                     ` Alan Maguire
  2020-06-02 10:13                       ` Lorenz Bauer
  0 siblings, 1 reply; 17+ messages in thread
From: Alan Maguire @ 2020-06-01 21:25 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alan Maguire, Lorenz Bauer, Alexei Starovoitov, bpf, kernel-team,
	Jakub Kicinski

On Mon, 1 Jun 2020, Daniel Borkmann wrote:

> On 6/1/20 7:48 PM, Alan Maguire wrote:
> > On Wed, 13 May 2020, Lorenz Bauer wrote:
> > 
> >>>> Option 1: always downgrade UNNECESSARY to NONE
> >>>> - Easiest to back port
> >>>> - The helper is safe by default
> >>>> - Performance impact unclear
> >>>> - No escape hatch for Cilium
> >>>>
> >>>> Option 2: add a flag to force CHECKSUM_NONE
> >>>> - New UAPI, can this be backported?
> >>>> - The helper isn't safe by default, needs documentation
> >>>> - Escape hatch for Cilium
> >>>>
> >>>> Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
> >>>> - New UAPI, can this be backported?
> >>>> - The helper is safe by default
> >>>> - Escape hatch for Cilium (though you'd need to detect availability of
> >>>> the
> >>>>     flag somehow)
> >>>
> >>> This seems most reasonable to me; I can try and cook a proposal for
> >>> tomorrow as
> >>> potential fix. Even if we add a flag, this is still backportable to stable
> >>> (as
> >>> long as the overall patch doesn't get too complex and the backport itself
> >>> stays
> >>> compatible uapi-wise to latest kernels. We've done that before.). I happen
> >>> to
> >>> have two ixgbe NICs on some of my test machines which seem to be setting
> >>> the
> >>> CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.
> >>
> >> Great! I'm happy to test, of course.
> > 
> > I had a go at implementing option 3 as a few colleagues ran into this
> > problem. They confirmed the fix below resolved the issue.  Daniel is
> > this  roughly what you had in mind? I can submit a patch for the bpf
> > tree if that's acceptable with the new flag. Do we need a few
> > tests though?
> 
> Coded this [0] up last week which Lorenz gave a spin as well. Originally
> wanted to
> get it out Friday night, but due to internal release stuff it got too late Fri
> night
> and didn't want to rush it at 3am anymore, so the series as fixes is going out
> tomorrow
> morning [today was public holiday in CH over here].
>

Looks great! Although I've only seen this issue arise
for cases where csum_level == 0, should we also
add "skb->csum_level = 0;" when we reset the
ip_summed value?

Feel free to add a

Reviewed-by: Alan Maguire <alan.maguire@oracle.com>

...for the series if needed. Thanks again!

Alan

> Thanks,
> Daniel
> 
>   [0]
> https://git.kernel.org/pub/scm/linux/kernel/git/dborkman/bpf.git/log/?h=pr/adjust-csum
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-06-01 21:25                     ` Alan Maguire
@ 2020-06-02 10:13                       ` Lorenz Bauer
  2020-06-02 15:01                         ` Daniel Borkmann
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenz Bauer @ 2020-06-02 10:13 UTC (permalink / raw)
  To: Alan Maguire
  Cc: Daniel Borkmann, Alexei Starovoitov, bpf, kernel-team, Jakub Kicinski

On Mon, 1 Jun 2020 at 22:25, Alan Maguire <alan.maguire@oracle.com> wrote:
>
> On Mon, 1 Jun 2020, Daniel Borkmann wrote:
>
> > On 6/1/20 7:48 PM, Alan Maguire wrote:
> > > On Wed, 13 May 2020, Lorenz Bauer wrote:
> > >
> > >>>> Option 1: always downgrade UNNECESSARY to NONE
> > >>>> - Easiest to back port
> > >>>> - The helper is safe by default
> > >>>> - Performance impact unclear
> > >>>> - No escape hatch for Cilium
> > >>>>
> > >>>> Option 2: add a flag to force CHECKSUM_NONE
> > >>>> - New UAPI, can this be backported?
> > >>>> - The helper isn't safe by default, needs documentation
> > >>>> - Escape hatch for Cilium
> > >>>>
> > >>>> Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
> > >>>> - New UAPI, can this be backported?
> > >>>> - The helper is safe by default
> > >>>> - Escape hatch for Cilium (though you'd need to detect availability of
> > >>>> the
> > >>>>     flag somehow)
> > >>>
> > >>> This seems most reasonable to me; I can try and cook a proposal for
> > >>> tomorrow as
> > >>> potential fix. Even if we add a flag, this is still backportable to stable
> > >>> (as
> > >>> long as the overall patch doesn't get too complex and the backport itself
> > >>> stays
> > >>> compatible uapi-wise to latest kernels. We've done that before.). I happen
> > >>> to
> > >>> have two ixgbe NICs on some of my test machines which seem to be setting
> > >>> the
> > >>> CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.
> > >>
> > >> Great! I'm happy to test, of course.
> > >
> > > I had a go at implementing option 3 as a few colleagues ran into this
> > > problem. They confirmed the fix below resolved the issue.  Daniel is
> > > this  roughly what you had in mind? I can submit a patch for the bpf
> > > tree if that's acceptable with the new flag. Do we need a few
> > > tests though?
> >
> > Coded this [0] up last week which Lorenz gave a spin as well. Originally
> > wanted to
> > get it out Friday night, but due to internal release stuff it got too late Fri
> > night
> > and didn't want to rush it at 3am anymore, so the series as fixes is going out
> > tomorrow
> > morning [today was public holiday in CH over here].
> >
>
> Looks great! Although I've only seen this issue arise
> for cases where csum_level == 0, should we also
> add "skb->csum_level = 0;" when we reset the
> ip_summed value?

FWIW I had the same reaction. Maybe it's worth adding after all, Daniel?

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Checksum behaviour of bpf_redirected packets
  2020-06-02 10:13                       ` Lorenz Bauer
@ 2020-06-02 15:01                         ` Daniel Borkmann
  0 siblings, 0 replies; 17+ messages in thread
From: Daniel Borkmann @ 2020-06-02 15:01 UTC (permalink / raw)
  To: Lorenz Bauer, Alan Maguire
  Cc: Alexei Starovoitov, bpf, kernel-team, Jakub Kicinski

On 6/2/20 12:13 PM, Lorenz Bauer wrote:
> On Mon, 1 Jun 2020 at 22:25, Alan Maguire <alan.maguire@oracle.com> wrote:
>> On Mon, 1 Jun 2020, Daniel Borkmann wrote:
>>> On 6/1/20 7:48 PM, Alan Maguire wrote:
>>>> On Wed, 13 May 2020, Lorenz Bauer wrote:
>>>>
>>>>>>> Option 1: always downgrade UNNECESSARY to NONE
>>>>>>> - Easiest to back port
>>>>>>> - The helper is safe by default
>>>>>>> - Performance impact unclear
>>>>>>> - No escape hatch for Cilium
>>>>>>>
>>>>>>> Option 2: add a flag to force CHECKSUM_NONE
>>>>>>> - New UAPI, can this be backported?
>>>>>>> - The helper isn't safe by default, needs documentation
>>>>>>> - Escape hatch for Cilium
>>>>>>>
>>>>>>> Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
>>>>>>> - New UAPI, can this be backported?
>>>>>>> - The helper is safe by default
>>>>>>> - Escape hatch for Cilium (though you'd need to detect availability of
>>>>>>> the
>>>>>>>      flag somehow)
>>>>>>
>>>>>> This seems most reasonable to me; I can try and cook a proposal for
>>>>>> tomorrow as
>>>>>> potential fix. Even if we add a flag, this is still backportable to stable
>>>>>> (as
>>>>>> long as the overall patch doesn't get too complex and the backport itself
>>>>>> stays
>>>>>> compatible uapi-wise to latest kernels. We've done that before.). I happen
>>>>>> to
>>>>>> have two ixgbe NICs on some of my test machines which seem to be setting
>>>>>> the
>>>>>> CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.
>>>>>
>>>>> Great! I'm happy to test, of course.
>>>>
>>>> I had a go at implementing option 3 as a few colleagues ran into this
>>>> problem. They confirmed the fix below resolved the issue.  Daniel is
>>>> this  roughly what you had in mind? I can submit a patch for the bpf
>>>> tree if that's acceptable with the new flag. Do we need a few
>>>> tests though?
>>>
>>> Coded this [0] up last week which Lorenz gave a spin as well. Originally
>>> wanted to
>>> get it out Friday night, but due to internal release stuff it got too late Fri
>>> night
>>> and didn't want to rush it at 3am anymore, so the series as fixes is going out
>>> tomorrow
>>> morning [today was public holiday in CH over here].
>>
>> Looks great! Although I've only seen this issue arise
>> for cases where csum_level == 0, should we also
>> add "skb->csum_level = 0;" when we reset the
>> ip_summed value?
> 
> FWIW I had the same reaction. Maybe it's worth adding after all, Daniel?

Although not needed, but yeah, fair enough. I've added a small skb helper for it.
Series is out here now, ptal [0].

Thanks,
Daniel

   [0] https://lore.kernel.org/bpf/cover.1591108731.git.daniel@iogearbox.net/

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-06-02 15:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-04 16:11 Checksum behaviour of bpf_redirected packets Lorenz Bauer
2020-05-06  1:28 ` Alexei Starovoitov
2020-05-06 16:24   ` Lorenz Bauer
2020-05-06 17:26     ` Jakub Kicinski
2020-05-06 21:55     ` Daniel Borkmann
2020-05-07 15:54       ` Lorenz Bauer
2020-05-07 16:43         ` Daniel Borkmann
2020-05-07 21:25           ` Jakub Kicinski
2020-05-11  9:31             ` Lorenz Bauer
2020-05-11  9:29           ` Lorenz Bauer
2020-05-12 21:25             ` Daniel Borkmann
2020-05-13 14:14               ` Lorenz Bauer
2020-06-01 17:48                 ` Alan Maguire
2020-06-01 20:13                   ` Daniel Borkmann
2020-06-01 21:25                     ` Alan Maguire
2020-06-02 10:13                       ` Lorenz Bauer
2020-06-02 15:01                         ` Daniel Borkmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.