Re: Checksum behaviour of bpf_redirected packets

From: Alan Maguire <alan.maguire@oracle.com>
To: Lorenz Bauer <lmb@cloudflare.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	bpf <bpf@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Jakub Kicinski <kuba@kernel.org>
Subject: Re: Checksum behaviour of bpf_redirected packets
Date: Mon, 1 Jun 2020 18:48:20 +0100 (BST)	[thread overview]
Message-ID: <alpine.LRH.2.21.2006011839430.623@localhost> (raw)
In-Reply-To: <CACAyw9_LPEOvHbmP8UrpwVkwYT57rKWRisai=Z7kbKxOPh5XNQ@mail.gmail.com>



On Wed, 13 May 2020, Lorenz Bauer wrote:

> > > Option 1: always downgrade UNNECESSARY to NONE
> > > - Easiest to back port
> > > - The helper is safe by default
> > > - Performance impact unclear
> > > - No escape hatch for Cilium
> > >
> > > Option 2: add a flag to force CHECKSUM_NONE
> > > - New UAPI, can this be backported?
> > > - The helper isn't safe by default, needs documentation
> > > - Escape hatch for Cilium
> > >
> > > Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
> > > - New UAPI, can this be backported?
> > > - The helper is safe by default
> > > - Escape hatch for Cilium (though you'd need to detect availability of the
> > >    flag somehow)
> >
> > This seems most reasonable to me; I can try and cook a proposal for tomorrow as
> > potential fix. Even if we add a flag, this is still backportable to stable (as
> > long as the overall patch doesn't get too complex and the backport itself stays
> > compatible uapi-wise to latest kernels. We've done that before.). I happen to
> > have two ixgbe NICs on some of my test machines which seem to be setting the
> > CHECKSUM_UNNECESSARY, so I'll run some experiments from over here as well.
> 
> Great! I'm happy to test, of course.
> 

I had a go at implementing option 3 as a few colleagues ran into this 
problem. They confirmed the fix below resolved the issue.  Daniel is
this  roughly what you had in mind? I can submit a patch for the bpf
tree if that's acceptable with the new flag. Do we need a few
tests though?

From 7e0b0c78530f3800e5c40aa1fe87e5db82c5fb59 Mon Sep 17 00:00:00 2001
From: Alan Maguire <alan.maguire@oracle.com>
Date: Mon, 1 Jun 2020 13:10:37 +0200
Subject: [PATCH bpf-next 1/2] bpf: fix bpf_skb_adjust_room decap for
 CHECKSUM_UNNECESSESARY skbs

When hardware verifies checksums for some of the headers it
will set CHECKSUM_UNNECESSESARY and csum_level indicates the
number of consecutive checksums found.  If we de-encapsulate
data however these values become invalid since we likely
just removed the checksum-validated headers.  The best option
in such cases is to revert to CHECKSUM_NONE as all checksums
will then be checked in software.  Otherwise such checks can
be skipped.

Other checksum states are handled via skb_postpull_rcsum().

Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
---
 include/uapi/linux/bpf.h |  7 +++++++
 net/core/filter.c        | 15 ++++++++++++++-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 974ca6e..03ab70c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1646,6 +1646,12 @@ struct bpf_stack_build_id {
  *		* **BPF_F_ADJ_ROOM_FIXED_GSO**: Do not adjust gso_size.
  *		  Adjusting mss in this way is not allowed for datagrams.
  *
+ *		* **BPF_F_ADJ_ROOM_SKIP_CSUM_RESET**: When shrinking skbs
+ *		  marked CHECKSUM_UNNECESSARY, avoid default behavior which
+ *		  resets to CHECKSUM_NONE.  In most cases, this flag will
+ *		  not be needed as the default behavior ensures checksums
+ *		  will be verified in sofware.
+ *
  *		* **BPF_F_ADJ_ROOM_ENCAP_L3_IPV4**,
  *		  **BPF_F_ADJ_ROOM_ENCAP_L3_IPV6**:
  *		  Any new space is reserved to hold a tunnel header.
@@ -3431,6 +3437,7 @@ enum {
 	BPF_F_ADJ_ROOM_ENCAP_L3_IPV6	= (1ULL << 2),
 	BPF_F_ADJ_ROOM_ENCAP_L4_GRE	= (1ULL << 3),
 	BPF_F_ADJ_ROOM_ENCAP_L4_UDP	= (1ULL << 4),
+	BPF_F_ADJ_ROOM_SKIP_CSUM_RESET	= (1ULL << 5),
 };
 
 enum {
diff --git a/net/core/filter.c b/net/core/filter.c
index a6fc234..47c8a31 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3113,7 +3113,8 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 {
 	int ret;
 
-	if (flags & ~BPF_F_ADJ_ROOM_FIXED_GSO)
+	if (flags & ~(BPF_F_ADJ_ROOM_FIXED_GSO |
+		      BPF_F_ADJ_ROOM_SKIP_CSUM_RESET))
 		return -EINVAL;
 
 	if (skb_is_gso(skb) && !skb_is_gso_tcp(skb)) {
@@ -3143,6 +3144,18 @@ static int bpf_skb_net_shrink(struct sk_buff *skb, u32 off, u32 len_diff,
 		shinfo->gso_segs = 0;
 	}
 
+	/*
+	 * Decap should invalidate checksum checks done by hardware.
+	 * skb_csum_unnecessary() is not used as the other conditions
+	 * in that predicate do not need to be considered here; we only
+	 * wish to downgrade CHECKSUM_UNNECESSARY to CHECKSUM_NONE.
+	 */
+	if (unlikely(!(flags & BPF_F_ADJ_ROOM_SKIP_CSUM_RESET) &&
+		     skb->ip_summed == CHECKSUM_UNNECESSARY)) {
+		skb->ip_summed = CHECKSUM_NONE;
+		skb->csum_level = 0;
+	}
+
 	return 0;
 }
 
-- 
1.8.3.1