Re: Checksum behaviour of bpf_redirected packets

From: Jakub Kicinski <kuba@kernel.org>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: Lorenz Bauer <lmb@cloudflare.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	bpf <bpf@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>
Subject: Re: Checksum behaviour of bpf_redirected packets
Date: Thu, 7 May 2020 14:25:18 -0700	[thread overview]
Message-ID: <20200507142518.43c22a1b@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com> (raw)
In-Reply-To: <a4830bd4-d998-9e5c-afd5-c5ec5504f1f3@iogearbox.net>

On Thu, 7 May 2020 18:43:47 +0200 Daniel Borkmann wrote:
> > Thanks for the patch, it indeed fixes our problem! I spent some more time
> > trying to understand the checksum offload stuff, here is where I am:
> > 
> > On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
> > everything works by default since the rest of the stack does checksumming in
> > software.
> > 
> > On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
> > will adjust for the data that is being removed from the skb. The rest of the
> > stack will use the correct value, all is well.
> > 
> > However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
> > the API of skb_adjust_room doesn't tell us whether the user intends to
> > remove headers or data, and how that will influence csum_level.
> >  From my POV, skb_adjust_room currently does the wrong thing.
> > I think we need to fix skb_adjust_room to do the right thing by default,
> > rather than extending the API. We spent a lot of time on tracking this down,
> > so hopefully we can spare others the pain.
> > 
> > As Jakub alludes to, we don't know when and how often to call
> > __skb_decr_checksum_unnecessary so we should just
> > unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
> > CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
> > enough to land as a fix via the bpf tree (which is important for our
> > production kernel). As a follow up we could add the inverse of the flags you
> > propose via bpf-next.
> > 
> > What do you think?  
> 
> My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
> basically trash performance if we have to fallback to sw in fast-path, these
> helpers are also used in our LB case for DSR, for example. I agree that it
> sucks to expose these implementation details though. So eventually we'd end
> up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
> a complex to use helper with all its flags where you end up looking into the
> implementation detail to understand what it is really doing. I'm not sure if
> we make anything worse, but I do see your concern. :/ (We do have bpf_csum_update()
> helper as well. I wonder whether we should split such control into a different
> helper.)

Probably stating the obvious but for decap of UDP tunnels which carry
locally terminated flows - we'd probably also want the upgrade from
UNNECESSARY to COMPLETE, like we do in the kernel
(skb_checksum_try_convert()). Tricky.