Re: Checksum behaviour of bpf_redirected packets

From: Lorenz Bauer <lmb@cloudflare.com>
To: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	bpf <bpf@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Jakub Kicinski <kuba@kernel.org>
Subject: Re: Checksum behaviour of bpf_redirected packets
Date: Mon, 11 May 2020 10:29:26 +0100	[thread overview]
Message-ID: <CACAyw99_GkLrxEj13R1ZJpnw_eWxhZas=72rtR8Pgt_Vq3dbeg@mail.gmail.com> (raw)
In-Reply-To: <a4830bd4-d998-9e5c-afd5-c5ec5504f1f3@iogearbox.net>

On Thu, 7 May 2020 at 17:43, Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 5/7/20 5:54 PM, Lorenz Bauer wrote:
> > On Wed, 6 May 2020 at 22:55, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 5/6/20 6:24 PM, Lorenz Bauer wrote:
> >>> On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com> wrote:
> >>>> On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
> >>>>>
> >>>>> In our TC classifier cls_redirect [1], we use the following sequence
> >>>>> of helper calls to
> >>>>> decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:
> >>>>>
> >>>>>     skb_adjust_room(skb, -encap_len,
> >>>>> BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
> >>>>>     bpf_redirect(skb->ifindex, BPF_F_INGRESS)
> >>>>>
> >>>>> It seems like some checksums of the inner headers are not validated in
> >>>>> this case.
> >>>>> For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
> >>>>> network stack and elicits a SYN ACK.
> >>>>>
> >>>>> Is this known but undocumented behaviour or a bug? In either case, is
> >>>>> there a work
> >>>>> around I'm not aware of?
> >>>>
> >>>> I thought inner and outer csums are covered by different flags and driver
> >>>> suppose to set the right one depending on level of in-hw checking it did.
> >>>
> >>> I've figured out what the problem is. We receive the following packet from
> >>> the driver:
> >>>
> >>>       | ETH | IP | UDP | GUE | IP | TCP |
> >>>       skb->ip_summed == CHECKSUM_UNNECESSARY
> >>>
> >>> ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
> >>> checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
> >>> and get the following:
> >>>
> >>>       | ETH | IP | TCP |
> >>>       skb->ip_summed == CHECKSUM_UNNECESSARY
> >>>
> >>> Note that ip_summed is still CHECKSUM_UNNECESSARY. After
> >>> bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
> >>> skb_checksum_init is turned into a no-op due to
> >>> CHECKSUM_UNNECESSARY.
> >>>
> >>> I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
> >>> accordingly. Unfortunately I don't understand how checksums work
> >>> sufficiently. Daniel, it seems like you wrote the helper, could you
> >>> take a look?
> >>
> >> Right, so in the skb_adjust_room() case we're not aware of protocol
> >> specifics. We do handle the csum complete case via skb_postpull_rcsum(),
> >> but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
> >> skb->csum_level of the original skb prior to skb_adjust_room() call
> >> might have been 0 (that is, covering UDP)? So if we'd add the possibility
> >> to __skb_decr_checksum_unnecessary() via flag, then it would become
> >> skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
> >> same for the reverse case. Below is a quick hack (compile tested-only);
> >> would this resolve your case ...
> >
> > Thanks for the patch, it indeed fixes our problem! I spent some more time
> > trying to understand the checksum offload stuff, here is where I am:
> >
> > On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
> > everything works by default since the rest of the stack does checksumming in
> > software.
> >
> > On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
> > will adjust for the data that is being removed from the skb. The rest of the
> > stack will use the correct value, all is well.
> >
> > However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
> > the API of skb_adjust_room doesn't tell us whether the user intends to
> > remove headers or data, and how that will influence csum_level.
> >  From my POV, skb_adjust_room currently does the wrong thing.
> > I think we need to fix skb_adjust_room to do the right thing by default,
> > rather than extending the API. We spent a lot of time on tracking this down,
> > so hopefully we can spare others the pain.
> >
> > As Jakub alludes to, we don't know when and how often to call
> > __skb_decr_checksum_unnecessary so we should just
> > unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
> > CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
> > enough to land as a fix via the bpf tree (which is important for our
> > production kernel). As a follow up we could add the inverse of the flags you
> > propose via bpf-next.
> >
> > What do you think?
>
> My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
> basically trash performance if we have to fallback to sw in fast-path, these
> helpers are also used in our LB case for DSR, for example.

Our setup also uses DSR, so I wonder how you manage to avoid this
checksum issue.
Why is Cilium not affected by this bug as well? You never pop headers?

FWIW, currently the only work around I know is to disable rx
checksumming for ALL
inbound traffic via ethtool -K bla rx off. Which in theory trashes
performance for all
RX traffic, not just the one going to the load balancer. I applied this in a
couple of production data centers, and did not see an increase in
softirq, which is
where I assume this would show up. My guess is that this is because our
RX << TX. Unconditionally setting CHECKSUM_NONE would be even less visible,
since we could turn rx checksumming back on in the general case.
Is there a way for you to quantify what the impact for Cilium would be?

> I agree that it
> sucks to expose these implementation details though. So eventually we'd end
> up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
> a complex to use helper with all its flags where you end up looking into the
> implementation detail to understand what it is really doing. I'm not sure if
> we make anything worse, but I do see your concern. :/

Having those flags seems fine to me, you're right that it's already complicated.
My concern is really with the current state of the helper however: I think that
as it exists right now it's buggy wrt checksum offload, and we need a
backportable fix.

Option 1: always downgrade UNNECESSARY to NONE
- Easiest to back port
- The helper is safe by default
- Performance impact unclear
- No escape hatch for Cilium

Option 2: add a flag to force CHECKSUM_NONE
- New UAPI, can this be backported?
- The helper isn't safe by default, needs documentation
- Escape hatch for Cilium

Option 3: downgrade to CHECKSUM_NONE, add flag to skip this
- New UAPI, can this be backported?
- The helper is safe by default
- Escape hatch for Cilium (though you'd need to detect availability of the
  flag somehow)

I guess there is also Option 0, add a flag but don't backport, which to
me is admitting defeat. If we were to do that we'd at least
want to document the problem. Thinking about how to do that already
makes my head spin:

- If you have a NIC that does CHECKSUM_UNNECESSARY
- And you pop network headers
- You will run into this bug
- To fix it you have to disable rx checksum offload

How do users figure out whether a NIC does UNNECESSARY vs. COMPLETE
vs. NONE? I have the luxury of only caring about two different drivers, but
what if I ship BPF (like Cilium does)? Ultimately vendors would either have
buggy programs, or would tell people to unconditionally disable rx
checksumming I believe.

From my POV, I'd prefer option 1 or 3, since I strongly believe that the
helper should be safe by default, and that the user can assert invariants
via flags to get better performance.I could live with option 2 as well since
I just have to care about a single kernel version.

> (We do have bpf_csum_update()
> helper as well. I wonder whether we should split such control into a different
> helper.)

I'm not sure what you mean, maybe you can elaborate a little?

Lorenz

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com