netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Borkmann <daniel@iogearbox.net>
To: "Maciej Żenczykowski" <maze@google.com>,
	"Alexei Starovoitov" <ast@kernel.org>
Cc: Linux Network Development Mailing List <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	BPF Mailing List <bpf@vger.kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Dongseok Yi <dseok.yi@samsung.com>,
	Willem de Bruijn <willemb@google.com>,
	yhs@fb.com, kpsingh@kernel.org, andrii@kernel.org,
	Jakub Kicinski <kuba@kernel.org>,
	songliubraving@fb.com, John Fastabend <john.fastabend@gmail.com>,
	Martin Lau <kafai@fb.com>
Subject: Re: [PATCH bpf-next 2/2] bpf: do not change gso_size during bpf_skb_change_proto()
Date: Wed, 16 Jun 2021 00:16:53 +0200	[thread overview]
Message-ID: <80bb7f49-4b69-169a-a540-f2a46551ef7c@iogearbox.net> (raw)
In-Reply-To: <CANP3RGc8PmPOjTGkDmbjzEVBezcQuNMcg17qpJx2aLU9juM_5w@mail.gmail.com>

On 6/15/21 9:35 AM, Maciej Żenczykowski wrote:
> On Thu, Jun 3, 2021 at 6:52 PM Maciej Żenczykowski
> <zenczykowski@gmail.com> wrote:
>> From: Maciej Żenczykowski <maze@google.com>
>>
>> This is technically a backwards incompatible change in behaviour,
>> but I'm going to argue that it is very unlikely to break things,
>> and likely to fix *far* more then it breaks.
>>
>> In no particular order, various reasons follow:
>>
>> (a) I've long had a bug assigned to myself to debug a super rare kernel
>> crash on Android Pixel phones which can (per stacktrace) be traced back
>> to bpf clat ipv6 to ipv4 protocol conversion causing some sort of ugly
>> failure much later on during transmit deep in the GSO engine, AFAICT
>> precisely because of this change to gso_size, though I've never been able
>> to manually reproduce it.
>> I believe it may be related to the particular network offload support
>> of attached usb ethernet dongle being used for tethering off of an
>> IPv6-only cellular connection.  The reason might be we end up with more
>> segments than max permitted, or with a gso packet with only one segment...
>> (either way we break some assumption and hit a BUG_ON)

Do you happen to have some more debug data from there, e.g. which bug_on
is hit? Do you have some pointers to the driver code where you suspect
this could cause an issue?

>> (b) There is no check that the gso_size is > 20 when reducing it by 20,
>> so we might end up with a negative (or underflowing) gso_size or
>> a gso_size of 0.  This can't possibly be good.
>> Indeed this is probably somehow exploitable (or at least can result
>> in a kernel crash) by delivering crafted packets and perhaps triggering
>> an infinite loop or a divide by zero...
>> As a reminder: gso_size (mss) is related to mtu, but not directly
>> derived from it: gso_size/mss may be significantly smaller then
>> one would get by deriving from local mtu.  And on some nics (which
>> do loose mtu checking on receive, it may even potentially be larger,
>> for example my work pc with 1500 mtu can receive 1520 byte frames
>> [and sometimes does due to bugs in a vendor plat46 implementation]).
>> Indeed even just going from 21 to 1 is potentially problematic because
>> it increases the number of segments by a factor of 21 (think DoS,
>> or some other crash due to too many segments).

Do you have a reproducer for creating such small gso_size from stack, is
this mainly from virtio_net side possible? If it's too small, perhaps the
gso attributes should just be cleared from the skb generally instead of
marking SKB_GSO_DODGY as we otherwise do?

>> (c) It's always safe to not increase the gso_size, because it doesn't
>> result in the max packet size increasing.  So the skb_increase_gso_size()
>> call was always unnecessary for correctness (and outright undesirable, see
>> later).  As such the only part which is potentially dangerous (ie. could
>> cause backwards compatibility issues) is the removal of the
>> skb_decrease_gso_size() call.

Right.

>> (d) If the packets are ultimately destined to the local device, then
>> there is absolutely no benefit to playing around with gso_size.
>> It only matters if the packets will egress the device.  ie. we're
>> either forwarding, or transmitting from the device.

Yep, the issue is that we don't know this at the time of the helper call.

>> (e) This logic only triggers for packets which are GSO.  It does not
>> trigger for skbs which are not GSO.  It will not convert a non-GSO mtu
>> sized packet into a GSO packet (and you don't even know what the mtu is,
>> so you can't even fix it).  As such your transmit path must *already* be
>> able to handle an mtu 20 bytes larger then your receive path (for ipv4
>> to ipv6 translation) - and indeed 28 bytes larger due to ipv4 fragments.
>> Thus removing the skb_decrease_gso_size() call doesn't actually increase
>> the size of the packets your transmit side must be able to handle.
>> ie. to handle non-GSO max-mtu packets, the ipv4/ipv6 device/route mtus
>> must already be set correctly.  Since for example with an ipv4 egress mtu
>> of 1500, ipv4 to ipv6 translation will already build 1520 byte ipv6 frames,
>> so you need a 1520 byte device mtu.  This means if your ipv6 device's
>> egress mtu is 1280, your ipv4 route must be 1260 (and actually 1252,
>> because of the need to handle fragments).  This is to handle normal non-GSO
>> packets.  Thus the reduction is simply not needed for GSO packets,
>> because when they're correctly built, they will already be the right size.

Makes sense to me.

>> (f) TSO/GSO should be able to exactly undo GRO: the number of packets
>> (TCP segments) should not be modified, so that TCP's mss counting works
>> correctly (this matters for congestion control).
>> If protocol conversion changes the gso_size, then the number of TCP segments
>> may increase or decrease.  Packet loss after protocol conversion can result
>> in partial loss of mss segments that the sender sent.  How's the sending
>> TCP stack going to react to receiving ACKs/SACKs in the middle of the
>> segments it sent?
>>
>> (g) skb_{decrease,increase}_gso_size() are already no-ops for GSO_BY_FRAGS
>> case (besides triggering WARN_ON_ONCE). This means you already cannot
>> guarantee that gso_size (and thus resulting packet mtu) is changed.
>> ie. you must assume it won't be changed.
>>
>> (h) changing gso_size is outright buggy for UDP GSO packets, where framing
>> matters (I believe that's also the case for SCTP, but it's already excluded
>> by [g]).  So the only remaining case is TCP, which also doesn't want it
>> (see [f]).
>>
>> (i) see also the reasoning on the previous attempt at fixing this
>> (commit fa7b83bf3b156c767f3e4a25bbf3817b08f3ff8e) which shows
>> that the current behaviour causes TCP packet loss:
>>
>>    In the forwarding path GRO -> BPF 6 to 4 -> GSO for TCP traffic, the
>>    coalesced packet payload can be > MSS, but < MSS + 20.
>>
>>    bpf_skb_proto_6_to_4() will upgrade the MSS and it can be > the payload
>>    length. After then tcp_gso_segment checks for the payload length if it
>>    is <= MSS. The condition is causing the packet to be dropped.
>>
>>    tcp_gso_segment():
>>      [...]
>>      mss = skb_shinfo(skb)->gso_size;
>>      if (unlikely(skb->len <= mss)) goto out;
>>      [...]
>>
>> Thus changing the gso_size is simply a very bad idea.
>> Increasing is unnecessary and buggy, and decreasing can go negative.
>>
>> Cc: Dongseok Yi <dseok.yi@samsung.com>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: Willem de Bruijn <willemb@google.com>
>> Fixes: 6578171a7ff0 ("bpf: add bpf_skb_change_proto helper")
>> Signed-off-by: Maciej Żenczykowski <maze@google.com>
>> ---
>>   net/core/filter.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 04848de3e058..953b6c31b803 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -3263,8 +3263,6 @@ static int bpf_skb_proto_4_to_6(struct sk_buff *skb)
>>                          shinfo->gso_type |=  SKB_GSO_TCPV6;
>>                  }
>>
>> -               /* Due to IPv6 header, MSS needs to be downgraded. */
>> -               skb_decrease_gso_size(shinfo, len_diff);
>>                  /* Header must be checked, and gso_segs recomputed. */
>>                  shinfo->gso_type |= SKB_GSO_DODGY;
>>                  shinfo->gso_segs = 0;
>> @@ -3304,8 +3302,6 @@ static int bpf_skb_proto_6_to_4(struct sk_buff *skb)
>>                          shinfo->gso_type |=  SKB_GSO_TCPV4;
>>                  }
>>
>> -               /* Due to IPv4 header, MSS can be upgraded. */
>> -               skb_increase_gso_size(shinfo, len_diff);
>>                  /* Header must be checked, and gso_segs recomputed. */
>>                  shinfo->gso_type |= SKB_GSO_DODGY;
>>                  shinfo->gso_segs = 0;
>> --
>> 2.32.0.rc1.229.g3e70b5a671-goog
> 
> This patch series (ie. this patch and the previous revert) seems to
> have gotten no response, and we're running out of time, since it
> reverts the newly added api.

The change you're reverting in patch 1/2 is only in net-next, but not in
Linus tree, so there still is a large enough time window for at least the
revert. That said, I presume what you mean here is to just revert the 1/2
in bpf-next and the 2/2 fix targeted for bpf tree, no?

Few follow-up questions:

1) Could we then also cover the case of skb_is_gso(skb) && !skb_is_gso_tcp(skb)
    that we currently reject with -ENOTSUPP, in other words all GSO cases?

2) Do we still need to mark SKB_GSO_DODGY and reset segs? I presume not anymore
    after this change?

3) skb_{decrease,increase}_gso_size() should probably just removed then.

Thanks,
Daniel

  reply	other threads:[~2021-06-15 22:17 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-04  1:52 [PATCH bpf-next 1/2] Revert "bpf: Check for BPF_F_ADJ_ROOM_FIXED_GSO when bpf_skb_change_proto" Maciej Żenczykowski
2021-06-04  1:52 ` [PATCH bpf-next 2/2] bpf: do not change gso_size during bpf_skb_change_proto() Maciej Żenczykowski
2021-06-15  7:35   ` Maciej Żenczykowski
2021-06-15 22:16     ` Daniel Borkmann [this message]
2021-06-16  0:06       ` Maciej Żenczykowski
2021-06-17  0:09         ` [PATCH bpf-next v2 1/4] Revert "bpf: Check for BPF_F_ADJ_ROOM_FIXED_GSO when bpf_skb_change_proto" Maciej Żenczykowski
2021-06-17  0:09           ` [PATCH bpf-next v2 2/4] bpf: do not change gso_size during bpf_skb_change_proto() Maciej Żenczykowski
2021-06-17  0:09           ` [PATCH bpf-next v2 3/4] bpf: support all gso types in bpf_skb_change_proto() Maciej Żenczykowski
2021-06-17  0:09           ` [PATCH bpf-next v2 4/4] bpf: more lenient bpf_skb_net_shrink() with BPF_F_ADJ_ROOM_FIXED_GSO Maciej Żenczykowski
2021-06-24 14:05             ` Daniel Borkmann
2021-06-24 17:13               ` Maciej Żenczykowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=80bb7f49-4b69-169a-a540-f2a46551ef7c@iogearbox.net \
    --to=daniel@iogearbox.net \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=dseok.yi@samsung.com \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maze@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=songliubraving@fb.com \
    --cc=willemb@google.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).