From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FB25C54E49 for ; Thu, 7 May 2020 21:06:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E90CF208CA for ; Thu, 7 May 2020 21:06:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726690AbgEGVGI (ORCPT ); Thu, 7 May 2020 17:06:08 -0400 Received: from www62.your-server.de ([213.133.104.62]:40234 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726218AbgEGVGH (ORCPT ); Thu, 7 May 2020 17:06:07 -0400 Received: from sslproxy01.your-server.de ([78.46.139.224]) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1jWniO-0004wi-CU; Thu, 07 May 2020 23:05:56 +0200 Received: from [178.195.186.98] (helo=pc-9.home) by sslproxy01.your-server.de with esmtpsa (TLSv1.3:TLS_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1jWniO-0008Ol-3O; Thu, 07 May 2020 23:05:56 +0200 Subject: Re: [PATCH v3] net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu To: =?UTF-8?Q?Maciej_=c5=bbenczykowski?= Cc: Alexei Starovoitov , Linux Network Development Mailing List , Linux Kernel Mailing List , BPF Mailing List , "David S . Miller" , Jakub Kicinski References: <20200507023606.111650-1-zenczykowski@gmail.com> From: Daniel Borkmann Message-ID: Date: Thu, 7 May 2020 23:05:55 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.102.2/25805/Thu May 7 14:14:46 2020) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/7/20 6:46 PM, Maciej Żenczykowski wrote: > (a) not clear why the max is SKB_MAX_ALLOC in the first place (this is > PAGE_SIZE << 2, ie. 16K on x86), while lo mtu is 64k Agreed, tbh, it's not clear to me either atm. :) The SKB_MAX_ALLOC constant itself should be replaced with something more appropriate. Also as a small side note, the !skb->dev check should be removed since in tc ingress/egress the skb->dev is never NULL. (See also tc_cls_act_convert_ctx_access() on struct __sk_buff, ifindex conversion.) > (b) hmm, if we're not redirecting, then exceeding the ingress device's > mtu doesn't seem to be a problem. > > Indeed AFAIK this can already happen, some devices will round mtu up > when they configure the device mru buffers. > (ie. you configure L3 mtu 1500, they treat that as L2 1536 or 1532 [-4 > fcs], simply because 3 * 512 is a nice amount of buffers, or they'll > accept not only 1514 L2, but also 1518 L2 or even 1522 L2 for VLAN and > Q-IN-Q -- even if the packets aren't actually VLAN'ed, so your non > VLAN mru might be 1504 or 1508) > > Indeed my corp dell workstation with some standard 1 gigabit > motherboard nic has a standard default mtu of 1500, and I've seen it > receive L3 mtu 1520 packets (apparently due to misconfiguration in our > hardware [cisco? juniper?] ipv4->ipv6 translator which can take 1500 > mtu ipv4 packets and convert them to 1520 mtu ipv6 packets without > fragmenting or generating icmp too big errors). While it's obviously > wrong, it does just work (the network paths themselves are also > obviously 1520 clean). Right, agree on tc ingress side when skb goes further up the stack. > (c) If we are redirecting we'll eventually (after bpf program returns) > hit dev_queue_xmit(), and shouldn't that be what returns an error? You mean whether the check should be inside __dev_queue_xmit() itself instead? Maybe. From a cursory glance the MTU checks are spread in upper layer protos. As mentioned, we would need to have coverage for BPF progs attached on the tc ingress and egress (sch_handle_{ingress,egress}()) hook where for each case an skb can be just TC_ACT_OK'ed (up to stack or down to driver), redirected via ____dev_forward_skb() or dev_queue_xmit(). The ____dev_forward_skb() has us covered and for TC_ACT_OK on tc ingress, we'd only need a generic upper cap. So for the rest of the cases, we'd need to have some sort of sanity check somewhere. > btw. is_skb_forwardable() actually tests > - device is up && (packet is gso || skb->len < dev->mtu + > dev->hard_header_len + VLAN_HLEN); > > which is also wrong and in 2 ways, cause VLAN_HLEN makes no sense on > non ethernet, and the __bpf_skb_max_len function doesn't account for > VLAN... (which possibly has implications if you try to redirect to a > vlan interface) Yeah, it probably would for QinQ which is probably why noone was running into it so far. At least the skb_vlan_push() would first store the tag via __vlan_hwaccel_put_tag() in the skb itself. > I think having an mtu check is useful, but I think the mtu should be > selectable by the bpf program. Because it might not even be device > mtu at all, it might be path mtu which we should be testing against. > It should also be checked for gso frames, since the max post > segmentation size should be enforced. > > I agree we should expose dev->mtu (and dev->hard_header_len and hatype) > > I'll mull this over a bit more, but I'm not convinced this patch isn't ok as is. > There just is probably more we should do. Ok, makes sense. Thanks for looking into it!