* Stack sends oversize UDP packet to the driver @ 2019-01-20 22:26 Michael Chan 2019-01-22 0:36 ` Daniel Axtens [not found] ` <CAF2d9jgskHTb-nmbVo9A2CQhh9T3OnH_vbfGcMBii13oq1teCw@mail.gmail.com> 0 siblings, 2 replies; 12+ messages in thread From: Michael Chan @ 2019-01-20 22:26 UTC (permalink / raw) To: Netdev, David Miller I've received a bug report of oversized UDP packets sent to the bnxt_en driver for transmission. There is no check for illegal length in the driver and it will send a corrupted BD to the NIC if the non-TSO length exceeds the maximum MTU supported by the driver. This ultimately causes the driver to hang. Looking a little deeper, it looks like the route of the SKB was initially to "lo" and therefore no fragmentation was done. And it looks like the route later got changed to the bnxt_en dev before transmission. The user was doing multiple VM reboots and the bad length was happening on the Linux host. I can add a length check in the driver to prevent this. But is there a better way to prevent this in the stack? Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-01-20 22:26 Stack sends oversize UDP packet to the driver Michael Chan @ 2019-01-22 0:36 ` Daniel Axtens 2019-01-22 0:59 ` Michael Chan [not found] ` <CAF2d9jgskHTb-nmbVo9A2CQhh9T3OnH_vbfGcMBii13oq1teCw@mail.gmail.com> 1 sibling, 1 reply; 12+ messages in thread From: Daniel Axtens @ 2019-01-22 0:36 UTC (permalink / raw) To: Michael Chan, Netdev, David Miller Hi Michael, > I've received a bug report of oversized UDP packets sent to the > bnxt_en driver for transmission. There is no check for illegal length > in the driver and it will send a corrupted BD to the NIC if the > non-TSO length exceeds the maximum MTU supported by the driver. This > ultimately causes the driver to hang. > > Looking a little deeper, it looks like the route of the SKB was > initially to "lo" and therefore no fragmentation was done. And it > looks like the route later got changed to the bnxt_en dev before > transmission. The user was doing multiple VM reboots and the bad > length was happening on the Linux host. > > I can add a length check in the driver to prevent this. But is there > a better way to prevent this in the stack? Thanks. I hit a similar sounding issue on a bnx2x - see commit 8914a595110a6eca69a5e275b323f5d09e18f4f9 In that case, a GSO packet with gso_size too large for the firmware was coming to the bnx2x driver from an ibmveth device via Open vSwitch. I also toyed with a fix in the stack and ended up fixing just the driver. I was hoping to get a generic fix in to the stack afterwards, but didn't get anything finished. Looking back at old branches, it looks like I considered adding MTU validation to validate_xmit_skb, but I never got that upstream. My vague recollection is that I ended up caught by edge cases: GSO_DODGY allows an untrusted source to set gso parameters, so that needed to be validated first - and that was complex and potentially slow, and I just got overtaken by more urgent work. (Note that this was a year ago and was in many ways my introduction to TSO/GSO, so I could be completely wrong.) Anyway, I can send you my partial work if it would be helpful. Regards, Daniel ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-01-22 0:36 ` Daniel Axtens @ 2019-01-22 0:59 ` Michael Chan 2019-01-22 18:28 ` Mahesh Bandewar (महेश बंडेवार) 0 siblings, 1 reply; 12+ messages in thread From: Michael Chan @ 2019-01-22 0:59 UTC (permalink / raw) To: Daniel Axtens; +Cc: Netdev, David Miller On Mon, Jan 21, 2019 at 4:36 PM Daniel Axtens <dja@axtens.net> wrote: > I hit a similar sounding issue on a bnx2x - see commit > 8914a595110a6eca69a5e275b323f5d09e18f4f9 > > In that case, a GSO packet with gso_size too large for the firmware was > coming to the bnx2x driver from an ibmveth device via Open vSwitch. I > also toyed with a fix in the stack and ended up fixing just the driver. Hi Daniel, yes I remember the issue that you reported at that time. The current issue is similar but for non-TSO UDP packets. > > I was hoping to get a generic fix in to the stack afterwards, but didn't > get anything finished. Looking back at old branches, it looks like I > considered adding MTU validation to validate_xmit_skb, MTU validation in validate_xmit_skb() would definitely prevent the current reported issue. Perhaps we can add a WARN() when an invalid length is detected so that the underlying issue will be reported and fixed. The current issue is that somehow the dst of the SKB gets changed to "lo" and fragmentation is not done for the proper output device. Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-01-22 0:59 ` Michael Chan @ 2019-01-22 18:28 ` Mahesh Bandewar (महेश बंडेवार) 2019-01-22 20:09 ` David Miller 2019-01-30 9:07 ` Michael Chan 0 siblings, 2 replies; 12+ messages in thread From: Mahesh Bandewar (महेश बंडेवार) @ 2019-01-22 18:28 UTC (permalink / raw) To: Michael Chan, Daniel Axtens Cc: Netdev, David Miller, Eric Dumazet, Willem de Bruijn [-- Attachment #1: Type: text/plain, Size: 2578 bytes --] On Mon, Jan 21, 2019 at 4:59 PM Michael Chan <michael.chan@broadcom.com> wrote: > > On Mon, Jan 21, 2019 at 4:36 PM Daniel Axtens <dja@axtens.net> wrote: > > > I hit a similar sounding issue on a bnx2x - see commit > > 8914a595110a6eca69a5e275b323f5d09e18f4f9 > > > > In that case, a GSO packet with gso_size too large for the firmware was > > coming to the bnx2x driver from an ibmveth device via Open vSwitch. I > > also toyed with a fix in the stack and ended up fixing just the driver. > > Hi Daniel, yes I remember the issue that you reported at that time. > The current issue is similar but for non-TSO UDP packets. > > > > > I was hoping to get a generic fix in to the stack afterwards, but didn't > > get anything finished. Looking back at old branches, it looks like I > > considered adding MTU validation to validate_xmit_skb, > > MTU validation in validate_xmit_skb() would definitely prevent the > current reported issue. Perhaps we can add a WARN() when an invalid > length is detected so that the underlying issue will be reported and > fixed. The current issue is that somehow the dst of the SKB gets > changed to "lo" and fragmentation is not done for the proper output > device. Thanks. [Looks like my earlier reply didn't make it to most of the people in the list, So let me include that here also] >>> We have been battling with something similar. The suspicion is that when an egress packet gets transferred from L4 to L3 some strange sequence of event causes dst invalidation which results into dst device to change to 'lo' which has the MTU of 65k, this makes the stack pass to a frame larger than the MTU to the driver which results into the undesired outcome you have mentioned. I have cooked few patches to address this issue, however, I'm finding it really hard to reproduce this issue in my test environment. If you could try them and see if those fixes the issue, we can refine them if necessary and eventually formalize patches. <<< Actually while looking at this issue I did look at the fix Daniel mentioned (8914a595110a6eca69a5e275b323f5d09e18f4f9) and wanted to reach out to him to see if the solution that I'm trying to implement in the stack works in that scenario too. The idea behind the fix is very simple and it is to create a dst-only (unregistered) device with a very low MTU and use it instead of 'lo' while invalidating the dst. This would make it *not* forward packets to driver which might need fragmentation. Attaching the patches to this mail. [Let me know if I should provide them in a different way] thanks, --mahesh.. [-- Attachment #2: 0001-loopback-create-blackhole-net-device-similar-to-loop.patch --] [-- Type: application/octet-stream, Size: 4587 bytes --] From 9c248c001e5f90a60ebf5289806992ebb0dfdec5 Mon Sep 17 00:00:00 2001 From: Mahesh Bandewar <maheshb@google.com> Date: Thu, 4 Oct 2018 13:33:38 -0700 Subject: [RFC PATCH 1/2] loopback: create blackhole net device similar to loopack. Create a blackhole net device that can be used for "dead" dst entries instead of loopback device. This blackhole device differs from loopback in few aspects: (a) It's not per-ns. (b) MTU on this device is ETH_MIN_MTU (c) The xmit function is essentially kfree_skb(). and (d) since it's not registed it wont have ifindex. Lower MTU effectively make the device not pass the MTU check during the route check when a dst associated with the skb is dead. Change-Id: I19fb3dfd7a91c8cba6c29274e80269371fc6d54a Signed-off-by: Mahesh Bandewar <maheshb@google.com> --- drivers/net/loopback.c | 72 ++++++++++++++++++++++++++++++++++----- include/linux/netdevice.h | 2 ++ 2 files changed, 65 insertions(+), 9 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 2df7f60fe052..a67daa16f49a 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -59,6 +59,13 @@ #include <net/net_namespace.h> #include <linux/u64_stats_sync.h> +/* blackhole_netdev - a device used for dsts that are marked expired! + * This is global device (instead of per-net-ns) since it's not needed + * to be per-ns and gets initialized at boot time. + */ +struct net_device *blackhole_netdev; +EXPORT_SYMBOL(blackhole_netdev); + /* The higher levels take care of making this non-reentrant (it's * called with bh's disabled). */ @@ -166,12 +173,14 @@ static const struct net_device_ops loopback_ops = { .ndo_set_mac_address = eth_mac_addr, }; -/* The loopback device is special. There is only one instance - * per network namespace. - */ -static void loopback_setup(struct net_device *dev) +static void gen_lo_setup(struct net_device *dev, + unsigned int mtu, + const struct ethtool_ops *eth_ops, + const struct header_ops *hdr_ops, + const struct net_device_ops *dev_ops, + void (*dev_destructor)(struct net_device *dev)) { - dev->mtu = 64 * 1024; + dev->mtu = mtu; dev->hard_header_len = ETH_HLEN; /* 14 */ dev->min_header_len = ETH_HLEN; /* 14 */ dev->addr_len = ETH_ALEN; /* 6 */ @@ -190,11 +199,20 @@ static void loopback_setup(struct net_device *dev) | NETIF_F_NETNS_LOCAL | NETIF_F_VLAN_CHALLENGED | NETIF_F_LOOPBACK; - dev->ethtool_ops = &loopback_ethtool_ops; - dev->header_ops = ð_header_ops; - dev->netdev_ops = &loopback_ops; + dev->ethtool_ops = eth_ops; + dev->header_ops = hdr_ops; + dev->netdev_ops = dev_ops; dev->needs_free_netdev = true; - dev->priv_destructor = loopback_dev_free; + dev->priv_destructor = dev_destructor; +} + +/* The loopback device is special. There is only one instance + * per network namespace. + */ +static void loopback_setup(struct net_device *dev) +{ + gen_lo_setup(dev, (64 * 1024), &loopback_ethtool_ops, ð_header_ops, + &loopback_ops, loopback_dev_free); } /* Setup and register the loopback device. */ @@ -229,3 +247,39 @@ static __net_init int loopback_net_init(struct net *net) struct pernet_operations __net_initdata loopback_net_ops = { .init = loopback_net_init, }; + +/* blackhole netdevice */ +static netdev_tx_t blackhole_netdev_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + kfree_skb(skb); + net_warn_ratelimited("%s(): Dropping skb.\n", __func__); + return NETDEV_TX_OK; +} + +static const struct net_device_ops blackhole_netdev_ops = { + .ndo_start_xmit = blackhole_netdev_xmit, +}; + +/* This is a dst-dummy device used specifically for invalidated + * DSTs and unlike loopback, this is not per-ns. + */ +static void blackhole_netdev_setup(struct net_device *dev) +{ + gen_lo_setup(dev, ETH_MIN_MTU, NULL, NULL, &blackhole_netdev_ops, NULL); +} + +/* Setup and register the blackhole_netdev. */ +static int __init blackhole_netdev_init(void) +{ + blackhole_netdev = alloc_netdev(0, "blackhole_dev", NET_NAME_UNKNOWN, + blackhole_netdev_setup); + if (!blackhole_netdev) + return -ENOMEM; + + dev_init_scheduler(blackhole_netdev); + + return 0; +} + +device_initcall(blackhole_netdev_init); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a57b9a853aab..6a52b112ad82 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -4838,4 +4838,6 @@ do { \ #define PTYPE_HASH_SIZE (16) #define PTYPE_HASH_MASK (PTYPE_HASH_SIZE - 1) +extern struct net_device *blackhole_netdev; + #endif /* _LINUX_NETDEVICE_H */ -- 2.20.1.321.g9e740568ce-goog [-- Attachment #3: 0002-blackhole_netdev-use-blackhole_netdev-to-invalidate-.patch --] [-- Type: application/octet-stream, Size: 2003 bytes --] From ad615fa0994f3e671a9e97121be24936c06d6b36 Mon Sep 17 00:00:00 2001 From: Mahesh Bandewar <maheshb@google.com> Date: Thu, 4 Oct 2018 13:40:31 -0700 Subject: [RFC PATCH 2/2] blackhole_netdev: use blackhole_netdev to invalidate dst entries Use blackhole_netdev instead of 'lo' device with lower MTU when marking dst "dead". Change-Id: I6c5cf4bb6aafe50ec3ad31defa03906b00b2cd8b Signed-off-by: Mahesh Bandewar <maheshb@google.com> --- net/core/dst.c | 2 +- net/ipv4/route.c | 3 +-- net/ipv6/route.c | 2 +- 3 files changed, 3 insertions(+), 4 deletions(-) diff --git a/net/core/dst.c b/net/core/dst.c index a263309df115..a07b05d0a826 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -176,7 +176,7 @@ void dst_dev_put(struct dst_entry *dst) dst->ops->ifdown(dst, dev, true); dst->input = dst_discard; dst->output = dst_discard_out; - dst->dev = dev_net(dst->dev)->loopback_dev; + dst->dev = blackhole_netdev; dev_hold(dst->dev); dev_put(dev); } diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 99be68b15da0..012c64ebbd0c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1485,7 +1485,6 @@ static void ipv4_dst_destroy(struct dst_entry *dst) void rt_flush_dev(struct net_device *dev) { - struct net *net = dev_net(dev); struct rtable *rt; int cpu; @@ -1496,7 +1495,7 @@ void rt_flush_dev(struct net_device *dev) list_for_each_entry(rt, &ul->head, rt_uncached) { if (rt->dst.dev != dev) continue; - rt->dst.dev = net->loopback_dev; + rt->dst.dev = blackhole_netdev; dev_hold(rt->dst.dev); dev_put(dev); } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 8e11f9a557b1..695f3aeb1f4d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -179,7 +179,7 @@ static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev) } if (rt_dev == dev) { - rt->dst.dev = loopback_dev; + rt->dst.dev = blackhole_netdev; dev_hold(rt->dst.dev); dev_put(rt_dev); } -- 2.20.1.321.g9e740568ce-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-01-22 18:28 ` Mahesh Bandewar (महेश बंडेवार) @ 2019-01-22 20:09 ` David Miller 2019-01-30 9:07 ` Michael Chan 1 sibling, 0 replies; 12+ messages in thread From: David Miller @ 2019-01-22 20:09 UTC (permalink / raw) To: maheshb; +Cc: michael.chan, dja, netdev, edumazet, willemb From: Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com> Date: Tue, 22 Jan 2019 10:28:58 -0800 > On Mon, Jan 21, 2019 at 4:59 PM Michael Chan <michael.chan@broadcom.com> wrote: >> >> On Mon, Jan 21, 2019 at 4:36 PM Daniel Axtens <dja@axtens.net> wrote: >> >> > I hit a similar sounding issue on a bnx2x - see commit >> > 8914a595110a6eca69a5e275b323f5d09e18f4f9 >> > >> > In that case, a GSO packet with gso_size too large for the firmware was >> > coming to the bnx2x driver from an ibmveth device via Open vSwitch. I >> > also toyed with a fix in the stack and ended up fixing just the driver. >> >> Hi Daniel, yes I remember the issue that you reported at that time. >> The current issue is similar but for non-TSO UDP packets. I just want to step in and say that I really don't want drivers to have to deal with this problem, if possible. Drivers are hard enough to write already. However, some aspects of this problem, like the MTU changes, route flaps, packet mirroring, etc. are all in one way or another asynchonous to the data fast path and therefore we may end up having to do something like this afterall right at the last step before ->ndo_start_xmit() or similar. Just my $0.02 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-01-22 18:28 ` Mahesh Bandewar (महेश बंडेवार) 2019-01-22 20:09 ` David Miller @ 2019-01-30 9:07 ` Michael Chan 2019-01-31 1:00 ` Mahesh Bandewar (महेश बंडेवार) 1 sibling, 1 reply; 12+ messages in thread From: Michael Chan @ 2019-01-30 9:07 UTC (permalink / raw) To: Mahesh Bandewar (महेश बंडेवार) Cc: Daniel Axtens, Netdev, David Miller, Eric Dumazet, Willem de Bruijn On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com> wrote: > > The idea behind the fix is very simple and it is to create a dst-only > (unregistered) device with a very low MTU and use it instead of 'lo' > while invalidating the dst. This would make it *not* forward packets > to driver which might need fragmentation. > We tested the 2 patches many times and including an overnight test. I can confirm that the oversize UDP packets are no longer seen with the patches applied. However, I don't see the blackhole xmit function getting called to free the SKBs though. Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-01-30 9:07 ` Michael Chan @ 2019-01-31 1:00 ` Mahesh Bandewar (महेश बंडेवार) 2019-02-05 19:35 ` Michael Chan 0 siblings, 1 reply; 12+ messages in thread From: Mahesh Bandewar (महेश बंडेवार) @ 2019-01-31 1:00 UTC (permalink / raw) To: Michael Chan Cc: Daniel Axtens, Netdev, David Miller, Eric Dumazet, Willem de Bruijn On Wed, Jan 30, 2019 at 1:07 AM Michael Chan <michael.chan@broadcom.com> wrote: > > On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार) > <maheshb@google.com> wrote: > > > > > The idea behind the fix is very simple and it is to create a dst-only > > (unregistered) device with a very low MTU and use it instead of 'lo' > > while invalidating the dst. This would make it *not* forward packets > > to driver which might need fragmentation. > > > > We tested the 2 patches many times and including an overnight test. I > can confirm that the oversize UDP packets are no longer seen with the > patches applied. However, I don't see the blackhole xmit function > getting called to free the SKBs though. > Thanks for the confirmation Michael. The blackhole device mtu is really small, so I would assume the fragmentation code dropped those packets before calling the xmit function (in ip_fragment), you could verify that with icmp counters. > Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-01-31 1:00 ` Mahesh Bandewar (महेश बंडेवार) @ 2019-02-05 19:35 ` Michael Chan 2019-02-07 4:51 ` Mahesh Bandewar (महेश बंडेवार) 0 siblings, 1 reply; 12+ messages in thread From: Michael Chan @ 2019-02-05 19:35 UTC (permalink / raw) To: Mahesh Bandewar (महेश बंडेवार) Cc: Daniel Axtens, Netdev, David Miller, Eric Dumazet, Willem de Bruijn On Wed, Jan 30, 2019 at 5:00 PM Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com> wrote: > > On Wed, Jan 30, 2019 at 1:07 AM Michael Chan <michael.chan@broadcom.com> wrote: > > > > On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार) > > <maheshb@google.com> wrote: > > > > > > > > The idea behind the fix is very simple and it is to create a dst-only > > > (unregistered) device with a very low MTU and use it instead of 'lo' > > > while invalidating the dst. This would make it *not* forward packets > > > to driver which might need fragmentation. > > > > > > > We tested the 2 patches many times and including an overnight test. I > > can confirm that the oversize UDP packets are no longer seen with the > > patches applied. However, I don't see the blackhole xmit function > > getting called to free the SKBs though. > > > Thanks for the confirmation Michael. The blackhole device mtu is > really small, so I would assume the fragmentation code dropped those > packets before calling the xmit function (in ip_fragment), you could > verify that with icmp counters. > I've looked at this a little more. The blackhole_dev is not IFF_UP | IFF_RUNNING, right? May be that's why the packets are never getting to the xmit function? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-02-05 19:35 ` Michael Chan @ 2019-02-07 4:51 ` Mahesh Bandewar (महेश बंडेवार) 2019-02-08 20:26 ` Mahesh Bandewar (महेश बंडेवार) 0 siblings, 1 reply; 12+ messages in thread From: Mahesh Bandewar (महेश बंडेवार) @ 2019-02-07 4:51 UTC (permalink / raw) To: Michael Chan Cc: Daniel Axtens, Netdev, David Miller, Eric Dumazet, Willem de Bruijn On Tue, Feb 5, 2019 at 11:36 AM Michael Chan <michael.chan@broadcom.com> wrote: > > On Wed, Jan 30, 2019 at 5:00 PM Mahesh Bandewar (महेश बंडेवार) > <maheshb@google.com> wrote: > > > > On Wed, Jan 30, 2019 at 1:07 AM Michael Chan <michael.chan@broadcom.com> wrote: > > > > > > On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार) > > > <maheshb@google.com> wrote: > > > > > > > > > > > The idea behind the fix is very simple and it is to create a dst-only > > > > (unregistered) device with a very low MTU and use it instead of 'lo' > > > > while invalidating the dst. This would make it *not* forward packets > > > > to driver which might need fragmentation. > > > > > > > > > > We tested the 2 patches many times and including an overnight test. I > > > can confirm that the oversize UDP packets are no longer seen with the > > > patches applied. However, I don't see the blackhole xmit function > > > getting called to free the SKBs though. > > > > > Thanks for the confirmation Michael. The blackhole device mtu is > > really small, so I would assume the fragmentation code dropped those > > packets before calling the xmit function (in ip_fragment), you could > > verify that with icmp counters. > > > > I've looked at this a little more. The blackhole_dev is not IFF_UP | > IFF_RUNNING, right? May be that's why the packets are never getting > to the xmit function? Yes, so I added those two flags and ended up writing a test-module for the device (which I will include while posting the patch-series). However, adding those flags is also not sufficient since the qdisc is initialized to noop_qdisc so qdisc enqueue will drop packets before hitting the ndo_start_xmit(). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-02-07 4:51 ` Mahesh Bandewar (महेश बंडेवार) @ 2019-02-08 20:26 ` Mahesh Bandewar (महेश बंडेवार) 2019-02-12 8:55 ` Michael Chan 0 siblings, 1 reply; 12+ messages in thread From: Mahesh Bandewar (महेश बंडेवार) @ 2019-02-08 20:26 UTC (permalink / raw) To: Michael Chan Cc: Daniel Axtens, Netdev, David Miller, Eric Dumazet, Willem de Bruijn [-- Attachment #1: Type: text/plain, Size: 2315 bytes --] On Wed, Feb 6, 2019 at 8:51 PM Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com> wrote: > > On Tue, Feb 5, 2019 at 11:36 AM Michael Chan <michael.chan@broadcom.com> wrote: > > > > On Wed, Jan 30, 2019 at 5:00 PM Mahesh Bandewar (महेश बंडेवार) > > <maheshb@google.com> wrote: > > > > > > On Wed, Jan 30, 2019 at 1:07 AM Michael Chan <michael.chan@broadcom.com> wrote: > > > > > > > > On Tue, Jan 22, 2019 at 10:29 AM Mahesh Bandewar (महेश बंडेवार) > > > > <maheshb@google.com> wrote: > > > > > > > > > > > > > > The idea behind the fix is very simple and it is to create a dst-only > > > > > (unregistered) device with a very low MTU and use it instead of 'lo' > > > > > while invalidating the dst. This would make it *not* forward packets > > > > > to driver which might need fragmentation. > > > > > > > > > > > > > We tested the 2 patches many times and including an overnight test. I > > > > can confirm that the oversize UDP packets are no longer seen with the > > > > patches applied. However, I don't see the blackhole xmit function > > > > getting called to free the SKBs though. > > > > > > > Thanks for the confirmation Michael. The blackhole device mtu is > > > really small, so I would assume the fragmentation code dropped those > > > packets before calling the xmit function (in ip_fragment), you could > > > verify that with icmp counters. > > > > > > > I've looked at this a little more. The blackhole_dev is not IFF_UP | > > IFF_RUNNING, right? May be that's why the packets are never getting > > to the xmit function? > Yes, so I added those two flags and ended up writing a test-module for > the device (which I will include while posting the patch-series). > However, adding those flags is also not sufficient since the qdisc is > initialized to noop_qdisc so qdisc enqueue will drop packets before > hitting the ndo_start_xmit(). I have another version of the fix (with help from Eric) and this should hit the .ndo_start_xmit() of the blackhole_dev. I'm adding these flags during the setup and then calling dev_activate() to change noop qdisc to null qdisc. Please give this patch set a try and let me know if the blackhole_dev xmit path gets exercised in your test scenario. [-- Attachment #2: 0001-loopback-create-blackhole-net-device-similar-to-loop.patch --] [-- Type: application/octet-stream, Size: 4714 bytes --] From 8aa152e0feb929479ef3f2f93fdd2edcf6f463b8 Mon Sep 17 00:00:00 2001 From: Mahesh Bandewar <maheshb@google.com> Date: Thu, 4 Oct 2018 13:33:38 -0700 Subject: [PATCH 1/3] loopback: create blackhole net device similar to loopack. Create a blackhole net device that can be used for "dead" dst entries instead of loopback device. This blackhole device differs from loopback in few aspects: (a) It's not per-ns. (b) MTU on this device is ETH_MIN_MTU (c) The xmit function is essentially kfree_skb(). and (d) since it's not registed it wont have ifindex. Lower MTU effectively make the device not pass the MTU check during the route check when a dst associated with the skb is dead. Change-Id: I19fb3dfd7a91c8cba6c29274e80269371fc6d54a Signed-off-by: Mahesh Bandewar <maheshb@google.com> --- drivers/net/loopback.c | 76 ++++++++++++++++++++++++++++++++++----- include/linux/netdevice.h | 2 ++ 2 files changed, 69 insertions(+), 9 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 2df7f60fe052..48d7842d5f9c 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -59,6 +59,13 @@ #include <net/net_namespace.h> #include <linux/u64_stats_sync.h> +/* blackhole_netdev - a device used for dsts that are marked expired! + * This is global device (instead of per-net-ns) since it's not needed + * to be per-ns and gets initialized at boot time. + */ +struct net_device *blackhole_netdev; +EXPORT_SYMBOL(blackhole_netdev); + /* The higher levels take care of making this non-reentrant (it's * called with bh's disabled). */ @@ -166,12 +173,14 @@ static const struct net_device_ops loopback_ops = { .ndo_set_mac_address = eth_mac_addr, }; -/* The loopback device is special. There is only one instance - * per network namespace. - */ -static void loopback_setup(struct net_device *dev) +static void gen_lo_setup(struct net_device *dev, + unsigned int mtu, + const struct ethtool_ops *eth_ops, + const struct header_ops *hdr_ops, + const struct net_device_ops *dev_ops, + void (*dev_destructor)(struct net_device *dev)) { - dev->mtu = 64 * 1024; + dev->mtu = mtu; dev->hard_header_len = ETH_HLEN; /* 14 */ dev->min_header_len = ETH_HLEN; /* 14 */ dev->addr_len = ETH_ALEN; /* 6 */ @@ -190,11 +199,20 @@ static void loopback_setup(struct net_device *dev) | NETIF_F_NETNS_LOCAL | NETIF_F_VLAN_CHALLENGED | NETIF_F_LOOPBACK; - dev->ethtool_ops = &loopback_ethtool_ops; - dev->header_ops = ð_header_ops; - dev->netdev_ops = &loopback_ops; + dev->ethtool_ops = eth_ops; + dev->header_ops = hdr_ops; + dev->netdev_ops = dev_ops; dev->needs_free_netdev = true; - dev->priv_destructor = loopback_dev_free; + dev->priv_destructor = dev_destructor; +} + +/* The loopback device is special. There is only one instance + * per network namespace. + */ +static void loopback_setup(struct net_device *dev) +{ + gen_lo_setup(dev, (64 * 1024), &loopback_ethtool_ops, ð_header_ops, + &loopback_ops, loopback_dev_free); } /* Setup and register the loopback device. */ @@ -229,3 +247,43 @@ static __net_init int loopback_net_init(struct net *net) struct pernet_operations __net_initdata loopback_net_ops = { .init = loopback_net_init, }; + +/* blackhole netdevice */ +static netdev_tx_t blackhole_netdev_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + kfree_skb(skb); + net_warn_ratelimited("%s(): Dropping skb.\n", __func__); + return NETDEV_TX_OK; +} + +static const struct net_device_ops blackhole_netdev_ops = { + .ndo_start_xmit = blackhole_netdev_xmit, +}; + +/* This is a dst-dummy device used specifically for invalidated + * DSTs and unlike loopback, this is not per-ns. + */ +static void blackhole_netdev_setup(struct net_device *dev) +{ + gen_lo_setup(dev, ETH_MIN_MTU, NULL, NULL, &blackhole_netdev_ops, NULL); +} + +/* Setup and register the blackhole_netdev. */ +static int __init blackhole_netdev_init(void) +{ + blackhole_netdev = alloc_netdev(0, "blackhole_dev", NET_NAME_UNKNOWN, + blackhole_netdev_setup); + if (!blackhole_netdev) + return -ENOMEM; + + dev_init_scheduler(blackhole_netdev); + dev_activate(blackhole_netdev); + + blackhole_netdev->flags |= IFF_UP | IFF_RUNNING; + dev_net_set(blackhole_netdev, &init_net); + + return 0; +} + +device_initcall(blackhole_netdev_init); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 1d95e634f3fe..c10cba10388a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -4848,4 +4848,6 @@ do { \ #define PTYPE_HASH_SIZE (16) #define PTYPE_HASH_MASK (PTYPE_HASH_SIZE - 1) +extern struct net_device *blackhole_netdev; + #endif /* _LINUX_NETDEVICE_H */ -- 2.20.1.791.gb4d0f1c61a-goog [-- Attachment #3: 0002-blackhole_netdev-use-blackhole_netdev-to-invalidate-.patch --] [-- Type: application/octet-stream, Size: 2000 bytes --] From 8ca028c8a51f0e8b9376c195fe96086d4dfa18ac Mon Sep 17 00:00:00 2001 From: Mahesh Bandewar <maheshb@google.com> Date: Thu, 4 Oct 2018 13:40:31 -0700 Subject: [PATCH 2/3] blackhole_netdev: use blackhole_netdev to invalidate dst entries Use blackhole_netdev instead of 'lo' device with lower MTU when marking dst "dead". Change-Id: I6c5cf4bb6aafe50ec3ad31defa03906b00b2cd8b Signed-off-by: Mahesh Bandewar <maheshb@google.com> --- net/core/dst.c | 2 +- net/ipv4/route.c | 3 +-- net/ipv6/route.c | 2 +- 3 files changed, 3 insertions(+), 4 deletions(-) diff --git a/net/core/dst.c b/net/core/dst.c index a263309df115..a07b05d0a826 100644 --- a/net/core/dst.c +++ b/net/core/dst.c @@ -176,7 +176,7 @@ void dst_dev_put(struct dst_entry *dst) dst->ops->ifdown(dst, dev, true); dst->input = dst_discard; dst->output = dst_discard_out; - dst->dev = dev_net(dst->dev)->loopback_dev; + dst->dev = blackhole_netdev; dev_hold(dst->dev); dev_put(dev); } diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 16259ea9df54..4e4f1b36e306 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1485,7 +1485,6 @@ static void ipv4_dst_destroy(struct dst_entry *dst) void rt_flush_dev(struct net_device *dev) { - struct net *net = dev_net(dev); struct rtable *rt; int cpu; @@ -1496,7 +1495,7 @@ void rt_flush_dev(struct net_device *dev) list_for_each_entry(rt, &ul->head, rt_uncached) { if (rt->dst.dev != dev) continue; - rt->dst.dev = net->loopback_dev; + rt->dst.dev = blackhole_netdev; dev_hold(rt->dst.dev); dev_put(dev); } diff --git a/net/ipv6/route.c b/net/ipv6/route.c index dc066fdf7e46..aca0e5651f5b 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -179,7 +179,7 @@ static void rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev) } if (rt_dev == dev) { - rt->dst.dev = loopback_dev; + rt->dst.dev = blackhole_netdev; dev_hold(rt->dst.dev); dev_put(rt_dev); } -- 2.20.1.791.gb4d0f1c61a-goog [-- Attachment #4: 0003-blackhole_dev-add-a-selftest.patch --] [-- Type: application/octet-stream, Size: 6342 bytes --] From 992832040ffe08b106bbf21c0fa8a7317655d530 Mon Sep 17 00:00:00 2001 From: Mahesh Bandewar <maheshb@google.com> Date: Wed, 10 Oct 2018 15:25:01 -0700 Subject: [PATCH 3/3] blackhole_dev: add a selftest Change-Id: Ia06c3edc124c652ee03833eb60a5c6229d4c4ec5 Signed-off-by: Mahesh Bandewar <maheshb@google.com> --- lib/Kconfig.debug | 9 ++ lib/Makefile | 1 + lib/test_blackhole_dev.c | 100 ++++++++++++++++++ tools/testing/selftests/net/Makefile | 3 +- tools/testing/selftests/net/config | 1 + .../selftests/net/test_blackhole_dev.sh | 11 ++ 6 files changed, 124 insertions(+), 1 deletion(-) create mode 100644 lib/test_blackhole_dev.c create mode 100755 tools/testing/selftests/net/test_blackhole_dev.sh diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index d4df5b24d75e..593ecd4bc815 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1901,6 +1901,15 @@ config TEST_BPF If unsure, say N. +config TEST_BLACKHOLE_DEV + tristate "Test BPF filter functionality" + depends on m && NET + help + This builds the "test_blackhole_dev" module that validates the + data path through this blackhole netdev. + + If unsure, say N. + config FIND_BIT_BENCHMARK tristate "Test find_bit functions" help diff --git a/lib/Makefile b/lib/Makefile index e1b59da71418..733f66ceb51a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -76,6 +76,7 @@ obj-$(CONFIG_TEST_KMOD) += test_kmod.o obj-$(CONFIG_TEST_DEBUG_VIRTUAL) += test_debug_virtual.o obj-$(CONFIG_TEST_MEMCAT_P) += test_memcat_p.o obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o +obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o ifeq ($(CONFIG_DEBUG_KOBJECT),y) CFLAGS_kobject.o += -DDEBUG diff --git a/lib/test_blackhole_dev.c b/lib/test_blackhole_dev.c new file mode 100644 index 000000000000..4c40580a99a3 --- /dev/null +++ b/lib/test_blackhole_dev.c @@ -0,0 +1,100 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * This module tests the blackhole_dev that is created during the + * net subsystem initialization. The test this module performs is + * by injecting an skb into the stack with skb->dev as the + * blackhole_dev and expects kernel to behave in a sane manner + * (in other words, *not crash*)! + * + * Copyright (c) 2018, Mahesh Bandewar <maheshb@google.com> + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/printk.h> +#include <linux/skbuff.h> +#include <linux/netdevice.h> +#include <linux/udp.h> +#include <linux/ipv6.h> + +#include <net/dst.h> + +#define SKB_SIZE 256 +#define HEAD_SIZE (14+40+8) /* Ether + IPv6 + UDP */ +#define TAIL_SIZE 32 /* random tail-room */ + +#define UDP_PORT 1234 + +static int __init test_blackholedev_init(void) +{ + struct ipv6hdr *ip6h; + struct sk_buff *skb; + struct ethhdr *ethh; + struct udphdr *uh; + int data_len; + int ret; + + skb = alloc_skb(SKB_SIZE, GFP_KERNEL); + if (!skb) + return -ENOMEM; + + /* Reserve head-room for the headers */ + skb_reserve(skb, HEAD_SIZE); + + /* Add data to the skb */ + data_len = SKB_SIZE - (HEAD_SIZE + TAIL_SIZE); + memset(__skb_put(skb, data_len), 0xf, data_len); + + /* Add protocol data */ + /* (Transport) UDP */ + uh = (struct udphdr *)skb_push(skb, sizeof(struct udphdr)); + skb_set_transport_header(skb, 0); + uh->source = uh->dest = htons(UDP_PORT); + uh->len = htons(data_len); + uh->check = 0; + /* (Network) IPv6 */ + ip6h = (struct ipv6hdr *)skb_push(skb, sizeof(struct ipv6hdr)); + skb_set_network_header(skb, 0); + ip6h->hop_limit = 32; + ip6h->payload_len = data_len + sizeof(struct udphdr); + ip6h->nexthdr = IPPROTO_UDP; + ip6h->saddr = in6addr_loopback; + ip6h->daddr = in6addr_loopback; + /* Ether */ + ethh = (struct ethhdr *)skb_push(skb, sizeof(struct ethhdr)); + skb_set_mac_header(skb, 0); + + skb->protocol = htons(ETH_P_IPV6); + skb->pkt_type = PACKET_HOST; + skb->dev = blackhole_netdev; + + /* Now attempt to send the packet */ + ret = dev_queue_xmit(skb); + + switch (ret) { + case NET_XMIT_SUCCESS: + pr_warn("dev_queue_xmit() returned NET_XMIT_SUCCESS\n"); + break; + case NET_XMIT_DROP: + pr_warn("dev_queue_xmit() returned NET_XMIT_DROP\n"); + break; + case NET_XMIT_CN: + pr_warn("dev_queue_xmit() returned NET_XMIT_CN\n"); + break; + default: + pr_err("dev_queue_xmit() returned UNKNOWN(%d)\n", ret); + } + + return 0; +} + +static void __exit test_blackholedev_exit(void) +{ + pr_warn("test_blackholedev module terminating.\n"); +} + +module_init(test_blackholedev_init); +module_exit(test_blackholedev_exit); + +MODULE_AUTHOR("Mahesh Bandewar <maheshb@google.com>"); +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index f8f3e90700c0..8952b2039fdc 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -4,8 +4,9 @@ CFLAGS = -Wall -Wl,--no-as-needed -O2 -g CFLAGS += -I../../../../usr/include/ +<<<<<<< HEAD TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh \ - rtnetlink.sh xfrm_policy.sh + rtnetlink.sh xfrm_policy.sh test_blackhole_dev.sh TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh TEST_PROGS += udpgro_bench.sh udpgro.sh test_vxlan_under_vrf.sh reuseport_addr_any.sh diff --git a/tools/testing/selftests/net/config b/tools/testing/selftests/net/config index 5821bdd98d20..4553c7367aad 100644 --- a/tools/testing/selftests/net/config +++ b/tools/testing/selftests/net/config @@ -28,3 +28,4 @@ CONFIG_NF_TABLES_IPV6=y CONFIG_NF_TABLES_IPV4=y CONFIG_NFT_CHAIN_NAT_IPV6=m CONFIG_NFT_CHAIN_NAT_IPV4=m +CONFIG_TEST_BLACKHOLE_DEV=m diff --git a/tools/testing/selftests/net/test_blackhole_dev.sh b/tools/testing/selftests/net/test_blackhole_dev.sh new file mode 100755 index 000000000000..3119b80e711f --- /dev/null +++ b/tools/testing/selftests/net/test_blackhole_dev.sh @@ -0,0 +1,11 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# Runs blackhole-dev test using blackhole-dev kernel module + +if /sbin/modprobe -q test_blackhole_dev ; then + /sbin/modprobe -q -r test_blackhole_dev; + echo "test_blackhole_dev: ok"; +else + echo "test_blackhole_dev: [FAIL]"; + exit 1; +fi -- 2.20.1.791.gb4d0f1c61a-goog ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: Stack sends oversize UDP packet to the driver 2019-02-08 20:26 ` Mahesh Bandewar (महेश बंडेवार) @ 2019-02-12 8:55 ` Michael Chan 0 siblings, 0 replies; 12+ messages in thread From: Michael Chan @ 2019-02-12 8:55 UTC (permalink / raw) To: Mahesh Bandewar (महेश बंडेवार) Cc: Daniel Axtens, Netdev, David Miller, Eric Dumazet, Willem de Bruijn On Fri, Feb 8, 2019 at 12:26 PM Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com> wrote: > > On Wed, Feb 6, 2019 at 8:51 PM Mahesh Bandewar (महेश बंडेवार) > <maheshb@google.com> wrote: > > > > On Tue, Feb 5, 2019 at 11:36 AM Michael Chan <michael.chan@broadcom.com> wrote: > > > I've looked at this a little more. The blackhole_dev is not IFF_UP | > > > IFF_RUNNING, right? May be that's why the packets are never getting > > > to the xmit function? > > Yes, so I added those two flags and ended up writing a test-module for > > the device (which I will include while posting the patch-series). > > However, adding those flags is also not sufficient since the qdisc is > > initialized to noop_qdisc so qdisc enqueue will drop packets before > > hitting the ndo_start_xmit(). > > I have another version of the fix (with help from Eric) and this > should hit the .ndo_start_xmit() of the blackhole_dev. I'm adding > these flags during the setup and then calling dev_activate() to change > noop qdisc to null qdisc. Please give this patch set a try and let me > know if the blackhole_dev xmit path gets exercised in your test > scenario. The new version still works in the sense that no oversize packets are seen in the NIC driver's xmit function. But I still don't see any packets hitting the blackhole's xmit function. I'm not 100% sure but I think the blackhole dev has no IP address and so the UDP packets are dropped in ip_finish_output2() because there is no neigh. Something like that. ^ permalink raw reply [flat|nested] 12+ messages in thread
[parent not found: <CAF2d9jgskHTb-nmbVo9A2CQhh9T3OnH_vbfGcMBii13oq1teCw@mail.gmail.com>]
* Re: Stack sends oversize UDP packet to the driver [not found] ` <CAF2d9jgskHTb-nmbVo9A2CQhh9T3OnH_vbfGcMBii13oq1teCw@mail.gmail.com> @ 2019-01-22 0:45 ` Michael Chan 0 siblings, 0 replies; 12+ messages in thread From: Michael Chan @ 2019-01-22 0:45 UTC (permalink / raw) To: Mahesh Bandewar (महेश बंडेवार) Cc: Netdev, David Miller, Eric Dumazet, Willem de Bruijn On Mon, Jan 21, 2019 at 4:13 PM Mahesh Bandewar (महेश बंडेवार) <maheshb@google.com> wrote: > I have cooked few patches to address this issue, however, I'm finding it really hard to reproduce this issue in my test environment. If you could try them and see if those fixes the issue, we can refine them if necessary and eventually formalize patches. > > Hi Mahesh, yes, I can have the patches tested, so please send them my way. Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2019-02-12 8:56 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-01-20 22:26 Stack sends oversize UDP packet to the driver Michael Chan 2019-01-22 0:36 ` Daniel Axtens 2019-01-22 0:59 ` Michael Chan 2019-01-22 18:28 ` Mahesh Bandewar (महेश बंडेवार) 2019-01-22 20:09 ` David Miller 2019-01-30 9:07 ` Michael Chan 2019-01-31 1:00 ` Mahesh Bandewar (महेश बंडेवार) 2019-02-05 19:35 ` Michael Chan 2019-02-07 4:51 ` Mahesh Bandewar (महेश बंडेवार) 2019-02-08 20:26 ` Mahesh Bandewar (महेश बंडेवार) 2019-02-12 8:55 ` Michael Chan [not found] ` <CAF2d9jgskHTb-nmbVo9A2CQhh9T3OnH_vbfGcMBii13oq1teCw@mail.gmail.com> 2019-01-22 0:45 ` Michael Chan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).