* [PATCH 1/2] net: Add __dev_forward_skb [not found] <<20140407075347.GA26461@gondor.apana.org.au> @ 2014-04-07 7:55 ` Herbert Xu 2014-04-07 18:48 ` Stephen Hemminger 2014-04-07 7:55 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu 1 sibling, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-07 7:55 UTC (permalink / raw) To: David S. Miller, netdev This patch adds the helper __dev_forward_skb which is identical to dev_forward_skb except that it doesn't actually inject the skb into the stack. This is useful where we wish to have finer control over how the packet is injected, e.g., via netif_rx_ni or netif_receive_skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/linux/netdevice.h | 1 + net/core/dev.c | 42 ++++++++++++++++++++++++------------------ 2 files changed, 25 insertions(+), 18 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 775cc95..4ab86d1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2619,6 +2619,7 @@ int dev_get_phys_port_id(struct net_device *dev, struct netdev_phys_port_id *ppid); int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq); +int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index 7570634..bd6ca09 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1661,6 +1661,29 @@ bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb) } EXPORT_SYMBOL_GPL(is_skb_forwardable); +int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) +{ + if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) { + if (skb_copy_ubufs(skb, GFP_ATOMIC)) { + atomic_long_inc(&dev->rx_dropped); + kfree_skb(skb); + return NET_RX_DROP; + } + } + + if (unlikely(!is_skb_forwardable(dev, skb))) { + atomic_long_inc(&dev->rx_dropped); + kfree_skb(skb); + return NET_RX_DROP; + } + + skb_scrub_packet(skb, true); + skb->protocol = eth_type_trans(skb, dev); + + return 0; +} +EXPORT_SYMBOL_GPL(__dev_forward_skb); + /** * dev_forward_skb - loopback an skb to another netif * @@ -1681,24 +1704,7 @@ EXPORT_SYMBOL_GPL(is_skb_forwardable); */ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) { - if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) { - if (skb_copy_ubufs(skb, GFP_ATOMIC)) { - atomic_long_inc(&dev->rx_dropped); - kfree_skb(skb); - return NET_RX_DROP; - } - } - - if (unlikely(!is_skb_forwardable(dev, skb))) { - atomic_long_inc(&dev->rx_dropped); - kfree_skb(skb); - return NET_RX_DROP; - } - - skb_scrub_packet(skb, true); - skb->protocol = eth_type_trans(skb, dev); - - return netif_rx_internal(skb); + return __dev_forward_skb(dev, skb) ?: netif_rx_internal(skb); } EXPORT_SYMBOL_GPL(dev_forward_skb); ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] net: Add __dev_forward_skb 2014-04-07 7:55 ` [PATCH 1/2] net: Add __dev_forward_skb Herbert Xu @ 2014-04-07 18:48 ` Stephen Hemminger 2014-04-08 7:15 ` Herbert Xu 0 siblings, 1 reply; 28+ messages in thread From: Stephen Hemminger @ 2014-04-07 18:48 UTC (permalink / raw) To: Herbert Xu; +Cc: David S. Miller, netdev On Mon, 07 Apr 2014 15:55:35 +0800 Herbert Xu <herbert@gondor.apana.org.au> wrote: > +} > +EXPORT_SYMBOL_GPL(__dev_forward_skb); > + Don't export a symbol without a followup patch that uses it. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] net: Add __dev_forward_skb 2014-04-07 18:48 ` Stephen Hemminger @ 2014-04-08 7:15 ` Herbert Xu 2014-04-08 16:27 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-08 7:15 UTC (permalink / raw) To: Stephen Hemminger; +Cc: David S. Miller, netdev On Mon, Apr 07, 2014 at 11:48:54AM -0700, Stephen Hemminger wrote: > On Mon, 07 Apr 2014 15:55:35 +0800 > Herbert Xu <herbert@gondor.apana.org.au> wrote: > > > +} > > +EXPORT_SYMBOL_GPL(__dev_forward_skb); > > + > > Don't export a symbol without a followup patch that uses it. Sorry but I screwed the threading so the follow-up patch is in its own thread. But I assure you that this symbol is used in the follow-up :) -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] net: Add __dev_forward_skb 2014-04-08 7:15 ` Herbert Xu @ 2014-04-08 16:27 ` David Miller 0 siblings, 0 replies; 28+ messages in thread From: David Miller @ 2014-04-08 16:27 UTC (permalink / raw) To: herbert; +Cc: stephen, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue, 8 Apr 2014 15:15:18 +0800 > On Mon, Apr 07, 2014 at 11:48:54AM -0700, Stephen Hemminger wrote: >> On Mon, 07 Apr 2014 15:55:35 +0800 >> Herbert Xu <herbert@gondor.apana.org.au> wrote: >> >> > +} >> > +EXPORT_SYMBOL_GPL(__dev_forward_skb); >> > + >> >> Don't export a symbol without a followup patch that uses it. > > Sorry but I screwed the threading so the follow-up patch is in its > own thread. But I assure you that this symbol is used in the > follow-up :) Yeah I was going to mention that to Stephen too :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/2] macvlan: Move broadcasts into a work queue [not found] <<20140407075347.GA26461@gondor.apana.org.au> 2014-04-07 7:55 ` [PATCH 1/2] net: Add __dev_forward_skb Herbert Xu @ 2014-04-07 7:55 ` Herbert Xu 2014-04-07 14:07 ` Eric Dumazet 2014-04-08 16:55 ` Stephen Hemminger 1 sibling, 2 replies; 28+ messages in thread From: Herbert Xu @ 2014-04-07 7:55 UTC (permalink / raw) To: David S. Miller, netdev Currently broadcasts are handled in network RX context, where the packets are sent through netif_rx. This means that the number of macvlans will be constrained by the capacity of netif_rx. For example, setting up 4096 macvlans practically causes all broadcast packets to be dropped as the default netif_rx queue size simply can't handle 4096 skbs being stuffed into it all at once. Fundamentally, we need to ensure that the amount of work handled in each netif_rx backlog run is constrained. As broadcasts are anything but constrained, it either needs to be limited per run or moved to process context. This patch picks the second option and moves all broadcast handling bar the trivial case of packets going to a single interface into a work queue. Obviously there also needs to be a limit on how many broadcast packets we postpone in this way. I've arbitrarily chosen tx_queue_len of the master device as the limit (act_mirred also happens to use this parameter in a similar way). In order to ensure we don't exceed the backlog queue we will use netif_rx_ni instead of netif_rx for broadcast packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- drivers/net/macvlan.c | 115 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 92 insertions(+), 23 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 753a8c2..ab7c925 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -30,6 +30,7 @@ #include <linux/if_link.h> #include <linux/if_macvlan.h> #include <linux/hash.h> +#include <linux/workqueue.h> #include <net/rtnetlink.h> #include <net/xfrm.h> @@ -40,10 +41,18 @@ struct macvlan_port { struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; struct list_head vlans; struct rcu_head rcu; + struct sk_buff_head bc_queue; + struct work_struct bc_work; bool passthru; int count; }; +struct macvlan_skb_cb { + const struct macvlan_dev *src; +}; + +#define MACVLAN_SKB_CB(__skb) ((struct macvlan_skb_cb *)&((__skb)->cb[0])) + static void macvlan_port_destroy(struct net_device *dev); static struct macvlan_port *macvlan_port_get_rcu(const struct net_device *dev) @@ -120,7 +129,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev = vlan->dev; if (local) - return dev_forward_skb(dev, skb); + return __dev_forward_skb(dev, skb); skb->dev = dev; if (ether_addr_equal_64bits(eth->h_dest, dev->broadcast)) @@ -128,7 +137,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, else skb->pkt_type = PACKET_MULTICAST; - return netif_rx(skb); + return 0; } static u32 macvlan_hash_mix(const struct macvlan_dev *vlan) @@ -175,32 +184,32 @@ static void macvlan_broadcast(struct sk_buff *skb, if (likely(nskb)) err = macvlan_broadcast_one( nskb, vlan, eth, - mode == MACVLAN_MODE_BRIDGE); + mode == MACVLAN_MODE_BRIDGE) ?: + netif_rx_ni(nskb); macvlan_count_rx(vlan, skb->len + ETH_HLEN, err == NET_RX_SUCCESS, 1); } } } -/* called under rcu_read_lock() from netif_receive_skb */ -static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) +static void macvlan_process_broadcast(struct work_struct *w) { - struct macvlan_port *port; - struct sk_buff *skb = *pskb; - const struct ethhdr *eth = eth_hdr(skb); - const struct macvlan_dev *vlan; - const struct macvlan_dev *src; - struct net_device *dev; - unsigned int len = 0; - int ret = NET_RX_DROP; + struct macvlan_port *port = container_of(w, struct macvlan_port, + bc_work); + struct sk_buff *skb; + struct sk_buff_head list; + + skb_queue_head_init(&list); + + spin_lock_bh(&port->bc_queue.lock); + skb_queue_splice_tail_init(&port->bc_queue, &list); + spin_unlock_bh(&port->bc_queue.lock); + + while ((skb = __skb_dequeue(&list))) { + const struct macvlan_dev *src = MACVLAN_SKB_CB(skb)->src; + + rcu_read_lock(); - port = macvlan_port_get_rcu(skb->dev); - if (is_multicast_ether_addr(eth->h_dest)) { - skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN); - if (!skb) - return RX_HANDLER_CONSUMED; - eth = eth_hdr(skb); - src = macvlan_hash_lookup(port, eth->h_source); if (!src) /* frame comes from an external address */ macvlan_broadcast(skb, port, NULL, @@ -213,20 +222,76 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) macvlan_broadcast(skb, port, src->dev, MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE); - else if (src->mode == MACVLAN_MODE_BRIDGE) + else /* * flood only to VEPA ports, bridge ports * already saw the frame on the way out. */ macvlan_broadcast(skb, port, src->dev, MACVLAN_MODE_VEPA); - else { + + rcu_read_unlock(); + + kfree_skb(skb); + } +} + +static void macvlan_broadcast_enqueue(struct macvlan_port *port, + struct sk_buff *skb) +{ + int err = -ENOMEM; + + skb = skb_clone(skb, GFP_ATOMIC); + if (unlikely(!skb)) + goto err; + + spin_lock(&port->bc_queue.lock); + if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) { + __skb_queue_tail(&port->bc_queue, skb); + err = 0; + } + spin_unlock(&port->bc_queue.lock); + + if (unlikely(err)) { +err: + atomic_long_inc(&skb->dev->rx_dropped); + return; + } + + schedule_work(&port->bc_work); +} + +/* called under rcu_read_lock() from netif_receive_skb */ +static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) +{ + struct macvlan_port *port; + struct sk_buff *skb = *pskb; + const struct ethhdr *eth = eth_hdr(skb); + const struct macvlan_dev *vlan; + const struct macvlan_dev *src; + struct net_device *dev; + unsigned int len = 0; + int ret = NET_RX_DROP; + + port = macvlan_port_get_rcu(skb->dev); + if (is_multicast_ether_addr(eth->h_dest)) { + skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN); + if (!skb) + return RX_HANDLER_CONSUMED; + eth = eth_hdr(skb); + src = macvlan_hash_lookup(port, eth->h_source); + if (src && src->mode != MACVLAN_MODE_VEPA && + src->mode != MACVLAN_MODE_BRIDGE) { /* forward to original port. */ vlan = src; - ret = macvlan_broadcast_one(skb, vlan, eth, 0); + ret = macvlan_broadcast_one(skb, vlan, eth, 0) ?: + netif_rx(skb); goto out; } + MACVLAN_SKB_CB(skb)->src = src; + macvlan_broadcast_enqueue(port, skb); + return RX_HANDLER_PASS; } @@ -764,6 +829,9 @@ static int macvlan_port_create(struct net_device *dev) for (i = 0; i < MACVLAN_HASH_SIZE; i++) INIT_HLIST_HEAD(&port->vlan_hash[i]); + skb_queue_head_init(&port->bc_queue); + INIT_WORK(&port->bc_work, macvlan_process_broadcast); + err = netdev_rx_handler_register(dev, macvlan_handle_frame, port); if (err) kfree(port); @@ -776,6 +844,7 @@ static void macvlan_port_destroy(struct net_device *dev) { struct macvlan_port *port = macvlan_port_get_rtnl(dev); + cancel_work_sync(&port->bc_work); dev->priv_flags &= ~IFF_MACVLAN_PORT; netdev_rx_handler_unregister(dev); kfree_rcu(port, rcu); ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-07 7:55 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu @ 2014-04-07 14:07 ` Eric Dumazet 2014-04-07 14:23 ` Herbert Xu 2014-04-08 16:55 ` Stephen Hemminger 1 sibling, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2014-04-07 14:07 UTC (permalink / raw) To: Herbert Xu; +Cc: David S. Miller, netdev On Mon, 2014-04-07 at 15:55 +0800, Herbert Xu wrote: > Currently broadcasts are handled in network RX context, where > the packets are sent through netif_rx. This means that the number > of macvlans will be constrained by the capacity of netif_rx. > > For example, setting up 4096 macvlans practically causes all > broadcast packets to be dropped as the default netif_rx queue > size simply can't handle 4096 skbs being stuffed into it all > at once. > > Fundamentally, we need to ensure that the amount of work handled > in each netif_rx backlog run is constrained. As broadcasts are > anything but constrained, it either needs to be limited per run > or moved to process context. > > This patch picks the second option and moves all broadcast handling > bar the trivial case of packets going to a single interface into > a work queue. Obviously there also needs to be a limit on how > many broadcast packets we postpone in this way. I've arbitrarily > chosen tx_queue_len of the master device as the limit (act_mirred > also happens to use this parameter in a similar way). > > In order to ensure we don't exceed the backlog queue we will use > netif_rx_ni instead of netif_rx for broadcast packets. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > --- > Hi Herbert. I suppose its a net-next material ? Memory allocations (one incoming message -> ~4096 duplications) probably should use GFP_KERNEL. This might need a change from rcu to simple mutex for macvlan_broadcast() scan of all macvlans. cond_resched() could help macvlan_process_broadcast() to not hog cpu. Anyway, 4.000 incoming messages are duplicated into 16.000.000 messages, it takes half a minute to process on a single cpu. You might need multiple workqueue to split the load on all online cpus ;) Thanks ! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-07 14:07 ` Eric Dumazet @ 2014-04-07 14:23 ` Herbert Xu 2014-04-08 16:48 ` Ben Greear 0 siblings, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-07 14:23 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, netdev On Mon, Apr 07, 2014 at 07:07:42AM -0700, Eric Dumazet wrote: > > I suppose its a net-next material ? Oh yes absolutely. > Memory allocations (one incoming message -> ~4096 duplications) probably > should use GFP_KERNEL. This might need a change from rcu to simple mutex > for macvlan_broadcast() scan of all macvlans. Good point. However, as this change will be non-trivial we can do this as a follow-up. > cond_resched() could help macvlan_process_broadcast() to not hog cpu. > > Anyway, 4.000 incoming messages are duplicated into 16.000.000 messages, > it takes half a minute to process on a single cpu. You might need > multiple workqueue to split the load on all online cpus ;) You're welcome to add further improvements :) Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-07 14:23 ` Herbert Xu @ 2014-04-08 16:48 ` Ben Greear 2014-04-08 17:23 ` Herbert Xu 0 siblings, 1 reply; 28+ messages in thread From: Ben Greear @ 2014-04-08 16:48 UTC (permalink / raw) To: Herbert Xu; +Cc: Eric Dumazet, David S. Miller, netdev On 04/07/2014 07:23 AM, Herbert Xu wrote: > On Mon, Apr 07, 2014 at 07:07:42AM -0700, Eric Dumazet wrote: >> >> I suppose its a net-next material ? > > Oh yes absolutely. > >> Memory allocations (one incoming message -> ~4096 duplications) probably >> should use GFP_KERNEL. This might need a change from rcu to simple mutex >> for macvlan_broadcast() scan of all macvlans. > > Good point. However, as this change will be non-trivial we can > do this as a follow-up. > >> cond_resched() could help macvlan_process_broadcast() to not hog cpu. >> >> Anyway, 4.000 incoming messages are duplicated into 16.000.000 messages, >> it takes half a minute to process on a single cpu. You might need >> multiple workqueue to split the load on all online cpus ;) > > You're welcome to add further improvements :) At least for ARPs, maybe we could inspect the packet and only deliver to interfaces configured with the MAC that is being ARPed for (or ones that are in promisc)? This would need to be a configurable optimization probably, since perhaps someone depends on the current behaviour.... Thanks, Ben > > Thanks, > -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-08 16:48 ` Ben Greear @ 2014-04-08 17:23 ` Herbert Xu 2014-04-11 1:40 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-08 17:23 UTC (permalink / raw) To: Ben Greear; +Cc: Eric Dumazet, David S. Miller, netdev On Tue, Apr 08, 2014 at 09:48:40AM -0700, Ben Greear wrote: > > At least for ARPs, maybe we could inspect the packet and only > deliver to interfaces configured with the MAC that is being > ARPed for (or ones that are in promisc)? Yes this would definitely be a great optimisation. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-08 17:23 ` Herbert Xu @ 2014-04-11 1:40 ` David Miller 2014-04-11 1:59 ` Herbert Xu 0 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2014-04-11 1:40 UTC (permalink / raw) To: herbert; +Cc: greearb, eric.dumazet, netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 9 Apr 2014 01:23:49 +0800 > On Tue, Apr 08, 2014 at 09:48:40AM -0700, Ben Greear wrote: >> >> At least for ARPs, maybe we could inspect the packet and only >> deliver to interfaces configured with the MAC that is being >> ARPed for (or ones that are in promisc)? > > Yes this would definitely be a great optimisation. Yes and you can use the direct rather than the workqueue path. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-11 1:40 ` David Miller @ 2014-04-11 1:59 ` Herbert Xu 2014-04-11 2:09 ` Eric Dumazet 0 siblings, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-11 1:59 UTC (permalink / raw) To: David Miller; +Cc: greearb, eric.dumazet, netdev On Thu, Apr 10, 2014 at 09:40:23PM -0400, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 9 Apr 2014 01:23:49 +0800 > > > On Tue, Apr 08, 2014 at 09:48:40AM -0700, Ben Greear wrote: > >> > >> At least for ARPs, maybe we could inspect the packet and only > >> deliver to interfaces configured with the MAC that is being > >> ARPed for (or ones that are in promisc)? > > > > Yes this would definitely be a great optimisation. > > Yes and you can use the direct rather than the workqueue path. This would be a worthy optimisation for bridge.c too. In fact, I wonder if it would be possible to share some of the broadcast/ multicast logic between macvlan and bridge. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-11 1:59 ` Herbert Xu @ 2014-04-11 2:09 ` Eric Dumazet 2014-04-11 2:13 ` Herbert Xu 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2014-04-11 2:09 UTC (permalink / raw) To: Herbert Xu; +Cc: David Miller, greearb, netdev On Fri, 2014-04-11 at 09:59 +0800, Herbert Xu wrote: > On Thu, Apr 10, 2014 at 09:40:23PM -0400, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Wed, 9 Apr 2014 01:23:49 +0800 > > > > > On Tue, Apr 08, 2014 at 09:48:40AM -0700, Ben Greear wrote: > > >> > > >> At least for ARPs, maybe we could inspect the packet and only > > >> deliver to interfaces configured with the MAC that is being > > >> ARPed for (or ones that are in promisc)? > > > > > > Yes this would definitely be a great optimisation. > > > > Yes and you can use the direct rather than the workqueue path. > > This would be a worthy optimisation for bridge.c too. In fact, > I wonder if it would be possible to share some of the broadcast/ > multicast logic between macvlan and bridge. But many ARP messages are broadcasted, particularly when you restart an hypervisor with thousand of macvlan. This is the moment we have this horrible quadratic behavior in macvlan. I do not understand the idea... ARP filter would require to inspect the queried IPv4 address, and macvlan do not currently have a list of IPv4 addresses per port. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-11 2:09 ` Eric Dumazet @ 2014-04-11 2:13 ` Herbert Xu 2014-04-11 8:45 ` David Laight 2014-04-11 16:17 ` Ben Greear 0 siblings, 2 replies; 28+ messages in thread From: Herbert Xu @ 2014-04-11 2:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, greearb, netdev On Thu, Apr 10, 2014 at 07:09:30PM -0700, Eric Dumazet wrote: > On Fri, 2014-04-11 at 09:59 +0800, Herbert Xu wrote: > > On Thu, Apr 10, 2014 at 09:40:23PM -0400, David Miller wrote: > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > > Date: Wed, 9 Apr 2014 01:23:49 +0800 > > > > > > > On Tue, Apr 08, 2014 at 09:48:40AM -0700, Ben Greear wrote: > > > >> > > > >> At least for ARPs, maybe we could inspect the packet and only > > > >> deliver to interfaces configured with the MAC that is being > > > >> ARPed for (or ones that are in promisc)? > > > > > > > > Yes this would definitely be a great optimisation. > > > > > > Yes and you can use the direct rather than the workqueue path. > > > > This would be a worthy optimisation for bridge.c too. In fact, > > I wonder if it would be possible to share some of the broadcast/ > > multicast logic between macvlan and bridge. > > But many ARP messages are broadcasted, particularly when you restart > an hypervisor with thousand of macvlan. This is the moment we have > this horrible quadratic behavior in macvlan. > > I do not understand the idea... > > ARP filter would require to inspect the queried IPv4 address, and > macvlan do not currently have a list of IPv4 addresses per port. Indeed. Thanks for snapping us out of our collective daydream :) -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-11 2:13 ` Herbert Xu @ 2014-04-11 8:45 ` David Laight 2014-04-11 16:11 ` Ben Greear 2014-04-11 16:17 ` Ben Greear 1 sibling, 1 reply; 28+ messages in thread From: David Laight @ 2014-04-11 8:45 UTC (permalink / raw) To: 'Herbert Xu', Eric Dumazet; +Cc: David Miller, greearb, netdev > > ARP filter would require to inspect the queried IPv4 address, and > > macvlan do not currently have a list of IPv4 addresses per port. > > Indeed. Thanks for snapping us out of our collective daydream :) You could save the info from ARP responses. Whether you want that much layer-breaking is another matter. By the sound of it, it might be worth doing the deep packet inspection at startup. David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-11 8:45 ` David Laight @ 2014-04-11 16:11 ` Ben Greear 2014-04-11 16:20 ` Ben Greear 0 siblings, 1 reply; 28+ messages in thread From: Ben Greear @ 2014-04-11 16:11 UTC (permalink / raw) To: David Laight; +Cc: 'Herbert Xu', Eric Dumazet, David Miller, netdev On 04/11/2014 01:45 AM, David Laight wrote: >>> ARP filter would require to inspect the queried IPv4 address, and >>> macvlan do not currently have a list of IPv4 addresses per port. >> >> Indeed. Thanks for snapping us out of our collective daydream :) > > You could save the info from ARP responses. > > Whether you want that much layer-breaking is another matter. > By the sound of it, it might be worth doing the deep packet > inspection at startup. I wouldn't want to keep history, but in most network scenarios, having only the interface with the matching MAC answer the APR is good enough. I've too much else to worry about to code this up now though. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-11 16:11 ` Ben Greear @ 2014-04-11 16:20 ` Ben Greear 0 siblings, 0 replies; 28+ messages in thread From: Ben Greear @ 2014-04-11 16:20 UTC (permalink / raw) To: David Laight; +Cc: 'Herbert Xu', Eric Dumazet, David Miller, netdev On 04/11/2014 09:11 AM, Ben Greear wrote: > On 04/11/2014 01:45 AM, David Laight wrote: >>>> ARP filter would require to inspect the queried IPv4 address, and >>>> macvlan do not currently have a list of IPv4 addresses per port. >>> >>> Indeed. Thanks for snapping us out of our collective daydream :) >> >> You could save the info from ARP responses. >> >> Whether you want that much layer-breaking is another matter. >> By the sound of it, it might be worth doing the deep packet >> inspection at startup. > > I wouldn't want to keep history, but in most network scenarios, > having only the interface with the matching MAC answer the APR is good > enough. I meant to say matching IP instead of matching MAC. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-11 2:13 ` Herbert Xu 2014-04-11 8:45 ` David Laight @ 2014-04-11 16:17 ` Ben Greear 1 sibling, 0 replies; 28+ messages in thread From: Ben Greear @ 2014-04-11 16:17 UTC (permalink / raw) To: Herbert Xu; +Cc: Eric Dumazet, David Miller, netdev On 04/10/2014 07:13 PM, Herbert Xu wrote: > On Thu, Apr 10, 2014 at 07:09:30PM -0700, Eric Dumazet wrote: >> On Fri, 2014-04-11 at 09:59 +0800, Herbert Xu wrote: >>> On Thu, Apr 10, 2014 at 09:40:23PM -0400, David Miller wrote: >>>> From: Herbert Xu <herbert@gondor.apana.org.au> >>>> Date: Wed, 9 Apr 2014 01:23:49 +0800 >>>> >>>>> On Tue, Apr 08, 2014 at 09:48:40AM -0700, Ben Greear wrote: >>>>>> >>>>>> At least for ARPs, maybe we could inspect the packet and only >>>>>> deliver to interfaces configured with the MAC that is being >>>>>> ARPed for (or ones that are in promisc)? >>>>> >>>>> Yes this would definitely be a great optimisation. >>>> >>>> Yes and you can use the direct rather than the workqueue path. >>> >>> This would be a worthy optimisation for bridge.c too. In fact, >>> I wonder if it would be possible to share some of the broadcast/ >>> multicast logic between macvlan and bridge. >> >> But many ARP messages are broadcasted, particularly when you restart >> an hypervisor with thousand of macvlan. This is the moment we have >> this horrible quadratic behavior in macvlan. >> >> I do not understand the idea... >> >> ARP filter would require to inspect the queried IPv4 address, and >> macvlan do not currently have a list of IPv4 addresses per port. > > Indeed. Thanks for snapping us out of our collective daydream :) arp-filter can already do this, so you could just re-use that behaviour, I think? Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-07 7:55 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu 2014-04-07 14:07 ` Eric Dumazet @ 2014-04-08 16:55 ` Stephen Hemminger 2014-04-08 17:14 ` Joe Perches 2014-04-09 8:50 ` Herbert Xu 1 sibling, 2 replies; 28+ messages in thread From: Stephen Hemminger @ 2014-04-08 16:55 UTC (permalink / raw) To: Herbert Xu; +Cc: David S. Miller, netdev On Mon, 07 Apr 2014 15:55:36 +0800 Herbert Xu <herbert@gondor.apana.org.au> wrote: > +static void macvlan_broadcast_enqueue(struct macvlan_port *port, > + struct sk_buff *skb) > +{ > + int err = -ENOMEM; > + > + skb = skb_clone(skb, GFP_ATOMIC); > + if (unlikely(!skb)) > + goto err; > + > + spin_lock(&port->bc_queue.lock); > + if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) { > + __skb_queue_tail(&port->bc_queue, skb); > + err = 0; > + } > + spin_unlock(&port->bc_queue.lock); > + > + if (unlikely(err)) { > +err: > + atomic_long_inc(&skb->dev->rx_dropped); > + return; > + } > + unlikely() with goto is redundant and unnecessary. IMHO jumping into a block is bad code style Why not just move err: to the end as in: skb = skb_clone(skb, GFP_ATOMIC); if (unlikely(!skb)) goto err; spin_lock(&port->bc_queue.lock); if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) { __skb_queue_tail(&port->bc_queue, skb); err = 0; } spin_unlock(&port->bc_queue.lock); if (err) goto err; schedule_work(&port->bc_work); return; err: atomic_long_inc(&skb->dev->rx_dropped); } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-08 16:55 ` Stephen Hemminger @ 2014-04-08 17:14 ` Joe Perches 2014-04-09 8:50 ` Herbert Xu 1 sibling, 0 replies; 28+ messages in thread From: Joe Perches @ 2014-04-08 17:14 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Herbert Xu, David S. Miller, netdev On Tue, 2014-04-08 at 09:55 -0700, Stephen Hemminger wrote: > On Mon, 07 Apr 2014 15:55:36 +0800 > Herbert Xu <herbert@gondor.apana.org.au> wrote: > > > +static void macvlan_broadcast_enqueue(struct macvlan_port *port, > > + struct sk_buff *skb) > > +{ > > + int err = -ENOMEM; > > + > > + skb = skb_clone(skb, GFP_ATOMIC); > > + if (unlikely(!skb)) > > + goto err; > > + > > + spin_lock(&port->bc_queue.lock); > > + if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) { > > + __skb_queue_tail(&port->bc_queue, skb); > > + err = 0; > > + } > > + spin_unlock(&port->bc_queue.lock); > > + > > + if (unlikely(err)) { > > +err: > > + atomic_long_inc(&skb->dev->rx_dropped); > > + return; > > + } > > + > > unlikely() with goto is redundant and unnecessary. > > IMHO jumping into a block is bad code style > > Why not just move err: to the end as in: > > > skb = skb_clone(skb, GFP_ATOMIC); > if (unlikely(!skb)) > goto err; > > spin_lock(&port->bc_queue.lock); > if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) { > __skb_queue_tail(&port->bc_queue, skb); > err = 0; > } > spin_unlock(&port->bc_queue.lock); > > if (err) > goto err; > > schedule_work(&port->bc_work); > return; > err: > > atomic_long_inc(&skb->dev->rx_dropped); > } Perhaps the spin_unlock should be duplicated instead and the "int err" removed. static void macvlan_broadcast_enqueue(struct macvlan_port *port, struct sk_buff *skb) { skb = skb_clone(skb, GFP_ATOMIC); if (!skb) goto err; spin_lock(&port->bc_queue.lock); if (skb_queue_len(&port->bc_queue) >= skb->dev->tx_queue_len) goto err_unlock; __skb_queue_tail(&port->bc_queue, skb); spin_unlock(&port->bc_queue.lock); schedule_work(&port->bc_work); return; err_unlock: spin_unlock(&port->bc_queue.lock); err: atomic_long_inc(&skb->dev->rx_dropped); } ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-08 16:55 ` Stephen Hemminger 2014-04-08 17:14 ` Joe Perches @ 2014-04-09 8:50 ` Herbert Xu 2014-04-09 10:10 ` David Laight 1 sibling, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-09 8:50 UTC (permalink / raw) To: Stephen Hemminger; +Cc: David S. Miller, netdev On Tue, Apr 08, 2014 at 09:55:23AM -0700, Stephen Hemminger wrote: > > unlikely() with goto is redundant and unnecessary. > > IMHO jumping into a block is bad code style > > Why not just move err: to the end as in: Good point. Here is an updated patch. macvlan: Move broadcasts into a work queue Currently broadcasts are handled in network RX context, where the packets are sent through netif_rx. This means that the number of macvlans will be constrained by the capacity of netif_rx. For example, setting up 4096 macvlans practically causes all broadcast packets to be dropped as the default netif_rx queue size simply can't handle 4096 skbs being stuffed into it all at once. Fundamentally, we need to ensure that the amount of work handled in each netif_rx backlog run is constrained. As broadcasts are anything but constrained, it either needs to be limited per run or moved to process context. This patch picks the second option and moves all broadcast handling bar the trivial case of packets going to a single interface into a work queue. Obviously there also needs to be a limit on how many broadcast packets we postpone in this way. I've arbitrarily chosen tx_queue_len of the master device as the limit (act_mirred also happens to use this parameter in a similar way). In order to ensure we don't exceed the backlog queue we will use netif_rx_ni instead of netif_rx for broadcast packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 753a8c2..8b8220f 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -30,6 +30,7 @@ #include <linux/if_link.h> #include <linux/if_macvlan.h> #include <linux/hash.h> +#include <linux/workqueue.h> #include <net/rtnetlink.h> #include <net/xfrm.h> @@ -40,10 +41,18 @@ struct macvlan_port { struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; struct list_head vlans; struct rcu_head rcu; + struct sk_buff_head bc_queue; + struct work_struct bc_work; bool passthru; int count; }; +struct macvlan_skb_cb { + const struct macvlan_dev *src; +}; + +#define MACVLAN_SKB_CB(__skb) ((struct macvlan_skb_cb *)&((__skb)->cb[0])) + static void macvlan_port_destroy(struct net_device *dev); static struct macvlan_port *macvlan_port_get_rcu(const struct net_device *dev) @@ -120,7 +129,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev = vlan->dev; if (local) - return dev_forward_skb(dev, skb); + return __dev_forward_skb(dev, skb); skb->dev = dev; if (ether_addr_equal_64bits(eth->h_dest, dev->broadcast)) @@ -128,7 +137,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, else skb->pkt_type = PACKET_MULTICAST; - return netif_rx(skb); + return 0; } static u32 macvlan_hash_mix(const struct macvlan_dev *vlan) @@ -175,32 +184,32 @@ static void macvlan_broadcast(struct sk_buff *skb, if (likely(nskb)) err = macvlan_broadcast_one( nskb, vlan, eth, - mode == MACVLAN_MODE_BRIDGE); + mode == MACVLAN_MODE_BRIDGE) ?: + netif_rx_ni(nskb); macvlan_count_rx(vlan, skb->len + ETH_HLEN, err == NET_RX_SUCCESS, 1); } } } -/* called under rcu_read_lock() from netif_receive_skb */ -static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) +static void macvlan_process_broadcast(struct work_struct *w) { - struct macvlan_port *port; - struct sk_buff *skb = *pskb; - const struct ethhdr *eth = eth_hdr(skb); - const struct macvlan_dev *vlan; - const struct macvlan_dev *src; - struct net_device *dev; - unsigned int len = 0; - int ret = NET_RX_DROP; + struct macvlan_port *port = container_of(w, struct macvlan_port, + bc_work); + struct sk_buff *skb; + struct sk_buff_head list; + + skb_queue_head_init(&list); + + spin_lock_bh(&port->bc_queue.lock); + skb_queue_splice_tail_init(&port->bc_queue, &list); + spin_unlock_bh(&port->bc_queue.lock); + + while ((skb = __skb_dequeue(&list))) { + const struct macvlan_dev *src = MACVLAN_SKB_CB(skb)->src; + + rcu_read_lock(); - port = macvlan_port_get_rcu(skb->dev); - if (is_multicast_ether_addr(eth->h_dest)) { - skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN); - if (!skb) - return RX_HANDLER_CONSUMED; - eth = eth_hdr(skb); - src = macvlan_hash_lookup(port, eth->h_source); if (!src) /* frame comes from an external address */ macvlan_broadcast(skb, port, NULL, @@ -213,20 +222,77 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) macvlan_broadcast(skb, port, src->dev, MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE); - else if (src->mode == MACVLAN_MODE_BRIDGE) + else /* * flood only to VEPA ports, bridge ports * already saw the frame on the way out. */ macvlan_broadcast(skb, port, src->dev, MACVLAN_MODE_VEPA); - else { + + rcu_read_unlock(); + + kfree_skb(skb); + } +} + +static void macvlan_broadcast_enqueue(struct macvlan_port *port, + struct sk_buff *skb) +{ + int err = -ENOMEM; + + skb = skb_clone(skb, GFP_ATOMIC); + if (!skb) + goto err; + + spin_lock(&port->bc_queue.lock); + if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) { + __skb_queue_tail(&port->bc_queue, skb); + err = 0; + } + spin_unlock(&port->bc_queue.lock); + + if (err) + goto err; + + schedule_work(&port->bc_work); + return; + +err: + atomic_long_inc(&skb->dev->rx_dropped); +} + +/* called under rcu_read_lock() from netif_receive_skb */ +static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) +{ + struct macvlan_port *port; + struct sk_buff *skb = *pskb; + const struct ethhdr *eth = eth_hdr(skb); + const struct macvlan_dev *vlan; + const struct macvlan_dev *src; + struct net_device *dev; + unsigned int len = 0; + int ret = NET_RX_DROP; + + port = macvlan_port_get_rcu(skb->dev); + if (is_multicast_ether_addr(eth->h_dest)) { + skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN); + if (!skb) + return RX_HANDLER_CONSUMED; + eth = eth_hdr(skb); + src = macvlan_hash_lookup(port, eth->h_source); + if (src && src->mode != MACVLAN_MODE_VEPA && + src->mode != MACVLAN_MODE_BRIDGE) { /* forward to original port. */ vlan = src; - ret = macvlan_broadcast_one(skb, vlan, eth, 0); + ret = macvlan_broadcast_one(skb, vlan, eth, 0) ?: + netif_rx(skb); goto out; } + MACVLAN_SKB_CB(skb)->src = src; + macvlan_broadcast_enqueue(port, skb); + return RX_HANDLER_PASS; } @@ -764,6 +830,9 @@ static int macvlan_port_create(struct net_device *dev) for (i = 0; i < MACVLAN_HASH_SIZE; i++) INIT_HLIST_HEAD(&port->vlan_hash[i]); + skb_queue_head_init(&port->bc_queue); + INIT_WORK(&port->bc_work, macvlan_process_broadcast); + err = netdev_rx_handler_register(dev, macvlan_handle_frame, port); if (err) kfree(port); @@ -776,6 +845,7 @@ static void macvlan_port_destroy(struct net_device *dev) { struct macvlan_port *port = macvlan_port_get_rtnl(dev); + cancel_work_sync(&port->bc_work); dev->priv_flags &= ~IFF_MACVLAN_PORT; netdev_rx_handler_unregister(dev); kfree_rcu(port, rcu); Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 28+ messages in thread
* RE: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-09 8:50 ` Herbert Xu @ 2014-04-09 10:10 ` David Laight 2014-04-10 12:59 ` Herbert Xu 0 siblings, 1 reply; 28+ messages in thread From: David Laight @ 2014-04-09 10:10 UTC (permalink / raw) To: 'Herbert Xu', Stephen Hemminger; +Cc: David S. Miller, netdev From: Herbert Xu ... > This patch picks the second option and moves all broadcast handling > bar the trivial case of packets going to a single interface into > a work queue. Obviously there also needs to be a limit on how > many broadcast packets we postpone in this way. I've arbitrarily > chosen tx_queue_len of the master device as the limit (act_mirred > also happens to use this parameter in a similar way). > > In order to ensure we don't exceed the backlog queue we will use > netif_rx_ni instead of netif_rx for broadcast packets. Should you limit the number of broadcasts queued for transmit on each interface as well as the number of postponed broadcasts. It probably isn't a good idea to completely fill an interface's transmit queue with broadcasts. David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-09 10:10 ` David Laight @ 2014-04-10 12:59 ` Herbert Xu 2014-04-10 14:50 ` David Laight 0 siblings, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-10 12:59 UTC (permalink / raw) To: David Laight; +Cc: Stephen Hemminger, David S. Miller, netdev On Wed, Apr 09, 2014 at 10:10:16AM +0000, David Laight wrote: > From: Herbert Xu > ... > > This patch picks the second option and moves all broadcast handling > > bar the trivial case of packets going to a single interface into > > a work queue. Obviously there also needs to be a limit on how > > many broadcast packets we postpone in this way. I've arbitrarily > > chosen tx_queue_len of the master device as the limit (act_mirred > > also happens to use this parameter in a similar way). > > > > In order to ensure we don't exceed the backlog queue we will use > > netif_rx_ni instead of netif_rx for broadcast packets. > > Should you limit the number of broadcasts queued for transmit > on each interface as well as the number of postponed broadcasts. > > It probably isn't a good idea to completely fill an interface's > transmit queue with broadcasts. These are *received* packets so I don't see how they're going to fill up the transmit queues. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-10 12:59 ` Herbert Xu @ 2014-04-10 14:50 ` David Laight 0 siblings, 0 replies; 28+ messages in thread From: David Laight @ 2014-04-10 14:50 UTC (permalink / raw) To: 'Herbert Xu'; +Cc: Stephen Hemminger, David S. Miller, netdev From: Herbert Xu > On Wed, Apr 09, 2014 at 10:10:16AM +0000, David Laight wrote: > > From: Herbert Xu > > ... > > > This patch picks the second option and moves all broadcast handling > > > bar the trivial case of packets going to a single interface into > > > a work queue. Obviously there also needs to be a limit on how > > > many broadcast packets we postpone in this way. I've arbitrarily > > > chosen tx_queue_len of the master device as the limit (act_mirred > > > also happens to use this parameter in a similar way). > > > > > > In order to ensure we don't exceed the backlog queue we will use > > > netif_rx_ni instead of netif_rx for broadcast packets. > > > > Should you limit the number of broadcasts queued for transmit > > on each interface as well as the number of postponed broadcasts. > > > > It probably isn't a good idea to completely fill an interface's > > transmit queue with broadcasts. > > These are *received* packets so I don't see how they're going > to fill up the transmit queues. I was thinking of a bridge - where the packets get transmitted. In this case they get put on the interfaces receive queue, but the same thing applies. Maybe there isn't actually a queue at that point? David ^ permalink raw reply [flat|nested] 28+ messages in thread
* [0/2] macvlan: Handle broadcasts in work queue @ 2014-04-07 7:53 Herbert Xu 2014-04-17 5:45 ` [PATCH 1/2] net: Add __dev_forward_skb Herbert Xu 2014-04-17 5:45 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu 0 siblings, 2 replies; 28+ messages in thread From: Herbert Xu @ 2014-04-07 7:53 UTC (permalink / raw) To: David S. Miller, netdev Hi: These two patches fix a problem where the macvlan driver tries to do too much work in netif_receive_skb. In particular, for broadcasts it'll send one skb to each macvlan interface all within a single netif_receive_skb call. This is way too much for netif_rx to handle once the number of macvlans exceeds a certain level (low thousands). Even if netif_rx could handle it, doing this amount of work in a single netif_receive_skb call seriously skews the balance of the network stack. These patches fix this by moving broadcasts to process context so that the scheduler can deal with the fairness issue. The immediate problem of the netif_rx limit is handled by using netif_rx_ni. Similar issues exist elsewhere in our stack regarding broadcasts, e.g., the bridge driver. Luckily, with the bridge driver at least we happen to have a limit of 256 ports per bridge. Still, doing 256 actions in one netif_receive_skb is less than ideal. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 1/2] net: Add __dev_forward_skb 2014-04-07 7:53 [0/2] macvlan: Handle broadcasts in " Herbert Xu @ 2014-04-17 5:45 ` Herbert Xu 2014-04-20 22:19 ` David Miller 2014-04-17 5:45 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu 1 sibling, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-17 5:45 UTC (permalink / raw) To: David S. Miller, netdev This patch adds the helper __dev_forward_skb which is identical to dev_forward_skb except that it doesn't actually inject the skb into the stack. This is useful where we wish to have finer control over how the packet is injected, e.g., via netif_rx_ni or netif_receive_skb. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> --- include/linux/netdevice.h | 1 + net/core/dev.c | 42 ++++++++++++++++++++++++------------------ 2 files changed, 25 insertions(+), 18 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 775cc95..4ab86d1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2619,6 +2619,7 @@ int dev_get_phys_port_id(struct net_device *dev, struct netdev_phys_port_id *ppid); int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq); +int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb); int dev_forward_skb(struct net_device *dev, struct sk_buff *skb); bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index 7570634..bd6ca09 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1661,6 +1661,29 @@ bool is_skb_forwardable(struct net_device *dev, struct sk_buff *skb) } EXPORT_SYMBOL_GPL(is_skb_forwardable); +int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb) +{ + if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) { + if (skb_copy_ubufs(skb, GFP_ATOMIC)) { + atomic_long_inc(&dev->rx_dropped); + kfree_skb(skb); + return NET_RX_DROP; + } + } + + if (unlikely(!is_skb_forwardable(dev, skb))) { + atomic_long_inc(&dev->rx_dropped); + kfree_skb(skb); + return NET_RX_DROP; + } + + skb_scrub_packet(skb, true); + skb->protocol = eth_type_trans(skb, dev); + + return 0; +} +EXPORT_SYMBOL_GPL(__dev_forward_skb); + /** * dev_forward_skb - loopback an skb to another netif * @@ -1681,24 +1704,7 @@ EXPORT_SYMBOL_GPL(is_skb_forwardable); */ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) { - if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) { - if (skb_copy_ubufs(skb, GFP_ATOMIC)) { - atomic_long_inc(&dev->rx_dropped); - kfree_skb(skb); - return NET_RX_DROP; - } - } - - if (unlikely(!is_skb_forwardable(dev, skb))) { - atomic_long_inc(&dev->rx_dropped); - kfree_skb(skb); - return NET_RX_DROP; - } - - skb_scrub_packet(skb, true); - skb->protocol = eth_type_trans(skb, dev); - - return netif_rx_internal(skb); + return __dev_forward_skb(dev, skb) ?: netif_rx_internal(skb); } EXPORT_SYMBOL_GPL(dev_forward_skb); -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 1/2] net: Add __dev_forward_skb 2014-04-17 5:45 ` [PATCH 1/2] net: Add __dev_forward_skb Herbert Xu @ 2014-04-20 22:19 ` David Miller 0 siblings, 0 replies; 28+ messages in thread From: David Miller @ 2014-04-20 22:19 UTC (permalink / raw) To: herbert; +Cc: netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Thu, 17 Apr 2014 13:45:03 +0800 > This patch adds the helper __dev_forward_skb which is identical to > dev_forward_skb except that it doesn't actually inject the skb into > the stack. This is useful where we wish to have finer control over > how the packet is injected, e.g., via netif_rx_ni or netif_receive_skb. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Applied. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-07 7:53 [0/2] macvlan: Handle broadcasts in " Herbert Xu 2014-04-17 5:45 ` [PATCH 1/2] net: Add __dev_forward_skb Herbert Xu @ 2014-04-17 5:45 ` Herbert Xu 2014-04-20 22:20 ` David Miller 1 sibling, 1 reply; 28+ messages in thread From: Herbert Xu @ 2014-04-17 5:45 UTC (permalink / raw) To: David S. Miller, netdev Currently broadcasts are handled in network RX context, where the packets are sent through netif_rx. This means that the number of macvlans will be constrained by the capacity of netif_rx. For example, setting up 4096 macvlans practically causes all broadcast packets to be dropped as the default netif_rx queue size simply can't handle 4096 skbs being stuffed into it all at once. Fundamentally, we need to ensure that the amount of work handled in each netif_rx backlog run is constrained. As broadcasts are anything but constrained, it either needs to be limited per run or moved to process context. This patch picks the second option and moves all broadcast handling bar the trivial case of packets going to a single interface into a work queue. Obviously there also needs to be a limit on how many broadcast packets we postpone in this way. I've arbitrarily chosen tx_queue_len of the master device as the limit (act_mirred also happens to use this parameter in a similar way). In order to ensure we don't exceed the backlog queue we will use netif_rx_ni instead of netif_rx for broadcast packets. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 753a8c2..8b8220f 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -30,6 +30,7 @@ #include <linux/if_link.h> #include <linux/if_macvlan.h> #include <linux/hash.h> +#include <linux/workqueue.h> #include <net/rtnetlink.h> #include <net/xfrm.h> @@ -40,10 +41,18 @@ struct macvlan_port { struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; struct list_head vlans; struct rcu_head rcu; + struct sk_buff_head bc_queue; + struct work_struct bc_work; bool passthru; int count; }; +struct macvlan_skb_cb { + const struct macvlan_dev *src; +}; + +#define MACVLAN_SKB_CB(__skb) ((struct macvlan_skb_cb *)&((__skb)->cb[0])) + static void macvlan_port_destroy(struct net_device *dev); static struct macvlan_port *macvlan_port_get_rcu(const struct net_device *dev) @@ -120,7 +129,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev = vlan->dev; if (local) - return dev_forward_skb(dev, skb); + return __dev_forward_skb(dev, skb); skb->dev = dev; if (ether_addr_equal_64bits(eth->h_dest, dev->broadcast)) @@ -128,7 +137,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, else skb->pkt_type = PACKET_MULTICAST; - return netif_rx(skb); + return 0; } static u32 macvlan_hash_mix(const struct macvlan_dev *vlan) @@ -175,32 +184,32 @@ static void macvlan_broadcast(struct sk_buff *skb, if (likely(nskb)) err = macvlan_broadcast_one( nskb, vlan, eth, - mode == MACVLAN_MODE_BRIDGE); + mode == MACVLAN_MODE_BRIDGE) ?: + netif_rx_ni(nskb); macvlan_count_rx(vlan, skb->len + ETH_HLEN, err == NET_RX_SUCCESS, 1); } } } -/* called under rcu_read_lock() from netif_receive_skb */ -static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) +static void macvlan_process_broadcast(struct work_struct *w) { - struct macvlan_port *port; - struct sk_buff *skb = *pskb; - const struct ethhdr *eth = eth_hdr(skb); - const struct macvlan_dev *vlan; - const struct macvlan_dev *src; - struct net_device *dev; - unsigned int len = 0; - int ret = NET_RX_DROP; + struct macvlan_port *port = container_of(w, struct macvlan_port, + bc_work); + struct sk_buff *skb; + struct sk_buff_head list; + + skb_queue_head_init(&list); + + spin_lock_bh(&port->bc_queue.lock); + skb_queue_splice_tail_init(&port->bc_queue, &list); + spin_unlock_bh(&port->bc_queue.lock); + + while ((skb = __skb_dequeue(&list))) { + const struct macvlan_dev *src = MACVLAN_SKB_CB(skb)->src; + + rcu_read_lock(); - port = macvlan_port_get_rcu(skb->dev); - if (is_multicast_ether_addr(eth->h_dest)) { - skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN); - if (!skb) - return RX_HANDLER_CONSUMED; - eth = eth_hdr(skb); - src = macvlan_hash_lookup(port, eth->h_source); if (!src) /* frame comes from an external address */ macvlan_broadcast(skb, port, NULL, @@ -213,20 +222,77 @@ static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) macvlan_broadcast(skb, port, src->dev, MACVLAN_MODE_VEPA | MACVLAN_MODE_BRIDGE); - else if (src->mode == MACVLAN_MODE_BRIDGE) + else /* * flood only to VEPA ports, bridge ports * already saw the frame on the way out. */ macvlan_broadcast(skb, port, src->dev, MACVLAN_MODE_VEPA); - else { + + rcu_read_unlock(); + + kfree_skb(skb); + } +} + +static void macvlan_broadcast_enqueue(struct macvlan_port *port, + struct sk_buff *skb) +{ + int err = -ENOMEM; + + skb = skb_clone(skb, GFP_ATOMIC); + if (!skb) + goto err; + + spin_lock(&port->bc_queue.lock); + if (skb_queue_len(&port->bc_queue) < skb->dev->tx_queue_len) { + __skb_queue_tail(&port->bc_queue, skb); + err = 0; + } + spin_unlock(&port->bc_queue.lock); + + if (err) + goto err; + + schedule_work(&port->bc_work); + return; + +err: + atomic_long_inc(&skb->dev->rx_dropped); +} + +/* called under rcu_read_lock() from netif_receive_skb */ +static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb) +{ + struct macvlan_port *port; + struct sk_buff *skb = *pskb; + const struct ethhdr *eth = eth_hdr(skb); + const struct macvlan_dev *vlan; + const struct macvlan_dev *src; + struct net_device *dev; + unsigned int len = 0; + int ret = NET_RX_DROP; + + port = macvlan_port_get_rcu(skb->dev); + if (is_multicast_ether_addr(eth->h_dest)) { + skb = ip_check_defrag(skb, IP_DEFRAG_MACVLAN); + if (!skb) + return RX_HANDLER_CONSUMED; + eth = eth_hdr(skb); + src = macvlan_hash_lookup(port, eth->h_source); + if (src && src->mode != MACVLAN_MODE_VEPA && + src->mode != MACVLAN_MODE_BRIDGE) { /* forward to original port. */ vlan = src; - ret = macvlan_broadcast_one(skb, vlan, eth, 0); + ret = macvlan_broadcast_one(skb, vlan, eth, 0) ?: + netif_rx(skb); goto out; } + MACVLAN_SKB_CB(skb)->src = src; + macvlan_broadcast_enqueue(port, skb); + return RX_HANDLER_PASS; } @@ -764,6 +830,9 @@ static int macvlan_port_create(struct net_device *dev) for (i = 0; i < MACVLAN_HASH_SIZE; i++) INIT_HLIST_HEAD(&port->vlan_hash[i]); + skb_queue_head_init(&port->bc_queue); + INIT_WORK(&port->bc_work, macvlan_process_broadcast); + err = netdev_rx_handler_register(dev, macvlan_handle_frame, port); if (err) kfree(port); @@ -776,6 +845,7 @@ static void macvlan_port_destroy(struct net_device *dev) { struct macvlan_port *port = macvlan_port_get_rtnl(dev); + cancel_work_sync(&port->bc_work); dev->priv_flags &= ~IFF_MACVLAN_PORT; netdev_rx_handler_unregister(dev); kfree_rcu(port, rcu); Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH 2/2] macvlan: Move broadcasts into a work queue 2014-04-17 5:45 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu @ 2014-04-20 22:20 ` David Miller 0 siblings, 0 replies; 28+ messages in thread From: David Miller @ 2014-04-20 22:20 UTC (permalink / raw) To: herbert; +Cc: netdev From: Herbert Xu <herbert@gondor.apana.org.au> Date: Thu, 17 Apr 2014 13:45:59 +0800 > Currently broadcasts are handled in network RX context, where > the packets are sent through netif_rx. This means that the number > of macvlans will be constrained by the capacity of netif_rx. > > For example, setting up 4096 macvlans practically causes all > broadcast packets to be dropped as the default netif_rx queue > size simply can't handle 4096 skbs being stuffed into it all > at once. > > Fundamentally, we need to ensure that the amount of work handled > in each netif_rx backlog run is constrained. As broadcasts are > anything but constrained, it either needs to be limited per run > or moved to process context. > > This patch picks the second option and moves all broadcast handling > bar the trivial case of packets going to a single interface into > a work queue. Obviously there also needs to be a limit on how > many broadcast packets we postpone in this way. I've arbitrarily > chosen tx_queue_len of the master device as the limit (act_mirred > also happens to use this parameter in a similar way). > > In order to ensure we don't exceed the backlog queue we will use > netif_rx_ni instead of netif_rx for broadcast packets. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Applied. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2014-04-20 22:20 UTC | newest] Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <<20140407075347.GA26461@gondor.apana.org.au> 2014-04-07 7:55 ` [PATCH 1/2] net: Add __dev_forward_skb Herbert Xu 2014-04-07 18:48 ` Stephen Hemminger 2014-04-08 7:15 ` Herbert Xu 2014-04-08 16:27 ` David Miller 2014-04-07 7:55 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu 2014-04-07 14:07 ` Eric Dumazet 2014-04-07 14:23 ` Herbert Xu 2014-04-08 16:48 ` Ben Greear 2014-04-08 17:23 ` Herbert Xu 2014-04-11 1:40 ` David Miller 2014-04-11 1:59 ` Herbert Xu 2014-04-11 2:09 ` Eric Dumazet 2014-04-11 2:13 ` Herbert Xu 2014-04-11 8:45 ` David Laight 2014-04-11 16:11 ` Ben Greear 2014-04-11 16:20 ` Ben Greear 2014-04-11 16:17 ` Ben Greear 2014-04-08 16:55 ` Stephen Hemminger 2014-04-08 17:14 ` Joe Perches 2014-04-09 8:50 ` Herbert Xu 2014-04-09 10:10 ` David Laight 2014-04-10 12:59 ` Herbert Xu 2014-04-10 14:50 ` David Laight 2014-04-07 7:53 [0/2] macvlan: Handle broadcasts in " Herbert Xu 2014-04-17 5:45 ` [PATCH 1/2] net: Add __dev_forward_skb Herbert Xu 2014-04-20 22:19 ` David Miller 2014-04-17 5:45 ` [PATCH 2/2] macvlan: Move broadcasts into a work queue Herbert Xu 2014-04-20 22:20 ` David Miller
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.