linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH net-next V2 0/6] XDP rx handler
@ 2018-08-13  3:17 Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 1/6] net: core: factor out generic XDP check and process routine Jason Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-13  3:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: ast, daniel, jbrouer, mst, Jason Wang

Hi:

This series tries to implement XDP support for rx hanlder. This would
be useful for doing native XDP on stacked device like macvlan, bridge
or even bond.

The idea is simple, let stacked device register a XDP rx handler. And
when driver return XDP_PASS, it will call a new helper xdp_do_pass()
which will try to pass XDP buff to XDP rx handler directly. XDP rx
handler may then decide how to proceed, it could consume the buff, ask
driver to drop the packet or ask the driver to fallback to normal skb
path.

A sample XDP rx handler was implemented for macvlan. And virtio-net
(mergeable buffer case) was converted to call xdp_do_pass() as an
example. For ease comparision, generic XDP support for rx handler was
also implemented.

Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
shows about 83% improvement.

Please review.

Thanks

Jason Wang (6):
  net: core: factor out generic XDP check and process routine
  net: core: generic XDP support for stacked device
  net: core: introduce XDP rx handler
  macvlan: count the number of vlan in source mode
  macvlan: basic XDP support
  virtio-net: support XDP rx handler

 drivers/net/macvlan.c      | 189 +++++++++++++++++++++++++++++++++++++++++++--
 drivers/net/virtio_net.c   |  11 +++
 include/linux/filter.h     |   1 +
 include/linux/if_macvlan.h |   1 +
 include/linux/netdevice.h  |  12 +++
 net/core/dev.c             |  69 +++++++++++++----
 net/core/filter.c          |  28 +++++++
 7 files changed, 293 insertions(+), 18 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH net-next V2 1/6] net: core: factor out generic XDP check and process routine
  2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
@ 2018-08-13  3:17 ` Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 2/6] net: core: generic XDP support for stacked device Jason Wang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-13  3:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: ast, daniel, jbrouer, mst, Jason Wang

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 net/core/dev.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f68122f..605c66e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4392,13 +4392,9 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(do_xdp_generic);
 
-static int netif_rx_internal(struct sk_buff *skb)
+static int netif_do_generic_xdp(struct sk_buff *skb)
 {
-	int ret;
-
-	net_timestamp_check(netdev_tstamp_prequeue, skb);
-
-	trace_netif_rx(skb);
+	int ret = XDP_PASS;
 
 	if (static_branch_unlikely(&generic_xdp_needed_key)) {
 		int ret;
@@ -4408,15 +4404,28 @@ static int netif_rx_internal(struct sk_buff *skb)
 		ret = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
 		rcu_read_unlock();
 		preempt_enable();
-
-		/* Consider XDP consuming the packet a success from
-		 * the netdev point of view we do not want to count
-		 * this as an error.
-		 */
-		if (ret != XDP_PASS)
-			return NET_RX_SUCCESS;
 	}
 
+	return ret;
+}
+
+static int netif_rx_internal(struct sk_buff *skb)
+{
+	int ret;
+
+	net_timestamp_check(netdev_tstamp_prequeue, skb);
+
+	trace_netif_rx(skb);
+
+	ret = netif_do_generic_xdp(skb);
+
+	/* Consider XDP consuming the packet a success from
+	 * the netdev point of view we do not want to count
+	 * this as an error.
+	 */
+	if (ret != XDP_PASS)
+		return NET_RX_SUCCESS;
+
 #ifdef CONFIG_RPS
 	if (static_key_false(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH net-next V2 2/6] net: core: generic XDP support for stacked device
  2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 1/6] net: core: factor out generic XDP check and process routine Jason Wang
@ 2018-08-13  3:17 ` Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 3/6] net: core: introduce XDP rx handler Jason Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-13  3:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: ast, daniel, jbrouer, mst, Jason Wang

Stacked device usually change skb->dev to its own and return
RX_HANDLER_ANOTHER during rx handler processing. But we don't call
generic XDP routine at that time, this means it can't work for stacked
device.

Fixing this by calling netif_do_generic_xdp() if rx handler returns
RX_HANDLER_ANOTHER. This allows us to do generic XDP on stacked device
e.g macvlan.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 net/core/dev.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 605c66e..a77ce08 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4822,6 +4822,11 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc,
 			ret = NET_RX_SUCCESS;
 			goto out;
 		case RX_HANDLER_ANOTHER:
+			ret = netif_do_generic_xdp(skb);
+			if (ret != XDP_PASS) {
+				ret = NET_RX_SUCCESS;
+				goto out;
+			}
 			goto another_round;
 		case RX_HANDLER_EXACT:
 			deliver_exact = true;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH net-next V2 3/6] net: core: introduce XDP rx handler
  2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 1/6] net: core: factor out generic XDP check and process routine Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 2/6] net: core: generic XDP support for stacked device Jason Wang
@ 2018-08-13  3:17 ` Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 4/6] macvlan: count the number of vlan in source mode Jason Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-13  3:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: ast, daniel, jbrouer, mst, Jason Wang

This patch tries to introduce XDP rx handler. This will be used by
stacked device that depends on rx handler for having a fast packet
processing path based on XDP.

This idea is simple, when XDP program returns XDP_PASS, instead of
building skb immediately, driver will call xdp_do_pass() to check
whether or not there's a XDP rx handler, if yes, it will pass XDP
buffer to XDP rx handler first.

There are two main tasks for XDP rx handler, the first is check
whether or not the setup or packet could be processed through XDP buff
directly. The second task is to run XDP program. An XDP rx handler can
return several different results which was defined by enum
rx_xdp_handler_result_t:

RX_XDP_HANDLER_CONSUMED: This means the XDP buff were consumed.
RX_XDP_HANDLER_DROP: This means XDP rx handler ask to drop the packet.
RX_XDP_HANDLER_PASS_FALLBACK: This means XDP rx handler can not
process the packet (e.g cloning), and we need to fall back to normal
skb path to deal with the packet.

Consider we have the following configuration, Level 0 device which has
a rx handler for Level 1 device which has a rx handler for L2 device.

L2 device
    |
L1 device
    |
L0 device

With the help of XDP rx handler, we can attach XDP program on each of
the layer or even run native XDP handler for L2 without XDP prog
attached to L1 device:

(XDP prog for L2 device)
    |
L2 XDP rx handler for L1
    |
(XDP prog for L1 device)
    |
L1 XDP rx hanlder for L0
    |
XDP prog for L0 device

It works like: When the XDP program for L0 device returns XDP_PASS, we
will first try to check and pass XDP buff to its XDP rx handler if
there's one. Then the L1 XDP rx handler will be called and to run XDP
program for L1. When L1 XDP program returns XDP_PASS or there's no XDP
program attached to L1, we will try to call xdp_do_pass() to pass it
to XDP rx hanlder for L1. Then XDP buff will be passed to L2 XDP rx
handler etc. And it will try to run L2 XDP program if any. And if
there's no L2 XDP program or XDP program returns XDP_PASS. The handler
usually will build skb and call netif_rx() for a local receive. If any
of the XDP rx handlers returns XDP_RX_HANDLER_FALLBACK, the code will
return to L0 device and L0 device will try to build skb and go through
normal rx handler path for skb.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 include/linux/filter.h    |  1 +
 include/linux/netdevice.h | 12 ++++++++++++
 net/core/dev.c            | 29 +++++++++++++++++++++++++++++
 net/core/filter.c         | 28 ++++++++++++++++++++++++++++
 4 files changed, 70 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd73..7cc8e69 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -791,6 +791,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
+rx_handler_result_t xdp_do_pass(struct xdp_buff *xdp);
 void xdp_do_flush_map(void);
 
 void bpf_warn_invalid_xdp_action(u32 act);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 282e2e9..21f0a9e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -421,6 +421,14 @@ enum rx_handler_result {
 typedef enum rx_handler_result rx_handler_result_t;
 typedef rx_handler_result_t rx_handler_func_t(struct sk_buff **pskb);
 
+enum rx_xdp_handler_result {
+	RX_XDP_HANDLER_CONSUMED,
+	RX_XDP_HANDLER_DROP,
+	RX_XDP_HANDLER_FALLBACK,
+};
+typedef enum rx_xdp_handler_result rx_xdp_handler_result_t;
+typedef rx_xdp_handler_result_t rx_xdp_handler_func_t(struct net_device *dev,
+						      struct xdp_buff *xdp);
 void __napi_schedule(struct napi_struct *n);
 void __napi_schedule_irqoff(struct napi_struct *n);
 
@@ -1898,6 +1906,7 @@ struct net_device {
 	struct bpf_prog __rcu	*xdp_prog;
 	unsigned long		gro_flush_timeout;
 	rx_handler_func_t __rcu	*rx_handler;
+	rx_xdp_handler_func_t __rcu *rx_xdp_handler;
 	void __rcu		*rx_handler_data;
 
 #ifdef CONFIG_NET_CLS_ACT
@@ -3530,7 +3539,10 @@ bool netdev_is_rx_handler_busy(struct net_device *dev);
 int netdev_rx_handler_register(struct net_device *dev,
 			       rx_handler_func_t *rx_handler,
 			       void *rx_handler_data);
+int netdev_rx_xdp_handler_register(struct net_device *dev,
+				   rx_xdp_handler_func_t *rx_xdp_handler);
 void netdev_rx_handler_unregister(struct net_device *dev);
+void netdev_rx_xdp_handler_unregister(struct net_device *dev);
 
 bool dev_valid_name(const char *name);
 int dev_ioctl(struct net *net, unsigned int cmd, struct ifreq *ifr,
diff --git a/net/core/dev.c b/net/core/dev.c
index a77ce08..b4e8949 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4638,6 +4638,12 @@ bool netdev_is_rx_handler_busy(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_is_rx_handler_busy);
 
+static bool netdev_is_rx_xdp_handler_busy(struct net_device *dev)
+{
+	ASSERT_RTNL();
+	return dev && rtnl_dereference(dev->rx_xdp_handler);
+}
+
 /**
  *	netdev_rx_handler_register - register receive handler
  *	@dev: device to register a handler for
@@ -4670,6 +4676,22 @@ int netdev_rx_handler_register(struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_register);
 
+int netdev_rx_xdp_handler_register(struct net_device *dev,
+				   rx_xdp_handler_func_t *rx_xdp_handler)
+{
+	if (netdev_is_rx_xdp_handler_busy(dev))
+		return -EBUSY;
+
+	if (dev->priv_flags & IFF_NO_RX_HANDLER)
+		return -EINVAL;
+
+	rcu_assign_pointer(dev->rx_xdp_handler, rx_xdp_handler);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(netdev_rx_xdp_handler_register);
+
+
 /**
  *	netdev_rx_handler_unregister - unregister receive handler
  *	@dev: device to unregister a handler from
@@ -4692,6 +4714,13 @@ void netdev_rx_handler_unregister(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
+void netdev_rx_xdp_handler_unregister(struct net_device *dev)
+{
+	ASSERT_RTNL();
+	RCU_INIT_POINTER(dev->rx_xdp_handler, NULL);
+}
+EXPORT_SYMBOL_GPL(netdev_rx_xdp_handler_unregister);
+
 /*
  * Limit the use of PFMEMALLOC reserves to those protocols that implement
  * the special handling of PFMEMALLOC skbs.
diff --git a/net/core/filter.c b/net/core/filter.c
index 587bbfb..9ea3797 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3312,6 +3312,34 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+rx_handler_result_t xdp_do_pass(struct xdp_buff *xdp)
+{
+	rx_xdp_handler_result_t ret;
+	rx_xdp_handler_func_t *rx_xdp_handler;
+	struct net_device *dev = xdp->rxq->dev;
+
+	ret = RX_XDP_HANDLER_FALLBACK;
+	rx_xdp_handler = rcu_dereference(dev->rx_xdp_handler);
+
+	if (rx_xdp_handler) {
+		ret = rx_xdp_handler(dev, xdp);
+		switch (ret) {
+		case RX_XDP_HANDLER_CONSUMED:
+			/* Fall through */
+		case RX_XDP_HANDLER_DROP:
+			/* Fall through */
+		case RX_XDP_HANDLER_FALLBACK:
+			break;
+		default:
+			BUG();
+			break;
+		}
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(xdp_do_pass);
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH net-next V2 4/6] macvlan: count the number of vlan in source mode
  2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
                   ` (2 preceding siblings ...)
  2018-08-13  3:17 ` [RFC PATCH net-next V2 3/6] net: core: introduce XDP rx handler Jason Wang
@ 2018-08-13  3:17 ` Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 5/6] macvlan: basic XDP support Jason Wang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-13  3:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: ast, daniel, jbrouer, mst, Jason Wang

This patch tries to count the number of vlans in source mode. This
will be used for implementing XDP rx handler for macvlan.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvlan.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index cfda146..b7c814d 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -53,6 +53,7 @@ struct macvlan_port {
 	struct hlist_head	vlan_source_hash[MACVLAN_HASH_SIZE];
 	DECLARE_BITMAP(mc_filter, MACVLAN_MC_FILTER_SZ);
 	unsigned char           perm_addr[ETH_ALEN];
+	unsigned long           source_count;
 };
 
 struct macvlan_source_entry {
@@ -1433,6 +1434,9 @@ int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
 	if (err)
 		goto unregister_netdev;
 
+	if (vlan->mode == MACVLAN_MODE_SOURCE)
+		port->source_count++;
+
 	list_add_tail_rcu(&vlan->list, &port->vlans);
 	netif_stacked_transfer_operstate(lowerdev, dev);
 	linkwatch_fire_event(dev);
@@ -1477,6 +1481,7 @@ static int macvlan_changelink(struct net_device *dev,
 			      struct netlink_ext_ack *extack)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
+	struct macvlan_port *port = vlan->port;
 	enum macvlan_mode mode;
 	bool set_mode = false;
 	enum macvlan_macaddr_mode macmode;
@@ -1491,8 +1496,10 @@ static int macvlan_changelink(struct net_device *dev,
 		    (vlan->mode == MACVLAN_MODE_PASSTHRU))
 			return -EINVAL;
 		if (vlan->mode == MACVLAN_MODE_SOURCE &&
-		    vlan->mode != mode)
+		    vlan->mode != mode) {
 			macvlan_flush_sources(vlan->port, vlan);
+			port->source_count--;
+		}
 	}
 
 	if (data && data[IFLA_MACVLAN_FLAGS]) {
@@ -1510,8 +1517,13 @@ static int macvlan_changelink(struct net_device *dev,
 		}
 		vlan->flags = flags;
 	}
-	if (set_mode)
+	if (set_mode) {
 		vlan->mode = mode;
+		if (mode == MACVLAN_MODE_SOURCE &&
+		    vlan->mode != mode) {
+			port->source_count++;
+		}
+	}
 	if (data && data[IFLA_MACVLAN_MACADDR_MODE]) {
 		if (vlan->mode != MACVLAN_MODE_SOURCE)
 			return -EINVAL;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH net-next V2 5/6] macvlan: basic XDP support
  2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
                   ` (3 preceding siblings ...)
  2018-08-13  3:17 ` [RFC PATCH net-next V2 4/6] macvlan: count the number of vlan in source mode Jason Wang
@ 2018-08-13  3:17 ` Jason Wang
  2018-08-13  3:17 ` [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler Jason Wang
  2018-08-14  0:32 ` [RFC PATCH net-next V2 0/6] " Alexei Starovoitov
  6 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-13  3:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: ast, daniel, jbrouer, mst, Jason Wang

This patch tries to implementing basic XDP support for macvlan. The
implementation was split into two parts:

1) XDP rx handler of underlay device:

We will register an XDP rx handler (macvlan_handle_xdp) to under layer
device. In this handler, we will the following cases to go for slow
path (XDP_RX_HANDLER_PASS):

- The packet is a multicast packet.
- A vlan is source mode
- Destination mac address does not match any vlan

If none of the above cases were true, it means we could go for XDP
path directly. We will change the dev and return
RX_XDP_HANDLER_ANOTHER.

2) If we find a destination vlan, we will try to run XDP prog.

If XDP prog return XDP_PASS, we will call xdp_do_pass() to pass it to
up layer XDP rx handler. This is needed for e.g macvtap to work. If
XDP_RX_HANDLER_FALLBACK is returned, we will build skb and call
netif_rx() to finish the receiving. Otherwise just return the result
to lower device. For XDP_TX, we will build skb and try XDP generic
transmission routine for simplicity. This could be optimized on top.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/macvlan.c      | 173 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/if_macvlan.h |   1 +
 2 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index b7c814d..42b747c 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -34,6 +34,7 @@
 #include <net/rtnetlink.h>
 #include <net/xfrm.h>
 #include <linux/netpoll.h>
+#include <linux/bpf.h>
 
 #define MACVLAN_HASH_BITS	8
 #define MACVLAN_HASH_SIZE	(1<<MACVLAN_HASH_BITS)
@@ -436,6 +437,122 @@ static void macvlan_forward_source(struct sk_buff *skb,
 	}
 }
 
+struct sk_buff *macvlan_xdp_build_skb(struct net_device *dev,
+				      struct xdp_buff *xdp)
+{
+	int len;
+	int buflen = xdp->data_end - xdp->data_hard_start;
+	int headroom = xdp->data - xdp->data_hard_start;
+	struct sk_buff *skb;
+
+	len = SKB_DATA_ALIGN(sizeof(struct skb_shared_info)) + headroom +
+	      SKB_DATA_ALIGN(buflen);
+
+	skb = build_skb(xdp->data_hard_start, len);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	__skb_put(skb, xdp->data_end - xdp->data);
+
+	skb->protocol = eth_type_trans(skb, dev);
+	skb->dev = dev;
+
+	return skb;
+}
+
+static rx_xdp_handler_result_t macvlan_receive_xdp(struct net_device *dev,
+						   struct xdp_buff *xdp)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	struct bpf_prog *xdp_prog;
+	struct sk_buff *skb;
+	u32 act = XDP_PASS;
+	rx_xdp_handler_result_t ret;
+	int err;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(vlan->xdp_prog);
+
+	if (xdp_prog)
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
+
+	switch (act) {
+	case XDP_PASS:
+		ret = xdp_do_pass(xdp);
+		if (ret != RX_XDP_HANDLER_FALLBACK) {
+			rcu_read_unlock();
+			return ret;
+		}
+		skb = macvlan_xdp_build_skb(dev, xdp);
+		if (!skb) {
+			act = XDP_DROP;
+			break;
+		}
+		rcu_read_unlock();
+		netif_rx(skb);
+		macvlan_count_rx(vlan, skb->len, true, false);
+		goto out;
+	case XDP_TX:
+		skb = macvlan_xdp_build_skb(dev, xdp);
+		if (!skb) {
+			act = XDP_DROP;
+			break;
+		}
+		generic_xdp_tx(skb, xdp_prog);
+		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(dev, xdp, xdp_prog);
+		xdp_do_flush_map();
+		if (err)
+			act = XDP_DROP;
+		break;
+	case XDP_DROP:
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		break;
+	}
+
+	rcu_read_unlock();
+out:
+	if (act == XDP_DROP)
+		return RX_XDP_HANDLER_DROP;
+
+	return RX_XDP_HANDLER_CONSUMED;
+}
+
+/* called under rcu_read_lock() from XDP handler */
+static rx_xdp_handler_result_t macvlan_handle_xdp(struct net_device *dev,
+						  struct xdp_buff *xdp)
+{
+	const struct ethhdr *eth = (const struct ethhdr *)xdp->data;
+	struct macvlan_port *port;
+	struct macvlan_dev *vlan;
+
+	if (is_multicast_ether_addr(eth->h_dest))
+		return RX_XDP_HANDLER_FALLBACK;
+
+	port = macvlan_port_get_rcu(dev);
+	if (port->source_count)
+		return RX_XDP_HANDLER_FALLBACK;
+
+	if (macvlan_passthru(port))
+		vlan = list_first_or_null_rcu(&port->vlans,
+					      struct macvlan_dev, list);
+	else
+		vlan = macvlan_hash_lookup(port, eth->h_dest);
+
+	if (!vlan)
+		return RX_XDP_HANDLER_FALLBACK;
+
+	dev = vlan->dev;
+	if (unlikely(!(dev->flags & IFF_UP)))
+		return RX_XDP_HANDLER_DROP;
+
+	return macvlan_receive_xdp(dev, xdp);
+}
+
 /* called under rcu_read_lock() from netif_receive_skb */
 static rx_handler_result_t macvlan_handle_frame(struct sk_buff **pskb)
 {
@@ -1089,6 +1206,44 @@ static int macvlan_dev_get_iflink(const struct net_device *dev)
 	return vlan->lowerdev->ifindex;
 }
 
+static int macvlan_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+			   struct netlink_ext_ack *extack)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	struct bpf_prog *old_prog = rtnl_dereference(vlan->xdp_prog);
+
+	rcu_assign_pointer(vlan->xdp_prog, prog);
+
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+static u32 macvlan_xdp_query(struct net_device *dev)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	const struct bpf_prog *xdp_prog = rtnl_dereference(vlan->xdp_prog);
+
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
+	return 0;
+}
+
+static int macvlan_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return macvlan_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_QUERY_PROG:
+		xdp->prog_id = macvlan_xdp_query(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct ethtool_ops macvlan_ethtool_ops = {
 	.get_link		= ethtool_op_get_link,
 	.get_link_ksettings	= macvlan_ethtool_get_link_ksettings,
@@ -1121,6 +1276,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
 #endif
 	.ndo_get_iflink		= macvlan_dev_get_iflink,
 	.ndo_features_check	= passthru_features_check,
+	.ndo_bpf		= macvlan_xdp,
 };
 
 void macvlan_common_setup(struct net_device *dev)
@@ -1173,10 +1329,20 @@ static int macvlan_port_create(struct net_device *dev)
 	INIT_WORK(&port->bc_work, macvlan_process_broadcast);
 
 	err = netdev_rx_handler_register(dev, macvlan_handle_frame, port);
-	if (err)
+	if (err) {
 		kfree(port);
-	else
-		dev->priv_flags |= IFF_MACVLAN_PORT;
+		goto out;
+	}
+
+	err = netdev_rx_xdp_handler_register(dev, macvlan_handle_xdp);
+	if (err) {
+		netdev_rx_handler_unregister(dev);
+		kfree(port);
+		goto out;
+	}
+
+	dev->priv_flags |= IFF_MACVLAN_PORT;
+out:
 	return err;
 }
 
@@ -1187,6 +1353,7 @@ static void macvlan_port_destroy(struct net_device *dev)
 
 	dev->priv_flags &= ~IFF_MACVLAN_PORT;
 	netdev_rx_handler_unregister(dev);
+	netdev_rx_xdp_handler_unregister(dev);
 
 	/* After this point, no packet can schedule bc_work anymore,
 	 * but we need to cancel it and purge left skbs if any.
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 2e55e4c..7c7059b 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -34,6 +34,7 @@ struct macvlan_dev {
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	struct netpoll		*netpoll;
 #endif
+	struct bpf_prog __rcu   *xdp_prog;
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler
  2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
                   ` (4 preceding siblings ...)
  2018-08-13  3:17 ` [RFC PATCH net-next V2 5/6] macvlan: basic XDP support Jason Wang
@ 2018-08-13  3:17 ` Jason Wang
  2018-08-14  9:22   ` Jesper Dangaard Brouer
  2018-08-14  0:32 ` [RFC PATCH net-next V2 0/6] " Alexei Starovoitov
  6 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2018-08-13  3:17 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: ast, daniel, jbrouer, mst, Jason Wang

This patch tries to add the support of XDP rx handler to
virtio-net. This is straight-forward, just call xdp_do_pass() and
behave depends on its return value.

Test was done by using XDP_DROP (xdp1) for macvlan on top of
virtio-net. PPS of SKB mode was ~1.2Mpps while PPS of native XDP mode
was ~2.2Mpps. About 83% improvement was measured.

Notes: for RFC, only mergeable buffer case was implemented.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/virtio_net.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 62311dd..1e22ad9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -777,6 +777,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		rx_xdp_handler_result_t ret;
 		struct xdp_frame *xdpf;
 		struct page *xdp_page;
 		struct xdp_buff xdp;
@@ -825,6 +826,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 
 		switch (act) {
 		case XDP_PASS:
+			ret = xdp_do_pass(&xdp);
+			if (ret == RX_XDP_HANDLER_DROP)
+				goto drop;
+			if (ret != RX_XDP_HANDLER_FALLBACK) {
+				if (unlikely(xdp_page != page))
+					put_page(page);
+				rcu_read_unlock();
+				goto xdp_xmit;
+			}
 			/* recalculate offset to account for any header
 			 * adjustments. Note other cases do not build an
 			 * skb and avoid using offset
@@ -881,6 +891,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		case XDP_ABORTED:
 			trace_xdp_exception(vi->dev, xdp_prog, act);
 			/* fall through */
+drop:
 		case XDP_DROP:
 			if (unlikely(xdp_page != page))
 				__free_pages(xdp_page, 0);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
                   ` (5 preceding siblings ...)
  2018-08-13  3:17 ` [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler Jason Wang
@ 2018-08-14  0:32 ` Alexei Starovoitov
  2018-08-14  7:59   ` Jason Wang
  6 siblings, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2018-08-14  0:32 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, ast, daniel, jbrouer, mst

On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
> Hi:
> 
> This series tries to implement XDP support for rx hanlder. This would
> be useful for doing native XDP on stacked device like macvlan, bridge
> or even bond.
> 
> The idea is simple, let stacked device register a XDP rx handler. And
> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
> which will try to pass XDP buff to XDP rx handler directly. XDP rx
> handler may then decide how to proceed, it could consume the buff, ask
> driver to drop the packet or ask the driver to fallback to normal skb
> path.
> 
> A sample XDP rx handler was implemented for macvlan. And virtio-net
> (mergeable buffer case) was converted to call xdp_do_pass() as an
> example. For ease comparision, generic XDP support for rx handler was
> also implemented.
> 
> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
> shows about 83% improvement.

I'm missing the motiviation for this.
It seems performance of such solution is ~1M packet per second.
What would be a real life use case for such feature ?

Another concern is that XDP users expect to get line rate performance
and native XDP delivers it. 'generic XDP' is a fallback only
mechanism to operate on NICs that don't have native XDP yet.
Toshiaki's veth XDP work fits XDP philosophy and allows
high speed networking to be done inside containers after veth.
It's trying to get to line rate inside container.
This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
and the users will get confused with forever slow modes of XDP.

Please explain the problem you're trying to solve.
"look, here I can to XDP on top of macvlan" is not an explanation of the problem.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-14  0:32 ` [RFC PATCH net-next V2 0/6] " Alexei Starovoitov
@ 2018-08-14  7:59   ` Jason Wang
  2018-08-14 10:17     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2018-08-14  7:59 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: netdev, linux-kernel, ast, daniel, jbrouer, mst



On 2018年08月14日 08:32, Alexei Starovoitov wrote:
> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to implement XDP support for rx hanlder. This would
>> be useful for doing native XDP on stacked device like macvlan, bridge
>> or even bond.
>>
>> The idea is simple, let stacked device register a XDP rx handler. And
>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>> handler may then decide how to proceed, it could consume the buff, ask
>> driver to drop the packet or ask the driver to fallback to normal skb
>> path.
>>
>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>> example. For ease comparision, generic XDP support for rx handler was
>> also implemented.
>>
>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>> shows about 83% improvement.
> I'm missing the motiviation for this.
> It seems performance of such solution is ~1M packet per second.

Notice it was measured by virtio-net which is kind of slow.

> What would be a real life use case for such feature ?

I had another run on top of 10G mlx4 and macvlan:

XDP_DROP on mlx4: 14.0Mpps
XDP_DROP on macvlan: 10.05Mpps

Perf shows macvlan_hash_lookup() and indirect call to 
macvlan_handle_xdp() are the reasons for the number drop. I think the 
numbers are acceptable. And we could try more optimizations on top.

So here's real life use case is trying to have an fast XDP path for rx 
handler based device:

- For containers, we can run XDP for macvlan (~70% of wire speed). This 
allows a container specific policy.
- For VM, we can implement macvtap XDP rx handler on top. This allow us 
to forward packet to VM without building skb in the setup of macvtap.
- The idea could be used by other rx handler based device like bridge, 
we may have a XDP fast forwarding path for bridge.

>
> Another concern is that XDP users expect to get line rate performance
> and native XDP delivers it. 'generic XDP' is a fallback only
> mechanism to operate on NICs that don't have native XDP yet.

So I can replace generic XDP TX routine with a native one for macvlan.

> Toshiaki's veth XDP work fits XDP philosophy and allows
> high speed networking to be done inside containers after veth.
> It's trying to get to line rate inside container.

This is one of the goal of this series as well. I agree veth XDP work 
looks pretty fine, but it only work for a specific setup I believe since 
it depends on XDP_REDIRECT which is supported by few drivers (and 
there's no VF driver support). And in order to make it work for a end 
user, the XDP program still need logic like hash(map) lookup to 
determine the destination veth.

> This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
> and the users will get confused with forever slow modes of XDP.
>
> Please explain the problem you're trying to solve.
> "look, here I can to XDP on top of macvlan" is not an explanation of the problem.
>

Thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler
  2018-08-13  3:17 ` [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler Jason Wang
@ 2018-08-14  9:22   ` Jesper Dangaard Brouer
  2018-08-14 13:01     ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Jesper Dangaard Brouer @ 2018-08-14  9:22 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, ast, daniel, mst

On Mon, 13 Aug 2018 11:17:30 +0800
Jason Wang <jasowang@redhat.com> wrote:

> This patch tries to add the support of XDP rx handler to
> virtio-net. This is straight-forward, just call xdp_do_pass() and
> behave depends on its return value.
> 
> Test was done by using XDP_DROP (xdp1) for macvlan on top of
> virtio-net. PPS of SKB mode was ~1.2Mpps while PPS of native XDP mode
> was ~2.2Mpps. About 83% improvement was measured.

I'm not convinced...

Why are you not using XDP_REDIRECT, which is already implemented in
receive_mergeable (which you modify below).

The macvlan driver just need to implement ndo_xdp_xmit(), and then you
can redirect (with XDP prog from physical driver into the guest).  It
should be much faster...


> Notes: for RFC, only mergeable buffer case was implemented.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/net/virtio_net.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 62311dd..1e22ad9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -777,6 +777,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  	rcu_read_lock();
>  	xdp_prog = rcu_dereference(rq->xdp_prog);
>  	if (xdp_prog) {
> +		rx_xdp_handler_result_t ret;
>  		struct xdp_frame *xdpf;
>  		struct page *xdp_page;
>  		struct xdp_buff xdp;
> @@ -825,6 +826,15 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  
>  		switch (act) {
>  		case XDP_PASS:
> +			ret = xdp_do_pass(&xdp);
> +			if (ret == RX_XDP_HANDLER_DROP)
> +				goto drop;
> +			if (ret != RX_XDP_HANDLER_FALLBACK) {
> +				if (unlikely(xdp_page != page))
> +					put_page(page);
> +				rcu_read_unlock();
> +				goto xdp_xmit;
> +			}
>  			/* recalculate offset to account for any header
>  			 * adjustments. Note other cases do not build an
>  			 * skb and avoid using offset
> @@ -881,6 +891,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  		case XDP_ABORTED:
>  			trace_xdp_exception(vi->dev, xdp_prog, act);
>  			/* fall through */
> +drop:
>  		case XDP_DROP:
>  			if (unlikely(xdp_page != page))
>  				__free_pages(xdp_page, 0);



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-14  7:59   ` Jason Wang
@ 2018-08-14 10:17     ` Jesper Dangaard Brouer
  2018-08-14 13:20       ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Jesper Dangaard Brouer @ 2018-08-14 10:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst

On Tue, 14 Aug 2018 15:59:01 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
> > On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:  
> >> Hi:
> >>
> >> This series tries to implement XDP support for rx hanlder. This would
> >> be useful for doing native XDP on stacked device like macvlan, bridge
> >> or even bond.
> >>
> >> The idea is simple, let stacked device register a XDP rx handler. And
> >> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
> >> which will try to pass XDP buff to XDP rx handler directly. XDP rx
> >> handler may then decide how to proceed, it could consume the buff, ask
> >> driver to drop the packet or ask the driver to fallback to normal skb
> >> path.
> >>
> >> A sample XDP rx handler was implemented for macvlan. And virtio-net
> >> (mergeable buffer case) was converted to call xdp_do_pass() as an
> >> example. For ease comparision, generic XDP support for rx handler was
> >> also implemented.
> >>
> >> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
> >> shows about 83% improvement.  
> > I'm missing the motiviation for this.
> > It seems performance of such solution is ~1M packet per second.  
> 
> Notice it was measured by virtio-net which is kind of slow.
> 
> > What would be a real life use case for such feature ?  
> 
> I had another run on top of 10G mlx4 and macvlan:
> 
> XDP_DROP on mlx4: 14.0Mpps
> XDP_DROP on macvlan: 10.05Mpps
> 
> Perf shows macvlan_hash_lookup() and indirect call to 
> macvlan_handle_xdp() are the reasons for the number drop. I think the 
> numbers are acceptable. And we could try more optimizations on top.
> 
> So here's real life use case is trying to have an fast XDP path for rx 
> handler based device:
> 
> - For containers, we can run XDP for macvlan (~70% of wire speed). This 
> allows a container specific policy.
> - For VM, we can implement macvtap XDP rx handler on top. This allow us 
> to forward packet to VM without building skb in the setup of macvtap.
> - The idea could be used by other rx handler based device like bridge, 
> we may have a XDP fast forwarding path for bridge.
> 
> >
> > Another concern is that XDP users expect to get line rate performance
> > and native XDP delivers it. 'generic XDP' is a fallback only
> > mechanism to operate on NICs that don't have native XDP yet.  
> 
> So I can replace generic XDP TX routine with a native one for macvlan.

If you simply implement ndo_xdp_xmit() for macvlan, and instead use
XDP_REDIRECT, then we are basically done.


> > Toshiaki's veth XDP work fits XDP philosophy and allows
> > high speed networking to be done inside containers after veth.
> > It's trying to get to line rate inside container.  
> 
> This is one of the goal of this series as well. I agree veth XDP work 
> looks pretty fine, but it only work for a specific setup I believe since 
> it depends on XDP_REDIRECT which is supported by few drivers (and 
> there's no VF driver support). 

The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
argument that only a few drivers implement this.  Especially since all
drivers also need to be extended with your proposed xdp_do_pass() call.

(rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
have to allocate HW TX-queue resources.  If we disconnect RX and TX
side of redirect, then we can implement RX-side in an afternoon.


> And in order to make it work for a end 
> user, the XDP program still need logic like hash(map) lookup to 
> determine the destination veth.

That _is_ the general idea behind XDP and eBPF, that we need to add logic
that determine the destination.  The kernel provides the basic
mechanisms for moving/redirecting packets fast, and someone else
builds an orchestration tool like Cilium, that adds the needed logic.

Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
accessible from XDP.

For macvlan, I imagine that we could add a BPF helper that allows you
to lookup/call macvlan_hash_lookup().

 
> > This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
> > and the users will get confused with forever slow modes of XDP.
> >
> > Please explain the problem you're trying to solve.
> > "look, here I can to XDP on top of macvlan" is not an explanation of the problem.
> >  


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler
  2018-08-14  9:22   ` Jesper Dangaard Brouer
@ 2018-08-14 13:01     ` Jason Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-14 13:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, linux-kernel, ast, daniel, mst



On 2018年08月14日 17:22, Jesper Dangaard Brouer wrote:
> On Mon, 13 Aug 2018 11:17:30 +0800
> Jason Wang<jasowang@redhat.com>  wrote:
>
>> This patch tries to add the support of XDP rx handler to
>> virtio-net. This is straight-forward, just call xdp_do_pass() and
>> behave depends on its return value.
>>
>> Test was done by using XDP_DROP (xdp1) for macvlan on top of
>> virtio-net. PPS of SKB mode was ~1.2Mpps while PPS of native XDP mode
>> was ~2.2Mpps. About 83% improvement was measured.
> I'm not convinced...
>
> Why are you not using XDP_REDIRECT, which is already implemented in
> receive_mergeable (which you modify below).
>
> The macvlan driver just need to implement ndo_xdp_xmit(), and then you
> can redirect (with XDP prog from physical driver into the guest).  It
> should be much faster...
>
>

Macvlan is different from macvtap. For host RX, macvtap deliver the 
packet to a pointer ring which could be accessed through a socket but 
macvlan deliver the packet to the normal networking stack. As an example 
of XDP rx handler, this series just try to make native XDP works for 
macvlan, macvtap path will still go for skb (but it's not hard to add it 
on top).

Consider the case of fast forwarding between host and guest. For TAP, 
XDP_REDIRECT works perfectly since from the host point of view, host RX 
is guest TX and host guest RX is host TX. But for macvtap which is based 
on macvlan, transmitting packet to macvtap/macvlan means transmitting 
packets to under layer device which is either a physical NIC or another 
macvlan device. That's why we can't use XDP_REDIRECT with ndo_xdp_xmit().

Thanks














^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-14 10:17     ` Jesper Dangaard Brouer
@ 2018-08-14 13:20       ` Jason Wang
  2018-08-14 14:03         ` David Ahern
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2018-08-14 13:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst



On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
> On Tue, 14 Aug 2018 15:59:01 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>> Hi:
>>>>
>>>> This series tries to implement XDP support for rx hanlder. This would
>>>> be useful for doing native XDP on stacked device like macvlan, bridge
>>>> or even bond.
>>>>
>>>> The idea is simple, let stacked device register a XDP rx handler. And
>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>> handler may then decide how to proceed, it could consume the buff, ask
>>>> driver to drop the packet or ask the driver to fallback to normal skb
>>>> path.
>>>>
>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>> example. For ease comparision, generic XDP support for rx handler was
>>>> also implemented.
>>>>
>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>>>> shows about 83% improvement.
>>> I'm missing the motiviation for this.
>>> It seems performance of such solution is ~1M packet per second.
>> Notice it was measured by virtio-net which is kind of slow.
>>
>>> What would be a real life use case for such feature ?
>> I had another run on top of 10G mlx4 and macvlan:
>>
>> XDP_DROP on mlx4: 14.0Mpps
>> XDP_DROP on macvlan: 10.05Mpps
>>
>> Perf shows macvlan_hash_lookup() and indirect call to
>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>> numbers are acceptable. And we could try more optimizations on top.
>>
>> So here's real life use case is trying to have an fast XDP path for rx
>> handler based device:
>>
>> - For containers, we can run XDP for macvlan (~70% of wire speed). This
>> allows a container specific policy.
>> - For VM, we can implement macvtap XDP rx handler on top. This allow us
>> to forward packet to VM without building skb in the setup of macvtap.
>> - The idea could be used by other rx handler based device like bridge,
>> we may have a XDP fast forwarding path for bridge.
>>
>>> Another concern is that XDP users expect to get line rate performance
>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>> mechanism to operate on NICs that don't have native XDP yet.
>> So I can replace generic XDP TX routine with a native one for macvlan.
> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
> XDP_REDIRECT, then we are basically done.

As I replied in another thread this probably not true. Its 
ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit() 
except for the case of bridge mode.

>
>
>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>> high speed networking to be done inside containers after veth.
>>> It's trying to get to line rate inside container.
>> This is one of the goal of this series as well. I agree veth XDP work
>> looks pretty fine, but it only work for a specific setup I believe since
>> it depends on XDP_REDIRECT which is supported by few drivers (and
>> there's no VF driver support).
> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
> argument that only a few drivers implement this.  Especially since all
> drivers also need to be extended with your proposed xdp_do_pass() call.
>
> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
> have to allocate HW TX-queue resources.  If we disconnect RX and TX
> side of redirect, then we can implement RX-side in an afternoon.

That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues 
which breaks assumptions of some drivers. And since we don't disconnect 
RX and TX, it looks to me the partial implementation is even worse? 
Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.

>
>
>> And in order to make it work for a end
>> user, the XDP program still need logic like hash(map) lookup to
>> determine the destination veth.
> That _is_ the general idea behind XDP and eBPF, that we need to add logic
> that determine the destination.  The kernel provides the basic
> mechanisms for moving/redirecting packets fast, and someone else
> builds an orchestration tool like Cilium, that adds the needed logic.

Yes, so my reply is for the concern about performance. I meant anyway 
the hash lookup will make it not hit the wire speed.

>
> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
> accessible from XDP.

Yes.

>
> For macvlan, I imagine that we could add a BPF helper that allows you
> to lookup/call macvlan_hash_lookup().

That's true but we still need a method to feed macvlan with XDP buff. 
I'm not sure if this could be treated as another kind of redirection, 
but ndo_xdp_xmit() could not be used for this case for sure. Compared to 
redirection, XDP rx handler has its own advantages:

1) Use the exist API and userspace to setup the network topology instead 
of inventing new tools and its own specific API. This means user can 
just setup macvlan (macvtap, bridge or other) as usual and simply attach 
XDP programs to both macvlan and its under layer device.
2) Ease the processing of complex logic, XDP can not do cloning or 
reference counting. We can differ those cases and let normal networking 
stack to deal with such packets seamlessly. I believe this is one of the 
advantage of XDP. This makes us to focus on the fast path and greatly 
simplify the codes.

Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP 
buff. It's just another basic mechanism. Policy is still done by XDP 
program itself.

Thanks

>
>   
>>> This XDP rx handler stuff is destined to stay at 1Mpps speeds forever
>>> and the users will get confused with forever slow modes of XDP.
>>>
>>> Please explain the problem you're trying to solve.
>>> "look, here I can to XDP on top of macvlan" is not an explanation of the problem.
>>>   
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-14 13:20       ` Jason Wang
@ 2018-08-14 14:03         ` David Ahern
  2018-08-15  0:29           ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: David Ahern @ 2018-08-14 14:03 UTC (permalink / raw)
  To: Jason Wang, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst

On 8/14/18 7:20 AM, Jason Wang wrote:
> 
> 
> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>> On Tue, 14 Aug 2018 15:59:01 +0800
>> Jason Wang <jasowang@redhat.com> wrote:
>>
>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>> Hi:
>>>>>
>>>>> This series tries to implement XDP support for rx hanlder. This would
>>>>> be useful for doing native XDP on stacked device like macvlan, bridge
>>>>> or even bond.
>>>>>
>>>>> The idea is simple, let stacked device register a XDP rx handler. And
>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>> handler may then decide how to proceed, it could consume the buff, ask
>>>>> driver to drop the packet or ask the driver to fallback to normal skb
>>>>> path.
>>>>>
>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>> example. For ease comparision, generic XDP support for rx handler was
>>>>> also implemented.
>>>>>
>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>>>>> shows about 83% improvement.
>>>> I'm missing the motiviation for this.
>>>> It seems performance of such solution is ~1M packet per second.
>>> Notice it was measured by virtio-net which is kind of slow.
>>>
>>>> What would be a real life use case for such feature ?
>>> I had another run on top of 10G mlx4 and macvlan:
>>>
>>> XDP_DROP on mlx4: 14.0Mpps
>>> XDP_DROP on macvlan: 10.05Mpps
>>>
>>> Perf shows macvlan_hash_lookup() and indirect call to
>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>> numbers are acceptable. And we could try more optimizations on top.
>>>
>>> So here's real life use case is trying to have an fast XDP path for rx
>>> handler based device:
>>>
>>> - For containers, we can run XDP for macvlan (~70% of wire speed). This
>>> allows a container specific policy.
>>> - For VM, we can implement macvtap XDP rx handler on top. This allow us
>>> to forward packet to VM without building skb in the setup of macvtap.
>>> - The idea could be used by other rx handler based device like bridge,
>>> we may have a XDP fast forwarding path for bridge.
>>>
>>>> Another concern is that XDP users expect to get line rate performance
>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>> mechanism to operate on NICs that don't have native XDP yet.
>>> So I can replace generic XDP TX routine with a native one for macvlan.
>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>> XDP_REDIRECT, then we are basically done.
> 
> As I replied in another thread this probably not true. Its
> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
> except for the case of bridge mode.
> 
>>
>>
>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>> high speed networking to be done inside containers after veth.
>>>> It's trying to get to line rate inside container.
>>> This is one of the goal of this series as well. I agree veth XDP work
>>> looks pretty fine, but it only work for a specific setup I believe since
>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>> there's no VF driver support).
>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>> argument that only a few drivers implement this.  Especially since all
>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>
>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>> side of redirect, then we can implement RX-side in an afternoon.
> 
> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
> which breaks assumptions of some drivers. And since we don't disconnect
> RX and TX, it looks to me the partial implementation is even worse?
> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
> 
>>
>>
>>> And in order to make it work for a end
>>> user, the XDP program still need logic like hash(map) lookup to
>>> determine the destination veth.
>> That _is_ the general idea behind XDP and eBPF, that we need to add logic
>> that determine the destination.  The kernel provides the basic
>> mechanisms for moving/redirecting packets fast, and someone else
>> builds an orchestration tool like Cilium, that adds the needed logic.
> 
> Yes, so my reply is for the concern about performance. I meant anyway
> the hash lookup will make it not hit the wire speed.
> 
>>
>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>> accessible from XDP.
> 
> Yes.
> 
>>
>> For macvlan, I imagine that we could add a BPF helper that allows you
>> to lookup/call macvlan_hash_lookup().
> 
> That's true but we still need a method to feed macvlan with XDP buff.
> I'm not sure if this could be treated as another kind of redirection,
> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
> redirection, XDP rx handler has its own advantages:
> 
> 1) Use the exist API and userspace to setup the network topology instead
> of inventing new tools and its own specific API. This means user can
> just setup macvlan (macvtap, bridge or other) as usual and simply attach
> XDP programs to both macvlan and its under layer device.
> 2) Ease the processing of complex logic, XDP can not do cloning or
> reference counting. We can differ those cases and let normal networking
> stack to deal with such packets seamlessly. I believe this is one of the
> advantage of XDP. This makes us to focus on the fast path and greatly
> simplify the codes.
> 
> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
> buff. It's just another basic mechanism. Policy is still done by XDP
> program itself.
> 

I have been looking into handling stacked devices via lookup helper
functions. The idea is that a program only needs to be installed on the
root netdev (ie., the one representing the physical port), and it can
use helpers to create an efficient pipeline to decide what to do with
the packet in the presence of stacked devices.

For example, anyone doing pure L3 could do:

{port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...

  --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT

port is the netdev associated with the ingress_ifindex in the xdp_md
context, vlan is the vlan in the packet or the assigned PVID if
relevant. From there l2dev could be a bond or bridge device for example,
and l3dev is the one with a network address (vlan netdev, bond netdev, etc).

I have L3 forwarding working for vlan devices and bonds. I had not
considered macvlans specifically yet, but it should be straightforward
to add.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-14 14:03         ` David Ahern
@ 2018-08-15  0:29           ` Jason Wang
  2018-08-15  5:35             ` Alexei Starovoitov
  2018-08-15 17:17             ` David Ahern
  0 siblings, 2 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-15  0:29 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst



On 2018年08月14日 22:03, David Ahern wrote:
> On 8/14/18 7:20 AM, Jason Wang wrote:
>>
>> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>>> On Tue, 14 Aug 2018 15:59:01 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>
>>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>>> Hi:
>>>>>>
>>>>>> This series tries to implement XDP support for rx hanlder. This would
>>>>>> be useful for doing native XDP on stacked device like macvlan, bridge
>>>>>> or even bond.
>>>>>>
>>>>>> The idea is simple, let stacked device register a XDP rx handler. And
>>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>>> handler may then decide how to proceed, it could consume the buff, ask
>>>>>> driver to drop the packet or ask the driver to fallback to normal skb
>>>>>> path.
>>>>>>
>>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>>> example. For ease comparision, generic XDP support for rx handler was
>>>>>> also implemented.
>>>>>>
>>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan (XDP_DROP)
>>>>>> shows about 83% improvement.
>>>>> I'm missing the motiviation for this.
>>>>> It seems performance of such solution is ~1M packet per second.
>>>> Notice it was measured by virtio-net which is kind of slow.
>>>>
>>>>> What would be a real life use case for such feature ?
>>>> I had another run on top of 10G mlx4 and macvlan:
>>>>
>>>> XDP_DROP on mlx4: 14.0Mpps
>>>> XDP_DROP on macvlan: 10.05Mpps
>>>>
>>>> Perf shows macvlan_hash_lookup() and indirect call to
>>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>>> numbers are acceptable. And we could try more optimizations on top.
>>>>
>>>> So here's real life use case is trying to have an fast XDP path for rx
>>>> handler based device:
>>>>
>>>> - For containers, we can run XDP for macvlan (~70% of wire speed). This
>>>> allows a container specific policy.
>>>> - For VM, we can implement macvtap XDP rx handler on top. This allow us
>>>> to forward packet to VM without building skb in the setup of macvtap.
>>>> - The idea could be used by other rx handler based device like bridge,
>>>> we may have a XDP fast forwarding path for bridge.
>>>>
>>>>> Another concern is that XDP users expect to get line rate performance
>>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>>> mechanism to operate on NICs that don't have native XDP yet.
>>>> So I can replace generic XDP TX routine with a native one for macvlan.
>>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>>> XDP_REDIRECT, then we are basically done.
>> As I replied in another thread this probably not true. Its
>> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
>> except for the case of bridge mode.
>>
>>>
>>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>>> high speed networking to be done inside containers after veth.
>>>>> It's trying to get to line rate inside container.
>>>> This is one of the goal of this series as well. I agree veth XDP work
>>>> looks pretty fine, but it only work for a specific setup I believe since
>>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>>> there's no VF driver support).
>>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>>> argument that only a few drivers implement this.  Especially since all
>>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>>
>>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>>> side of redirect, then we can implement RX-side in an afternoon.
>> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
>> which breaks assumptions of some drivers. And since we don't disconnect
>> RX and TX, it looks to me the partial implementation is even worse?
>> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>>
>>>
>>>> And in order to make it work for a end
>>>> user, the XDP program still need logic like hash(map) lookup to
>>>> determine the destination veth.
>>> That _is_ the general idea behind XDP and eBPF, that we need to add logic
>>> that determine the destination.  The kernel provides the basic
>>> mechanisms for moving/redirecting packets fast, and someone else
>>> builds an orchestration tool like Cilium, that adds the needed logic.
>> Yes, so my reply is for the concern about performance. I meant anyway
>> the hash lookup will make it not hit the wire speed.
>>
>>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>>> accessible from XDP.
>> Yes.
>>
>>> For macvlan, I imagine that we could add a BPF helper that allows you
>>> to lookup/call macvlan_hash_lookup().
>> That's true but we still need a method to feed macvlan with XDP buff.
>> I'm not sure if this could be treated as another kind of redirection,
>> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
>> redirection, XDP rx handler has its own advantages:
>>
>> 1) Use the exist API and userspace to setup the network topology instead
>> of inventing new tools and its own specific API. This means user can
>> just setup macvlan (macvtap, bridge or other) as usual and simply attach
>> XDP programs to both macvlan and its under layer device.
>> 2) Ease the processing of complex logic, XDP can not do cloning or
>> reference counting. We can differ those cases and let normal networking
>> stack to deal with such packets seamlessly. I believe this is one of the
>> advantage of XDP. This makes us to focus on the fast path and greatly
>> simplify the codes.
>>
>> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
>> buff. It's just another basic mechanism. Policy is still done by XDP
>> program itself.
>>
> I have been looking into handling stacked devices via lookup helper
> functions. The idea is that a program only needs to be installed on the
> root netdev (ie., the one representing the physical port), and it can
> use helpers to create an efficient pipeline to decide what to do with
> the packet in the presence of stacked devices.
>
> For example, anyone doing pure L3 could do:
>
> {port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...
>
>    --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT
>
> port is the netdev associated with the ingress_ifindex in the xdp_md
> context, vlan is the vlan in the packet or the assigned PVID if
> relevant. From there l2dev could be a bond or bridge device for example,
> and l3dev is the one with a network address (vlan netdev, bond netdev, etc).

Looks less flexible since the topology is hard coded in the XDP program 
itself and this requires all logic to be implemented in the program on 
the root netdev.

>
> I have L3 forwarding working for vlan devices and bonds. I had not
> considered macvlans specifically yet, but it should be straightforward
> to add.
>

Yes, and all these could be done through XDP rx handler as well, and it 
can do even more with rather simple logic:

1 macvlan has its own namespace, and want its own bpf logic.
2 Ruse the exist topology information for dealing with more complex 
setup like macvlan on top of bond and team. There's no need to bpf 
program to care about topology. If you look at the code, there's even no 
need to attach XDP on each stacked device. The calling of xdp_do_pass() 
can try to pass XDP buff to upper device even if there's no XDP program 
attached to current layer.
3 Deliver XDP buff to userspace through macvtap.

Thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-15  0:29           ` Jason Wang
@ 2018-08-15  5:35             ` Alexei Starovoitov
  2018-08-15  7:04               ` Jason Wang
  2018-08-15 17:17             ` David Ahern
  1 sibling, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2018-08-15  5:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst

On Wed, Aug 15, 2018 at 08:29:45AM +0800, Jason Wang wrote:
> 
> Looks less flexible since the topology is hard coded in the XDP program
> itself and this requires all logic to be implemented in the program on the
> root netdev.
> 
> > 
> > I have L3 forwarding working for vlan devices and bonds. I had not
> > considered macvlans specifically yet, but it should be straightforward
> > to add.
> > 
> 
> Yes, and all these could be done through XDP rx handler as well, and it can
> do even more with rather simple logic:
> 
> 1 macvlan has its own namespace, and want its own bpf logic.
> 2 Ruse the exist topology information for dealing with more complex setup
> like macvlan on top of bond and team. There's no need to bpf program to care
> about topology. If you look at the code, there's even no need to attach XDP
> on each stacked device. The calling of xdp_do_pass() can try to pass XDP
> buff to upper device even if there's no XDP program attached to current
> layer.
> 3 Deliver XDP buff to userspace through macvtap.

I think I'm getting what you're trying to achieve.
You actually don't want any bpf programs in there at all.
You want macvlan builtin logic to act on raw packet frames.
It would have been less confusing if you said so from the beginning.
I think there is little value in such work, since something still
needs to process this raw frames eventually. If it's XDP with BPF progs
than they can maintain the speed, but in such case there is no need
for macvlan. The first layer can be normal xdp+bpf+xdp_redirect just fine.
In case where there is no xdp+bpf in final processing, the frames are
converted to skb and performance is lost, so in such cases there is no
need for builtin macvlan acting on raw xdp frames either. Just keep
existing macvlan acting on skbs.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-15  5:35             ` Alexei Starovoitov
@ 2018-08-15  7:04               ` Jason Wang
  2018-08-16  2:49                 ` Alexei Starovoitov
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2018-08-15  7:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst, Toshiaki Makita



On 2018年08月15日 13:35, Alexei Starovoitov wrote:
> On Wed, Aug 15, 2018 at 08:29:45AM +0800, Jason Wang wrote:
>> Looks less flexible since the topology is hard coded in the XDP program
>> itself and this requires all logic to be implemented in the program on the
>> root netdev.
>>
>>> I have L3 forwarding working for vlan devices and bonds. I had not
>>> considered macvlans specifically yet, but it should be straightforward
>>> to add.
>>>
>> Yes, and all these could be done through XDP rx handler as well, and it can
>> do even more with rather simple logic:
>>
>> 1 macvlan has its own namespace, and want its own bpf logic.
>> 2 Ruse the exist topology information for dealing with more complex setup
>> like macvlan on top of bond and team. There's no need to bpf program to care
>> about topology. If you look at the code, there's even no need to attach XDP
>> on each stacked device. The calling of xdp_do_pass() can try to pass XDP
>> buff to upper device even if there's no XDP program attached to current
>> layer.
>> 3 Deliver XDP buff to userspace through macvtap.
> I think I'm getting what you're trying to achieve.
> You actually don't want any bpf programs in there at all.
> You want macvlan builtin logic to act on raw packet frames.

The built-in logic is just used to find the destination macvlan device. 
It could be done by through another bpf program. Instead of inventing 
lots of generic infrastructure on kernel with specific userspace API, 
built-in logic has its own advantages:

- support hundreds or even thousands of macvlans
- using exist tools to configure network
- immunity to topology changes

> It would have been less confusing if you said so from the beginning.

The name "XDP rx handler" is probably not good. Something like "stacked 
deivce XDP" might be better.

> I think there is little value in such work, since something still
> needs to process this raw frames eventually. If it's XDP with BPF progs
> than they can maintain the speed, but in such case there is no need
> for macvlan. The first layer can be normal xdp+bpf+xdp_redirect just fine.

I'm a little bit confused. We allow per veth XDP program, so I believe 
per macvlan XDP program makes sense as well? This allows great 
flexibility and there's no need to care about topology in bpf program. 
The configuration is also greatly simplified. The only difference is we 
can use xdp_redirect for veth since it was pair device, we can transmit 
XDP frames to one veth and do XDP on its peer. This does not work for 
the case of macvlan which is based on rx handler.

Actually, for the case of veth, if we implement XDP rx handler for 
bridge it can works seamlessly with veth like.

eth0(XDP_PASS) -> [bridge XDP rx handler and ndo_xdp_xmit()] -> veth --- 
veth (XDP).

Besides the usage for containers, we can implement macvtap RX handler 
which allows a fast packet forwarding to userspace.

> In case where there is no xdp+bpf in final processing, the frames are
> converted to skb and performance is lost, so in such cases there is no
> need for builtin macvlan acting on raw xdp frames either. Just keep
> existing macvlan acting on skbs.
>

Yes, this is how veth works as well.

Actually, the idea is not limited to macvlan but for all device that is 
based on rx handler. Consider the case of bonding, this allows to set a 
very simple XDP program on slaves and keep a single main logic XDP 
program on the bond instead of duplicating it in all slaves.

Thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-15  0:29           ` Jason Wang
  2018-08-15  5:35             ` Alexei Starovoitov
@ 2018-08-15 17:17             ` David Ahern
  2018-08-16  3:34               ` Jason Wang
  1 sibling, 1 reply; 27+ messages in thread
From: David Ahern @ 2018-08-15 17:17 UTC (permalink / raw)
  To: Jason Wang, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst

On 8/14/18 6:29 PM, Jason Wang wrote:
> 
> 
> On 2018年08月14日 22:03, David Ahern wrote:
>> On 8/14/18 7:20 AM, Jason Wang wrote:
>>>
>>> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>>>> On Tue, 14 Aug 2018 15:59:01 +0800
>>>> Jason Wang <jasowang@redhat.com> wrote:
>>>>
>>>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>>>> Hi:
>>>>>>>
>>>>>>> This series tries to implement XDP support for rx hanlder. This
>>>>>>> would
>>>>>>> be useful for doing native XDP on stacked device like macvlan,
>>>>>>> bridge
>>>>>>> or even bond.
>>>>>>>
>>>>>>> The idea is simple, let stacked device register a XDP rx handler.
>>>>>>> And
>>>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>>>> handler may then decide how to proceed, it could consume the
>>>>>>> buff, ask
>>>>>>> driver to drop the packet or ask the driver to fallback to normal
>>>>>>> skb
>>>>>>> path.
>>>>>>>
>>>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>>>> example. For ease comparision, generic XDP support for rx handler
>>>>>>> was
>>>>>>> also implemented.
>>>>>>>
>>>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan
>>>>>>> (XDP_DROP)
>>>>>>> shows about 83% improvement.
>>>>>> I'm missing the motiviation for this.
>>>>>> It seems performance of such solution is ~1M packet per second.
>>>>> Notice it was measured by virtio-net which is kind of slow.
>>>>>
>>>>>> What would be a real life use case for such feature ?
>>>>> I had another run on top of 10G mlx4 and macvlan:
>>>>>
>>>>> XDP_DROP on mlx4: 14.0Mpps
>>>>> XDP_DROP on macvlan: 10.05Mpps
>>>>>
>>>>> Perf shows macvlan_hash_lookup() and indirect call to
>>>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>>>> numbers are acceptable. And we could try more optimizations on top.
>>>>>
>>>>> So here's real life use case is trying to have an fast XDP path for rx
>>>>> handler based device:
>>>>>
>>>>> - For containers, we can run XDP for macvlan (~70% of wire speed).
>>>>> This
>>>>> allows a container specific policy.
>>>>> - For VM, we can implement macvtap XDP rx handler on top. This
>>>>> allow us
>>>>> to forward packet to VM without building skb in the setup of macvtap.
>>>>> - The idea could be used by other rx handler based device like bridge,
>>>>> we may have a XDP fast forwarding path for bridge.
>>>>>
>>>>>> Another concern is that XDP users expect to get line rate performance
>>>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>>>> mechanism to operate on NICs that don't have native XDP yet.
>>>>> So I can replace generic XDP TX routine with a native one for macvlan.
>>>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>>>> XDP_REDIRECT, then we are basically done.
>>> As I replied in another thread this probably not true. Its
>>> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
>>> except for the case of bridge mode.
>>>
>>>>
>>>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>>>> high speed networking to be done inside containers after veth.
>>>>>> It's trying to get to line rate inside container.
>>>>> This is one of the goal of this series as well. I agree veth XDP work
>>>>> looks pretty fine, but it only work for a specific setup I believe
>>>>> since
>>>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>>>> there's no VF driver support).
>>>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>>>> argument that only a few drivers implement this.  Especially since all
>>>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>>>
>>>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>>>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>>>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>>>> side of redirect, then we can implement RX-side in an afternoon.
>>> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
>>> which breaks assumptions of some drivers. And since we don't disconnect
>>> RX and TX, it looks to me the partial implementation is even worse?
>>> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>>>
>>>>
>>>>> And in order to make it work for a end
>>>>> user, the XDP program still need logic like hash(map) lookup to
>>>>> determine the destination veth.
>>>> That _is_ the general idea behind XDP and eBPF, that we need to add
>>>> logic
>>>> that determine the destination.  The kernel provides the basic
>>>> mechanisms for moving/redirecting packets fast, and someone else
>>>> builds an orchestration tool like Cilium, that adds the needed logic.
>>> Yes, so my reply is for the concern about performance. I meant anyway
>>> the hash lookup will make it not hit the wire speed.
>>>
>>>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>>>> accessible from XDP.
>>> Yes.
>>>
>>>> For macvlan, I imagine that we could add a BPF helper that allows you
>>>> to lookup/call macvlan_hash_lookup().
>>> That's true but we still need a method to feed macvlan with XDP buff.
>>> I'm not sure if this could be treated as another kind of redirection,
>>> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
>>> redirection, XDP rx handler has its own advantages:
>>>
>>> 1) Use the exist API and userspace to setup the network topology instead
>>> of inventing new tools and its own specific API. This means user can
>>> just setup macvlan (macvtap, bridge or other) as usual and simply attach
>>> XDP programs to both macvlan and its under layer device.
>>> 2) Ease the processing of complex logic, XDP can not do cloning or
>>> reference counting. We can differ those cases and let normal networking
>>> stack to deal with such packets seamlessly. I believe this is one of the
>>> advantage of XDP. This makes us to focus on the fast path and greatly
>>> simplify the codes.
>>>
>>> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
>>> buff. It's just another basic mechanism. Policy is still done by XDP
>>> program itself.
>>>
>> I have been looking into handling stacked devices via lookup helper
>> functions. The idea is that a program only needs to be installed on the
>> root netdev (ie., the one representing the physical port), and it can
>> use helpers to create an efficient pipeline to decide what to do with
>> the packet in the presence of stacked devices.
>>
>> For example, anyone doing pure L3 could do:
>>
>> {port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...
>>
>>    --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT
>>
>> port is the netdev associated with the ingress_ifindex in the xdp_md
>> context, vlan is the vlan in the packet or the assigned PVID if
>> relevant. From there l2dev could be a bond or bridge device for example,
>> and l3dev is the one with a network address (vlan netdev, bond netdev,
>> etc).
> 
> Looks less flexible since the topology is hard coded in the XDP program
> itself and this requires all logic to be implemented in the program on
> the root netdev.

Nothing about the topology is hard coded. The idea is to mimic a
hardware pipeline and acknowledging that a port device can have an
arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc

> 
>>
>> I have L3 forwarding working for vlan devices and bonds. I had not
>> considered macvlans specifically yet, but it should be straightforward
>> to add.
>>
> 
> Yes, and all these could be done through XDP rx handler as well, and it
> can do even more with rather simple logic:

From a forwarding perspective I suspect the rx handler approach is going
to have much more overhead (ie., higher latency per packet and hence
lower throughput) as the layers determine which one to use (e.g., is the
FIB lookup done on the port device, vlan device, or macvlan device on
the vlan device).

> 
> 1 macvlan has its own namespace, and want its own bpf logic.
> 2 Ruse the exist topology information for dealing with more complex
> setup like macvlan on top of bond and team. There's no need to bpf
> program to care about topology. If you look at the code, there's even no
> need to attach XDP on each stacked device. The calling of xdp_do_pass()
> can try to pass XDP buff to upper device even if there's no XDP program
> attached to current layer.
> 3 Deliver XDP buff to userspace through macvtap.
> 
> Thanks


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-15  7:04               ` Jason Wang
@ 2018-08-16  2:49                 ` Alexei Starovoitov
  2018-08-16  4:21                   ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2018-08-16  2:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst, Toshiaki Makita

On Wed, Aug 15, 2018 at 03:04:35PM +0800, Jason Wang wrote:
> 
> > > 3 Deliver XDP buff to userspace through macvtap.
> > I think I'm getting what you're trying to achieve.
> > You actually don't want any bpf programs in there at all.
> > You want macvlan builtin logic to act on raw packet frames.
> 
> The built-in logic is just used to find the destination macvlan device. It
> could be done by through another bpf program. Instead of inventing lots of
> generic infrastructure on kernel with specific userspace API, built-in logic
> has its own advantages:
> 
> - support hundreds or even thousands of macvlans

are you saying xdp bpf program cannot handle thousands macvlans?

> - using exist tools to configure network
> - immunity to topology changes

what do you mean specifically?

> 
> Besides the usage for containers, we can implement macvtap RX handler which
> allows a fast packet forwarding to userspace.

and try to reinvent af_xdp? the motivation for the patchset still escapes me.

> Actually, the idea is not limited to macvlan but for all device that is
> based on rx handler. Consider the case of bonding, this allows to set a very
> simple XDP program on slaves and keep a single main logic XDP program on the
> bond instead of duplicating it in all slaves.

I think such mixed environment of hardcoded in-kernel things like bond
mixed together with xdp programs will be difficult to manage and debug.
How admin suppose to debug it? Say something in the chain of
nic -> native xdp -> bond with your xdp rx -> veth -> xdp prog -> consumer
is dropping a packet. If all forwarding decisions are done by bpf progs
the progs will have packet tracing facility (like cilium does) to
show packet flow end-to-end. It works briliantly like traceroute within a host.
But when you have things like macvlan, bond, bridge in the middle
that can also act on packet, the admin will have a hard time.

Essentially what you're proposing is to make all kernel builtin packet
steering/forwarding facilities to understand raw xdp frames. That's a lot of code
and at the end of the chain you'd need fast xdp frame consumer otherwise
perf benefits are lost. If that consumer is xdp bpf program
why bother with xdp-fied macvlan or bond? If that consumer is tcp stack
than forwarding via xdp-fied bond is no faster than via skb-based bond.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-15 17:17             ` David Ahern
@ 2018-08-16  3:34               ` Jason Wang
  2018-08-16  4:05                 ` Alexei Starovoitov
  2018-08-17 21:15                 ` David Ahern
  0 siblings, 2 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-16  3:34 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst



On 2018年08月16日 01:17, David Ahern wrote:
> On 8/14/18 6:29 PM, Jason Wang wrote:
>>
>> On 2018年08月14日 22:03, David Ahern wrote:
>>> On 8/14/18 7:20 AM, Jason Wang wrote:
>>>> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 14 Aug 2018 15:59:01 +0800
>>>>> Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> This series tries to implement XDP support for rx hanlder. This
>>>>>>>> would
>>>>>>>> be useful for doing native XDP on stacked device like macvlan,
>>>>>>>> bridge
>>>>>>>> or even bond.
>>>>>>>>
>>>>>>>> The idea is simple, let stacked device register a XDP rx handler.
>>>>>>>> And
>>>>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>>>>> handler may then decide how to proceed, it could consume the
>>>>>>>> buff, ask
>>>>>>>> driver to drop the packet or ask the driver to fallback to normal
>>>>>>>> skb
>>>>>>>> path.
>>>>>>>>
>>>>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>>>>> example. For ease comparision, generic XDP support for rx handler
>>>>>>>> was
>>>>>>>> also implemented.
>>>>>>>>
>>>>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan
>>>>>>>> (XDP_DROP)
>>>>>>>> shows about 83% improvement.
>>>>>>> I'm missing the motiviation for this.
>>>>>>> It seems performance of such solution is ~1M packet per second.
>>>>>> Notice it was measured by virtio-net which is kind of slow.
>>>>>>
>>>>>>> What would be a real life use case for such feature ?
>>>>>> I had another run on top of 10G mlx4 and macvlan:
>>>>>>
>>>>>> XDP_DROP on mlx4: 14.0Mpps
>>>>>> XDP_DROP on macvlan: 10.05Mpps
>>>>>>
>>>>>> Perf shows macvlan_hash_lookup() and indirect call to
>>>>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>>>>> numbers are acceptable. And we could try more optimizations on top.
>>>>>>
>>>>>> So here's real life use case is trying to have an fast XDP path for rx
>>>>>> handler based device:
>>>>>>
>>>>>> - For containers, we can run XDP for macvlan (~70% of wire speed).
>>>>>> This
>>>>>> allows a container specific policy.
>>>>>> - For VM, we can implement macvtap XDP rx handler on top. This
>>>>>> allow us
>>>>>> to forward packet to VM without building skb in the setup of macvtap.
>>>>>> - The idea could be used by other rx handler based device like bridge,
>>>>>> we may have a XDP fast forwarding path for bridge.
>>>>>>
>>>>>>> Another concern is that XDP users expect to get line rate performance
>>>>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>>>>> mechanism to operate on NICs that don't have native XDP yet.
>>>>>> So I can replace generic XDP TX routine with a native one for macvlan.
>>>>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>>>>> XDP_REDIRECT, then we are basically done.
>>>> As I replied in another thread this probably not true. Its
>>>> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
>>>> except for the case of bridge mode.
>>>>
>>>>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>>>>> high speed networking to be done inside containers after veth.
>>>>>>> It's trying to get to line rate inside container.
>>>>>> This is one of the goal of this series as well. I agree veth XDP work
>>>>>> looks pretty fine, but it only work for a specific setup I believe
>>>>>> since
>>>>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>>>>> there's no VF driver support).
>>>>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>>>>> argument that only a few drivers implement this.  Especially since all
>>>>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>>>>
>>>>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>>>>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>>>>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>>>>> side of redirect, then we can implement RX-side in an afternoon.
>>>> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
>>>> which breaks assumptions of some drivers. And since we don't disconnect
>>>> RX and TX, it looks to me the partial implementation is even worse?
>>>> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>>>>
>>>>>> And in order to make it work for a end
>>>>>> user, the XDP program still need logic like hash(map) lookup to
>>>>>> determine the destination veth.
>>>>> That _is_ the general idea behind XDP and eBPF, that we need to add
>>>>> logic
>>>>> that determine the destination.  The kernel provides the basic
>>>>> mechanisms for moving/redirecting packets fast, and someone else
>>>>> builds an orchestration tool like Cilium, that adds the needed logic.
>>>> Yes, so my reply is for the concern about performance. I meant anyway
>>>> the hash lookup will make it not hit the wire speed.
>>>>
>>>>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>>>>> accessible from XDP.
>>>> Yes.
>>>>
>>>>> For macvlan, I imagine that we could add a BPF helper that allows you
>>>>> to lookup/call macvlan_hash_lookup().
>>>> That's true but we still need a method to feed macvlan with XDP buff.
>>>> I'm not sure if this could be treated as another kind of redirection,
>>>> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
>>>> redirection, XDP rx handler has its own advantages:
>>>>
>>>> 1) Use the exist API and userspace to setup the network topology instead
>>>> of inventing new tools and its own specific API. This means user can
>>>> just setup macvlan (macvtap, bridge or other) as usual and simply attach
>>>> XDP programs to both macvlan and its under layer device.
>>>> 2) Ease the processing of complex logic, XDP can not do cloning or
>>>> reference counting. We can differ those cases and let normal networking
>>>> stack to deal with such packets seamlessly. I believe this is one of the
>>>> advantage of XDP. This makes us to focus on the fast path and greatly
>>>> simplify the codes.
>>>>
>>>> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
>>>> buff. It's just another basic mechanism. Policy is still done by XDP
>>>> program itself.
>>>>
>>> I have been looking into handling stacked devices via lookup helper
>>> functions. The idea is that a program only needs to be installed on the
>>> root netdev (ie., the one representing the physical port), and it can
>>> use helpers to create an efficient pipeline to decide what to do with
>>> the packet in the presence of stacked devices.
>>>
>>> For example, anyone doing pure L3 could do:
>>>
>>> {port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...
>>>
>>>     --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT
>>>
>>> port is the netdev associated with the ingress_ifindex in the xdp_md
>>> context, vlan is the vlan in the packet or the assigned PVID if
>>> relevant. From there l2dev could be a bond or bridge device for example,
>>> and l3dev is the one with a network address (vlan netdev, bond netdev,
>>> etc).
>> Looks less flexible since the topology is hard coded in the XDP program
>> itself and this requires all logic to be implemented in the program on
>> the root netdev.
> Nothing about the topology is hard coded. The idea is to mimic a
> hardware pipeline and acknowledging that a port device can have an
> arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc

I may miss something but BPF forbids loop. Without a loop how can we 
make sure all stacked devices is enumerated correctly without knowing 
the topology in advance?

>
>>> I have L3 forwarding working for vlan devices and bonds. I had not
>>> considered macvlans specifically yet, but it should be straightforward
>>> to add.
>>>
>> Yes, and all these could be done through XDP rx handler as well, and it
>> can do even more with rather simple logic:
>  From a forwarding perspective I suspect the rx handler approach is going
> to have much more overhead (ie., higher latency per packet and hence
> lower throughput) as the layers determine which one to use (e.g., is the
> FIB lookup done on the port device, vlan device, or macvlan device on
> the vlan device).

Well, if we want stacked device behave correctly, this is probably the 
only way. E.g in the above figure, to make "find l2dev" work correctly, 
we still need device specific logic which would be much similar to what 
XDP rx handler did.

Thanks

>
>> 1 macvlan has its own namespace, and want its own bpf logic.
>> 2 Ruse the exist topology information for dealing with more complex
>> setup like macvlan on top of bond and team. There's no need to bpf
>> program to care about topology. If you look at the code, there's even no
>> need to attach XDP on each stacked device. The calling of xdp_do_pass()
>> can try to pass XDP buff to upper device even if there's no XDP program
>> attached to current layer.
>> 3 Deliver XDP buff to userspace through macvtap.
>>
>> Thanks


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-16  3:34               ` Jason Wang
@ 2018-08-16  4:05                 ` Alexei Starovoitov
  2018-08-16  4:24                   ` Jason Wang
  2018-08-17 21:15                 ` David Ahern
  1 sibling, 1 reply; 27+ messages in thread
From: Alexei Starovoitov @ 2018-08-16  4:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst

On Thu, Aug 16, 2018 at 11:34:20AM +0800, Jason Wang wrote:
> > Nothing about the topology is hard coded. The idea is to mimic a
> > hardware pipeline and acknowledging that a port device can have an
> > arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc
> 
> I may miss something but BPF forbids loop. Without a loop how can we make
> sure all stacked devices is enumerated correctly without knowing the
> topology in advance?

not following. why do you need a loop to implement macvlan as an xdp prog?
if loop is needed, such algorithm is not going to scale whether
it's implemented as bpf program or as in-kernel c code.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-16  2:49                 ` Alexei Starovoitov
@ 2018-08-16  4:21                   ` Jason Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-16  4:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst, Toshiaki Makita



On 2018年08月16日 10:49, Alexei Starovoitov wrote:
> On Wed, Aug 15, 2018 at 03:04:35PM +0800, Jason Wang wrote:
>>>> 3 Deliver XDP buff to userspace through macvtap.
>>> I think I'm getting what you're trying to achieve.
>>> You actually don't want any bpf programs in there at all.
>>> You want macvlan builtin logic to act on raw packet frames.
>> The built-in logic is just used to find the destination macvlan device. It
>> could be done by through another bpf program. Instead of inventing lots of
>> generic infrastructure on kernel with specific userspace API, built-in logic
>> has its own advantages:
>>
>> - support hundreds or even thousands of macvlans
> are you saying xdp bpf program cannot handle thousands macvlans?

Correct me if I was wrong. It works well when the macvlan requires 
similar logic. But let's consider the case when each macvlan wants its 
own specific logic. Is this possible to have thousands of different 
policies and actions in a single BPF program? With XDP rx hanlder, 
there's no need to root device to care about them. Each macvlan can only 
care about itself. This is similar to the case that qdisc could be 
attached to each stacked device.

>
>> - using exist tools to configure network
>> - immunity to topology changes
> what do you mean specifically?

Still the above example, if some macvlans is deleted or created. We need 
notify and update the policies in the root device, this requires 
userspace control program to monitor those changes and notify BPF 
program through maps. Unless the BPF program is designed for some 
specific configurations and setups, it would not be an easy task.

>
>> Besides the usage for containers, we can implement macvtap RX handler which
>> allows a fast packet forwarding to userspace.
> and try to reinvent af_xdp? the motivation for the patchset still escapes me.

Nope, macvtap was used for forwarding packets to VM. This is just try to 
deliver the XDP buff to VM instead of skb. Similar idea was used by 
TUN/TAP which shows amazing improvements.

>
>> Actually, the idea is not limited to macvlan but for all device that is
>> based on rx handler. Consider the case of bonding, this allows to set a very
>> simple XDP program on slaves and keep a single main logic XDP program on the
>> bond instead of duplicating it in all slaves.
> I think such mixed environment of hardcoded in-kernel things like bond
> mixed together with xdp programs will be difficult to manage and debug.
> How admin suppose to debug it?

Well, we've already had in-kernel XDP_TX routine. It should be not 
harder than that.

>   Say something in the chain of
> nic -> native xdp -> bond with your xdp rx -> veth -> xdp prog -> consumer
> is dropping a packet. If all forwarding decisions are done by bpf progs
> the progs will have packet tracing facility (like cilium does) to
> show packet flow end-to-end. It works briliantly like traceroute within a host.

Does this work well for veth pair as well? If yes, it should work for rx 
handler or maybe it has some hard code logic like "ok, the packet goes 
to veth, I'm sure it will be delivered to its peer"? The idea of this 
series is not forbidding the forwarding decisions done by bpf progs, if 
the code did this by accident, we can introduce flag to disable/enable 
XDP rx handler.

And I believe redirection is part of XDP usage, we may still want things 
like XDP_TX.

> But when you have things like macvlan, bond, bridge in the middle
> that can also act on packet, the admin will have a hard time.

I admit it may require admin help, but it gives us more flexibility.

>
> Essentially what you're proposing is to make all kernel builtin packet
> steering/forwarding facilities to understand raw xdp frames.

Probably not, at least for this series it just focus on rx handler. We 
only have less than 10 devices use that.

> That's a lot of code
> and at the end of the chain you'd need fast xdp frame consumer otherwise
> perf benefits are lost.

The performance are lost but still the same as skb. And except for 
redirection, we do have other consumer like XDP_TX.

>   If that consumer is xdp bpf program
> why bother with xdp-fied macvlan or bond?

For macvlan, we may want to have different polices for different 
devices. For bond, we don't want to duplicate XDP logic in each slaves, 
and only bond know which slave could be used for XDP_TX.

>   If that consumer is tcp stack
> than forwarding via xdp-fied bond is no faster than via skb-based bond.
>

Yes.

Thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-16  4:05                 ` Alexei Starovoitov
@ 2018-08-16  4:24                   ` Jason Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-08-16  4:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst



On 2018年08月16日 12:05, Alexei Starovoitov wrote:
> On Thu, Aug 16, 2018 at 11:34:20AM +0800, Jason Wang wrote:
>>> Nothing about the topology is hard coded. The idea is to mimic a
>>> hardware pipeline and acknowledging that a port device can have an
>>> arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc
>> I may miss something but BPF forbids loop. Without a loop how can we make
>> sure all stacked devices is enumerated correctly without knowing the
>> topology in advance?
> not following. why do you need a loop to implement macvlan as an xdp prog?
> if loop is needed, such algorithm is not going to scale whether
> it's implemented as bpf program or as in-kernel c code.

David said the port can have arbitrary layers stacked on it. So if we 
try to enumerate them before making forwarding decisions purely by BPF 
program, it looks to me a loop is needed here.

Thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-16  3:34               ` Jason Wang
  2018-08-16  4:05                 ` Alexei Starovoitov
@ 2018-08-17 21:15                 ` David Ahern
  2018-08-20  6:34                   ` Jason Wang
  1 sibling, 1 reply; 27+ messages in thread
From: David Ahern @ 2018-08-17 21:15 UTC (permalink / raw)
  To: Jason Wang, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst

On 8/15/18 9:34 PM, Jason Wang wrote:
> I may miss something but BPF forbids loop. Without a loop how can we
> make sure all stacked devices is enumerated correctly without knowing
> the topology in advance?

netdev_for_each_upper_dev_rcu

BPF helpers allow programs to do lookups in kernel tables, in this case
the ability to find an upper device that would receive the packet.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-17 21:15                 ` David Ahern
@ 2018-08-20  6:34                   ` Jason Wang
  2018-09-05 17:20                     ` David Ahern
  0 siblings, 1 reply; 27+ messages in thread
From: Jason Wang @ 2018-08-20  6:34 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst



On 2018年08月18日 05:15, David Ahern wrote:
> On 8/15/18 9:34 PM, Jason Wang wrote:
>> I may miss something but BPF forbids loop. Without a loop how can we
>> make sure all stacked devices is enumerated correctly without knowing
>> the topology in advance?
> netdev_for_each_upper_dev_rcu
>
> BPF helpers allow programs to do lookups in kernel tables, in this case
> the ability to find an upper device that would receive the packet.

So if I understand correctly, you mean using 
netdev_for_each_upper_dev_rcu() inside a BPF helper? If yes, I think we 
may still need device specific logic. E.g for macvlan, 
netdev_for_each_upper_dev_rcu() enumerates all macvlan devices on top a 
lower device. But what we need is one of the macvlan that matches the 
dst mac address which is similar to what XDP rx handler did. And it 
would become more complicated if we have multiple layers of device.

So let's consider a simple case, consider we have 5 macvlan devices:

macvlan0: doing some packet filtering before passing packets to TCP/IP stack
macvlan1: modify packets and redirect to another interface
macvlan2: modify packets and transmit packet back through XDP_TX
macvlan3: deliver packets to AF_XDP
macvtap0: deliver packets raw XDP to VM

So, with XDP rx handler, what we need to just to attach five different 
XDP programs to each macvlan device. Your idea is to do all things in 
the root device XDP program. This looks complicated and not flexible 
since it needs to care a lot of things, e.g adding/removing 
actions/policies. And XDP program needs to call BPF helper that use 
netdev_for_each_upper_dev_rcu() to work correctly with stacked device.

Thanks

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-08-20  6:34                   ` Jason Wang
@ 2018-09-05 17:20                     ` David Ahern
  2018-09-06  5:12                       ` Jason Wang
  0 siblings, 1 reply; 27+ messages in thread
From: David Ahern @ 2018-09-05 17:20 UTC (permalink / raw)
  To: Jason Wang, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst

[ sorry for the delay; focused on the nexthop RFC ]

On 8/20/18 12:34 AM, Jason Wang wrote:
> 
> 
> On 2018年08月18日 05:15, David Ahern wrote:
>> On 8/15/18 9:34 PM, Jason Wang wrote:
>>> I may miss something but BPF forbids loop. Without a loop how can we
>>> make sure all stacked devices is enumerated correctly without knowing
>>> the topology in advance?
>> netdev_for_each_upper_dev_rcu
>>
>> BPF helpers allow programs to do lookups in kernel tables, in this case
>> the ability to find an upper device that would receive the packet.
> 
> So if I understand correctly, you mean using
> netdev_for_each_upper_dev_rcu() inside a BPF helper? If yes, I think we
> may still need device specific logic. E.g for macvlan,
> netdev_for_each_upper_dev_rcu() enumerates all macvlan devices on top a
> lower device. But what we need is one of the macvlan that matches the
> dst mac address which is similar to what XDP rx handler did. And it
> would become more complicated if we have multiple layers of device.

My device lookup helper takes the base port index (starting device),
vlan protocol, vlan tag and dest mac. So, yes, the mac address is used
to uniquely identify the stacked device.

> 
> So let's consider a simple case, consider we have 5 macvlan devices:
> 
> macvlan0: doing some packet filtering before passing packets to TCP/IP
> stack
> macvlan1: modify packets and redirect to another interface
> macvlan2: modify packets and transmit packet back through XDP_TX
> macvlan3: deliver packets to AF_XDP
> macvtap0: deliver packets raw XDP to VM
> 
> So, with XDP rx handler, what we need to just to attach five different
> XDP programs to each macvlan device. Your idea is to do all things in
> the root device XDP program. This looks complicated and not flexible
> since it needs to care a lot of things, e.g adding/removing
> actions/policies. And XDP program needs to call BPF helper that use
> netdev_for_each_upper_dev_rcu() to work correctly with stacked device.
> 

Stacking on top of a nic port can have all kinds of combinations of
vlans, bonds, bridges, vlans on bonds and bridges, macvlans, etc. I
suspect trying to install a program for layer 3 forwarding on each one
and iteratively running the programs would kill the performance gained
from forwarding with xdp.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
  2018-09-05 17:20                     ` David Ahern
@ 2018-09-06  5:12                       ` Jason Wang
  0 siblings, 0 replies; 27+ messages in thread
From: Jason Wang @ 2018-09-06  5:12 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst



On 2018年09月06日 01:20, David Ahern wrote:
> [ sorry for the delay; focused on the nexthop RFC ]

No problem. Your comments is appreciated.

> On 8/20/18 12:34 AM, Jason Wang wrote:
>>
>> On 2018年08月18日 05:15, David Ahern wrote:
>>> On 8/15/18 9:34 PM, Jason Wang wrote:
>>>> I may miss something but BPF forbids loop. Without a loop how can we
>>>> make sure all stacked devices is enumerated correctly without knowing
>>>> the topology in advance?
>>> netdev_for_each_upper_dev_rcu
>>>
>>> BPF helpers allow programs to do lookups in kernel tables, in this case
>>> the ability to find an upper device that would receive the packet.
>> So if I understand correctly, you mean using
>> netdev_for_each_upper_dev_rcu() inside a BPF helper? If yes, I think we
>> may still need device specific logic. E.g for macvlan,
>> netdev_for_each_upper_dev_rcu() enumerates all macvlan devices on top a
>> lower device. But what we need is one of the macvlan that matches the
>> dst mac address which is similar to what XDP rx handler did. And it
>> would become more complicated if we have multiple layers of device.
> My device lookup helper takes the base port index (starting device),
> vlan protocol, vlan tag and dest mac. So, yes, the mac address is used
> to uniquely identify the stacked device.

Ok.

>
>> So let's consider a simple case, consider we have 5 macvlan devices:
>>
>> macvlan0: doing some packet filtering before passing packets to TCP/IP
>> stack
>> macvlan1: modify packets and redirect to another interface
>> macvlan2: modify packets and transmit packet back through XDP_TX
>> macvlan3: deliver packets to AF_XDP
>> macvtap0: deliver packets raw XDP to VM
>>
>> So, with XDP rx handler, what we need to just to attach five different
>> XDP programs to each macvlan device. Your idea is to do all things in
>> the root device XDP program. This looks complicated and not flexible
>> since it needs to care a lot of things, e.g adding/removing
>> actions/policies. And XDP program needs to call BPF helper that use
>> netdev_for_each_upper_dev_rcu() to work correctly with stacked device.
>>
> Stacking on top of a nic port can have all kinds of combinations of
> vlans, bonds, bridges, vlans on bonds and bridges, macvlans, etc. I
> suspect trying to install a program for layer 3 forwarding on each one
> and iteratively running the programs would kill the performance gained
> from forwarding with xdp.

Yes, the performance may drop but it's still much faster than XDP 
generic path.

One reason for the drop is the device specific logic like mac address 
matching which is also needed for the case of a single XDP program on 
the root device. For macvlan, if we allow attach XDP on macvlan, we can 
offload the mac address lookup to hardware through L2 forwarding 
offload, this can give us no performance drop I believe. The only reason 
that was introduced by XDP rx handler itself is probably the indirect 
calls. We can try to amortize them by introducing some kind of batching 
on top. For the issue of multiple XDP program iterations, for this RFC, 
if we have N stacked devices, there's no need to attach XDP program on 
each layer, the only thing that need is the XDP_PASS action in the root 
device, then you can attach XDP program on any one or some stacked 
devices on top.

So the RFC is not intended to replace any exist solution, it just 
provides some flexibility for having native XDP on stacked device (which 
is based on rx handler) and benefit from exist tools to do the 
configuration. If user want to do all things in the root device, that 
should work well without any issues.

Thanks



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-09-06  5:12 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-13  3:17 [RFC PATCH net-next V2 0/6] XDP rx handler Jason Wang
2018-08-13  3:17 ` [RFC PATCH net-next V2 1/6] net: core: factor out generic XDP check and process routine Jason Wang
2018-08-13  3:17 ` [RFC PATCH net-next V2 2/6] net: core: generic XDP support for stacked device Jason Wang
2018-08-13  3:17 ` [RFC PATCH net-next V2 3/6] net: core: introduce XDP rx handler Jason Wang
2018-08-13  3:17 ` [RFC PATCH net-next V2 4/6] macvlan: count the number of vlan in source mode Jason Wang
2018-08-13  3:17 ` [RFC PATCH net-next V2 5/6] macvlan: basic XDP support Jason Wang
2018-08-13  3:17 ` [RFC PATCH net-next V2 6/6] virtio-net: support XDP rx handler Jason Wang
2018-08-14  9:22   ` Jesper Dangaard Brouer
2018-08-14 13:01     ` Jason Wang
2018-08-14  0:32 ` [RFC PATCH net-next V2 0/6] " Alexei Starovoitov
2018-08-14  7:59   ` Jason Wang
2018-08-14 10:17     ` Jesper Dangaard Brouer
2018-08-14 13:20       ` Jason Wang
2018-08-14 14:03         ` David Ahern
2018-08-15  0:29           ` Jason Wang
2018-08-15  5:35             ` Alexei Starovoitov
2018-08-15  7:04               ` Jason Wang
2018-08-16  2:49                 ` Alexei Starovoitov
2018-08-16  4:21                   ` Jason Wang
2018-08-15 17:17             ` David Ahern
2018-08-16  3:34               ` Jason Wang
2018-08-16  4:05                 ` Alexei Starovoitov
2018-08-16  4:24                   ` Jason Wang
2018-08-17 21:15                 ` David Ahern
2018-08-20  6:34                   ` Jason Wang
2018-09-05 17:20                     ` David Ahern
2018-09-06  5:12                       ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).