netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
@ 2017-07-07 17:34 John Fastabend
  2017-07-07 17:34 ` [RFC PATCH 01/12] ixgbe: NULL xdp_tx rings on resource cleanup John Fastabend
                   ` (12 more replies)
  0 siblings, 13 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:34 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

This series adds two new XDP helper routines bpf_redirect() and
bpf_redirect_map(). The first variant bpf_redirect() is meant
to be used the same way it is currently being used by the cls_bpf
classifier. An xdp packet will be redirected immediately when this
is called.

The other variant bpf_redirect_map(map, key, flags) uses a new
map type called devmap. A devmap uses integers as keys and
net_devices as values. The user provies key/ifindex pairs to
update the map with new net_devices. This provides two benefits
over the normal variant 'bpf_redirect()'. First the datapath
bpf program is abstracted away from using hard-coded ifindex
values. Allowing a single bpf program to be run any many different
environments. Second, and perhaps more important, the map enables 
batching packet transmits. The map plus small driver changes
allows for batching all send requests across a NAPI poll loop.
This allows driver writers to optimize the driver xmit path
and only call expensive operations once for a batch of xdp_buffs.

The devmap was designed with possible future work to support
multicast and broadcast as following patches.

To see, in more detail, how to leverage the new helpers and
map from the userspace side please review these two patches,

  xdp: sample program for new bpf_redirect helper
  xdp: bpf redirect with map sample program

I'm sending this as an RFC now because (a) the merge window
is closed so it seems like a good time to get feedback and
(b) I'm currently doing a last round of testing to ensure all
the features are still working after latest revisions as well
as doing a final review.

Any feedback would be welcome. Thanks to Jesper, Andy, and
Daniel for all their input, patches, fixes, testing, review, etc.
so far it is very much appreciated!

Thanks,
John

---

John Fastabend (12):
      ixgbe: NULL xdp_tx rings on resource cleanup
      net: xdp: support xdp generic on virtual devices
      xdp: add bpf_redirect helper function
      xdp: sample program for new bpf_redirect helper
      net: implement XDP_REDIRECT for xdp generic
      ixgbe: add initial support for xdp redirect
      xdp: add trace event for xdp redirect
      bpf: add devmap, a map for storing net device references
      bpf: add bpf_redirect_map helper routine
      xdp: Add batching support to redirect map
      net: add notifier hooks for devmap bpf map
      xdp: bpf redirect with map sample program


 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |    8 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   60 +++
 include/linux/bpf.h                           |    5 
 include/linux/bpf_types.h                     |    3 
 include/linux/filter.h                        |   14 +
 include/linux/netdevice.h                     |   11 +
 include/trace/events/xdp.h                    |   31 ++
 include/uapi/linux/bpf.h                      |   10 +
 kernel/bpf/Makefile                           |    3 
 kernel/bpf/devmap.c                           |  431 +++++++++++++++++++++++++
 kernel/bpf/verifier.c                         |   14 +
 net/core/dev.c                                |  226 ++++++++-----
 net/core/filter.c                             |  172 ++++++++++
 samples/bpf/Makefile                          |    8 
 samples/bpf/bpf_helpers.h                     |    2 
 samples/bpf/xdp_redirect_kern.c               |   81 +++++
 samples/bpf/xdp_redirect_map_kern.c           |   83 +++++
 samples/bpf/xdp_redirect_map_user.c           |  105 ++++++
 samples/bpf/xdp_redirect_user.c               |  102 ++++++
 tools/testing/selftests/bpf/test_maps.c       |   15 +
 20 files changed, 1283 insertions(+), 101 deletions(-)
 create mode 100644 kernel/bpf/devmap.c
 create mode 100644 samples/bpf/xdp_redirect_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_user.c
 create mode 100644 samples/bpf/xdp_redirect_user.c

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 01/12] ixgbe: NULL xdp_tx rings on resource cleanup
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
@ 2017-07-07 17:34 ` John Fastabend
  2017-07-07 17:35 ` [RFC PATCH 02/12] net: xdp: support xdp generic on virtual devices John Fastabend
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:34 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

tx_rings and rx_rings are cleaned up on close paths in ixgbe driver
however, xdp_rings are not. Set the xdp_rings to NULL here so that
we can use the pointer to indicate if the XDP rings are initialized.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index b45fdc9..f1bfae0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -1018,8 +1018,12 @@ static void ixgbe_free_q_vector(struct ixgbe_adapter *adapter, int v_idx)
 	struct ixgbe_q_vector *q_vector = adapter->q_vector[v_idx];
 	struct ixgbe_ring *ring;
 
-	ixgbe_for_each_ring(ring, q_vector->tx)
-		adapter->tx_ring[ring->queue_index] = NULL;
+	ixgbe_for_each_ring(ring, q_vector->tx) {
+		if (ring_is_xdp(ring))
+			adapter->xdp_ring[ring->queue_index] = NULL;
+		else
+			adapter->tx_ring[ring->queue_index] = NULL;
+	}
 
 	ixgbe_for_each_ring(ring, q_vector->rx)
 		adapter->rx_ring[ring->queue_index] = NULL;

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 02/12] net: xdp: support xdp generic on virtual devices
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
  2017-07-07 17:34 ` [RFC PATCH 01/12] ixgbe: NULL xdp_tx rings on resource cleanup John Fastabend
@ 2017-07-07 17:35 ` John Fastabend
  2017-07-07 17:35 ` [RFC PATCH 03/12] xdp: add bpf_redirect helper function John Fastabend
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:35 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

XDP generic allows users to test XDP programs and/or run them with
degraded performance on devices that do not yet support XDP. For
testing I typically test eBPF programs using a set of veth devices.
This allows testing topologies that would otherwise be difficult to
setup especially in the early stages of development.

This patch adds a xdp generic hook to the netif_rx_internal()
function which is called from dev_forward_skb(). With this addition
attaching XDP programs to veth devices works as expected! Also I
noticed multiple drivers using netif_rx(). These devices will also
benefit and generic XDP will work for them as well.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 net/core/dev.c |  208 ++++++++++++++++++++++++++++++--------------------------
 1 file changed, 113 insertions(+), 95 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index a91572a..28b9939 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3865,6 +3865,107 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 	return NET_RX_DROP;
 }
 
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct bpf_prog *xdp_prog)
+{
+	struct xdp_buff xdp;
+	u32 act = XDP_DROP;
+	void *orig_data;
+	int hlen, off;
+	u32 mac_len;
+
+	/* Reinjected packets coming from act_mirred or similar should
+	 * not get XDP generic processing.
+	 */
+	if (skb_cloned(skb))
+		return XDP_PASS;
+
+	if (skb_linearize(skb))
+		goto do_drop;
+
+	/* The XDP program wants to see the packet starting at the MAC
+	 * header.
+	 */
+	mac_len = skb->data - skb_mac_header(skb);
+	hlen = skb_headlen(skb) + mac_len;
+	xdp.data = skb->data - mac_len;
+	xdp.data_end = xdp.data + hlen;
+	xdp.data_hard_start = skb->data - skb_headroom(skb);
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	off = xdp.data - orig_data;
+	if (off > 0)
+		__skb_pull(skb, off);
+	else if (off < 0)
+		__skb_push(skb, -off);
+
+	switch (act) {
+	case XDP_TX:
+		__skb_push(skb, mac_len);
+		/* fall through */
+	case XDP_PASS:
+		break;
+
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		/* fall through */
+	case XDP_ABORTED:
+		trace_xdp_exception(skb->dev, xdp_prog, act);
+		/* fall through */
+	case XDP_DROP:
+	do_drop:
+		kfree_skb(skb);
+		break;
+	}
+
+	return act;
+}
+
+/* When doing generic XDP we have to bypass the qdisc layer and the
+ * network taps in order to match in-driver-XDP behavior.
+ */
+static void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
+{
+	struct net_device *dev = skb->dev;
+	struct netdev_queue *txq;
+	bool free_skb = true;
+	int cpu, rc;
+
+	txq = netdev_pick_tx(dev, skb, NULL);
+	cpu = smp_processor_id();
+	HARD_TX_LOCK(dev, txq, cpu);
+	if (!netif_xmit_stopped(txq)) {
+		rc = netdev_start_xmit(skb, dev, txq, 0);
+		if (dev_xmit_complete(rc))
+			free_skb = false;
+	}
+	HARD_TX_UNLOCK(dev, txq);
+	if (free_skb) {
+		trace_xdp_exception(dev, xdp_prog, XDP_TX);
+		kfree_skb(skb);
+	}
+}
+
+static struct static_key generic_xdp_needed __read_mostly;
+
+static int do_xdp_generic(struct sk_buff *skb)
+{
+	struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+
+	if (xdp_prog) {
+		u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+
+		if (act != XDP_PASS) {
+			if (act == XDP_TX)
+				generic_xdp_tx(skb, xdp_prog);
+			return XDP_DROP;
+		}
+	}
+	return XDP_PASS;
+}
+
 static int netif_rx_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -3872,6 +3973,14 @@ static int netif_rx_internal(struct sk_buff *skb)
 	net_timestamp_check(netdev_tstamp_prequeue, skb);
 
 	trace_netif_rx(skb);
+
+	if (static_key_false(&generic_xdp_needed)) {
+		int ret = do_xdp_generic(skb);
+
+		if (ret != XDP_PASS)
+			return NET_RX_DROP;
+	}
+
 #ifdef CONFIG_RPS
 	if (static_key_false(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
@@ -4338,8 +4447,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
-static struct static_key generic_xdp_needed __read_mostly;
-
 static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp)
 {
 	struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
@@ -4373,89 +4480,6 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp)
 	return ret;
 }
 
-static u32 netif_receive_generic_xdp(struct sk_buff *skb,
-				     struct bpf_prog *xdp_prog)
-{
-	struct xdp_buff xdp;
-	u32 act = XDP_DROP;
-	void *orig_data;
-	int hlen, off;
-	u32 mac_len;
-
-	/* Reinjected packets coming from act_mirred or similar should
-	 * not get XDP generic processing.
-	 */
-	if (skb_cloned(skb))
-		return XDP_PASS;
-
-	if (skb_linearize(skb))
-		goto do_drop;
-
-	/* The XDP program wants to see the packet starting at the MAC
-	 * header.
-	 */
-	mac_len = skb->data - skb_mac_header(skb);
-	hlen = skb_headlen(skb) + mac_len;
-	xdp.data = skb->data - mac_len;
-	xdp.data_end = xdp.data + hlen;
-	xdp.data_hard_start = skb->data - skb_headroom(skb);
-	orig_data = xdp.data;
-
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
-
-	off = xdp.data - orig_data;
-	if (off > 0)
-		__skb_pull(skb, off);
-	else if (off < 0)
-		__skb_push(skb, -off);
-
-	switch (act) {
-	case XDP_TX:
-		__skb_push(skb, mac_len);
-		/* fall through */
-	case XDP_PASS:
-		break;
-
-	default:
-		bpf_warn_invalid_xdp_action(act);
-		/* fall through */
-	case XDP_ABORTED:
-		trace_xdp_exception(skb->dev, xdp_prog, act);
-		/* fall through */
-	case XDP_DROP:
-	do_drop:
-		kfree_skb(skb);
-		break;
-	}
-
-	return act;
-}
-
-/* When doing generic XDP we have to bypass the qdisc layer and the
- * network taps in order to match in-driver-XDP behavior.
- */
-static void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
-{
-	struct net_device *dev = skb->dev;
-	struct netdev_queue *txq;
-	bool free_skb = true;
-	int cpu, rc;
-
-	txq = netdev_pick_tx(dev, skb, NULL);
-	cpu = smp_processor_id();
-	HARD_TX_LOCK(dev, txq, cpu);
-	if (!netif_xmit_stopped(txq)) {
-		rc = netdev_start_xmit(skb, dev, txq, 0);
-		if (dev_xmit_complete(rc))
-			free_skb = false;
-	}
-	HARD_TX_UNLOCK(dev, txq);
-	if (free_skb) {
-		trace_xdp_exception(dev, xdp_prog, XDP_TX);
-		kfree_skb(skb);
-	}
-}
-
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4468,17 +4492,11 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 	rcu_read_lock();
 
 	if (static_key_false(&generic_xdp_needed)) {
-		struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog);
-
-		if (xdp_prog) {
-			u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+		int ret = do_xdp_generic(skb);
 
-			if (act != XDP_PASS) {
-				rcu_read_unlock();
-				if (act == XDP_TX)
-					generic_xdp_tx(skb, xdp_prog);
-				return NET_RX_DROP;
-			}
+		if (ret != XDP_PASS) {
+			rcu_read_unlock();
+			return NET_RX_DROP;
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 03/12] xdp: add bpf_redirect helper function
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
  2017-07-07 17:34 ` [RFC PATCH 01/12] ixgbe: NULL xdp_tx rings on resource cleanup John Fastabend
  2017-07-07 17:35 ` [RFC PATCH 02/12] net: xdp: support xdp generic on virtual devices John Fastabend
@ 2017-07-07 17:35 ` John Fastabend
  2017-07-09 13:37   ` Saeed Mahameed
  2017-07-07 17:35 ` [RFC PATCH 04/12] xdp: sample program for new bpf_redirect helper John Fastabend
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:35 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

This adds support for a bpf_redirect helper function to the XDP
infrastructure. For now this only supports redirecting to the egress
path of a port.

In order to support drivers handling a xdp_buff natively this patches
uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff
to the specified device.

If the program specifies either (a) an unknown device or (b) a device
that does not support the operation a BPF warning is thrown and the
XDP_ABORTED error code is returned.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/filter.h    |    4 +++
 include/linux/netdevice.h |    6 +++++
 include/uapi/linux/bpf.h  |    1 +
 net/core/filter.c         |   53 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 64 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1fa26dc..d0a1279 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -667,7 +667,11 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
+
+int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp);
+
 void bpf_warn_invalid_xdp_action(u32 act);
+void bpf_warn_invalid_xdp_redirect(u32 ifindex);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 85f01d6..49e8c12 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -66,6 +66,7 @@
 /* UDP Tunnel offloads */
 struct udp_tunnel_info;
 struct bpf_prog;
+struct xdp_buff;
 
 void netdev_set_default_ethtool_ops(struct net_device *dev,
 				    const struct ethtool_ops *ops);
@@ -1138,6 +1139,9 @@ struct xfrmdev_ops {
  * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp);
  *	This function is used to set or query state related to XDP on the
  *	netdevice. See definition of enum xdp_netdev_command for details.
+ * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
+ *	This function is used to submit a XDP packet for transmit on a
+ *	netdevice.
  *
  */
 struct net_device_ops {
@@ -1323,6 +1327,8 @@ struct net_device_ops {
 						       int needed_headroom);
 	int			(*ndo_xdp)(struct net_device *dev,
 					   struct netdev_xdp *xdp);
+	int			(*ndo_xdp_xmit)(struct net_device *dev,
+						struct xdp_buff *xdp);
 };
 
 /**
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f94b48b..e1f3827 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -689,6 +689,7 @@ enum xdp_action {
 	XDP_DROP,
 	XDP_PASS,
 	XDP_TX,
+	XDP_REDIRECT,
 };
 
 /* user accessible metadata for XDP packet hook
diff --git a/net/core/filter.c b/net/core/filter.c
index b39c869..5c9fe3e 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2298,6 +2298,51 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 	.arg2_type	= ARG_ANYTHING,
 };
 
+static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
+{
+	if (dev->netdev_ops->ndo_xdp_xmit) {
+		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		return 0;
+	}
+	bpf_warn_invalid_xdp_redirect(dev->ifindex);
+	return -EOPNOTSUPP;
+}
+
+int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
+	ri->ifindex = 0;
+	if (unlikely(!dev)) {
+		bpf_warn_invalid_xdp_redirect(ri->ifindex);
+		return -EINVAL;
+	}
+
+	return __bpf_tx_xdp(dev, xdp);
+}
+EXPORT_SYMBOL_GPL(xdp_do_redirect);
+
+BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+	if (unlikely(flags))
+		return XDP_ABORTED;
+
+	ri->ifindex = ifindex;
+	ri->flags = flags;
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_proto = {
+	.func           = bpf_xdp_redirect,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_ANYTHING,
+	.arg2_type      = ARG_ANYTHING,
+};
+
 bool bpf_helper_changes_pkt_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -2790,6 +2835,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_xdp_adjust_head:
 		return &bpf_xdp_adjust_head_proto;
+	case BPF_FUNC_redirect:
+		return &bpf_xdp_redirect_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -3110,6 +3157,12 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+void bpf_warn_invalid_xdp_redirect(u32 ifindex)
+{
+	WARN_ONCE(1, "Illegal XDP redirect to unsupported device ifindex(%i)\n", ifindex);
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_redirect);
+
 static u32 bpf_convert_ctx_access(enum bpf_access_type type,
 				  const struct bpf_insn *si,
 				  struct bpf_insn *insn_buf,

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 04/12] xdp: sample program for new bpf_redirect helper
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (2 preceding siblings ...)
  2017-07-07 17:35 ` [RFC PATCH 03/12] xdp: add bpf_redirect helper function John Fastabend
@ 2017-07-07 17:35 ` John Fastabend
  2017-07-07 17:36 ` [RFC PATCH 05/12] net: implement XDP_REDIRECT for xdp generic John Fastabend
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:35 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

This implements a sample program for testing bpf_redirect. It reports
the number of packets redirected per second and as input takes the
ifindex of the device to run the xdp program on and the ifindex of the
interface to redirect packets to.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 samples/bpf/Makefile            |    4 ++
 samples/bpf/xdp_redirect_kern.c |   81 +++++++++++++++++++++++++++++++
 samples/bpf/xdp_redirect_user.c |  102 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 187 insertions(+)
 create mode 100644 samples/bpf/xdp_redirect_kern.c
 create mode 100644 samples/bpf/xdp_redirect_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index e7ec9b8..99a5084 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -36,6 +36,7 @@ hostprogs-y += lwt_len_hist
 hostprogs-y += xdp_tx_iptunnel
 hostprogs-y += test_map_in_map
 hostprogs-y += per_socket_stats_example
+hostprogs-y += xdp_redirect
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -76,6 +77,7 @@ lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
 xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
 test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o
 per_socket_stats_example-objs := $(LIBBPF) cookie_uid_helper_example.o
+xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -111,6 +113,7 @@ always += lwt_len_hist_kern.o
 always += xdp_tx_iptunnel_kern.o
 always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
+always += xdp_redirect_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -146,6 +149,7 @@ HOSTLOADLIBES_tc_l2_redirect += -l elf
 HOSTLOADLIBES_lwt_len_hist += -l elf
 HOSTLOADLIBES_xdp_tx_iptunnel += -lelf
 HOSTLOADLIBES_test_map_in_map += -lelf
+HOSTLOADLIBES_xdp_redirect += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp_redirect_kern.c b/samples/bpf/xdp_redirect_kern.c
new file mode 100644
index 0000000..a34ad45
--- /dev/null
+++ b/samples/bpf/xdp_redirect_kern.c
@@ -0,0 +1,81 @@
+/* Copyright (c) 2016 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") tx_port = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 1,
+};
+
+
+static void swap_src_dst_mac(void *data)
+{
+	unsigned short *p = data;
+	unsigned short dst[3];
+
+	dst[0] = p[0];
+	dst[1] = p[1];
+	dst[2] = p[2];
+	p[0] = p[3];
+	p[1] = p[4];
+	p[2] = p[5];
+	p[3] = dst[0];
+	p[4] = dst[1];
+	p[5] = dst[2];
+}
+
+SEC("xdp_redirect")
+int xdp_redirect_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	int *ifindex, port = 0;
+	long *value;
+	u32 key = 0;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	ifindex = bpf_map_lookup_elem(&tx_port, &port);
+	if (!ifindex)
+		return rc;
+
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	swap_src_dst_mac(data);
+	return bpf_redirect(*ifindex, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_user.c b/samples/bpf/xdp_redirect_user.c
new file mode 100644
index 0000000..761a91d
--- /dev/null
+++ b/samples/bpf/xdp_redirect_user.c
@@ -0,0 +1,102 @@
+/* Copyright (c) 2016 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+static int ifindex_in;
+static int ifindex_out;
+
+static void int_exit(int sig)
+{
+	set_link_xdp_fd(ifindex_in, -1, 0);
+	exit(0);
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(int interval, int ifindex)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(map_fd[1], &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("ifindex %i: %10llu pkt/s\n",
+			       ifindex, sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	int ret, key = 0;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (ac != 3) {
+		printf("usage: %s IFINDEX_IN IFINDEX_OUT\n", argv[0]);
+		return 1;
+	}
+
+	ifindex_in = strtoul(argv[1], NULL, 0);
+	ifindex_out = strtoul(argv[2], NULL, 0);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+
+	if (set_link_xdp_fd(ifindex_in, prog_fd[0], 0) < 0) {
+		printf("link set xdp fd failed\n");
+		return 1;
+	}
+
+	/* bpf redirect port */
+	ret = bpf_map_update_elem(map_fd[0], &key, &ifindex_out, 0);
+	if (ret) {
+		perror("bpf_update_elem");
+		goto out;
+	}
+
+	poll_stats(2, ifindex_out);
+
+out:
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 05/12] net: implement XDP_REDIRECT for xdp generic
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (3 preceding siblings ...)
  2017-07-07 17:35 ` [RFC PATCH 04/12] xdp: sample program for new bpf_redirect helper John Fastabend
@ 2017-07-07 17:36 ` John Fastabend
  2017-07-07 17:36 ` [RFC PATCH 06/12] ixgbe: add initial support for xdp redirect John Fastabend
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:36 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

Add support for redirect to xdp generic creating a fall back for
devices that do not yet have support and allowing test infrastructure
using veth pairs to be built.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/filter.h |    1 +
 net/core/dev.c         |   22 ++++++++++++++++++++--
 net/core/filter.c      |   26 ++++++++++++++++++++++++++
 3 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d0a1279..ccb73d8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -668,6 +668,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb);
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp);
 
 void bpf_warn_invalid_xdp_action(u32 act);
diff --git a/net/core/dev.c b/net/core/dev.c
index 28b9939..2f7b930 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3902,6 +3902,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		__skb_push(skb, -off);
 
 	switch (act) {
+	case XDP_REDIRECT:
 	case XDP_TX:
 		__skb_push(skb, mac_len);
 		/* fall through */
@@ -3956,14 +3957,27 @@ static int do_xdp_generic(struct sk_buff *skb)
 
 	if (xdp_prog) {
 		u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+		int err;
 
 		if (act != XDP_PASS) {
-			if (act == XDP_TX)
+			switch (act) {
+			case XDP_REDIRECT:
+				err = xdp_do_generic_redirect(skb->dev, skb);
+				if (err)
+					goto out_redir;
+			/* fallthru to submit skb */
+			case XDP_TX:
 				generic_xdp_tx(skb, xdp_prog);
+				break;
+			}
 			return XDP_DROP;
 		}
 	}
 	return XDP_PASS;
+out_redir:
+	trace_xdp_exception(skb->dev, xdp_prog, XDP_REDIRECT);
+	kfree_skb(skb);
+	return XDP_DROP;
 }
 
 static int netif_rx_internal(struct sk_buff *skb)
@@ -3977,8 +3991,12 @@ static int netif_rx_internal(struct sk_buff *skb)
 	if (static_key_false(&generic_xdp_needed)) {
 		int ret = do_xdp_generic(skb);
 
+		/* Consider XDP consuming the packet a success from
+		 * the netdev point of view we do not want to count
+		 * this as an error.
+		 */
 		if (ret != XDP_PASS)
-			return NET_RX_DROP;
+			return NET_RX_SUCCESS;
 	}
 
 #ifdef CONFIG_RPS
diff --git a/net/core/filter.c b/net/core/filter.c
index 5c9fe3e..1fbe079 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2323,6 +2323,32 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	unsigned int len;
+
+	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
+	ri->ifindex = 0;
+	if (unlikely(!dev)) {
+		bpf_warn_invalid_xdp_redirect(ri->ifindex);
+		goto err;
+	}
+
+	if (unlikely(!(dev->flags & IFF_UP)))
+		goto err;
+
+	len = dev->mtu + dev->hard_header_len + VLAN_HLEN;
+	if (skb->len > len)
+		goto err;
+
+	skb->dev = dev;
+	return 0;
+err:
+	return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(xdp_do_generic_redirect);
+
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 06/12] ixgbe: add initial support for xdp redirect
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (4 preceding siblings ...)
  2017-07-07 17:36 ` [RFC PATCH 05/12] net: implement XDP_REDIRECT for xdp generic John Fastabend
@ 2017-07-07 17:36 ` John Fastabend
  2017-07-07 17:36 ` [RFC PATCH 07/12] xdp: add trace event " John Fastabend
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:36 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

There are optimizations we can add after the basic feature is
enabled. But, for now keep the patch simple.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   41 ++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f1dbdf2..3db0473 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2214,7 +2214,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
 				     struct xdp_buff *xdp)
 {
-	int result = IXGBE_XDP_PASS;
+	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
 	u32 act;
 
@@ -2231,6 +2231,13 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	case XDP_TX:
 		result = ixgbe_xmit_xdp_ring(adapter, xdp);
 		break;
+	case XDP_REDIRECT:
+		err = xdp_do_redirect(adapter->netdev, xdp);
+		if (!err)
+			result = IXGBE_XDP_TX;
+		else
+			result = IXGBE_XDP_CONSUMED;
+		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 		/* fallthrough */
@@ -9823,6 +9830,37 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_xdp *xdp)
 	}
 }
 
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_ring *ring;
+	int err;
+
+	if (unlikely(test_bit(__IXGBE_DOWN, &adapter->state)))
+		return -EINVAL;
+
+	/* During program transitions its possible adapter->xdp_prog is assigned
+	 * but ring has not been configured yet. In this case simply abort xmit.
+	 */
+	ring = adapter->xdp_prog ? adapter->xdp_ring[smp_processor_id()] : NULL;
+	if (unlikely(!ring))
+		return -EINVAL;
+
+	err = ixgbe_xmit_xdp_ring(adapter, xdp);
+	if (err != IXGBE_XDP_TX)
+		return -ENOMEM;
+
+	/* Force memory writes to complete before letting h/w know there
+	 * are new descriptors to fetch.
+	 */
+	wmb();
+
+	ring = adapter->xdp_ring[smp_processor_id()];
+	writel(ring->next_to_use, ring->tail);
+
+	return 0;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
@@ -9869,6 +9907,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_xdp *xdp)
 	.ndo_udp_tunnel_del	= ixgbe_del_udp_tunnel_port,
 	.ndo_features_check	= ixgbe_features_check,
 	.ndo_xdp		= ixgbe_xdp,
+	.ndo_xdp_xmit		= ixgbe_xdp_xmit,
 };
 
 /**

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 07/12] xdp: add trace event for xdp redirect
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (5 preceding siblings ...)
  2017-07-07 17:36 ` [RFC PATCH 06/12] ixgbe: add initial support for xdp redirect John Fastabend
@ 2017-07-07 17:36 ` John Fastabend
  2017-07-07 17:37 ` [RFC PATCH 08/12] bpf: add devmap, a map for storing net device references John Fastabend
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:36 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

This adds a trace event for xdp redirect which may help when debugging
XDP programs that use redirect bpf commands.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    2 +-
 include/linux/filter.h                        |    4 ++-
 include/trace/events/xdp.h                    |   31 ++++++++++++++++++++++++-
 net/core/filter.c                             |   13 +++++++---
 4 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3db0473..38f7ff9 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2232,7 +2232,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 		result = ixgbe_xmit_xdp_ring(adapter, xdp);
 		break;
 	case XDP_REDIRECT:
-		err = xdp_do_redirect(adapter->netdev, xdp);
+		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
 		if (!err)
 			result = IXGBE_XDP_TX;
 		else
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ccb73d8..997b9c9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -669,7 +669,9 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb);
-int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp);
+int xdp_do_redirect(struct net_device *dev,
+		    struct xdp_buff *xdp,
+		    struct bpf_prog *prog);
 
 void bpf_warn_invalid_xdp_action(u32 act);
 void bpf_warn_invalid_xdp_redirect(u32 ifindex);
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 1b61357..7b1eb7b 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -12,7 +12,8 @@
 	FN(ABORTED)		\
 	FN(DROP)		\
 	FN(PASS)		\
-	FN(TX)
+	FN(TX)			\
+	FN(REDIRECT)
 
 #define __XDP_ACT_TP_FN(x)	\
 	TRACE_DEFINE_ENUM(XDP_##x);
@@ -48,6 +49,34 @@
 		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB))
 );
 
+TRACE_EVENT(xdp_redirect,
+
+	TP_PROTO(const struct net_device *from,
+		 const struct net_device *to,
+		 const struct bpf_prog *xdp, u32 act),
+
+	TP_ARGS(from, to, xdp, act),
+
+	TP_STRUCT__entry(
+		__string(name_from, from->name)
+		__string(name_to, to->name)
+		__array(u8, prog_tag, 8)
+		__field(u32, act)
+	),
+
+	TP_fast_assign(
+		BUILD_BUG_ON(sizeof(__entry->prog_tag) != sizeof(xdp->tag));
+		memcpy(__entry->prog_tag, xdp->tag, sizeof(xdp->tag));
+		__assign_str(name_from, from->name);
+		__assign_str(name_to, to->name);
+		__entry->act = act;
+	),
+
+	TP_printk("prog=%s from=%s to=%s action=%s",
+		  __print_hex_str(__entry->prog_tag, 8),
+		  __get_str(name_from), __get_str(name_to),
+		  __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB))
+);
 #endif /* _TRACE_XDP_H */
 
 #include <trace/define_trace.h>
diff --git a/net/core/filter.c b/net/core/filter.c
index 1fbe079..441abbb 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -54,6 +54,7 @@
 #include <net/dst.h>
 #include <net/sock_reuseport.h>
 #include <net/busy_poll.h>
+#include <linux/bpf_trace.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2308,18 +2309,22 @@ static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
 	return -EOPNOTSUPP;
 }
 
-int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
+int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
+		    struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct net_device *fwd;
 
-	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
+	fwd = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
 	ri->ifindex = 0;
-	if (unlikely(!dev)) {
+	if (unlikely(!fwd)) {
 		bpf_warn_invalid_xdp_redirect(ri->ifindex);
 		return -EINVAL;
 	}
 
-	return __bpf_tx_xdp(dev, xdp);
+	trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
+
+	return __bpf_tx_xdp(fwd, xdp);
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 08/12] bpf: add devmap, a map for storing net device references
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (6 preceding siblings ...)
  2017-07-07 17:36 ` [RFC PATCH 07/12] xdp: add trace event " John Fastabend
@ 2017-07-07 17:37 ` John Fastabend
  2017-07-08 18:57   ` Jesper Dangaard Brouer
  2017-07-07 17:37 ` [RFC PATCH 09/12] bpf: add bpf_redirect_map helper routine John Fastabend
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:37 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

Device map (devmap) is a BPF map, primarily useful for networking
applications, that uses a key to lookup a reference to a netdevice.

The map provides a clean way for BPF programs to build virtual port
to physical port maps. Additionally, it provides a scoping function
for the redirect action itself allowing multiple optimizations. Future
patches will leverage the map to provide batching at the XDP layer.

Another optimization/feature, that is not yet implemented, would be
to support multiple netdevices per key to support efficient multicast
and broadcast support.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/bpf_types.h               |    3 
 include/uapi/linux/bpf.h                |    1 
 kernel/bpf/Makefile                     |    3 
 kernel/bpf/devmap.c                     |  264 +++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c                   |    8 +
 tools/testing/selftests/bpf/test_maps.c |   15 ++
 6 files changed, 294 insertions(+)
 create mode 100644 kernel/bpf/devmap.c

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 03bf223..b18056b 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -34,3 +34,6 @@
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
+#ifdef CONFIG_NET
+BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
+#endif
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e1f3827..0a48060 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -104,6 +104,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_LPM_TRIE,
 	BPF_MAP_TYPE_ARRAY_OF_MAPS,
 	BPF_MAP_TYPE_HASH_OF_MAPS,
+	BPF_MAP_TYPE_DEVMAP,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e1e5e65..48e9270 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -2,6 +2,9 @@ obj-y := core.o
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o
+ifeq ($(CONFIG_NET),y)
+obj-$(CONFIG_BPF_SYSCALL) += devmap.o
+endif
 ifeq ($(CONFIG_PERF_EVENTS),y)
 obj-$(CONFIG_BPF_SYSCALL) += stackmap.o
 endif
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
new file mode 100644
index 0000000..1a87835
--- /dev/null
+++ b/kernel/bpf/devmap.c
@@ -0,0 +1,264 @@
+/* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+/* Devmaps primary use is as a backend map for XDP BPF helper call
+ * bpf_redirect_map(). Because XDP is mostly concerned with performance we
+ * spent some effort to ensure the datapath with redirect maps does not use
+ * any locking. This is a quick note on the details.
+ *
+ * We have three possible paths to get into the devmap control plane bpf
+ * syscalls, bpf programs, and driver side xmit/flush operations. A bpf syscall
+ * will invoke an update, delete, or lookup operation. To ensure updates and
+ * deletes appear atomic from the datapath side xchg() is used to modify the
+ * netdev_map array. Then because the datapath does a lookup into the netdev_map
+ * array (read-only) from an RCU critical section we use call_rcu() to wait for
+ * an rcu grace period before free'ing the old data structures. This ensures the
+ * datapath always has a valid copy. However, the datapath does a "flush"
+ * operation that pushes any pending packets in the driver outside the RCU
+ * critical section. Each bpf_dtab_netdev tracks these pending operations using
+ * an atomic per-cpu bitmap. The bpf_dtab_netdev object will not be destroyed
+ * until all bits are cleared indicating outstanding flush operations have
+ * completed.
+ *
+ * BPF syscalls may race with BPF program calls on any of the update, delete
+ * or lookup operations. As noted above the xchg() operation also keep the
+ * netdev_map consistent in this case. From the devmap side BPF programs
+ * calling into these operations are the same as multiple user space threads
+ * making system calls.
+ */
+#include <linux/bpf.h>
+#include <linux/jhash.h>
+#include <linux/filter.h>
+#include <linux/rculist_nulls.h>
+#include "percpu_freelist.h"
+#include "bpf_lru_list.h"
+#include "map_in_map.h"
+
+struct bpf_dtab_netdev {
+	struct net_device *dev;
+	int key;
+	struct rcu_head rcu;
+	struct bpf_dtab *dtab;
+};
+
+struct bpf_dtab {
+	struct bpf_map map;
+	struct bpf_dtab_netdev **netdev_map;
+};
+
+static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
+{
+	struct bpf_dtab *dtab;
+	u64 cost;
+	int err;
+
+	/* check sanity of attributes */
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    attr->value_size != 4 || attr->map_flags)
+		return ERR_PTR(-EINVAL);
+
+	/* if value_size is bigger, the user space won't be able to
+	 * access the elements.
+	 */
+	if (attr->value_size > KMALLOC_MAX_SIZE)
+		return ERR_PTR(-E2BIG);
+
+	dtab = kzalloc(sizeof(*dtab), GFP_USER);
+	if (!dtab)
+		return ERR_PTR(-ENOMEM);
+
+	/* mandatory map attributes */
+	dtab->map.map_type = attr->map_type;
+	dtab->map.key_size = attr->key_size;
+	dtab->map.value_size = attr->value_size;
+	dtab->map.max_entries = attr->max_entries;
+	dtab->map.map_flags = attr->map_flags;
+
+	err = -ENOMEM;
+
+	/* make sure page count doesn't overflow */
+	cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *);
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_dtab;
+
+	dtab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	/* if map size is larger than memlock limit, reject it early */
+	err = bpf_map_precharge_memlock(dtab->map.pages);
+	if (err)
+		goto free_dtab;
+
+	dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries *
+					      sizeof(struct bpf_dtab_netdev *));
+	if (!dtab->netdev_map)
+		goto free_dtab;
+
+	return &dtab->map;
+
+free_dtab:
+	kfree(dtab);
+	return ERR_PTR(err);
+}
+
+static void dev_map_free(struct bpf_map *map)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	int i;
+
+	/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+	 * so the programs (can be more than one that used this map) were
+	 * disconnected from events. Wait for outstanding critical sections in
+	 * these programs to complete. The rcu critical section only guarantees
+	 * no further reads against netdev_map. It does __not__ ensure pending
+	 * flush operations (if any) are complete.
+	 */
+	synchronize_rcu();
+
+	for (i = 0; i < dtab->map.max_entries; i++) {
+		struct bpf_dtab_netdev *dev;
+
+		dev = dtab->netdev_map[i];
+		if (!dev)
+			continue;
+
+		dev_put(dev->dev);
+		kfree(dev);
+	}
+
+	/* At this point bpf program is detached and all pending operations
+	 * _must_ be complete
+	 */
+	bpf_map_area_free(dtab->netdev_map);
+	kfree(dtab);
+}
+
+static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	u32 index = key ? *(u32 *)key : U32_MAX;
+	u32 *next = (u32 *)next_key;
+
+	if (index >= dtab->map.max_entries) {
+		*next = 0;
+		return 0;
+	}
+
+	if (index == dtab->map.max_entries - 1)
+		return -ENOENT;
+
+	*next = index + 1;
+	return 0;
+}
+
+/* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or
+ * update happens in parallel here a dev_put wont happen until after reading the
+ * ifindex.
+ */
+static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	struct bpf_dtab_netdev *dev;
+	u32 i = *(u32 *)key;
+
+	if (i >= map->max_entries)
+		return NULL;
+
+	dev = READ_ONCE(dtab->netdev_map[i]);
+	return dev ? &dev->dev->ifindex : NULL;
+}
+
+static void __dev_map_entry_free(struct rcu_head *rcu)
+{
+	struct bpf_dtab_netdev *old_dev;
+
+	old_dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
+	dev_put(old_dev->dev);
+	kfree(old_dev);
+}
+
+static int dev_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	struct bpf_dtab_netdev *old_dev;
+	int k = *(u32 *)key;
+
+	if (k >= map->max_entries)
+		return -EINVAL;
+
+	/* Use synchronize_rcu() here to ensure any rcu critical sections
+	 * have completed, but this does not guarantee a flush has happened
+	 * yet. Because driver side rcu_read_lock/unlock only protects the
+	 * running XDP program. However, for pending flush operations the
+	 * dev and ctx are stored in another per cpu map. And additionally,
+	 * the driver tear down ensures all soft irqs are complete before
+	 * removing the net device in the case of dev_put equals zero.
+	 */
+	old_dev = xchg(&dtab->netdev_map[k], NULL);
+	if (old_dev)
+		call_rcu(&old_dev->rcu, __dev_map_entry_free);
+	return 0;
+}
+
+static int dev_map_update_elem(struct bpf_map *map, void *key, void *value,
+				u64 map_flags)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_dtab_netdev *dev, *old_dev;
+	u32 i = *(u32 *)key;
+	u32 ifindex = *(u32 *)value;
+
+	if (unlikely(map_flags > BPF_EXIST))
+		return -EINVAL;
+
+	if (unlikely(i >= dtab->map.max_entries))
+		return -E2BIG;
+
+	if (unlikely(map_flags == BPF_NOEXIST))
+		return -EEXIST;
+
+	if (!ifindex) {
+		dev = NULL;
+	} else {
+		dev = kmalloc(sizeof(*dev), GFP_ATOMIC | __GFP_NOWARN);
+		if (!dev)
+			return -ENOMEM;
+
+		dev->dev = dev_get_by_index(net, ifindex);
+		if (!dev->dev) {
+			kfree(dev);
+			return -EINVAL;
+		}
+
+		dev->key = i;
+		dev->dtab = dtab;
+	}
+
+	/* Use call_rcu() here to ensure rcu critical sections have completed
+	 * Remembering the driver side flush operation will happen before the
+	 * net device is removed.
+	 */
+	old_dev = xchg(&dtab->netdev_map[i], dev);
+	if (old_dev)
+		call_rcu(&old_dev->rcu, __dev_map_entry_free);
+
+	return 0;
+}
+
+const struct bpf_map_ops dev_map_ops = {
+	.map_alloc = dev_map_alloc,
+	.map_free = dev_map_free,
+	.map_get_next_key = dev_map_get_next_key,
+	.map_lookup_elem = dev_map_lookup_elem,
+	.map_update_elem = dev_map_update_elem,
+	.map_delete_elem = dev_map_delete_elem,
+};
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 74ea96e..06073ba 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1294,6 +1294,14 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		    func_id != BPF_FUNC_current_task_under_cgroup)
 			goto error;
 		break;
+	/* devmap returns a pointer to a live net_device ifindex that we cannot
+	 * allow to be modified from bpf side. So do not allow lookup elemnts
+	 * for now.
+	 */
+	case BPF_MAP_TYPE_DEVMAP:
+		if (func_id == BPF_FUNC_map_lookup_elem)
+			goto error;
+		break;
 	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
 	case BPF_MAP_TYPE_HASH_OF_MAPS:
 		if (func_id != BPF_FUNC_map_lookup_elem)
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 79601c8..36d6ac3 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -438,6 +438,21 @@ static void test_arraymap_percpu_many_keys(void)
 	close(fd);
 }
 
+static void test_devmap(int task, void *data)
+{
+	int next_key, fd;
+	__u32 key, value;
+
+	fd = bpf_create_map(BPF_MAP_TYPE_DEVMAP, sizeof(key), sizeof(value),
+			    2, 0);
+	if (fd < 0) {
+		printf("Failed to create arraymap '%s'!\n", strerror(errno));
+		exit(1);
+	}
+
+	close(fd);
+}
+
 #define MAP_SIZE (32 * 1024)
 
 static void test_map_large(void)

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 09/12] bpf: add bpf_redirect_map helper routine
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (7 preceding siblings ...)
  2017-07-07 17:37 ` [RFC PATCH 08/12] bpf: add devmap, a map for storing net device references John Fastabend
@ 2017-07-07 17:37 ` John Fastabend
  2017-07-07 17:37 ` [RFC PATCH 10/12] xdp: Add batching support to redirect map John Fastabend
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:37 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

BPF programs can use the devmap with a bpf_redirect_map() helper
routine to forward packets to netdevice in map.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/bpf.h      |    3 +++
 include/uapi/linux/bpf.h |    8 ++++++-
 kernel/bpf/devmap.c      |   12 ++++++++++
 kernel/bpf/verifier.c    |    4 +++
 net/core/filter.c        |   54 ++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5175729..8c2f3e1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -374,4 +374,7 @@ static inline void __bpf_prog_uncharge(struct user_struct *user, u32 pages)
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+/* Map specifics */
+struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0a48060..b95f46d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -346,6 +346,11 @@ enum bpf_attach_type {
  *     @flags: bit 0 - if set, redirect to ingress instead of egress
  *             other bits - reserved
  *     Return: TC_ACT_REDIRECT
+ * int bpf_redirect_map(key, map, flags)
+ *     redirect to endpoint in map
+ *     @key: index in map to lookup
+ *     @map: fd of map to do lookup in
+ *     @flags: --
  *
  * u32 bpf_get_route_realm(skb)
  *     retrieve a dst's tclassid
@@ -569,7 +574,8 @@ enum bpf_attach_type {
 	FN(probe_read_str),		\
 	FN(get_socket_cookie),		\
 	FN(get_socket_uid),		\
-	FN(set_hash),
+	FN(set_hash),			\
+	FN(redirect_map),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 1a87835..36dc13de 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -159,6 +159,18 @@ static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	return 0;
 }
 
+struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	struct bpf_dtab_netdev *dev;
+
+	if (key >= map->max_entries)
+		return NULL;
+
+	dev = READ_ONCE(dtab->netdev_map[key]);
+	return dev ? dev->dev : NULL;
+}
+
 /* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or
  * update happens in parallel here a dev_put wont happen until after reading the
  * ifindex.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 06073ba..1d03956 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1330,6 +1330,10 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 		if (map->map_type != BPF_MAP_TYPE_CGROUP_ARRAY)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP)
+			goto error;
+		break;
 	default:
 		break;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 441abbb..482bda8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1778,6 +1778,7 @@ static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev,
 struct redirect_info {
 	u32 ifindex;
 	u32 flags;
+	struct bpf_map *map;
 };
 
 static DEFINE_PER_CPU(struct redirect_info, redirect_info);
@@ -1791,6 +1792,7 @@ struct redirect_info {
 
 	ri->ifindex = ifindex;
 	ri->flags = flags;
+	ri->map = NULL;
 
 	return TC_ACT_REDIRECT;
 }
@@ -1818,6 +1820,29 @@ int skb_do_redirect(struct sk_buff *skb)
 	.arg2_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_redirect_map, struct bpf_map *, map, u32, ifindex, u64, flags)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+
+	if (unlikely(flags))
+		return XDP_ABORTED;
+
+	ri->ifindex = ifindex;
+	ri->flags = flags;
+	ri->map = map;
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_redirect_map_proto = {
+	.func           = bpf_redirect_map,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb)
 {
 	return task_get_classid(skb);
@@ -2309,14 +2334,41 @@ static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
 	return -EOPNOTSUPP;
 }
 
+int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
+			struct bpf_prog *xdp_prog)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_map *map = ri->map;
+	struct net_device *fwd;
+
+	fwd = __dev_map_lookup_elem(map, ri->ifindex);
+	if (!fwd)
+		goto out;
+
+	ri->ifindex = 0;
+	ri->map = NULL;
+
+	trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
+
+	return __bpf_tx_xdp(fwd, xdp);
+out:
+	ri->ifindex = 0;
+	ri->map = NULL;
+	return -EINVAL;
+}
+
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	struct net_device *fwd;
 
+	if (ri->map)
+		return xdp_do_redirect_map(dev, xdp, xdp_prog);
+
 	fwd = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
 	ri->ifindex = 0;
+	ri->map = NULL;
 	if (unlikely(!fwd)) {
 		bpf_warn_invalid_xdp_redirect(ri->ifindex);
 		return -EINVAL;
@@ -2868,6 +2920,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 		return &bpf_xdp_adjust_head_proto;
 	case BPF_FUNC_redirect:
 		return &bpf_xdp_redirect_proto;
+	case BPF_FUNC_redirect_map:
+		return &bpf_redirect_map_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 10/12] xdp: Add batching support to redirect map
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (8 preceding siblings ...)
  2017-07-07 17:37 ` [RFC PATCH 09/12] bpf: add bpf_redirect_map helper routine John Fastabend
@ 2017-07-07 17:37 ` John Fastabend
  2017-07-10 17:53   ` Jesper Dangaard Brouer
  2017-07-07 17:38 ` [RFC PATCH 11/12] net: add notifier hooks for devmap bpf map John Fastabend
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:37 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

For performance reasons we want to avoid updating the tail pointer in
the driver tx ring as much as possible. To accomplish this we add
batching support to the redirect path in XDP.

This adds another ndo op "xdp_flush" that is used to inform the driver
that it should bump the tail pointer on the TX ring.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   25 +++++++
 include/linux/bpf.h                           |    2 +
 include/linux/filter.h                        |    7 ++
 include/linux/netdevice.h                     |    5 +
 kernel/bpf/devmap.c                           |   84 +++++++++++++++++++++++++
 net/core/filter.c                             |   52 +++++++++++++--
 6 files changed, 161 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 38f7ff9..c093a31 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2415,6 +2415,8 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		 */
 		wmb();
 		writel(ring->next_to_use, ring->tail);
+
+		xdp_do_flush_map();
 	}
 
 	u64_stats_update_begin(&rx_ring->syncp);
@@ -9850,15 +9852,31 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (err != IXGBE_XDP_TX)
 		return -ENOMEM;
 
+	return 0;
+}
+
+static void ixgbe_xdp_flush(struct net_device *dev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_ring *ring;
+
+	/* Its possible the device went down between xdp xmit and flush so
+	 * we need to ensure device is still up.
+	 */
+	if (unlikely(test_bit(__IXGBE_DOWN, &adapter->state)))
+		return;
+
+	ring = adapter->xdp_prog ? adapter->xdp_ring[smp_processor_id()] : NULL;
+	if (unlikely(!ring))
+		return;
+
 	/* Force memory writes to complete before letting h/w know there
 	 * are new descriptors to fetch.
 	 */
 	wmb();
-
-	ring = adapter->xdp_ring[smp_processor_id()];
 	writel(ring->next_to_use, ring->tail);
 
-	return 0;
+	return;
 }
 
 static const struct net_device_ops ixgbe_netdev_ops = {
@@ -9908,6 +9926,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	.ndo_features_check	= ixgbe_features_check,
 	.ndo_xdp		= ixgbe_xdp,
 	.ndo_xdp_xmit		= ixgbe_xdp_xmit,
+	.ndo_xdp_flush		= ixgbe_xdp_flush,
 };
 
 /**
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8c2f3e1..6483eaf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -376,5 +376,7 @@ static inline void __bpf_prog_uncharge(struct user_struct *user, u32 pages)
 
 /* Map specifics */
 struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key);
+void __dev_map_insert_ctx(struct bpf_map *map, u32 index);
+void __dev_map_flush(struct bpf_map *map);
 
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 997b9c9..1cb8b1b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -668,10 +668,17 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+/* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
+ * same cpu context. Further for best results no more than a single map
+ * for the do_redirect/do_flush pair should be used. This limitation is
+ * because we only track one map and force a flush when the map changes.
+ * This does not appear to be a real limiation for existing software.
+ */
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb);
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
+void xdp_do_flush_map(void);
 
 void bpf_warn_invalid_xdp_action(u32 act);
 void bpf_warn_invalid_xdp_redirect(u32 ifindex);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 49e8c12..400c8df 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1142,7 +1142,9 @@ struct xfrmdev_ops {
  * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
- *
+ * void (*ndo_xdp_flush)(struct net_device *dev);
+ *	This function is used to inform the driver to flush a paticular
+ *	xpd tx queue. Must be called on same CPU as xdp_xmit.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1329,6 +1331,7 @@ struct net_device_ops {
 					   struct netdev_xdp *xdp);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
 						struct xdp_buff *xdp);
+	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
 /**
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 36dc13de..656e334 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -53,6 +53,7 @@ struct bpf_dtab_netdev {
 struct bpf_dtab {
 	struct bpf_map map;
 	struct bpf_dtab_netdev **netdev_map;
+	unsigned long int __percpu *flush_needed;
 };
 
 static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
@@ -87,6 +88,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 
 	/* make sure page count doesn't overflow */
 	cost = (u64) dtab->map.max_entries * sizeof(struct bpf_dtab_netdev *);
+	cost += BITS_TO_LONGS(attr->max_entries) * sizeof(unsigned long);
 	if (cost >= U32_MAX - PAGE_SIZE)
 		goto free_dtab;
 
@@ -97,6 +99,14 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (err)
 		goto free_dtab;
 
+	/* A per cpu bitfield with a bit per possible net device */
+	dtab->flush_needed = __alloc_percpu(
+				BITS_TO_LONGS(attr->max_entries) *
+				sizeof(unsigned long),
+				__alignof__(unsigned long));
+	if (!dtab->flush_needed)
+		goto free_dtab;
+
 	dtab->netdev_map = bpf_map_area_alloc(dtab->map.max_entries *
 					      sizeof(struct bpf_dtab_netdev *));
 	if (!dtab->netdev_map)
@@ -105,6 +115,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	return &dtab->map;
 
 free_dtab:
+	free_percpu(dtab->flush_needed);
 	kfree(dtab);
 	return ERR_PTR(err);
 }
@@ -112,7 +123,7 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 static void dev_map_free(struct bpf_map *map)
 {
 	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
-	int i;
+	int i, cpu;
 
 	/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
 	 * so the programs (can be more than one that used this map) were
@@ -123,6 +134,18 @@ static void dev_map_free(struct bpf_map *map)
 	 */
 	synchronize_rcu();
 
+	/* To ensure all pending flush operations have completed wait for flush
+	 * bitmap to indicate all flush_needed bits to be zero on _all_ cpus.
+	 * Because the above synchronize_rcu() ensures the map is disconnected
+	 * from the program we can assume no new bits will be set.
+	 */
+	for_each_online_cpu(cpu) {
+		unsigned long *bitmap = per_cpu_ptr(dtab->flush_needed, cpu);
+
+		while (!bitmap_empty(bitmap, dtab->map.max_entries))
+			cpu_relax();
+	}
+
 	for (i = 0; i < dtab->map.max_entries; i++) {
 		struct bpf_dtab_netdev *dev;
 
@@ -137,6 +160,7 @@ static void dev_map_free(struct bpf_map *map)
 	/* At this point bpf program is detached and all pending operations
 	 * _must_ be complete
 	 */
+	free_percpu(dtab->flush_needed);
 	bpf_map_area_free(dtab->netdev_map);
 	kfree(dtab);
 }
@@ -159,6 +183,14 @@ static int dev_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	return 0;
 }
 
+void __dev_map_insert_ctx(struct bpf_map *map, u32 key)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
+
+	set_bit(key, bitmap);
+}
+
 struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 {
 	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
@@ -171,6 +203,39 @@ struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
 	return dev ? dev->dev : NULL;
 }
 
+/* __dev_map_flush is called from xdp_do_flush_map() which _must_ be signaled
+ * from the driver before returning from its napi->poll() routine. The poll()
+ * routine is called either from busy_poll context or net_rx_action signaled
+ * from NET_RX_SOFTIRQ. Either way the poll routine must complete before the
+ * net device can be torn down. On devmap tear down we ensure the ctx bitmap
+ * is zeroed before completing to ensure all flush operations have completed.
+ */
+void __dev_map_flush(struct bpf_map *map)
+{
+	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
+	unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
+	u32 bit;
+
+	for_each_set_bit(bit, bitmap, map->max_entries) {
+		struct bpf_dtab_netdev *dev = READ_ONCE(dtab->netdev_map[bit]);
+		struct net_device *netdev;
+
+		/* This is possible if the dev entry is removed by user space
+		 * between xdp redirect and flush op.
+		 */
+		if (unlikely(!dev))
+			continue;
+
+		netdev = dev->dev;
+
+		clear_bit(bit, bitmap);
+		if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush))
+			continue;
+
+		netdev->netdev_ops->ndo_xdp_flush(netdev);
+	}
+}
+
 /* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or
  * update happens in parallel here a dev_put wont happen until after reading the
  * ifindex.
@@ -188,11 +253,28 @@ static void *dev_map_lookup_elem(struct bpf_map *map, void *key)
 	return dev ? &dev->dev->ifindex : NULL;
 }
 
+static void dev_map_flush_old(struct bpf_dtab_netdev *old_dev)
+{
+	if (old_dev->dev->netdev_ops->ndo_xdp_flush) {
+		struct net_device *fl = old_dev->dev;
+		unsigned long *bitmap;
+		int cpu;
+
+		for_each_online_cpu(cpu) {
+			bitmap = per_cpu_ptr(old_dev->dtab->flush_needed, cpu);
+			clear_bit(old_dev->key, bitmap);
+
+			fl->netdev_ops->ndo_xdp_flush(old_dev->dev);
+		}
+	}
+}
+
 static void __dev_map_entry_free(struct rcu_head *rcu)
 {
 	struct bpf_dtab_netdev *old_dev;
 
 	old_dev = container_of(rcu, struct bpf_dtab_netdev, rcu);
+	dev_map_flush_old(old_dev);
 	dev_put(old_dev->dev);
 	kfree(old_dev);
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index 482bda8..8b66955 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1779,6 +1779,7 @@ struct redirect_info {
 	u32 ifindex;
 	u32 flags;
 	struct bpf_map *map;
+	struct bpf_map *map_to_flush;
 };
 
 static DEFINE_PER_CPU(struct redirect_info, redirect_info);
@@ -2324,33 +2325,66 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
 	.arg2_type	= ARG_ANYTHING,
 };
 
-static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
+static int __bpf_tx_xdp(struct net_device *dev,
+			struct bpf_map *map,
+			struct xdp_buff *xdp,
+			u32 index)
 {
-	if (dev->netdev_ops->ndo_xdp_xmit) {
-		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
-		return 0;
+	int err;
+
+	if (!dev->netdev_ops->ndo_xdp_xmit) {
+		bpf_warn_invalid_xdp_redirect(dev->ifindex);
+		return -EOPNOTSUPP;
 	}
-	bpf_warn_invalid_xdp_redirect(dev->ifindex);
-	return -EOPNOTSUPP;
+
+	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+	if (err)
+		return err;
+
+	if (map)
+		__dev_map_insert_ctx(map, index);
+	else
+		dev->netdev_ops->ndo_xdp_flush(dev);
+
+	return err;
 }
 
+void xdp_do_flush_map(void)
+{
+	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_map *map = ri->map_to_flush;
+
+	ri->map = NULL;
+	ri->map_to_flush = NULL;
+
+	if (map)
+		__dev_map_flush(map);
+}
+EXPORT_SYMBOL_GPL(xdp_do_flush_map);
+
 int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 			struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	struct bpf_map *map = ri->map;
+	u32 index = ri->ifindex;
 	struct net_device *fwd;
 
-	fwd = __dev_map_lookup_elem(map, ri->ifindex);
+	fwd = __dev_map_lookup_elem(map, index);
 	if (!fwd)
 		goto out;
 
 	ri->ifindex = 0;
+
+	if (ri->map_to_flush && (ri->map_to_flush != map))
+		xdp_do_flush_map();
+
+	ri->map_to_flush = map;
 	ri->map = NULL;
 
 	trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
 
-	return __bpf_tx_xdp(fwd, xdp);
+	return __bpf_tx_xdp(fwd, map, xdp, index);
 out:
 	ri->ifindex = 0;
 	ri->map = NULL;
@@ -2376,7 +2410,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 	trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
 
-	return __bpf_tx_xdp(fwd, xdp);
+	return __bpf_tx_xdp(fwd, NULL, xdp, 0);
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 11/12] net: add notifier hooks for devmap bpf map
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (9 preceding siblings ...)
  2017-07-07 17:37 ` [RFC PATCH 10/12] xdp: Add batching support to redirect map John Fastabend
@ 2017-07-07 17:38 ` John Fastabend
  2017-07-07 17:38 ` [RFC PATCH 12/12] xdp: bpf redirect with map sample program John Fastabend
  2017-07-07 17:48 ` [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:38 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

The BPF map devmap holds a refcnt on the net_device structure when
it is in the map. We need to do this to ensure on driver unload we
don't lose a dev reference.

However, its not very convenient to have to manually unload the map
when destroying a net device so add notifier handlers to do the cleanup
automatically. But this creates a race between update/destroy BPF
syscall and programs and the unregister netdev hook.

Unfortunately, the best I could come up with is either to live with
requiring manual removal of net devices from the map before removing
the net device OR to add a mutex in devmap to ensure the map is not
modified while we are removing a device. The fallout also requires
that BPF programs no longer update/delete the map from the BPF program
side because the mutex may sleep and this can not be done from inside
an rcu critical section.  This is not a real problem though because I
have not come up with any use cases where this is actually useful in
practice. If/when we come up with a compelling user for this we may
need to revisit this.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 kernel/bpf/devmap.c   |   73 +++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c |    4 ++-
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 656e334..1191060 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -34,6 +34,17 @@
  * netdev_map consistent in this case. From the devmap side BPF programs
  * calling into these operations are the same as multiple user space threads
  * making system calls.
+ *
+ * Finally, any of the above may race with a netdev_unregister notifier. The
+ * unregister notifier must search for net devices in the map structure that
+ * contain a reference to the net device and remove them. This is a two step
+ * process (a) dereference the bpf_dtab_netdev object in netdev_map and (b)
+ * check to see if the ifindex is the same as the net_device being removed.
+ * Unfortunately, the xchg() operations do not protect against this. To avoid
+ * potentially removing incorrect objects the dev_map_list_mutex protects
+ * conflicting netdev unregister and BPF syscall operations. Updates and
+ * deletes from a BPF program (done in rcu critical section) are blocked
+ * because of this mutex.
  */
 #include <linux/bpf.h>
 #include <linux/jhash.h>
@@ -54,8 +65,12 @@ struct bpf_dtab {
 	struct bpf_map map;
 	struct bpf_dtab_netdev **netdev_map;
 	unsigned long int __percpu *flush_needed;
+	struct list_head list;
 };
 
+static DEFINE_MUTEX(dev_map_list_mutex);
+static LIST_HEAD(dev_map_list);
+
 static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 {
 	struct bpf_dtab *dtab;
@@ -112,6 +127,9 @@ static struct bpf_map *dev_map_alloc(union bpf_attr *attr)
 	if (!dtab->netdev_map)
 		goto free_dtab;
 
+	mutex_lock(&dev_map_list_mutex);
+	list_add_tail(&dtab->list, &dev_map_list);
+	mutex_unlock(&dev_map_list_mutex);
 	return &dtab->map;
 
 free_dtab:
@@ -146,6 +164,11 @@ static void dev_map_free(struct bpf_map *map)
 			cpu_relax();
 	}
 
+	/* Although we should no longer have datapath or bpf syscall operations
+	 * at this point we we can still race with netdev notifier, hence the
+	 * lock.
+	 */
+	mutex_lock(&dev_map_list_mutex);
 	for (i = 0; i < dtab->map.max_entries; i++) {
 		struct bpf_dtab_netdev *dev;
 
@@ -160,6 +183,8 @@ static void dev_map_free(struct bpf_map *map)
 	/* At this point bpf program is detached and all pending operations
 	 * _must_ be complete
 	 */
+	list_del(&dtab->list);
+	mutex_unlock(&dev_map_list_mutex);
 	free_percpu(dtab->flush_needed);
 	bpf_map_area_free(dtab->netdev_map);
 	kfree(dtab);
@@ -296,9 +321,11 @@ static int dev_map_delete_elem(struct bpf_map *map, void *key)
 	 * the driver tear down ensures all soft irqs are complete before
 	 * removing the net device in the case of dev_put equals zero.
 	 */
+	mutex_lock(&dev_map_list_mutex);
 	old_dev = xchg(&dtab->netdev_map[k], NULL);
 	if (old_dev)
 		call_rcu(&old_dev->rcu, __dev_map_entry_free);
+	mutex_unlock(&dev_map_list_mutex);
 	return 0;
 }
 
@@ -341,9 +368,11 @@ static int dev_map_update_elem(struct bpf_map *map, void *key, void *value,
 	 * Remembering the driver side flush operation will happen before the
 	 * net device is removed.
 	 */
+	mutex_lock(&dev_map_list_mutex);
 	old_dev = xchg(&dtab->netdev_map[i], dev);
 	if (old_dev)
 		call_rcu(&old_dev->rcu, __dev_map_entry_free);
+	mutex_unlock(&dev_map_list_mutex);
 
 	return 0;
 }
@@ -356,3 +385,47 @@ static int dev_map_update_elem(struct bpf_map *map, void *key, void *value,
 	.map_update_elem = dev_map_update_elem,
 	.map_delete_elem = dev_map_delete_elem,
 };
+
+static int dev_map_notification(struct notifier_block *notifier,
+				ulong event, void *ptr)
+{
+	struct net_device *netdev = netdev_notifier_info_to_dev(ptr);
+	struct bpf_dtab *dtab;
+	int i;
+
+	switch (event) {
+	case NETDEV_UNREGISTER:
+		mutex_lock(&dev_map_list_mutex);
+		list_for_each_entry(dtab, &dev_map_list, list) {
+			for (i = 0; i < dtab->map.max_entries; i++) {
+				struct bpf_dtab_netdev *dev;
+
+				dev = dtab->netdev_map[i];
+				if (!dev ||
+				    dev->dev->ifindex != netdev->ifindex)
+					continue;
+				dev = xchg(&dtab->netdev_map[i], NULL);
+				if (dev)
+					call_rcu(&dev->rcu,
+						 __dev_map_entry_free);
+			}
+		}
+		mutex_unlock(&dev_map_list_mutex);
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block dev_map_notifier = {
+	.notifier_call = dev_map_notification,
+};
+
+static int __init dev_map_init(void)
+{
+	register_netdevice_notifier(&dev_map_notifier);
+	return 0;
+}
+
+subsys_initcall(dev_map_init);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1d03956..8981704 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1299,7 +1299,9 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
 	 * for now.
 	 */
 	case BPF_MAP_TYPE_DEVMAP:
-		if (func_id == BPF_FUNC_map_lookup_elem)
+		if (func_id == BPF_FUNC_map_lookup_elem ||
+		    func_id == BPF_FUNC_map_update_elem ||
+		    func_id == BPF_FUNC_map_delete_elem)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_ARRAY_OF_MAPS:

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 12/12] xdp: bpf redirect with map sample program
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (10 preceding siblings ...)
  2017-07-07 17:38 ` [RFC PATCH 11/12] net: add notifier hooks for devmap bpf map John Fastabend
@ 2017-07-07 17:38 ` John Fastabend
  2017-07-07 17:48 ` [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
  12 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:38 UTC (permalink / raw)
  To: netdev, davem; +Cc: brouer, john.fastabend, andy, daniel, ast

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Tested-by: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
---
 samples/bpf/Makefile                |    4 +
 samples/bpf/bpf_helpers.h           |    2 +
 samples/bpf/xdp_redirect_map_kern.c |   83 ++++++++++++++++++++++++++++
 samples/bpf/xdp_redirect_map_user.c |  105 +++++++++++++++++++++++++++++++++++
 4 files changed, 194 insertions(+)
 create mode 100644 samples/bpf/xdp_redirect_map_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 99a5084..4615789 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -37,6 +37,7 @@ hostprogs-y += xdp_tx_iptunnel
 hostprogs-y += test_map_in_map
 hostprogs-y += per_socket_stats_example
 hostprogs-y += xdp_redirect
+hostprogs-y += xdp_redirect_map
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o
@@ -78,6 +79,7 @@ xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
 test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o
 per_socket_stats_example-objs := $(LIBBPF) cookie_uid_helper_example.o
 xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
+xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -114,6 +116,7 @@ always += xdp_tx_iptunnel_kern.o
 always += test_map_in_map_kern.o
 always += cookie_uid_helper_example.o
 always += xdp_redirect_kern.o
+always += xdp_redirect_map_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -150,6 +153,7 @@ HOSTLOADLIBES_lwt_len_hist += -l elf
 HOSTLOADLIBES_xdp_tx_iptunnel += -lelf
 HOSTLOADLIBES_test_map_in_map += -lelf
 HOSTLOADLIBES_xdp_redirect += -lelf
+HOSTLOADLIBES_xdp_redirect_map += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index f4840b8..6afdfd4 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -38,6 +38,8 @@ static int (*bpf_clone_redirect)(void *ctx, int ifindex, int flags) =
 	(void *) BPF_FUNC_clone_redirect;
 static int (*bpf_redirect)(int ifindex, int flags) =
 	(void *) BPF_FUNC_redirect;
+static int (*bpf_redirect_map)(void *map, int key, int flags) =
+	(void *) BPF_FUNC_redirect_map;
 static int (*bpf_perf_event_output)(void *ctx, void *map,
 				    unsigned long long flags, void *data,
 				    int size) =
diff --git a/samples/bpf/xdp_redirect_map_kern.c b/samples/bpf/xdp_redirect_map_kern.c
new file mode 100644
index 0000000..2faf196
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_kern.c
@@ -0,0 +1,83 @@
+/* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") tx_port = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 100,
+};
+
+struct bpf_map_def SEC("maps") rxcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 1,
+};
+
+
+static void swap_src_dst_mac(void *data)
+{
+	unsigned short *p = data;
+	unsigned short dst[3];
+
+	dst[0] = p[0];
+	dst[1] = p[1];
+	dst[2] = p[2];
+	p[0] = p[3];
+	p[1] = p[4];
+	p[2] = p[5];
+	p[3] = dst[0];
+	p[4] = dst[1];
+	p[5] = dst[2];
+}
+
+SEC("xdp_redirect_map")
+int xdp_redirect_map_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	int vport, port = 0, m = 0;
+	long *value;
+	u32 key = 0;
+	u64 nh_off;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	/* constant virtual port */
+	vport = 0;
+
+	/* count packet in global counter */
+	value = bpf_map_lookup_elem(&rxcnt, &key);
+	if (value)
+		*value += 1;
+
+	swap_src_dst_mac(data);
+
+	/* send packet out physical port */
+	return bpf_redirect_map(&tx_port, vport, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_redirect_map_user.c b/samples/bpf/xdp_redirect_map_user.c
new file mode 100644
index 0000000..0b8009a
--- /dev/null
+++ b/samples/bpf/xdp_redirect_map_user.c
@@ -0,0 +1,105 @@
+/* Copyright (c) 2017 Covalent IO, Inc. http://covalent.io
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include <linux/bpf.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+static int ifindex_in;
+static int ifindex_out;
+
+static void int_exit(int sig)
+{
+	set_link_xdp_fd(ifindex_in, -1, 0);
+	exit(0);
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(int interval, int ifindex)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 values[nr_cpus], prev[nr_cpus];
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		__u64 sum = 0;
+		__u32 key = 0;
+		int i;
+
+		sleep(interval);
+		assert(bpf_map_lookup_elem(map_fd[1], &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += (values[i] - prev[i]);
+		if (sum)
+			printf("ifindex %i: %10llu pkt/s\n",
+			       ifindex, sum / interval);
+		memcpy(prev, values, sizeof(values));
+	}
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	int ret, key = 0;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (ac != 3) {
+		printf("usage: %s IFINDEX_IN IFINDEX_OUT\n", argv[0]);
+		return 1;
+	}
+
+	ifindex_in = strtoul(argv[1], NULL, 0);
+	ifindex_out = strtoul(argv[2], NULL, 0);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+
+	if (set_link_xdp_fd(ifindex_in, prog_fd[0], 0) < 0) {
+		printf("link set xdp fd failed\n");
+		return 1;
+	}
+
+	printf("map[0] (vports) = %i, map[1] (map) = %i, map[2] (count) = %i\n",
+		map_fd[0], map_fd[1], map_fd[2]);
+
+	/* populate virtual to physical port map */
+	ret = bpf_map_update_elem(map_fd[0], &key, &ifindex_out, 0);
+	if (ret) {
+		perror("bpf_update_elem");
+		goto out;
+	}
+
+	poll_stats(2, ifindex_out);
+
+out:
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
                   ` (11 preceding siblings ...)
  2017-07-07 17:38 ` [RFC PATCH 12/12] xdp: bpf redirect with map sample program John Fastabend
@ 2017-07-07 17:48 ` John Fastabend
  2017-07-08  9:46   ` David Miller
  12 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-07 17:48 UTC (permalink / raw)
  To: netdev, davem
  Cc: brouer, andy, daniel, ast, Alexander Duyck, bjorn.topel,
	Jakub Kicinski, ecree, sgoutham, Yuval.Mintz, saeedm

On 07/07/2017 10:34 AM, John Fastabend wrote:
> This series adds two new XDP helper routines bpf_redirect() and
> bpf_redirect_map(). The first variant bpf_redirect() is meant
> to be used the same way it is currently being used by the cls_bpf
> classifier. An xdp packet will be redirected immediately when this
> is called.

Also other than the typo in the title there ;) I'm going to CC
the driver maintainers working on XDP (makes for a long CC list but)
because we would want to try and get support in as many as possible in
the next merge window.

For this rev I just implemented on ixgbe because I wrote the
original XDP support there. I'll volunteer to do virtio as well.

Thanks,
John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-07 17:48 ` [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
@ 2017-07-08  9:46   ` David Miller
  2017-07-08 19:06     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: David Miller @ 2017-07-08  9:46 UTC (permalink / raw)
  To: john.fastabend
  Cc: netdev, brouer, andy, daniel, ast, alexander.duyck, bjorn.topel,
	jakub.kicinski, ecree, sgoutham, Yuval.Mintz, saeedm

From: John Fastabend <john.fastabend@gmail.com>
Date: Fri, 07 Jul 2017 10:48:36 -0700

> On 07/07/2017 10:34 AM, John Fastabend wrote:
>> This series adds two new XDP helper routines bpf_redirect() and
>> bpf_redirect_map(). The first variant bpf_redirect() is meant
>> to be used the same way it is currently being used by the cls_bpf
>> classifier. An xdp packet will be redirected immediately when this
>> is called.
> 
> Also other than the typo in the title there ;) I'm going to CC
> the driver maintainers working on XDP (makes for a long CC list but)
> because we would want to try and get support in as many as possible in
> the next merge window.
> 
> For this rev I just implemented on ixgbe because I wrote the
> original XDP support there. I'll volunteer to do virtio as well.

I went over this series a few times and it looks great to me.
You didn't even give me some coding style issues to pick on :-)

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 08/12] bpf: add devmap, a map for storing net device references
  2017-07-07 17:37 ` [RFC PATCH 08/12] bpf: add devmap, a map for storing net device references John Fastabend
@ 2017-07-08 18:57   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-08 18:57 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, davem, andy, daniel, ast, brouer

On Fri, 07 Jul 2017 10:37:12 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Device map (devmap) is a BPF map, primarily useful for networking
> applications, that uses a key to lookup a reference to a netdevice.
> 
> The map provides a clean way for BPF programs to build virtual port
> to physical port maps. Additionally, it provides a scoping function
> for the redirect action itself allowing multiple optimizations. Future
> patches will leverage the map to provide batching at the XDP layer.
> 
> Another optimization/feature, that is not yet implemented, would be
> to support multiple netdevices per key to support efficient multicast
> and broadcast support.

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 74ea96e..06073ba 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1294,6 +1294,14 @@ static int check_map_func_compatibility(struct bpf_map *map, int func_id)
>  		    func_id != BPF_FUNC_current_task_under_cgroup)
>  			goto error;
>  		break;
> +	/* devmap returns a pointer to a live net_device ifindex that we cannot
> +	 * allow to be modified from bpf side. So do not allow lookup elemnts
                                                                      ^^^^^^^
Spelling of elements

> +	 * for now.
> +	 */
> +	case BPF_MAP_TYPE_DEVMAP:
> +		if (func_id == BPF_FUNC_map_lookup_elem)
> +			goto error;
> +		break;

Reviewer notice this limitation from the bpf_prog side.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-08  9:46   ` David Miller
@ 2017-07-08 19:06     ` Jesper Dangaard Brouer
  2017-07-10 18:30       ` Jesper Dangaard Brouer
  2017-07-11 15:36       ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-08 19:06 UTC (permalink / raw)
  To: David Miller
  Cc: john.fastabend, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Sat, 08 Jul 2017 10:46:18 +0100 (WEST)
David Miller <davem@davemloft.net> wrote:

> From: John Fastabend <john.fastabend@gmail.com>
> Date: Fri, 07 Jul 2017 10:48:36 -0700
> 
> > On 07/07/2017 10:34 AM, John Fastabend wrote:  
> >> This series adds two new XDP helper routines bpf_redirect() and
> >> bpf_redirect_map(). The first variant bpf_redirect() is meant
> >> to be used the same way it is currently being used by the cls_bpf
> >> classifier. An xdp packet will be redirected immediately when this
> >> is called.  
> > 
> > Also other than the typo in the title there ;) I'm going to CC
> > the driver maintainers working on XDP (makes for a long CC list but)
> > because we would want to try and get support in as many as possible in
> > the next merge window.
> > 
> > For this rev I just implemented on ixgbe because I wrote the
> > original XDP support there. I'll volunteer to do virtio as well.  
> 
> I went over this series a few times and it looks great to me.
> You didn't even give me some coding style issues to pick on :-)

We (Daniel, Andy and I) have been reviewing and improving on this
patchset the last couple of weeks ;-).  We had some stability issues,
which is why it wasn't published earlier. My plan is to test this
latest patchset again, Monday and Tuesday. I'll try to assess stability
and provide some performance numbers.

I've complained/warned about the danger of redirecting with XDP,
without providing (1) a way to debug/see XDP redirects, (2) a way
interfaces opt-in they can be redirected. (1) is solved by patch-07/12
via a tracepoint. (2) is currently done by only forwarding to
interfaces with an XDP program loaded itself, this also comes from a
practical need for NIC drivers to allocate XDP-TX HW queues.

I'm not satisfied with the (UAPI) name for the new map
"BPF_MAP_TYPE_DEVMAP" and the filename this is placed in
"kernel/bpf/devmap.c", as we want to take advantage of compiler
inlining for the next redirect map types.  (1) because the name doesn't
tell the user that this map is connected to the redirect_map call.
(2) we want to introduce other kinds of redirect maps (like redirect to
CPUs and sockets), and it would be good if they shared a common "text"
string.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/12] xdp: add bpf_redirect helper function
  2017-07-07 17:35 ` [RFC PATCH 03/12] xdp: add bpf_redirect helper function John Fastabend
@ 2017-07-09 13:37   ` Saeed Mahameed
  2017-07-10 17:23     ` John Fastabend
  0 siblings, 1 reply; 43+ messages in thread
From: Saeed Mahameed @ 2017-07-09 13:37 UTC (permalink / raw)
  To: John Fastabend, netdev, davem; +Cc: brouer, andy, daniel, ast



On 7/7/2017 8:35 PM, John Fastabend wrote:
> This adds support for a bpf_redirect helper function to the XDP
> infrastructure. For now this only supports redirecting to the egress
> path of a port.
>
> In order to support drivers handling a xdp_buff natively this patches
> uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff
> to the specified device.
>
> If the program specifies either (a) an unknown device or (b) a device
> that does not support the operation a BPF warning is thrown and the
> XDP_ABORTED error code is returned.
>
> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/linux/filter.h    |    4 +++
>  include/linux/netdevice.h |    6 +++++
>  include/uapi/linux/bpf.h  |    1 +
>  net/core/filter.c         |   53 +++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 64 insertions(+)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1fa26dc..d0a1279 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -667,7 +667,11 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *filter,
>
>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>  				       const struct bpf_insn *patch, u32 len);
> +
> +int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp);
> +
>  void bpf_warn_invalid_xdp_action(u32 act);
> +void bpf_warn_invalid_xdp_redirect(u32 ifindex);
>
>  #ifdef CONFIG_BPF_JIT
>  extern int bpf_jit_enable;
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 85f01d6..49e8c12 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -66,6 +66,7 @@
>  /* UDP Tunnel offloads */
>  struct udp_tunnel_info;
>  struct bpf_prog;
> +struct xdp_buff;
>
>  void netdev_set_default_ethtool_ops(struct net_device *dev,
>  				    const struct ethtool_ops *ops);
> @@ -1138,6 +1139,9 @@ struct xfrmdev_ops {
>   * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp);
>   *	This function is used to set or query state related to XDP on the
>   *	netdevice. See definition of enum xdp_netdev_command for details.
> + * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
> + *	This function is used to submit a XDP packet for transmit on a
> + *	netdevice.
>   *
>   */
>  struct net_device_ops {
> @@ -1323,6 +1327,8 @@ struct net_device_ops {
>  						       int needed_headroom);
>  	int			(*ndo_xdp)(struct net_device *dev,
>  					   struct netdev_xdp *xdp);
> +	int			(*ndo_xdp_xmit)(struct net_device *dev,
> +						struct xdp_buff *xdp);
>  };
>
>  /**
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f94b48b..e1f3827 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -689,6 +689,7 @@ enum xdp_action {
>  	XDP_DROP,
>  	XDP_PASS,
>  	XDP_TX,
> +	XDP_REDIRECT,
>  };
>
>  /* user accessible metadata for XDP packet hook
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b39c869..5c9fe3e 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2298,6 +2298,51 @@ static int bpf_skb_trim_rcsum(struct sk_buff *skb, unsigned int new_len)
>  	.arg2_type	= ARG_ANYTHING,
>  };
>
> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
> +{
> +	if (dev->netdev_ops->ndo_xdp_xmit) {
> +		dev->netdev_ops->ndo_xdp_xmit(dev, xdp);

Hi John,

I have some concern here regarding synchronizing between the redirecting 
device and the target device:

if the target device's NAPI is also doing XDP_TX on the same XDP TX ring 
which this NDO might be redirecting xdp packets into the same ring, 
there would be a race accessing this ring resources (buffers and 
descriptors). Maybe you addressed this issue in the device driver 
implementation of this ndo or with some NAPI tricks/assumptions, I guess 
we have the same issue for if you run the same program to redirect 
traffic from multiple netdevices into one netdevice, how do you 
synchronize accessing this TX ring ?

Maybe we need some clear guidelines in this ndo documentation stating 
how to implement this ndo and what are the assumptions on those XDP TX 
redirect rings or from which context this ndo can run.

can you please elaborate.

Thanks,
Saeed.

> +		return 0;
> +	}
> +	bpf_warn_invalid_xdp_redirect(dev->ifindex);
> +	return -EOPNOTSUPP;
> +}
> +
> +int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp)
> +{
> +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
> +
> +	dev = dev_get_by_index_rcu(dev_net(dev), ri->ifindex);
> +	ri->ifindex = 0;
> +	if (unlikely(!dev)) {
> +		bpf_warn_invalid_xdp_redirect(ri->ifindex);
> +		return -EINVAL;
> +	}
> +
> +	return __bpf_tx_xdp(dev, xdp);
> +}
> +EXPORT_SYMBOL_GPL(xdp_do_redirect);
> +
> +BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
> +{
> +	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
> +
> +	if (unlikely(flags))
> +		return XDP_ABORTED;
> +
> +	ri->ifindex = ifindex;
> +	ri->flags = flags;
> +	return XDP_REDIRECT;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_redirect_proto = {
> +	.func           = bpf_xdp_redirect,
> +	.gpl_only       = false,
> +	.ret_type       = RET_INTEGER,
> +	.arg1_type      = ARG_ANYTHING,
> +	.arg2_type      = ARG_ANYTHING,
> +};
> +
>  bool bpf_helper_changes_pkt_data(void *func)
>  {
>  	if (func == bpf_skb_vlan_push ||
> @@ -2790,6 +2835,8 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
>  		return &bpf_get_smp_processor_id_proto;
>  	case BPF_FUNC_xdp_adjust_head:
>  		return &bpf_xdp_adjust_head_proto;
> +	case BPF_FUNC_redirect:
> +		return &bpf_xdp_redirect_proto;
>  	default:
>  		return bpf_base_func_proto(func_id);
>  	}
> @@ -3110,6 +3157,12 @@ void bpf_warn_invalid_xdp_action(u32 act)
>  }
>  EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>
> +void bpf_warn_invalid_xdp_redirect(u32 ifindex)
> +{
> +	WARN_ONCE(1, "Illegal XDP redirect to unsupported device ifindex(%i)\n", ifindex);
> +}
> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_redirect);
> +
>  static u32 bpf_convert_ctx_access(enum bpf_access_type type,
>  				  const struct bpf_insn *si,
>  				  struct bpf_insn *insn_buf,
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/12] xdp: add bpf_redirect helper function
  2017-07-09 13:37   ` Saeed Mahameed
@ 2017-07-10 17:23     ` John Fastabend
  2017-07-11 14:09       ` Andy Gospodarek
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-10 17:23 UTC (permalink / raw)
  To: Saeed Mahameed, netdev, davem; +Cc: brouer, andy, daniel, ast

On 07/09/2017 06:37 AM, Saeed Mahameed wrote:
> 
> 
> On 7/7/2017 8:35 PM, John Fastabend wrote:
>> This adds support for a bpf_redirect helper function to the XDP
>> infrastructure. For now this only supports redirecting to the egress
>> path of a port.
>>
>> In order to support drivers handling a xdp_buff natively this patches
>> uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff
>> to the specified device.
>>
>> If the program specifies either (a) an unknown device or (b) a device
>> that does not support the operation a BPF warning is thrown and the
>> XDP_ABORTED error code is returned.
>>
>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---

[...]

>>
>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
>> +{
>> +    if (dev->netdev_ops->ndo_xdp_xmit) {
>> +        dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
> 
> Hi John,
> 
> I have some concern here regarding synchronizing between the
> redirecting device and the target device:
> 
> if the target device's NAPI is also doing XDP_TX on the same XDP TX
> ring which this NDO might be redirecting xdp packets into the same
> ring, there would be a race accessing this ring resources (buffers
> and descriptors). Maybe you addressed this issue in the device driver
> implementation of this ndo or with some NAPI tricks/assumptions, I
> guess we have the same issue for if you run the same program to
> redirect traffic from multiple netdevices into one netdevice, how do
> you synchronize accessing this TX ring ?

The implementation uses a per cpu TX ring to resolve these races. And
the pair of driver interface API calls, xdp_do_redirect() and xdp_do_flush_map()
must be completed in a single poll() handler.

This comment was included in the header file to document this,

/* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
 * same cpu context. Further for best results no more than a single map
 * for the do_redirect/do_flush pair should be used. This limitation is
 * because we only track one map and force a flush when the map changes.
 * This does not appear to be a real limitation for existing software.
 */

In general some documentation about implementing XDP would probably be
useful to add in Documentation/networking but this IMO goes beyond just
this patch series.

> 
> Maybe we need some clear guidelines in this ndo documentation stating
> how to implement this ndo and what are the assumptions on those XDP
> TX redirect rings or from which context this ndo can run.
> 
> can you please elaborate.

I think the best implementation is to use a per cpu TX ring as I did in
this series. If your device is limited by the number of queues for some
reason some other scheme would need to be devised. Unfortunately, the only
thing I've come up for this case (using only this series) would both impact
performance and make the code complex.

A nice solution might be to constrain networking "tasks" to only a subset
of cores. For 64+ core systems this might be a good idea. It would allow
avoiding locking using per_cpu logic but also avoid networking consuming
slices of every core in the system. As core count goes up I think we will
eventually need to address this.I believe Eric was thinking along these
lines with his netconf talk iirc. Obviously this work is way outside the
scope of this series though.


> Thanks,
> Saeed.
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 10/12] xdp: Add batching support to redirect map
  2017-07-07 17:37 ` [RFC PATCH 10/12] xdp: Add batching support to redirect map John Fastabend
@ 2017-07-10 17:53   ` Jesper Dangaard Brouer
  2017-07-10 17:56     ` John Fastabend
  0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-10 17:53 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, davem, andy, daniel, ast, brouer

On Fri, 07 Jul 2017 10:37:59 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 36dc13de..656e334 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
[...]
>  
> +void __dev_map_insert_ctx(struct bpf_map *map, u32 key)
> +{
> +	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
> +	unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
> +
> +	set_bit(key, bitmap);
> +}

I don't like that this adds an atomic op (set_bit) per packet on a fast-path.
It shows up on a perf top #6 with xdp_redirect_map.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 10/12] xdp: Add batching support to redirect map
  2017-07-10 17:53   ` Jesper Dangaard Brouer
@ 2017-07-10 17:56     ` John Fastabend
  0 siblings, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-10 17:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, davem, andy, daniel, ast

On 07/10/2017 10:53 AM, Jesper Dangaard Brouer wrote:
> On Fri, 07 Jul 2017 10:37:59 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>> index 36dc13de..656e334 100644
>> --- a/kernel/bpf/devmap.c
>> +++ b/kernel/bpf/devmap.c
> [...]
>>  
>> +void __dev_map_insert_ctx(struct bpf_map *map, u32 key)
>> +{
>> +	struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
>> +	unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
>> +
>> +	set_bit(key, bitmap);
>> +}
> 
> I don't like that this adds an atomic op (set_bit) per packet on a fast-path.
> It shows up on a perf top #6 with xdp_redirect_map.
> 

Its a per cpu bitmap so __set_bit() should be fine here.

Thanks,
John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-08 19:06     ` Jesper Dangaard Brouer
@ 2017-07-10 18:30       ` Jesper Dangaard Brouer
  2017-07-11  0:59         ` John Fastabend
  2017-07-11 15:36       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-10 18:30 UTC (permalink / raw)
  To: David Miller
  Cc: john.fastabend, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Sat, 8 Jul 2017 21:06:17 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: John Fastabend <john.fastabend@gmail.com>
> > Date: Fri, 07 Jul 2017 10:48:36 -0700
> >   
> > > On 07/07/2017 10:34 AM, John Fastabend wrote:    
> > >> This series adds two new XDP helper routines bpf_redirect() and
> > >> bpf_redirect_map(). The first variant bpf_redirect() is meant
> > >> to be used the same way it is currently being used by the cls_bpf
> > >> classifier. An xdp packet will be redirected immediately when this
> > >> is called.    
> > > 
> > > Also other than the typo in the title there ;) I'm going to CC
> > > the driver maintainers working on XDP (makes for a long CC list but)
> > > because we would want to try and get support in as many as possible in
> > > the next merge window.
> > > 
> > > For this rev I just implemented on ixgbe because I wrote the
> > > original XDP support there. I'll volunteer to do virtio as well.    
> > 
> > I went over this series a few times and it looks great to me.
> > You didn't even give me some coding style issues to pick on :-)  
> 
> We (Daniel, Andy and I) have been reviewing and improving on this
> patchset the last couple of weeks ;-).  We had some stability issues,
> which is why it wasn't published earlier. My plan is to test this
> latest patchset again, Monday and Tuesday. I'll try to assess stability
> and provide some performance numbers.


Damn, I though it was stable, I have been running a lot of performance
tests, and then this just happened :-(

[11357.149486] BUG: unable to handle kernel NULL pointer dereference at 0000000000000210
[11357.157393] IP: __dev_map_flush+0x58/0x90
[11357.161446] PGD 3ff685067 
[11357.161446] P4D 3ff685067 
[11357.164199] PUD 3ff684067 
[11357.166947] PMD 0 
[11357.170396] 
[11357.173981] Oops: 0000 [#1] PREEMPT SMP
[11357.177859] Modules linked in: coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf mxm_wmi i2c_i801 pcspkr sg i2c_co]
[11357.203021] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.12.0-net-next-xdp_redirect02-rfc+ #135
[11357.211706] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[11357.221606] task: ffffffff81c0e480 task.stack: ffffffff81c00000
[11357.227568] RIP: 0010:__dev_map_flush+0x58/0x90
[11357.232138] RSP: 0018:ffff88041fa03de0 EFLAGS: 00010286
[11357.237409] RAX: 0000000000000000 RBX: ffff8803fc996600 RCX: 0000000000000003
[11357.244589] RDX: ffff88040d0bf480 RSI: 0000000000000000 RDI: ffffffff81d901d8
[11357.251767] RBP: ffff88041fa03df8 R08: fffffffffffffffc R09: 0000000700000008
[11357.258940] R10: ffffffff813f11d0 R11: 0000000000000040 R12: ffffe8ffffc014a0
[11357.266119] R13: 0000000000000003 R14: 000000000000003c R15: ffff8803fc9c26c0
[11357.273294] FS:  0000000000000000(0000) GS:ffff88041fa00000(0000) knlGS:0000000000000000
[11357.281454] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11357.287244] CR2: 0000000000000210 CR3: 00000003fc41e000 CR4: 00000000001406f0
[11357.294423] Call Trace:
[11357.296912]  <IRQ>
[11357.298967]  xdp_do_flush_map+0x36/0x40
[11357.302847]  ixgbe_poll+0x7ea/0x1370 [ixgbe]
[11357.307160]  net_rx_action+0x247/0x3e0
[11357.310957]  __do_softirq+0x106/0x2cb
[11357.314664]  irq_exit+0xbe/0xd0
[11357.317851]  do_IRQ+0x4f/0xd0
[11357.320858]  common_interrupt+0x86/0x86
[11357.324733] RIP: 0010:poll_idle+0x2f/0x5a
[11357.328781] RSP: 0018:ffffffff81c03dd0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff8e
[11357.336426] RAX: 0000000000000000 RBX: ffffffff81d689e0 RCX: 0000000000200000
[11357.343605] RDX: 0000000000000000 RSI: ffffffff81c0e480 RDI: ffff88041fa22800
[11357.350783] RBP: ffffffff81c03dd0 R08: 00000000000003c5 R09: 0000000000000018
[11357.357958] R10: 0000000000000327 R11: 0000000000000390 R12: ffff88041fa22800
[11357.365135] R13: ffffffff81d689f8 R14: 0000000000000000 R15: ffffffff81d689e0
[11357.372311]  </IRQ>
[11357.374453]  cpuidle_enter_state+0xf2/0x2e0
[11357.378678]  cpuidle_enter+0x17/0x20
[11357.382299]  call_cpuidle+0x23/0x40
[11357.385834]  do_idle+0xe8/0x190
[11357.389024]  cpu_startup_entry+0x1d/0x20
[11357.392993]  rest_init+0xd5/0xe0
[11357.396268]  start_kernel+0x3d7/0x3e4
[11357.399979]  x86_64_start_reservations+0x2a/0x2c
[11357.404641]  x86_64_start_kernel+0x178/0x18b
[11357.408959]  secondary_startup_64+0x9f/0x9f
[11357.413186]  ? secondary_startup_64+0x9f/0x9f
[11357.417589] Code: 41 89 c5 48 8b 53 60 44 89 e8 48 8d 14 c2 48 8b 12 48 85 d2 74 23 48 8b 3a f0 49 0f b3 04 24 48 85 ff 74 15 48 8b 87 e0 0 
[11357.436565] RIP: __dev_map_flush+0x58/0x90 RSP: ffff88041fa03de0
[11357.442613] CR2: 0000000000000210
[11357.445972] ---[ end trace f7ed232095169a98 ]---
[11357.450637] Kernel panic - not syncing: Fatal exception in interrupt
[11357.457038] Kernel Offset: disabled
[11357.460566] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

(gdb) list *(__dev_map_flush)+0x58
0xffffffff811422a8 is in __dev_map_flush (kernel/bpf/devmap.c:257).
252				continue;
253	
254			netdev = dev->dev;
255	
256			clear_bit(bit, bitmap);
257			if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush))
258				continue;
259	
260			netdev->netdev_ops->ndo_xdp_flush(netdev);
261		}

My analysis it that "netdev->netdev_ops == NULL" as:

 NULL pointer dereference at 0000000000000210
 $ echo $((0x210))
 528

As ndo_xdp_flush is at offset 528 in struct net_device_ops.

 $ pahole -C net_device_ops vmlinux
 void (*ndo_xdp_flush)(struct net_device *); /*   528     8 */

But I cannot see where/what will change netdev->netdev_ops to be NULL?!?


> I've complained/warned about the danger of redirecting with XDP,
> without providing (1) a way to debug/see XDP redirects, (2) a way
> interfaces opt-in they can be redirected. (1) is solved by patch-07/12
> via a tracepoint. (2) is currently done by only forwarding to
> interfaces with an XDP program loaded itself, this also comes from a
> practical need for NIC drivers to allocate XDP-TX HW queues.
> 
> I'm not satisfied with the (UAPI) name for the new map
> "BPF_MAP_TYPE_DEVMAP" and the filename this is placed in
> "kernel/bpf/devmap.c", as we want to take advantage of compiler
> inlining for the next redirect map types.  (1) because the name doesn't
> tell the user that this map is connected to the redirect_map call.
> (2) we want to introduce other kinds of redirect maps (like redirect to
> CPUs and sockets), and it would be good if they shared a common "text"
> string.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-10 18:30       ` Jesper Dangaard Brouer
@ 2017-07-11  0:59         ` John Fastabend
  2017-07-11 14:23           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-11  0:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David Miller
  Cc: netdev, andy, daniel, ast, alexander.duyck, bjorn.topel,
	jakub.kicinski, ecree, sgoutham, Yuval.Mintz, saeedm

On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote:
> On Sat, 8 Jul 2017 21:06:17 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
>> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST)
>> David Miller <davem@davemloft.net> wrote:
>>
>>> From: John Fastabend <john.fastabend@gmail.com>
>>> Date: Fri, 07 Jul 2017 10:48:36 -0700
>>>   
>>>> On 07/07/2017 10:34 AM, John Fastabend wrote:    
>>>>> This series adds two new XDP helper routines bpf_redirect() and
>>>>> bpf_redirect_map(). The first variant bpf_redirect() is meant
>>>>> to be used the same way it is currently being used by the cls_bpf
>>>>> classifier. An xdp packet will be redirected immediately when this
>>>>> is called.    
>>>>
>>>> Also other than the typo in the title there ;) I'm going to CC
>>>> the driver maintainers working on XDP (makes for a long CC list but)
>>>> because we would want to try and get support in as many as possible in
>>>> the next merge window.
>>>>
>>>> For this rev I just implemented on ixgbe because I wrote the
>>>> original XDP support there. I'll volunteer to do virtio as well.    
>>>
>>> I went over this series a few times and it looks great to me.
>>> You didn't even give me some coding style issues to pick on :-)  
>>
>> We (Daniel, Andy and I) have been reviewing and improving on this
>> patchset the last couple of weeks ;-).  We had some stability issues,
>> which is why it wasn't published earlier. My plan is to test this
>> latest patchset again, Monday and Tuesday. I'll try to assess stability
>> and provide some performance numbers.
> 
> 
> Damn, I though it was stable, I have been running a lot of performance
> tests, and then this just happened :-(

Thanks, I'll take a look through the code and see if I can come up with
why this might happen. I haven't hit it on my tests yet though.

.John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/12] xdp: add bpf_redirect helper function
  2017-07-10 17:23     ` John Fastabend
@ 2017-07-11 14:09       ` Andy Gospodarek
  2017-07-11 18:38         ` John Fastabend
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Gospodarek @ 2017-07-11 14:09 UTC (permalink / raw)
  To: John Fastabend
  Cc: Saeed Mahameed, netdev, David Miller, Jesper Dangaard Brouer,
	Daniel Borkmann, Alexei Starovoitov

On Mon, Jul 10, 2017 at 1:23 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 07/09/2017 06:37 AM, Saeed Mahameed wrote:
>>
>>
>> On 7/7/2017 8:35 PM, John Fastabend wrote:
>>> This adds support for a bpf_redirect helper function to the XDP
>>> infrastructure. For now this only supports redirecting to the egress
>>> path of a port.
>>>
>>> In order to support drivers handling a xdp_buff natively this patches
>>> uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff
>>> to the specified device.
>>>
>>> If the program specifies either (a) an unknown device or (b) a device
>>> that does not support the operation a BPF warning is thrown and the
>>> XDP_ABORTED error code is returned.
>>>
>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>> ---
>
> [...]
>
>>>
>>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
>>> +{
>>> +    if (dev->netdev_ops->ndo_xdp_xmit) {
>>> +        dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
>>
>> Hi John,
>>
>> I have some concern here regarding synchronizing between the
>> redirecting device and the target device:
>>
>> if the target device's NAPI is also doing XDP_TX on the same XDP TX
>> ring which this NDO might be redirecting xdp packets into the same
>> ring, there would be a race accessing this ring resources (buffers
>> and descriptors). Maybe you addressed this issue in the device driver
>> implementation of this ndo or with some NAPI tricks/assumptions, I
>> guess we have the same issue for if you run the same program to
>> redirect traffic from multiple netdevices into one netdevice, how do
>> you synchronize accessing this TX ring ?
>
> The implementation uses a per cpu TX ring to resolve these races. And
> the pair of driver interface API calls, xdp_do_redirect() and xdp_do_flush_map()
> must be completed in a single poll() handler.
>
> This comment was included in the header file to document this,
>
> /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
>  * same cpu context. Further for best results no more than a single map
>  * for the do_redirect/do_flush pair should be used. This limitation is
>  * because we only track one map and force a flush when the map changes.
>  * This does not appear to be a real limitation for existing software.
>  */
>
> In general some documentation about implementing XDP would probably be
> useful to add in Documentation/networking but this IMO goes beyond just
> this patch series.
>
>>
>> Maybe we need some clear guidelines in this ndo documentation stating
>> how to implement this ndo and what are the assumptions on those XDP
>> TX redirect rings or from which context this ndo can run.
>>
>> can you please elaborate.
>
> I think the best implementation is to use a per cpu TX ring as I did in
> this series. If your device is limited by the number of queues for some
> reason some other scheme would need to be devised. Unfortunately, the only
> thing I've come up for this case (using only this series) would both impact
> performance and make the code complex.
>
> A nice solution might be to constrain networking "tasks" to only a subset
> of cores. For 64+ core systems this might be a good idea. It would allow
> avoiding locking using per_cpu logic but also avoid networking consuming
> slices of every core in the system. As core count goes up I think we will
> eventually need to address this.I believe Eric was thinking along these
> lines with his netconf talk iirc. Obviously this work is way outside the
> scope of this series though.

I agree that it is outside the scope of this series, but I think it is
important to consider the impact of the output queue selection in both
a heterogenous and homogenous driver setup and how tx could be
optimized or even considered to be more reliable and I think that was
part of Saeed's point.

I got base redirect support for bnxt_en working yesterday, but for it
and other drivers that do not necessarily create a ring/queue per core
like ixgbe there is probably a bit more to work in each driver to
properly track output tx rings/queues than what you have done with
ixgbe.

>
>
>> Thanks,
>> Saeed.
>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11  0:59         ` John Fastabend
@ 2017-07-11 14:23           ` Jesper Dangaard Brouer
  2017-07-11 18:26             ` John Fastabend
  0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-11 14:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Mon, 10 Jul 2017 17:59:17 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote:
> > On Sat, 8 Jul 2017 21:06:17 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> >> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST)
> >> David Miller <davem@davemloft.net> wrote:
> >>  
> >>> From: John Fastabend <john.fastabend@gmail.com>
> >>> Date: Fri, 07 Jul 2017 10:48:36 -0700
> >>>     
> >>>> On 07/07/2017 10:34 AM, John Fastabend wrote:      
> >>>>> This series adds two new XDP helper routines bpf_redirect() and
> >>>>> bpf_redirect_map(). The first variant bpf_redirect() is meant
> >>>>> to be used the same way it is currently being used by the cls_bpf
> >>>>> classifier. An xdp packet will be redirected immediately when this
> >>>>> is called.      
> >>>>
> >>>> Also other than the typo in the title there ;) I'm going to CC
> >>>> the driver maintainers working on XDP (makes for a long CC list but)
> >>>> because we would want to try and get support in as many as possible in
> >>>> the next merge window.
> >>>>
> >>>> For this rev I just implemented on ixgbe because I wrote the
> >>>> original XDP support there. I'll volunteer to do virtio as well.      
> >>>
> >>> I went over this series a few times and it looks great to me.
> >>> You didn't even give me some coding style issues to pick on :-)    
> >>
> >> We (Daniel, Andy and I) have been reviewing and improving on this
> >> patchset the last couple of weeks ;-).  We had some stability issues,
> >> which is why it wasn't published earlier. My plan is to test this
> >> latest patchset again, Monday and Tuesday. I'll try to assess stability
> >> and provide some performance numbers.  
> > 
> > 
> > Damn, I though it was stable, I have been running a lot of performance
> > tests, and then this just happened :-(  
> 
> Thanks, I'll take a look through the code and see if I can come up with
> why this might happen. I haven't hit it on my tests yet though.

I've figured out why this happens, and I have a fix, see patch below
with some comments with questions.

The problem is that we can leak map_to_flush in an error path, the fix:

diff --git a/net/core/filter.c b/net/core/filter.c
index 2ccd6ff09493..7f1f48668dcf 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2497,11 +2497,14 @@ int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
        ri->map = NULL;
 
        trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
-
+       // Q: Should we also trace "goto out" (failed lookup)?
+       //    like bpf_warn_invalid_xdp_redirect();
        return __bpf_tx_xdp(fwd, map, xdp, index);
 out:
        ri->ifindex = 0;
-       ri->map = NULL;
+       // XXX: here we could leak ri->map_to_flush, which could be
+       //      picked up later by xdp_do_flush_map()
+       xdp_do_flush_map(); /* Clears ri->map_to_flush + ri->map */
        return -EINVAL;


While debugging this, I noticed that we can have packets in-flight,
while the XDP RX rings are being reconfigured.  I wonder if this is a
ixgbe driver XDP-bug?  I think it would be best to add some
RCU-barrier, after ixgbe_setup_tc().

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ed97aa81a850..4872fbb54ecd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9801,7 +9804,18 @@ static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
 
        /* If transitioning XDP modes reconfigure rings */
        if (!!prog != !!old_prog) {
-               int err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev));
+               // XXX: Warn pkts can be in-flight in old_prog
+               //      while ixgbe_setup_tc() calls ixgbe_close(dev).
+               //
+               // Should we avoid these in-flight packets?
+               // Would it be enough to add an synchronize_rcu()
+               // or rcu_barrier()?
+               // or do we need an napi_synchronize() call here?
+               //
+               int err;
+               netdev_info(dev,
+                           "Calling ixgbe_setup_tc() to reconfig XDP rings\n");
+               err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev));
 
                if (err) {
                        rcu_assign_pointer(adapter->xdp_prog, old_prog);

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-08 19:06     ` Jesper Dangaard Brouer
  2017-07-10 18:30       ` Jesper Dangaard Brouer
@ 2017-07-11 15:36       ` Jesper Dangaard Brouer
  2017-07-11 17:48         ` John Fastabend
  1 sibling, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-11 15:36 UTC (permalink / raw)
  To: David Miller
  Cc: john.fastabend, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Sat, 8 Jul 2017 21:06:17 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> My plan is to test this latest patchset again, Monday and Tuesday.
> I'll try to assess stability and provide some performance numbers.

Performance numbers:

 14378479 pkt/s = XDP_DROP without touching memory
  9222401 pkt/s = xdp1: XDP_DROP with reading packet data
  6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
  4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
  5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap

The performance drop between xdp2 and xdp_redirect, was expected due
to the HW-tailptr flush per packet, which is costly.

 (1/6344472-1/4595574)*10^9 = -59.98 ns

The performance drop between xdp2 and xdp_redirect_map, is higher than
I expected, which is not good!  The avoidance of the tailptr flush per
packet was expected to give a higher boost.  The cost increased with
40 ns, which is too high compared to the code added (on a 4GHz machine
approx 160 cycles).

 (1/6344472-1/5066243)*10^9 = -39.77 ns

This system doesn't have DDIO, thus we are stalling on cache-misses,
but I was actually expecting that the added code could "hide" behind
these cache-misses.

I'm somewhat surprised to see this large a performance drop.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Results::

 # XDP_DROP with reading packet data
 [jbrouer@canyon bpf]$ sudo ./xdp1 3
 proto 17:    6449727 pkt/s
 proto 17:    9222639 pkt/s
 proto 17:    9222401 pkt/s
 proto 17:    9223083 pkt/s
 proto 17:    9223515 pkt/s
 proto 17:    9222477 pkt/s
 ^C

 # XDP_TX with swap mac
 [jbrouer@canyon bpf]$ sudo ./xdp2 3
 proto 17:     934682 pkt/s
 proto 17:    6344845 pkt/s
 proto 17:    6344472 pkt/s
 proto 17:    6345265 pkt/s
 proto 17:    6345238 pkt/s
 proto 17:    6345338 pkt/s
 ^C

 # XDP_REDIRECT with swap mac (simulate XDP_TX via same ifindex)
 [jbrouer@canyon bpf]$ sudo ./xdp_redirect 3 3
 ifindex 3:     749567 pkt/s
 ifindex 3:    4595025 pkt/s
 ifindex 3:    4595574 pkt/s
 ifindex 3:    4595429 pkt/s
 ifindex 3:    4595340 pkt/s
 ifindex 3:    4595352 pkt/s
 ifindex 3:    4595364 pkt/s
 ^C

  # XDP_REDIRECT with swap mac + devmap (still simulate XDP_TX)
 [jbrouer@canyon bpf]$ sudo ./xdp_redirect_map 3 3
 map[0] (vports) = 4, map[1] (map) = 5, map[2] (count) = 0
 ifindex 3:    3076506 pkt/s
 ifindex 3:    5066282 pkt/s
 ifindex 3:    5066243 pkt/s
 ifindex 3:    5067376 pkt/s
 ifindex 3:    5067226 pkt/s
 ifindex 3:    5067622 pkt/s

My own tools::

 [jbrouer@canyon prototype-kernel]$
   sudo ./xdp_bench01_mem_access_cost --dev ixgbe1 --sec 2 \
    --action XDP_DROP
 XDP_action   pps        pps-human-readable mem      
 XDP_DROP     0          0                  no_touch 
 XDP_DROP     9894401    9,894,401          no_touch 
 XDP_DROP     14377459   14,377,459         no_touch 
 XDP_DROP     14378228   14,378,228         no_touch 
 XDP_DROP     14378400   14,378,400         no_touch 
 XDP_DROP     14378319   14,378,319         no_touch 
 XDP_DROP     14378479   14,378,479         no_touch 
 XDP_DROP     14377332   14,377,332         no_touch 
 XDP_DROP     14378411   14,378,411         no_touch 
 XDP_DROP     14378095   14,378,095         no_touch 
 ^CInterrupted: Removing XDP program on ifindex:3 device:ixgbe1

 [jbrouer@canyon prototype-kernel]$
  sudo ./xdp_bench01_mem_access_cost --dev ixgbe1 --sec 2 \
   --action XDP_DROP --read
 XDP_action   pps        pps-human-readable mem      
 XDP_DROP     0          0                  read     
 XDP_DROP     6994114    6,994,114          read     
 XDP_DROP     8979414    8,979,414          read     
 XDP_DROP     8979636    8,979,636          read     
 XDP_DROP     8980087    8,980,087          read     
 XDP_DROP     8979097    8,979,097          read     
 XDP_DROP     8978970    8,978,970          read     
 ^CInterrupted: Removing XDP program on ifindex:3 device:ixgbe1

 [jbrouer@canyon prototype-kernel]$
  sudo ./xdp_bench01_mem_access_cost --dev ixgbe1 --sec 2 \
  --action XDP_TX --swap --read
 XDP_action   pps        pps-human-readable mem      
 XDP_TX       0          0                  swap_mac 
 XDP_TX       2141556    2,141,556          swap_mac 
 XDP_TX       6171984    6,171,984          swap_mac 
 XDP_TX       6171955    6,171,955          swap_mac 
 XDP_TX       6171767    6,171,767          swap_mac 
 XDP_TX       6171680    6,171,680          swap_mac 
 XDP_TX       6172201    6,172,201          swap_mac 
 ^CInterrupted: Removing XDP program on ifindex:3 device:ixgbe1


Setting tuned-adm network-latency ::

 $ sudo tuned-adm list
 [...]
 Current active profile: network-latency

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 15:36       ` Jesper Dangaard Brouer
@ 2017-07-11 17:48         ` John Fastabend
  2017-07-11 18:01           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-11 17:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, David Miller
  Cc: netdev, andy, daniel, ast, alexander.duyck, bjorn.topel,
	jakub.kicinski, ecree, sgoutham, Yuval.Mintz, saeedm

On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:
> On Sat, 8 Jul 2017 21:06:17 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
>> My plan is to test this latest patchset again, Monday and Tuesday.
>> I'll try to assess stability and provide some performance numbers.
> 
> Performance numbers:
> 
>  14378479 pkt/s = XDP_DROP without touching memory
>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> 
> The performance drop between xdp2 and xdp_redirect, was expected due
> to the HW-tailptr flush per packet, which is costly.
> 
>  (1/6344472-1/4595574)*10^9 = -59.98 ns
> 
> The performance drop between xdp2 and xdp_redirect_map, is higher than
> I expected, which is not good!  The avoidance of the tailptr flush per
> packet was expected to give a higher boost.  The cost increased with
> 40 ns, which is too high compared to the code added (on a 4GHz machine
> approx 160 cycles).
> 
>  (1/6344472-1/5066243)*10^9 = -39.77 ns
> 
> This system doesn't have DDIO, thus we are stalling on cache-misses,
> but I was actually expecting that the added code could "hide" behind
> these cache-misses.
> 
> I'm somewhat surprised to see this large a performance drop.
> 

Yep, although there is room for optimizations in the code path for sure. And
5mpps is not horrible my preference is to get this series in plus any
small optimization we come up with while the merge window is closed. Then
follow up patches can do optimizations.

One easy optimization is to get rid of the atomic bitops. They are not needed
here we have a per cpu unsigned long. Another easy one would be to move
some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
and flush ops on the net device in the hotpath really should be done in the
slow path.

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 1191060..899364d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -213,7 +213,7 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 key)
        struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
        unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
 
-       set_bit(key, bitmap);
+       __set_bit(key, bitmap);
 }
 
 struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
@@ -253,7 +253,7 @@ void __dev_map_flush(struct bpf_map *map)
 
                netdev = dev->dev;
 
-               clear_bit(bit, bitmap);
+               __clear_bit(bit, bitmap);
                if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush))
                        continue;
 
@@ -287,7 +287,7 @@ static void dev_map_flush_old(struct bpf_dtab_netdev *old_dev)
 
                for_each_online_cpu(cpu) {
                        bitmap = per_cpu_ptr(old_dev->dtab->flush_needed, cpu);
-                       clear_bit(old_dev->key, bitmap);
+                       __clear_bit(old_dev->key, bitmap);
 
                        fl->netdev_ops->ndo_xdp_flush(old_dev->dev);
                }

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 17:48         ` John Fastabend
@ 2017-07-11 18:01           ` Jesper Dangaard Brouer
  2017-07-11 18:29             ` John Fastabend
  2017-07-11 18:44             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-11 18:01 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Tue, 11 Jul 2017 10:48:29 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:
> > On Sat, 8 Jul 2017 21:06:17 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> >> My plan is to test this latest patchset again, Monday and Tuesday.
> >> I'll try to assess stability and provide some performance numbers.  
> > 
> > Performance numbers:
> > 
> >  14378479 pkt/s = XDP_DROP without touching memory
> >   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
> >   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> >   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
> >   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> > 
> > The performance drop between xdp2 and xdp_redirect, was expected due
> > to the HW-tailptr flush per packet, which is costly.
> > 
> >  (1/6344472-1/4595574)*10^9 = -59.98 ns
> > 
> > The performance drop between xdp2 and xdp_redirect_map, is higher than
> > I expected, which is not good!  The avoidance of the tailptr flush per
> > packet was expected to give a higher boost.  The cost increased with
> > 40 ns, which is too high compared to the code added (on a 4GHz machine
> > approx 160 cycles).
> > 
> >  (1/6344472-1/5066243)*10^9 = -39.77 ns
> > 
> > This system doesn't have DDIO, thus we are stalling on cache-misses,
> > but I was actually expecting that the added code could "hide" behind
> > these cache-misses.
> > 
> > I'm somewhat surprised to see this large a performance drop.
> >   
> 
> Yep, although there is room for optimizations in the code path for sure. And
> 5mpps is not horrible my preference is to get this series in plus any
> small optimization we come up with while the merge window is closed. Then
> follow up patches can do optimizations.

IMHO 5Mpps is a very bad number for XDP.

> One easy optimization is to get rid of the atomic bitops. They are not needed
> here we have a per cpu unsigned long. Another easy one would be to move
> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
> and flush ops on the net device in the hotpath really should be done in the
> slow path.

I'm already running with a similar patch as below, but it
(surprisingly) only gave my 3 ns improvement.  I also tried a
prefetchw() on xdp.data that gave me 10 ns (which is quite good).

I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
have DDIO ... I have high hopes for this, as the major bottleneck on
this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.

Something is definitely wrong on this CPU, as perf stats shows, a very
bad utilization of the CPU pipeline with 0.89 insn per cycle.


> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index 1191060..899364d 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -213,7 +213,7 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 key)
>         struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map);
>         unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed);
>  
> -       set_bit(key, bitmap);
> +       __set_bit(key, bitmap);
>  }
>  
>  struct net_device  *__dev_map_lookup_elem(struct bpf_map *map, u32 key)
> @@ -253,7 +253,7 @@ void __dev_map_flush(struct bpf_map *map)
>  
>                 netdev = dev->dev;
>  
> -               clear_bit(bit, bitmap);
> +               __clear_bit(bit, bitmap);
>                 if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush))
>                         continue;
>  
> @@ -287,7 +287,7 @@ static void dev_map_flush_old(struct bpf_dtab_netdev *old_dev)
>  
>                 for_each_online_cpu(cpu) {
>                         bitmap = per_cpu_ptr(old_dev->dtab->flush_needed, cpu);
> -                       clear_bit(old_dev->key, bitmap);
> +                       __clear_bit(old_dev->key, bitmap);
>  
>                         fl->netdev_ops->ndo_xdp_flush(old_dev->dev);
>                 }
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 14:23           ` Jesper Dangaard Brouer
@ 2017-07-11 18:26             ` John Fastabend
  2017-07-13 11:14               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-11 18:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm

On 07/11/2017 07:23 AM, Jesper Dangaard Brouer wrote:
> On Mon, 10 Jul 2017 17:59:17 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote:
>>> On Sat, 8 Jul 2017 21:06:17 +0200
>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>   
>>>> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST)
>>>> David Miller <davem@davemloft.net> wrote:
>>>>  
>>>>> From: John Fastabend <john.fastabend@gmail.com>
>>>>> Date: Fri, 07 Jul 2017 10:48:36 -0700
>>>>>     
>>>>>> On 07/07/2017 10:34 AM, John Fastabend wrote:      
>>>>>>> This series adds two new XDP helper routines bpf_redirect() and
>>>>>>> bpf_redirect_map(). The first variant bpf_redirect() is meant
>>>>>>> to be used the same way it is currently being used by the cls_bpf
>>>>>>> classifier. An xdp packet will be redirected immediately when this
>>>>>>> is called.      
>>>>>>
>>>>>> Also other than the typo in the title there ;) I'm going to CC
>>>>>> the driver maintainers working on XDP (makes for a long CC list but)
>>>>>> because we would want to try and get support in as many as possible in
>>>>>> the next merge window.
>>>>>>
>>>>>> For this rev I just implemented on ixgbe because I wrote the
>>>>>> original XDP support there. I'll volunteer to do virtio as well.      
>>>>>
>>>>> I went over this series a few times and it looks great to me.
>>>>> You didn't even give me some coding style issues to pick on :-)    
>>>>
>>>> We (Daniel, Andy and I) have been reviewing and improving on this
>>>> patchset the last couple of weeks ;-).  We had some stability issues,
>>>> which is why it wasn't published earlier. My plan is to test this
>>>> latest patchset again, Monday and Tuesday. I'll try to assess stability
>>>> and provide some performance numbers.  
>>>
>>>
>>> Damn, I though it was stable, I have been running a lot of performance
>>> tests, and then this just happened :-(  
>>
>> Thanks, I'll take a look through the code and see if I can come up with
>> why this might happen. I haven't hit it on my tests yet though.
> 
> I've figured out why this happens, and I have a fix, see patch below
> with some comments with questions.
> 

Awesome!

> The problem is that we can leak map_to_flush in an error path, the fix:
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2ccd6ff09493..7f1f48668dcf 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2497,11 +2497,14 @@ int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
>         ri->map = NULL;
>  
>         trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
> -
> +       // Q: Should we also trace "goto out" (failed lookup)?
> +       //    like bpf_warn_invalid_xdp_redirect();

Maybe another trace event? trace_xdp_redirect_failed()

>         return __bpf_tx_xdp(fwd, map, xdp, index);
>  out:
>         ri->ifindex = 0;
> -       ri->map = NULL;
> +       // XXX: here we could leak ri->map_to_flush, which could be
> +       //      picked up later by xdp_do_flush_map()
> +       xdp_do_flush_map(); /* Clears ri->map_to_flush + ri->map */

+1 

ah map lookup failed and we need to do the flush nice catch.

>         return -EINVAL;
> 
> 
> While debugging this, I noticed that we can have packets in-flight,
> while the XDP RX rings are being reconfigured.  I wonder if this is a
> ixgbe driver XDP-bug?  I think it would be best to add some
> RCU-barrier, after ixgbe_setup_tc().
> 

Actually I think a synchronize_sched() is needed, after the IXGBE_DOWN bit
is set but before the xdp_tx queues are cleaned up. In practice the ixgbe_down/up
sequence has so many msleep() operations for napi cleanup and hardware sync
I would be really surprised we ever hit this. But for correctness we should
likely add it.	

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index ed97aa81a850..4872fbb54ecd 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -9801,7 +9804,18 @@ static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog)
>  
>         /* If transitioning XDP modes reconfigure rings */
>         if (!!prog != !!old_prog) {
> -               int err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev));
> +               // XXX: Warn pkts can be in-flight in old_prog
> +               //      while ixgbe_setup_tc() calls ixgbe_close(dev).
> +               //
> +               // Should we avoid these in-flight packets?
> +               // Would it be enough to add an synchronize_rcu()
> +               // or rcu_barrier()?
> +               // or do we need an napi_synchronize() call here?
> +               //
> +               int err;
> +               netdev_info(dev,
> +                           "Calling ixgbe_setup_tc() to reconfig XDP rings\n");
> +               err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev));
>  
>                 if (err) {
>                         rcu_assign_pointer(adapter->xdp_prog, old_prog);
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 18:01           ` Jesper Dangaard Brouer
@ 2017-07-11 18:29             ` John Fastabend
  2017-07-11 18:44             ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 43+ messages in thread
From: John Fastabend @ 2017-07-11 18:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm

On 07/11/2017 11:01 AM, Jesper Dangaard Brouer wrote:
> On Tue, 11 Jul 2017 10:48:29 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:
>>> On Sat, 8 Jul 2017 21:06:17 +0200
>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>   
>>>> My plan is to test this latest patchset again, Monday and Tuesday.
>>>> I'll try to assess stability and provide some performance numbers.  
>>>
>>> Performance numbers:
>>>
>>>  14378479 pkt/s = XDP_DROP without touching memory
>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
>>>
>>> The performance drop between xdp2 and xdp_redirect, was expected due
>>> to the HW-tailptr flush per packet, which is costly.
>>>
>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
>>>
>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
>>> I expected, which is not good!  The avoidance of the tailptr flush per
>>> packet was expected to give a higher boost.  The cost increased with
>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
>>> approx 160 cycles).
>>>
>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
>>>
>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
>>> but I was actually expecting that the added code could "hide" behind
>>> these cache-misses.
>>>
>>> I'm somewhat surprised to see this large a performance drop.
>>>   
>>
>> Yep, although there is room for optimizations in the code path for sure. And
>> 5mpps is not horrible my preference is to get this series in plus any
>> small optimization we come up with while the merge window is closed. Then
>> follow up patches can do optimizations.
> 
> IMHO 5Mpps is a very bad number for XDP.
> 
>> One easy optimization is to get rid of the atomic bitops. They are not needed
>> here we have a per cpu unsigned long. Another easy one would be to move
>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
>> and flush ops on the net device in the hotpath really should be done in the
>> slow path.
> 
> I'm already running with a similar patch as below, but it
> (surprisingly) only gave my 3 ns improvement.  I also tried a
> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
> 

Ah OK good, do the above numbers use the both the bitops changes and the
prefechw?

> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
> have DDIO ... I have high hopes for this, as the major bottleneck on
> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
> 
> Something is definitely wrong on this CPU, as perf stats shows, a very
> bad utilization of the CPU pipeline with 0.89 insn per cycle.
> 

Interesting, the E5-1650 numbers will be good to know. If you have the
perf trace to posting might help track down some hot spots.

.John

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/12] xdp: add bpf_redirect helper function
  2017-07-11 14:09       ` Andy Gospodarek
@ 2017-07-11 18:38         ` John Fastabend
  2017-07-11 19:38           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-11 18:38 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Saeed Mahameed, netdev, David Miller, Jesper Dangaard Brouer,
	Daniel Borkmann, Alexei Starovoitov

On 07/11/2017 07:09 AM, Andy Gospodarek wrote:
> On Mon, Jul 10, 2017 at 1:23 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 07/09/2017 06:37 AM, Saeed Mahameed wrote:
>>>
>>>
>>> On 7/7/2017 8:35 PM, John Fastabend wrote:
>>>> This adds support for a bpf_redirect helper function to the XDP
>>>> infrastructure. For now this only supports redirecting to the egress
>>>> path of a port.
>>>>
>>>> In order to support drivers handling a xdp_buff natively this patches
>>>> uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff
>>>> to the specified device.
>>>>
>>>> If the program specifies either (a) an unknown device or (b) a device
>>>> that does not support the operation a BPF warning is thrown and the
>>>> XDP_ABORTED error code is returned.
>>>>
>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>> ---
>>
>> [...]
>>
>>>>
>>>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
>>>> +{
>>>> +    if (dev->netdev_ops->ndo_xdp_xmit) {
>>>> +        dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
>>>
>>> Hi John,
>>>
>>> I have some concern here regarding synchronizing between the
>>> redirecting device and the target device:
>>>
>>> if the target device's NAPI is also doing XDP_TX on the same XDP TX
>>> ring which this NDO might be redirecting xdp packets into the same
>>> ring, there would be a race accessing this ring resources (buffers
>>> and descriptors). Maybe you addressed this issue in the device driver
>>> implementation of this ndo or with some NAPI tricks/assumptions, I
>>> guess we have the same issue for if you run the same program to
>>> redirect traffic from multiple netdevices into one netdevice, how do
>>> you synchronize accessing this TX ring ?
>>
>> The implementation uses a per cpu TX ring to resolve these races. And
>> the pair of driver interface API calls, xdp_do_redirect() and xdp_do_flush_map()
>> must be completed in a single poll() handler.
>>
>> This comment was included in the header file to document this,
>>
>> /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
>>  * same cpu context. Further for best results no more than a single map
>>  * for the do_redirect/do_flush pair should be used. This limitation is
>>  * because we only track one map and force a flush when the map changes.
>>  * This does not appear to be a real limitation for existing software.
>>  */
>>
>> In general some documentation about implementing XDP would probably be
>> useful to add in Documentation/networking but this IMO goes beyond just
>> this patch series.
>>
>>>
>>> Maybe we need some clear guidelines in this ndo documentation stating
>>> how to implement this ndo and what are the assumptions on those XDP
>>> TX redirect rings or from which context this ndo can run.
>>>
>>> can you please elaborate.
>>
>> I think the best implementation is to use a per cpu TX ring as I did in
>> this series. If your device is limited by the number of queues for some
>> reason some other scheme would need to be devised. Unfortunately, the only
>> thing I've come up for this case (using only this series) would both impact
>> performance and make the code complex.
>>
>> A nice solution might be to constrain networking "tasks" to only a subset
>> of cores. For 64+ core systems this might be a good idea. It would allow
>> avoiding locking using per_cpu logic but also avoid networking consuming
>> slices of every core in the system. As core count goes up I think we will
>> eventually need to address this.I believe Eric was thinking along these
>> lines with his netconf talk iirc. Obviously this work is way outside the
>> scope of this series though.
> 
> I agree that it is outside the scope of this series, but I think it is
> important to consider the impact of the output queue selection in both
> a heterogenous and homogenous driver setup and how tx could be
> optimized or even considered to be more reliable and I think that was
> part of Saeed's point.
> 
> I got base redirect support for bnxt_en working yesterday, but for it
> and other drivers that do not necessarily create a ring/queue per core
> like ixgbe there is probably a bit more to work in each driver to
> properly track output tx rings/queues than what you have done with
> ixgbe.
> 

The problem, in my mind at least, is if you do not have a ring per core
how does the locking work? I don't see any good way to do this outside
of locking which I was trying to avoid.

.John

>>
>>
>>> Thanks,
>>> Saeed.
>>>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 18:01           ` Jesper Dangaard Brouer
  2017-07-11 18:29             ` John Fastabend
@ 2017-07-11 18:44             ` Jesper Dangaard Brouer
  2017-07-11 18:56               ` John Fastabend
  1 sibling, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-11 18:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Tue, 11 Jul 2017 20:01:36 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 11 Jul 2017 10:48:29 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
> > On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:  
> > > On Sat, 8 Jul 2017 21:06:17 +0200
> > > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > >     
> > >> My plan is to test this latest patchset again, Monday and Tuesday.
> > >> I'll try to assess stability and provide some performance numbers.    
> > > 
> > > Performance numbers:
> > > 
> > >  14378479 pkt/s = XDP_DROP without touching memory
> > >   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
> > >   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> > >   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
> > >   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> > > 
> > > The performance drop between xdp2 and xdp_redirect, was expected due
> > > to the HW-tailptr flush per packet, which is costly.
> > > 
> > >  (1/6344472-1/4595574)*10^9 = -59.98 ns
> > > 
> > > The performance drop between xdp2 and xdp_redirect_map, is higher than
> > > I expected, which is not good!  The avoidance of the tailptr flush per
> > > packet was expected to give a higher boost.  The cost increased with
> > > 40 ns, which is too high compared to the code added (on a 4GHz machine
> > > approx 160 cycles).
> > > 
> > >  (1/6344472-1/5066243)*10^9 = -39.77 ns
> > > 
> > > This system doesn't have DDIO, thus we are stalling on cache-misses,
> > > but I was actually expecting that the added code could "hide" behind
> > > these cache-misses.
> > > 
> > > I'm somewhat surprised to see this large a performance drop.
> > >     
> > 
> > Yep, although there is room for optimizations in the code path for sure. And
> > 5mpps is not horrible my preference is to get this series in plus any
> > small optimization we come up with while the merge window is closed. Then
> > follow up patches can do optimizations.  
> 
> IMHO 5Mpps is a very bad number for XDP.
> 
> > One easy optimization is to get rid of the atomic bitops. They are not needed
> > here we have a per cpu unsigned long. Another easy one would be to move
> > some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
> > and flush ops on the net device in the hotpath really should be done in the
> > slow path.  
> 
> I'm already running with a similar patch as below, but it
> (surprisingly) only gave my 3 ns improvement.  I also tried a
> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
> 
> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
> have DDIO ... I have high hopes for this, as the major bottleneck on
> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
> 
> Something is definitely wrong on this CPU, as perf stats shows, a very
> bad utilization of the CPU pipeline with 0.89 insn per cycle.

Wow, getting DDIO working and avoiding the cache-miss, was really
_the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
optimization)

13,939,674 pkt/s = XDP_DROP without touching memory
14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
 7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap

Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
read packet memory.

The large performance gap to xdp_redirect is due to the tailptr flush,
which really show up on this system.  The CPU efficiency is 1.36 insn
per cycle, which for map variant is 2.15 insn per cycle.

 Gap (1/13221812-1/7596576)*10^9 = -56 ns

The xdp_redirect_map performance is really really good, almost 10G
wirespeed on a single CPU!!!  This is amazing, and we know that this
code is not even optimal yet.  The performance difference to xdp2 is
only around 1 ns.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


 [jbrouer@E5-1650v4 bpf]$ sudo ./xdp1 6
 proto 17:   11919302 pkt/s
 proto 17:   14281659 pkt/s
 proto 17:   14290650 pkt/s
 proto 17:   14283120 pkt/s
 proto 17:   14303023 pkt/s
 proto 17:   14299496 pkt/s

 [jbrouer@E5-1650v4 bpf]$ sudo ./xdp2 6
 proto 0:          1 pkt/s
 proto 17:    3225455 pkt/s
 proto 17:   13266772 pkt/s
 proto 17:   13221812 pkt/s
 proto 17:   13222200 pkt/s
 proto 17:   13225508 pkt/s
 proto 17:   13226274 pkt/s

 [jbrouer@E5-1650v4 bpf]$ sudo ./xdp_redirect 6 6
 ifindex 6:      66040 pkt/s
 ifindex 6:    7029143 pkt/s
 ifindex 6:    7596576 pkt/s
 ifindex 6:    7598499 pkt/s
 ifindex 6:    7597025 pkt/s
 ifindex 6:    7598462 pkt/s

 [jbrouer@E5-1650v4 bpf]$ sudo ./xdp_redirect_map 6 6
 map[0] (vports) = 4, map[1] (map) = 5, map[2] (count) = 0
 ifindex 6:      95429 pkt/s
 ifindex 6:   12156600 pkt/s
 ifindex 6:   13058435 pkt/s
 ifindex 6:   13058515 pkt/s
 ifindex 6:   13059213 pkt/s
 ifindex 6:   13058322 pkt/s
 ifindex 6:   13059342 pkt/s

 [jbrouer@E5-1650v4 prototype-kernel]$
  sudo ./xdp_bench01_mem_access_cost --dev ixgbe2 --action XDP_DROP
 XDP_action   pps        pps-human-readable mem      
 XDP_DROP     0          0                  no_touch 
 XDP_DROP     1          1                  no_touch 
 XDP_DROP     11959667   11,959,667         no_touch 
 XDP_DROP     13939674   13,939,674         no_touch 
 XDP_DROP     13954549   13,954,549         no_touch 
 XDP_DROP     13953897   13,953,897         no_touch 
 XDP_DROP     13963531   13,963,531         no_touch 

 [jbrouer@E5-1650v4 prototype-kernel]$
  sudo ./xdp_bench01_mem_access_cost --dev ixgbe2 --action XDP_DROP --read
 XDP_action   pps        pps-human-readable mem      
 XDP_DROP     0          0                  read     
 XDP_DROP     0          0                  read     
 XDP_DROP     0          0                  read     
 XDP_DROP     8611099    8,611,099          read     
 XDP_DROP     14300230   14,300,230         read     
 XDP_DROP     14293416   14,293,416         read     
 XDP_DROP     14297247   14,297,247         read     
 XDP_DROP     14300563   14,300,563         read     
 XDP_DROP     14299873   14,299,873         read     
 ^CInterrupted: Removing XDP program on ifindex:6 device:ixgbe2

 [jbrouer@E5-1650v4 prototype-kernel]$
 sudo ./xdp_bench01_mem_access_cost --dev ixgbe2 --action XDP_TX --swap
 XDP_action   pps        pps-human-readable mem      
 XDP_TX       1          1                  swap_mac 
 XDP_TX       3007657    3,007,657          swap_mac 
 XDP_TX       13322885   13,322,885         swap_mac 
 XDP_TX       13200845   13,200,845         swap_mac 
 XDP_TX       13189829   13,189,829         swap_mac 
 XDP_TX       13197952   13,197,952         swap_mac 
 XDP_TX       13198856   13,198,856         swap_mac 
 ^CInterrupted: Removing XDP program on ifindex:6 device:ixgbe2

Normal xdp_redirect:

 [jbrouer@localhost ~]$ sudo perf stat -C10 -e L1-icache-load-misses -e cycles:k -e  instructions:k -e cache-misses:k -e   cache-references:k  -e LLC-store-misses:k -e LLC-store -e LLC-load-misses:k -e  LLC-load -e L1-dcache-load-misses -e L1-dcache-loads -e L1-dcache-stores -e ref-cycles -e bus-cycles -r 3 sleep 1

 Performance counter stats for 'CPU(s) 10' (3 runs):

           105,645      L1-icache-load-misses                                         ( +-  0.90% )  (35.48%)
     3,790,481,191      cycles:k                                                      ( +-  0.00% )  (42.67%)
     5,159,580,049      instructions:k            #    1.36  insn per cycle           ( +-  0.02% )  (49.87%)
               931      cache-misses:k            #    0.004 % of all cache refs      ( +- 31.80% )  (49.97%)
        21,394,789      cache-references:k                                            ( +-  0.03% )  (50.07%)
                 0      LLC-store-misses                                              (43.07%)
           840,689      LLC-store                                                     ( +-  0.13% )  (43.16%)
                 0      LLC-load-misses                                               (14.37%)
         8,031,535      LLC-load                                                      ( +-  0.02% )  (14.27%)
        42,211,992      L1-dcache-load-misses     #    2.49% of all L1-dcache hits    ( +-  0.01% )  (21.36%)
     1,692,701,894      L1-dcache-loads                                               ( +-  0.02% )  (28.46%)
       922,985,760      L1-dcache-stores                                              ( +-  0.02% )  (28.37%)
     3,591,805,402      ref-cycles                                                    ( +-  0.00% )  (35.47%)
        99,762,378      bus-cycles                                                    ( +-  0.00% )  (35.47%)

       1.000964284 seconds time elapsed                                          ( +-  0.01% )


xdp_redirect_map::

 [jbrouer@localhost ~]$ sudo perf stat -C10 -e L1-icache-load-misses -e cycles:k -e  instructions:k -e cache-misses:k -e   cache-references:k  -e LLC-store-misses:k -e LLC-store -e LLC-load-misses:k -e  LLC-load -e L1-dcache-load-misses -e L1-dcache-loads -e L1-dcache-stores -e ref-cycles -e bus-cycles -r 3 sleep 1

 Performance counter stats for 'CPU(s) 10' (3 runs):

           103,113      L1-icache-load-misses                                         ( +- 11.60% )  (35.48%)
     3,789,133,467      cycles:k                                                      ( +-  0.03% )  (42.66%)
     8,152,033,594      instructions:k            #    2.15  insn per cycle           ( +-  9.00% )  (49.85%)
             1,414      cache-misses:k            #    0.004 % of all cache refs      ( +- 21.42% )  (49.95%)
        32,480,603      cache-references:k                                            ( +-  8.63% )  (50.05%)
                 0      LLC-store-misses                                              (43.06%)
           786,799      LLC-store                                                     ( +-  1.57% )  (43.14%)
                67      LLC-load-misses                                               ( +-100.00% )  (14.37%)
        12,445,529      LLC-load                                                      ( +-  9.05% )  (14.28%)
        77,013,768      L1-dcache-load-misses     #    2.83% of all L1-dcache hits    ( +-  9.07% )  (21.38%)
     2,725,676,877      L1-dcache-loads                                               ( +-  8.98% )  (28.47%)
     1,566,087,361      L1-dcache-stores                                              ( +-  8.95% )  (28.39%)
     3,590,481,746      ref-cycles                                                    ( +-  0.03% )  (35.48%)
        99,725,522      bus-cycles                                                    ( +-  0.03% )  (35.47%)

       1.000920909 seconds time elapsed                                          ( +-  0.00% )

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 18:44             ` Jesper Dangaard Brouer
@ 2017-07-11 18:56               ` John Fastabend
  2017-07-11 19:19                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-11 18:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm

On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote:
> On Tue, 11 Jul 2017 20:01:36 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
>> On Tue, 11 Jul 2017 10:48:29 -0700
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>
>>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:  
>>>> On Sat, 8 Jul 2017 21:06:17 +0200
>>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>>     
>>>>> My plan is to test this latest patchset again, Monday and Tuesday.
>>>>> I'll try to assess stability and provide some performance numbers.    
>>>>
>>>> Performance numbers:
>>>>
>>>>  14378479 pkt/s = XDP_DROP without touching memory
>>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
>>>>
>>>> The performance drop between xdp2 and xdp_redirect, was expected due
>>>> to the HW-tailptr flush per packet, which is costly.
>>>>
>>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
>>>>
>>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
>>>> I expected, which is not good!  The avoidance of the tailptr flush per
>>>> packet was expected to give a higher boost.  The cost increased with
>>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
>>>> approx 160 cycles).
>>>>
>>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
>>>>
>>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
>>>> but I was actually expecting that the added code could "hide" behind
>>>> these cache-misses.
>>>>
>>>> I'm somewhat surprised to see this large a performance drop.
>>>>     
>>>
>>> Yep, although there is room for optimizations in the code path for sure. And
>>> 5mpps is not horrible my preference is to get this series in plus any
>>> small optimization we come up with while the merge window is closed. Then
>>> follow up patches can do optimizations.  
>>
>> IMHO 5Mpps is a very bad number for XDP.
>>
>>> One easy optimization is to get rid of the atomic bitops. They are not needed
>>> here we have a per cpu unsigned long. Another easy one would be to move
>>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
>>> and flush ops on the net device in the hotpath really should be done in the
>>> slow path.  
>>
>> I'm already running with a similar patch as below, but it
>> (surprisingly) only gave my 3 ns improvement.  I also tried a
>> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
>>
>> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
>> have DDIO ... I have high hopes for this, as the major bottleneck on
>> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
>>
>> Something is definitely wrong on this CPU, as perf stats shows, a very
>> bad utilization of the CPU pipeline with 0.89 insn per cycle.
> 
> Wow, getting DDIO working and avoiding the cache-miss, was really
> _the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
> really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
> optimization)
> 

Very nice :) this was with the prefecthw() removed right?

> 13,939,674 pkt/s = XDP_DROP without touching memory
> 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
> 13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>  7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
> 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap
> 
> Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
> read packet memory.
> 
> The large performance gap to xdp_redirect is due to the tailptr flush,
> which really show up on this system.  The CPU efficiency is 1.36 insn
> per cycle, which for map variant is 2.15 insn per cycle.
> 
>  Gap (1/13221812-1/7596576)*10^9 = -56 ns
> 
> The xdp_redirect_map performance is really really good, almost 10G
> wirespeed on a single CPU!!!  This is amazing, and we know that this
> code is not even optimal yet.  The performance difference to xdp2 is
> only around 1 ns.
> 

Great, yeah there are some more likely()/unlikely() hints we could add and
also remove some of the checks in the hotpath, etc.

Thanks for doing this!

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 18:56               ` John Fastabend
@ 2017-07-11 19:19                 ` Jesper Dangaard Brouer
  2017-07-11 19:37                   ` John Fastabend
  0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-11 19:19 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer, Andi Kleen

On Tue, 11 Jul 2017 11:56:21 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 11 Jul 2017 20:01:36 +0200
> > Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >   
> >> On Tue, 11 Jul 2017 10:48:29 -0700
> >> John Fastabend <john.fastabend@gmail.com> wrote:
> >>  
> >>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:    
> >>>> On Sat, 8 Jul 2017 21:06:17 +0200
> >>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >>>>       
> >>>>> My plan is to test this latest patchset again, Monday and Tuesday.
> >>>>> I'll try to assess stability and provide some performance numbers.      
> >>>>
> >>>> Performance numbers:
> >>>>
> >>>>  14378479 pkt/s = XDP_DROP without touching memory
> >>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
> >>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> >>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
> >>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
> >>>>
> >>>> The performance drop between xdp2 and xdp_redirect, was expected due
> >>>> to the HW-tailptr flush per packet, which is costly.
> >>>>
> >>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
> >>>>
> >>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
> >>>> I expected, which is not good!  The avoidance of the tailptr flush per
> >>>> packet was expected to give a higher boost.  The cost increased with
> >>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
> >>>> approx 160 cycles).
> >>>>
> >>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
> >>>>
> >>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
> >>>> but I was actually expecting that the added code could "hide" behind
> >>>> these cache-misses.
> >>>>
> >>>> I'm somewhat surprised to see this large a performance drop.
> >>>>       
> >>>
> >>> Yep, although there is room for optimizations in the code path for sure. And
> >>> 5mpps is not horrible my preference is to get this series in plus any
> >>> small optimization we come up with while the merge window is closed. Then
> >>> follow up patches can do optimizations.    
> >>
> >> IMHO 5Mpps is a very bad number for XDP.
> >>  
> >>> One easy optimization is to get rid of the atomic bitops. They are not needed
> >>> here we have a per cpu unsigned long. Another easy one would be to move
> >>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
> >>> and flush ops on the net device in the hotpath really should be done in the
> >>> slow path.    
> >>
> >> I'm already running with a similar patch as below, but it
> >> (surprisingly) only gave my 3 ns improvement.  I also tried a
> >> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
> >>
> >> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
> >> have DDIO ... I have high hopes for this, as the major bottleneck on
> >> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
> >>
> >> Something is definitely wrong on this CPU, as perf stats shows, a very
> >> bad utilization of the CPU pipeline with 0.89 insn per cycle.  
> > 
> > Wow, getting DDIO working and avoiding the cache-miss, was really
> > _the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
> > really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
> > optimization)
> >   
> 
> Very nice :) this was with the prefecthw() removed right?

Yes, prefetchw removed.

> > 13,939,674 pkt/s = XDP_DROP without touching memory
> > 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
> > 13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
> >  7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
> > 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap
> > 
> > Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
> > read packet memory.
> > 
> > The large performance gap to xdp_redirect is due to the tailptr flush,
> > which really show up on this system.  The CPU efficiency is 1.36 insn
> > per cycle, which for map variant is 2.15 insn per cycle.
> > 
> >  Gap (1/13221812-1/7596576)*10^9 = -56 ns
> > 
> > The xdp_redirect_map performance is really really good, almost 10G
> > wirespeed on a single CPU!!!  This is amazing, and we know that this
> > code is not even optimal yet.  The performance difference to xdp2 is
> > only around 1 ns.
> >   
> 
> Great, yeah there are some more likely()/unlikely() hints we could add and
> also remove some of the checks in the hotpath, etc.

Yes, plus inlining some function call.
 
> Thanks for doing this!

I have a really strange observation... if I change the CPU powersave
settings, then the xdp_redirect_map performance drops in half!  Above
was with "tuned-adm profile powersave" (because, this is a really noisy
server, and I'm sitting next to it).  I can see that the CPU under-load
goes into "turbomode", rest going into low-power, including the
Hyper-thread siblings.

If I change the profile to: # tuned-adm profile network-latency

ifindex 6:   12964879 pkt/s
ifindex 6:   12964683 pkt/s
ifindex 6:   12961497 pkt/s
ifindex 6:   11779966 pkt/s <-- change to tuned-adm profile network-latency
ifindex 6:    6853959 pkt/s
ifindex 6:    6851120 pkt/s
ifindex 6:    6856934 pkt/s
ifindex 6:    6857344 pkt/s
ifindex 6:    6857161 pkt/s

The CPU efficiency goes from 2.35 to 1.24 insn per cycle.

John do you know some Intel people that could help me understand what
is going on?!? This is very strange...

I tried Andi's toplev tool, which AFAIK indicate that this is a
Frontend problem, e.g. in decoding the instructions?!?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


 sudo  ./toplev.py -I 2000 -l3 -a --core C4  --show-sample

 C4    FE   Frontend_Bound:                               43.45 % Slots [4.80%]
 C4    RET  Retiring:                                     49.27 % Slots [4.80%]
 C4    FE   Frontend_Bound.Frontend_Latency:              33.57 % Slots [4.67%]
 C4    RET  Retiring.Microcode_Sequencer:                  9.05 % Slots [4.67%] BN
 C4-T1 FE   Frontend_Bound.Frontend_Latency.MS_Switches:   1.40 % Clocks[4.67%]
 C4-T1      MUX:                                           4.67 %
 C4-T0 FE   Frontend_Bound.Frontend_Latency.MS_Switches:  33.20 % Clocks[4.67%]
 C4-T0      MUX:                                           4.67 %


[jbrouer@localhost pmu-tools]$ sudo turbostat 
	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz
	-	322	8.78	3668	3600
	0	61	5.12	1200	3600
	6	0	0.01	1235	3600
	1	1	0.09	1225	3600
	7	0	0.02	1212	3600
	2	0	0.02	1243	3600
	8	0	0.02	1307	3600
	3	0	0.04	1205	3600
	9	0	0.01	1207	3600
	4	0	0.00	3801	3600
	10	3800	100.00	3800	3600
	5	0	0.01	1255	3600
	11	0	0.04	1219	3600


[jbrouer@localhost pmu-tools]$ sudo turbostat 
	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz
	-	3800	100.00	3800	3600
	0	3800	100.00	3800	3600
	6	3800	100.00	3800	3600
	1	3800	100.00	3800	3600
	7	3800	100.00	3800	3600
	2	3800	100.00	3800	3600
	8	3800	100.00	3800	3600
	3	3800	100.00	3800	3600
	9	3800	100.00	3800	3600
	4	3800	100.00	3800	3600
	10	3800	100.00	3800	3600
	5	3800	100.00	3800	3600
	11	3800	100.00	3800	3600

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 19:19                 ` Jesper Dangaard Brouer
@ 2017-07-11 19:37                   ` John Fastabend
  2017-07-16  8:23                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-11 19:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, Andi Kleen, Alexander Duyck, jesse.brandeburg

On 07/11/2017 12:19 PM, Jesper Dangaard Brouer wrote:
> On Tue, 11 Jul 2017 11:56:21 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote:
>>> On Tue, 11 Jul 2017 20:01:36 +0200
>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>   
>>>> On Tue, 11 Jul 2017 10:48:29 -0700
>>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>>  
>>>>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote:    
>>>>>> On Sat, 8 Jul 2017 21:06:17 +0200
>>>>>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>>>>       
>>>>>>> My plan is to test this latest patchset again, Monday and Tuesday.
>>>>>>> I'll try to assess stability and provide some performance numbers.      
>>>>>>
>>>>>> Performance numbers:
>>>>>>
>>>>>>  14378479 pkt/s = XDP_DROP without touching memory
>>>>>>   9222401 pkt/s = xdp1: XDP_DROP with reading packet data
>>>>>>   6344472 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>>>>   4595574 pkt/s = xdp_redirect:     XDP_REDIRECT with swap mac (simulate XDP_TX)
>>>>>>   5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap
>>>>>>
>>>>>> The performance drop between xdp2 and xdp_redirect, was expected due
>>>>>> to the HW-tailptr flush per packet, which is costly.
>>>>>>
>>>>>>  (1/6344472-1/4595574)*10^9 = -59.98 ns
>>>>>>
>>>>>> The performance drop between xdp2 and xdp_redirect_map, is higher than
>>>>>> I expected, which is not good!  The avoidance of the tailptr flush per
>>>>>> packet was expected to give a higher boost.  The cost increased with
>>>>>> 40 ns, which is too high compared to the code added (on a 4GHz machine
>>>>>> approx 160 cycles).
>>>>>>
>>>>>>  (1/6344472-1/5066243)*10^9 = -39.77 ns
>>>>>>
>>>>>> This system doesn't have DDIO, thus we are stalling on cache-misses,
>>>>>> but I was actually expecting that the added code could "hide" behind
>>>>>> these cache-misses.
>>>>>>
>>>>>> I'm somewhat surprised to see this large a performance drop.
>>>>>>       
>>>>>
>>>>> Yep, although there is room for optimizations in the code path for sure. And
>>>>> 5mpps is not horrible my preference is to get this series in plus any
>>>>> small optimization we come up with while the merge window is closed. Then
>>>>> follow up patches can do optimizations.    
>>>>
>>>> IMHO 5Mpps is a very bad number for XDP.
>>>>  
>>>>> One easy optimization is to get rid of the atomic bitops. They are not needed
>>>>> here we have a per cpu unsigned long. Another easy one would be to move
>>>>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit
>>>>> and flush ops on the net device in the hotpath really should be done in the
>>>>> slow path.    
>>>>
>>>> I'm already running with a similar patch as below, but it
>>>> (surprisingly) only gave my 3 ns improvement.  I also tried a
>>>> prefetchw() on xdp.data that gave me 10 ns (which is quite good).
>>>>
>>>> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which
>>>> have DDIO ... I have high hopes for this, as the major bottleneck on
>>>> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses.
>>>>
>>>> Something is definitely wrong on this CPU, as perf stats shows, a very
>>>> bad utilization of the CPU pipeline with 0.89 insn per cycle.  
>>>
>>> Wow, getting DDIO working and avoiding the cache-miss, was really
>>> _the_ issue.  On this CPU E5-1650 v4 @ 3.60GHz things look really
>>> really good for XDP_REDIRECT with maps. (p.s. with __set_bit()
>>> optimization)
>>>   
>>
>> Very nice :) this was with the prefecthw() removed right?
> 
> Yes, prefetchw removed.

Great, I wanted to avoid playing prefetch games for as long as possible.
At least in this initial submission.

> 
>>> 13,939,674 pkt/s = XDP_DROP without touching memory
>>> 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data
>>> 13,221,812 pkt/s = xdp2: XDP_TX   with swap mac (writes into pkt)
>>>  7,596,576 pkt/s = xdp_redirect:    XDP_REDIRECT with swap mac (like XDP_TX)
>>> 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap
>>>
>>> Surprisingly, on this DDIO capable CPU it is slightly slower NOT to
>>> read packet memory.
>>>
>>> The large performance gap to xdp_redirect is due to the tailptr flush,
>>> which really show up on this system.  The CPU efficiency is 1.36 insn
>>> per cycle, which for map variant is 2.15 insn per cycle.
>>>
>>>  Gap (1/13221812-1/7596576)*10^9 = -56 ns
>>>
>>> The xdp_redirect_map performance is really really good, almost 10G
>>> wirespeed on a single CPU!!!  This is amazing, and we know that this
>>> code is not even optimal yet.  The performance difference to xdp2 is
>>> only around 1 ns.
>>>   
>>
>> Great, yeah there are some more likely()/unlikely() hints we could add and
>> also remove some of the checks in the hotpath, etc.
> 
> Yes, plus inlining some function call.
>  

Yep.

>> Thanks for doing this!
> 
> I have a really strange observation... if I change the CPU powersave
> settings, then the xdp_redirect_map performance drops in half!  Above
> was with "tuned-adm profile powersave" (because, this is a really noisy
> server, and I'm sitting next to it).  I can see that the CPU under-load
> goes into "turbomode", rest going into low-power, including the
> Hyper-thread siblings.
> 
> If I change the profile to: # tuned-adm profile network-latency
> 
> ifindex 6:   12964879 pkt/s
> ifindex 6:   12964683 pkt/s
> ifindex 6:   12961497 pkt/s
> ifindex 6:   11779966 pkt/s <-- change to tuned-adm profile network-latency
> ifindex 6:    6853959 pkt/s
> ifindex 6:    6851120 pkt/s
> ifindex 6:    6856934 pkt/s
> ifindex 6:    6857344 pkt/s
> ifindex 6:    6857161 pkt/s
> 
> The CPU efficiency goes from 2.35 to 1.24 insn per cycle.
> 
> John do you know some Intel people that could help me understand what
> is going on?!? This is very strange...
> 
> I tried Andi's toplev tool, which AFAIK indicate that this is a
> Frontend problem, e.g. in decoding the instructions?!?
> 

hmm maybe Jesse or Alex have some clues. Adding them to the CC list.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/12] xdp: add bpf_redirect helper function
  2017-07-11 18:38         ` John Fastabend
@ 2017-07-11 19:38           ` Jesper Dangaard Brouer
  2017-07-12 11:00             ` Saeed Mahameed
  0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-11 19:38 UTC (permalink / raw)
  To: John Fastabend
  Cc: Andy Gospodarek, Saeed Mahameed, netdev, David Miller,
	Daniel Borkmann, Alexei Starovoitov, brouer

On Tue, 11 Jul 2017 11:38:33 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 07/11/2017 07:09 AM, Andy Gospodarek wrote:
> > On Mon, Jul 10, 2017 at 1:23 PM, John Fastabend
> > <john.fastabend@gmail.com> wrote:  
> >> On 07/09/2017 06:37 AM, Saeed Mahameed wrote:  
> >>>
> >>>
> >>> On 7/7/2017 8:35 PM, John Fastabend wrote:  
> >>>> This adds support for a bpf_redirect helper function to the XDP
> >>>> infrastructure. For now this only supports redirecting to the egress
> >>>> path of a port.
> >>>>
> >>>> In order to support drivers handling a xdp_buff natively this patches
> >>>> uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff
> >>>> to the specified device.
> >>>>
> >>>> If the program specifies either (a) an unknown device or (b) a device
> >>>> that does not support the operation a BPF warning is thrown and the
> >>>> XDP_ABORTED error code is returned.
> >>>>
> >>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
> >>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
> >>>> ---  
> >>
> >> [...]
> >>  
> >>>>
> >>>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
> >>>> +{
> >>>> +    if (dev->netdev_ops->ndo_xdp_xmit) {
> >>>> +        dev->netdev_ops->ndo_xdp_xmit(dev, xdp);  
> >>>
> >>> Hi John,
> >>>
> >>> I have some concern here regarding synchronizing between the
> >>> redirecting device and the target device:
> >>>
> >>> if the target device's NAPI is also doing XDP_TX on the same XDP TX
> >>> ring which this NDO might be redirecting xdp packets into the same
> >>> ring, there would be a race accessing this ring resources (buffers
> >>> and descriptors). Maybe you addressed this issue in the device driver
> >>> implementation of this ndo or with some NAPI tricks/assumptions, I
> >>> guess we have the same issue for if you run the same program to
> >>> redirect traffic from multiple netdevices into one netdevice, how do
> >>> you synchronize accessing this TX ring ?  
> >>
> >> The implementation uses a per cpu TX ring to resolve these races. And
> >> the pair of driver interface API calls, xdp_do_redirect() and xdp_do_flush_map()
> >> must be completed in a single poll() handler.
> >>
> >> This comment was included in the header file to document this,
> >>
> >> /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
> >>  * same cpu context. Further for best results no more than a single map
> >>  * for the do_redirect/do_flush pair should be used. This limitation is
> >>  * because we only track one map and force a flush when the map changes.
> >>  * This does not appear to be a real limitation for existing software.
> >>  */
> >>
> >> In general some documentation about implementing XDP would probably be
> >> useful to add in Documentation/networking but this IMO goes beyond just
> >> this patch series.
> >>  
> >>>
> >>> Maybe we need some clear guidelines in this ndo documentation stating
> >>> how to implement this ndo and what are the assumptions on those XDP
> >>> TX redirect rings or from which context this ndo can run.
> >>>
> >>> can you please elaborate.  
> >>
> >> I think the best implementation is to use a per cpu TX ring as I did in
> >> this series. If your device is limited by the number of queues for some
> >> reason some other scheme would need to be devised. Unfortunately, the only
> >> thing I've come up for this case (using only this series) would both impact
> >> performance and make the code complex.
> >>
> >> A nice solution might be to constrain networking "tasks" to only a subset
> >> of cores. For 64+ core systems this might be a good idea. It would allow
> >> avoiding locking using per_cpu logic but also avoid networking consuming
> >> slices of every core in the system. As core count goes up I think we will
> >> eventually need to address this.I believe Eric was thinking along these
> >> lines with his netconf talk iirc. Obviously this work is way outside the
> >> scope of this series though.  
> > 
> > I agree that it is outside the scope of this series, but I think it is
> > important to consider the impact of the output queue selection in both
> > a heterogenous and homogenous driver setup and how tx could be
> > optimized or even considered to be more reliable and I think that was
> > part of Saeed's point.
> > 
> > I got base redirect support for bnxt_en working yesterday, but for it
> > and other drivers that do not necessarily create a ring/queue per core
> > like ixgbe there is probably a bit more to work in each driver to
> > properly track output tx rings/queues than what you have done with
> > ixgbe.
> >   
> 
> The problem, in my mind at least, is if you do not have a ring per core
> how does the locking work? I don't see any good way to do this outside
> of locking which I was trying to avoid.

My solution would be to queue the XDP packets in the devmap, and then
bulk xdp_xmit them to the device on flush.  Talking a lock per bulk
amortize cost to basically nothing.  The other advantage of this is
improving the instruction-cache (re)usage.

One thing I don't like with this patchset, is this implicit requirement
it put on drivers, that they must have a HW TX-ring queue per CPU in the
system.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 03/12] xdp: add bpf_redirect helper function
  2017-07-11 19:38           ` Jesper Dangaard Brouer
@ 2017-07-12 11:00             ` Saeed Mahameed
  0 siblings, 0 replies; 43+ messages in thread
From: Saeed Mahameed @ 2017-07-12 11:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: Andy Gospodarek, netdev, David Miller, Daniel Borkmann,
	Alexei Starovoitov



On 7/11/2017 10:38 PM, Jesper Dangaard Brouer wrote:
> On Tue, 11 Jul 2017 11:38:33 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
>
>> On 07/11/2017 07:09 AM, Andy Gospodarek wrote:
>>> On Mon, Jul 10, 2017 at 1:23 PM, John Fastabend
>>> <john.fastabend@gmail.com> wrote:
>>>> On 07/09/2017 06:37 AM, Saeed Mahameed wrote:
>>>>>
>>>>>
>>>>> On 7/7/2017 8:35 PM, John Fastabend wrote:
>>>>>> This adds support for a bpf_redirect helper function to the XDP
>>>>>> infrastructure. For now this only supports redirecting to the egress
>>>>>> path of a port.
>>>>>>
>>>>>> In order to support drivers handling a xdp_buff natively this patches
>>>>>> uses a new ndo operation ndo_xdp_xmit() that takes pushes a xdp_buff
>>>>>> to the specified device.
>>>>>>
>>>>>> If the program specifies either (a) an unknown device or (b) a device
>>>>>> that does not support the operation a BPF warning is thrown and the
>>>>>> XDP_ABORTED error code is returned.
>>>>>>
>>>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>>>> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>> ---
>>>>
>>>> [...]
>>>>
>>>>>>
>>>>>> +static int __bpf_tx_xdp(struct net_device *dev, struct xdp_buff *xdp)
>>>>>> +{
>>>>>> +    if (dev->netdev_ops->ndo_xdp_xmit) {
>>>>>> +        dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
>>>>>
>>>>> Hi John,
>>>>>
>>>>> I have some concern here regarding synchronizing between the
>>>>> redirecting device and the target device:
>>>>>
>>>>> if the target device's NAPI is also doing XDP_TX on the same XDP TX
>>>>> ring which this NDO might be redirecting xdp packets into the same
>>>>> ring, there would be a race accessing this ring resources (buffers
>>>>> and descriptors). Maybe you addressed this issue in the device driver
>>>>> implementation of this ndo or with some NAPI tricks/assumptions, I
>>>>> guess we have the same issue for if you run the same program to
>>>>> redirect traffic from multiple netdevices into one netdevice, how do
>>>>> you synchronize accessing this TX ring ?
>>>>
>>>> The implementation uses a per cpu TX ring to resolve these races. And
>>>> the pair of driver interface API calls, xdp_do_redirect() and xdp_do_flush_map()
>>>> must be completed in a single poll() handler.
>>>>
>>>> This comment was included in the header file to document this,
>>>>
>>>> /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
>>>>  * same cpu context. Further for best results no more than a single map
>>>>  * for the do_redirect/do_flush pair should be used. This limitation is
>>>>  * because we only track one map and force a flush when the map changes.
>>>>  * This does not appear to be a real limitation for existing software.
>>>>  */
>>>>
>>>> In general some documentation about implementing XDP would probably be
>>>> useful to add in Documentation/networking but this IMO goes beyond just
>>>> this patch series.
>>>>
>>>>>
>>>>> Maybe we need some clear guidelines in this ndo documentation stating
>>>>> how to implement this ndo and what are the assumptions on those XDP
>>>>> TX redirect rings or from which context this ndo can run.
>>>>>
>>>>> can you please elaborate.
>>>>
>>>> I think the best implementation is to use a per cpu TX ring as I did in
>>>> this series. If your device is limited by the number of queues for some
>>>> reason some other scheme would need to be devised. Unfortunately, the only
>>>> thing I've come up for this case (using only this series) would both impact
>>>> performance and make the code complex.

mlx5 and mlx4 have no limitation in regards of number of queues, My only 
concern is that this looks like a very heavy assumption with some 
unwanted side effects.

is this per cpu TX ring made only for XDP_REDIRECT action ? or it is 
shared with the XDP_TX action coming from the same device ?

if yes, wouldn't there be a race on a preempt systems or while 
XDP_REDIRECT is taking action, a HW IRQ RX interrupt occurs on the 
target device, which might execute an XDP_TX action at the same time on 
the same ring ?

>>>>
>>>> A nice solution might be to constrain networking "tasks" to only a subset
>>>> of cores. For 64+ core systems this might be a good idea. It would allow
>>>> avoiding locking using per_cpu logic but also avoid networking consuming
>>>> slices of every core in the system. As core count goes up I think we will
>>>> eventually need to address this.I believe Eric was thinking along these
>>>> lines with his netconf talk iirc. Obviously this work is way outside the
>>>> scope of this series though.
>>>
>>> I agree that it is outside the scope of this series, but I think it is
>>> important to consider the impact of the output queue selection in both
>>> a heterogenous and homogenous driver setup and how tx could be
>>> optimized or even considered to be more reliable and I think that was
>>> part of Saeed's point.
>>>
>>> I got base redirect support for bnxt_en working yesterday, but for it
>>> and other drivers that do not necessarily create a ring/queue per core
>>> like ixgbe there is probably a bit more to work in each driver to
>>> properly track output tx rings/queues than what you have done with
>>> ixgbe.
>>>
>>
>> The problem, in my mind at least, is if you do not have a ring per core
>> how does the locking work? I don't see any good way to do this outside
>> of locking which I was trying to avoid.

I also agree this is outside the scope of the series, but i also tend to 
agree with Andy and Jesper, we need some-kind of a middle ground to at 
least address the 64+ core systems and limited drivers/NICs and improve 
XDP redirect reliability.

I also don't see any other solution other than having a Qdisc like layer 
with some locking mechanism. it should be a good practice to separate 
between the XDP redirect work producers and the actual target driver's 
XDP ring consumers, which will remove the dependency and assumption of 
the per cpu ring.

The question is do we want this ? and what is the performance impact?

>
> My solution would be to queue the XDP packets in the devmap, and then
> bulk xdp_xmit them to the device on flush.  Talking a lock per bulk
> amortize cost to basically nothing.  The other advantage of this is
> improving the instruction-cache (re)usage.
>
> One thing I don't like with this patchset, is this implicit requirement
> it put on drivers, that they must have a HW TX-ring queue per CPU in the
> system.
>

+1, but I am not totally against the current approach with a future task 
to improve.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 18:26             ` John Fastabend
@ 2017-07-13 11:14               ` Jesper Dangaard Brouer
  2017-07-13 16:16                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-13 11:14 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Tue, 11 Jul 2017 11:26:54 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> On 07/11/2017 07:23 AM, Jesper Dangaard Brouer wrote:
> > On Mon, 10 Jul 2017 17:59:17 -0700
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> >> On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote:  
> >>> On Sat, 8 Jul 2017 21:06:17 +0200
> >>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >>>     
> >>>> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST)
> >>>> David Miller <davem@davemloft.net> wrote:
> >>>>    
> >>>>> From: John Fastabend <john.fastabend@gmail.com>
> >>>>> Date: Fri, 07 Jul 2017 10:48:36 -0700
> >>>>>       
> >>>>>> On 07/07/2017 10:34 AM, John Fastabend wrote:        
> >>>>>>> This series adds two new XDP helper routines bpf_redirect() and
> >>>>>>> bpf_redirect_map(). The first variant bpf_redirect() is meant
> >>>>>>> to be used the same way it is currently being used by the cls_bpf
> >>>>>>> classifier. An xdp packet will be redirected immediately when this
> >>>>>>> is called.        
> >>>>>>
> >>>>>> Also other than the typo in the title there ;) I'm going to CC
> >>>>>> the driver maintainers working on XDP (makes for a long CC list but)
> >>>>>> because we would want to try and get support in as many as possible in
> >>>>>> the next merge window.
> >>>>>>
> >>>>>> For this rev I just implemented on ixgbe because I wrote the
> >>>>>> original XDP support there. I'll volunteer to do virtio as well.        
> >>>>>
> >>>>> I went over this series a few times and it looks great to me.
> >>>>> You didn't even give me some coding style issues to pick on :-)      
> >>>>
> >>>> We (Daniel, Andy and I) have been reviewing and improving on this
> >>>> patchset the last couple of weeks ;-).  We had some stability issues,
> >>>> which is why it wasn't published earlier. My plan is to test this
> >>>> latest patchset again, Monday and Tuesday. I'll try to assess stability
> >>>> and provide some performance numbers.    
> >>>
> >>>
> >>> Damn, I though it was stable, I have been running a lot of performance
> >>> tests, and then this just happened :-(    
> >>
> >> Thanks, I'll take a look through the code and see if I can come up with
> >> why this might happen. I haven't hit it on my tests yet though.  
> > 
> > I've figured out why this happens, and I have a fix, see patch below
> > with some comments with questions.
> >   
> 
> Awesome!
> 
> > The problem is that we can leak map_to_flush in an error path, the fix:
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 2ccd6ff09493..7f1f48668dcf 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2497,11 +2497,14 @@ int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
> >         ri->map = NULL;
> >  
> >         trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
> > -
> > +       // Q: Should we also trace "goto out" (failed lookup)?
> > +       //    like bpf_warn_invalid_xdp_redirect();  
> 
> Maybe another trace event? trace_xdp_redirect_failed()
> 
> >         return __bpf_tx_xdp(fwd, map, xdp, index);
> >  out:
> >         ri->ifindex = 0;
> > -       ri->map = NULL;
> > +       // XXX: here we could leak ri->map_to_flush, which could be
> > +       //      picked up later by xdp_do_flush_map()
> > +       xdp_do_flush_map(); /* Clears ri->map_to_flush + ri->map */  
> 
> +1 
> 
> ah map lookup failed and we need to do the flush nice catch.

I'm still getting crashes (but much harder to provoke), but I figured
out why.  We sort of missed one case, where map_to_flush gets set, when
the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then
forgets to call xdp_do_flush_map, if all packets in that NAPI cycle
failed.  We could blame the driver, but yhe clean solution is making
sure, that we don't set map_to_flush when the __bpf_tx_xdp() call
fails. It should also handle the other case I fixed .... I'll cleanup
my PoC-fix patch, test it and provide it here.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-13 11:14               ` Jesper Dangaard Brouer
@ 2017-07-13 16:16                 ` Jesper Dangaard Brouer
  2017-07-13 17:00                   ` John Fastabend
  0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-13 16:16 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, brouer

On Thu, 13 Jul 2017 13:14:30 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> I'm still getting crashes (but much harder to provoke), but I figured
> out why.  We sort of missed one case, where map_to_flush gets set, when
> the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then
> forgets to call xdp_do_flush_map, if all packets in that NAPI cycle
> failed.  We could blame the driver, but yhe clean solution is making
> sure, that we don't set map_to_flush when the __bpf_tx_xdp() call
> fails. It should also handle the other case I fixed .... I'll cleanup
> my PoC-fix patch, test it and provide it here.

I changed flow in the function to be:

int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
			struct bpf_prog *xdp_prog)
{
	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
	struct bpf_map *map = ri->map;
	u32 index = ri->ifindex;
	struct net_device *fwd;
	int err = -EINVAL;

	ri->ifindex = 0;
	ri->map = NULL;

	fwd = __dev_map_lookup_elem(map, index);
	if (!fwd)
		goto out;

	if (ri->map_to_flush && (ri->map_to_flush != map))
		xdp_do_flush_map();

	err = __bpf_tx_xdp(fwd, map, xdp, index);
	if (likely(!err))
		ri->map_to_flush = map;

out:
	trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
	return err;
}


The diff is:

diff --git a/net/core/filter.c b/net/core/filter.c
index 4ca895d6ed51..c50a7ec2cdab 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2483,26 +2483,25 @@ int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
        struct bpf_map *map = ri->map;
        u32 index = ri->ifindex;
        struct net_device *fwd;
+       int err = -EINVAL;
+
+       ri->ifindex = 0;
+       ri->map = NULL;
 
        fwd = __dev_map_lookup_elem(map, index);
        if (!fwd)
                goto out;
 
-       ri->ifindex = 0;
-
        if (ri->map_to_flush && (ri->map_to_flush != map))
                xdp_do_flush_map();
 
-       ri->map_to_flush = map;
-       ri->map = NULL;
+       err = __bpf_tx_xdp(fwd, map, xdp, index);
+       if (likely(!err))
+               ri->map_to_flush = map;
 
-       trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
-
-       return __bpf_tx_xdp(fwd, map, xdp, index);
 out:
-       ri->ifindex = 0;
-       ri->map = NULL;
-       return -EINVAL;
+       trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
+       return err;
 }
 
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-13 16:16                 ` Jesper Dangaard Brouer
@ 2017-07-13 17:00                   ` John Fastabend
  2017-07-13 18:21                     ` David Miller
  0 siblings, 1 reply; 43+ messages in thread
From: John Fastabend @ 2017-07-13 17:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm

On 07/13/2017 09:16 AM, Jesper Dangaard Brouer wrote:
> On Thu, 13 Jul 2017 13:14:30 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> 
>> I'm still getting crashes (but much harder to provoke), but I figured
>> out why.  We sort of missed one case, where map_to_flush gets set, when
>> the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then
>> forgets to call xdp_do_flush_map, if all packets in that NAPI cycle
>> failed.  We could blame the driver, but yhe clean solution is making
>> sure, that we don't set map_to_flush when the __bpf_tx_xdp() call
>> fails. It should also handle the other case I fixed .... I'll cleanup
>> my PoC-fix patch, test it and provide it here.
> 
> I changed flow in the function to be:


Great, I'll merge this, the other couple fixes, and the bitops optimization and
hopefully then we are set. I'll post a v2 and we can do some final checks.

Thanks!
John

> 
> int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
> 			struct bpf_prog *xdp_prog)
> {
> 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
> 	struct bpf_map *map = ri->map;
> 	u32 index = ri->ifindex;
> 	struct net_device *fwd;
> 	int err = -EINVAL;
> 
> 	ri->ifindex = 0;
> 	ri->map = NULL;
> 
> 	fwd = __dev_map_lookup_elem(map, index);
> 	if (!fwd)
> 		goto out;
> 
> 	if (ri->map_to_flush && (ri->map_to_flush != map))
> 		xdp_do_flush_map();
> 
> 	err = __bpf_tx_xdp(fwd, map, xdp, index);
> 	if (likely(!err))
> 		ri->map_to_flush = map;
> 
> out:
> 	trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
> 	return err;
> }
> 
> 
> The diff is:
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4ca895d6ed51..c50a7ec2cdab 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2483,26 +2483,25 @@ int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
>         struct bpf_map *map = ri->map;
>         u32 index = ri->ifindex;
>         struct net_device *fwd;
> +       int err = -EINVAL;
> +
> +       ri->ifindex = 0;
> +       ri->map = NULL;
>  
>         fwd = __dev_map_lookup_elem(map, index);
>         if (!fwd)
>                 goto out;
>  
> -       ri->ifindex = 0;
> -
>         if (ri->map_to_flush && (ri->map_to_flush != map))
>                 xdp_do_flush_map();
>  
> -       ri->map_to_flush = map;
> -       ri->map = NULL;
> +       err = __bpf_tx_xdp(fwd, map, xdp, index);
> +       if (likely(!err))
> +               ri->map_to_flush = map;
>  
> -       trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
> -
> -       return __bpf_tx_xdp(fwd, map, xdp, index);
>  out:
> -       ri->ifindex = 0;
> -       ri->map = NULL;
> -       return -EINVAL;
> +       trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT);
> +       return err;
>  }
>  
>  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-13 17:00                   ` John Fastabend
@ 2017-07-13 18:21                     ` David Miller
  0 siblings, 0 replies; 43+ messages in thread
From: David Miller @ 2017-07-13 18:21 UTC (permalink / raw)
  To: john.fastabend
  Cc: brouer, netdev, andy, daniel, ast, alexander.duyck, bjorn.topel,
	jakub.kicinski, ecree, sgoutham, Yuval.Mintz, saeedm

From: John Fastabend <john.fastabend@gmail.com>
Date: Thu, 13 Jul 2017 10:00:15 -0700

> On 07/13/2017 09:16 AM, Jesper Dangaard Brouer wrote:
>> On Thu, 13 Jul 2017 13:14:30 +0200
>> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>> 
>>> I'm still getting crashes (but much harder to provoke), but I figured
>>> out why.  We sort of missed one case, where map_to_flush gets set, when
>>> the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then
>>> forgets to call xdp_do_flush_map, if all packets in that NAPI cycle
>>> failed.  We could blame the driver, but yhe clean solution is making
>>> sure, that we don't set map_to_flush when the __bpf_tx_xdp() call
>>> fails. It should also handle the other case I fixed .... I'll cleanup
>>> my PoC-fix patch, test it and provide it here.
>> 
>> I changed flow in the function to be:
> 
> 
> Great, I'll merge this, the other couple fixes, and the bitops optimization and
> hopefully then we are set. I'll post a v2 and we can do some final checks.

I am so looking forward to merging this, great work everyone.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-11 19:37                   ` John Fastabend
@ 2017-07-16  8:23                     ` Jesper Dangaard Brouer
  2017-07-17 17:04                       ` Jesse Brandeburg
  0 siblings, 1 reply; 43+ messages in thread
From: Jesper Dangaard Brouer @ 2017-07-16  8:23 UTC (permalink / raw)
  To: John Fastabend
  Cc: David Miller, netdev, andy, daniel, ast, alexander.duyck,
	bjorn.topel, jakub.kicinski, ecree, sgoutham, Yuval.Mintz,
	saeedm, Andi Kleen, jesse.brandeburg, brouer


On Tue, 11 Jul 2017 12:37:10 -0700 John Fastabend <john.fastabend@gmail.com> wrote:

> > I have a really strange observation... if I change the CPU powersave
> > settings, then the xdp_redirect_map performance drops in half!  Above
> > was with "tuned-adm profile powersave" (because, this is a really noisy
> > server, and I'm sitting next to it).  I can see that the CPU under-load
> > goes into "turbomode", rest going into low-power, including the
> > Hyper-thread siblings.
> > 
> > If I change the profile to: # tuned-adm profile network-latency
> > 
> > ifindex 6:   12964879 pkt/s
> > ifindex 6:   12964683 pkt/s
> > ifindex 6:   12961497 pkt/s
> > ifindex 6:   11779966 pkt/s <-- change to tuned-adm profile network-latency
> > ifindex 6:    6853959 pkt/s
> > ifindex 6:    6851120 pkt/s
> > ifindex 6:    6856934 pkt/s
> > ifindex 6:    6857344 pkt/s
> > ifindex 6:    6857161 pkt/s
> > 
> > The CPU efficiency goes from 2.35 to 1.24 insn per cycle.
> > 
> > John do you know some Intel people that could help me understand what
> > is going on?!? This is very strange...
> > 
> > I tried Andi's toplev tool, which AFAIK indicate that this is a
> > Frontend problem, e.g. in decoding the instructions?!?
> >   
> 
> hmm maybe Jesse or Alex have some clues. Adding them to the CC list.

This seems related to Hyper-Threading.  I tried to disable
hyperthreading in the BIOS, and the problem goes away.  That is, the
benchmarks are no-longer affected by the CPU tuned-adm profile.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
  2017-07-16  8:23                     ` Jesper Dangaard Brouer
@ 2017-07-17 17:04                       ` Jesse Brandeburg
  0 siblings, 0 replies; 43+ messages in thread
From: Jesse Brandeburg @ 2017-07-17 17:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: John Fastabend, David Miller, netdev, andy, daniel, ast,
	alexander.duyck, bjorn.topel, jakub.kicinski, ecree, sgoutham,
	Yuval.Mintz, saeedm, Andi Kleen

On Sun, 16 Jul 2017 10:23:08 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Tue, 11 Jul 2017 12:37:10 -0700 John Fastabend <john.fastabend@gmail.com> wrote:
> 
>  [...]  
> > 
> > hmm maybe Jesse or Alex have some clues. Adding them to the CC list.  
> 
> This seems related to Hyper-Threading.  I tried to disable
> hyperthreading in the BIOS, and the problem goes away.  That is, the
> benchmarks are no-longer affected by the CPU tuned-adm profile.

Wow, nice job figuring that out.  I went and took a look at tuned-adm
documentation but didn't see anything that stood out that would cause
the behavior you were seeing.

I suspect your toplev results showing this is a frontend problem mesh
nicely with the hyper-threading configuration inducing the bad behavior,
since there is still only one execution unit, and the fetchers would
have to stall.  I think you've rediscovered why the forwarding /
routing crowd generally turns off hyperthreading.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-07-17 17:04 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-07 17:34 [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
2017-07-07 17:34 ` [RFC PATCH 01/12] ixgbe: NULL xdp_tx rings on resource cleanup John Fastabend
2017-07-07 17:35 ` [RFC PATCH 02/12] net: xdp: support xdp generic on virtual devices John Fastabend
2017-07-07 17:35 ` [RFC PATCH 03/12] xdp: add bpf_redirect helper function John Fastabend
2017-07-09 13:37   ` Saeed Mahameed
2017-07-10 17:23     ` John Fastabend
2017-07-11 14:09       ` Andy Gospodarek
2017-07-11 18:38         ` John Fastabend
2017-07-11 19:38           ` Jesper Dangaard Brouer
2017-07-12 11:00             ` Saeed Mahameed
2017-07-07 17:35 ` [RFC PATCH 04/12] xdp: sample program for new bpf_redirect helper John Fastabend
2017-07-07 17:36 ` [RFC PATCH 05/12] net: implement XDP_REDIRECT for xdp generic John Fastabend
2017-07-07 17:36 ` [RFC PATCH 06/12] ixgbe: add initial support for xdp redirect John Fastabend
2017-07-07 17:36 ` [RFC PATCH 07/12] xdp: add trace event " John Fastabend
2017-07-07 17:37 ` [RFC PATCH 08/12] bpf: add devmap, a map for storing net device references John Fastabend
2017-07-08 18:57   ` Jesper Dangaard Brouer
2017-07-07 17:37 ` [RFC PATCH 09/12] bpf: add bpf_redirect_map helper routine John Fastabend
2017-07-07 17:37 ` [RFC PATCH 10/12] xdp: Add batching support to redirect map John Fastabend
2017-07-10 17:53   ` Jesper Dangaard Brouer
2017-07-10 17:56     ` John Fastabend
2017-07-07 17:38 ` [RFC PATCH 11/12] net: add notifier hooks for devmap bpf map John Fastabend
2017-07-07 17:38 ` [RFC PATCH 12/12] xdp: bpf redirect with map sample program John Fastabend
2017-07-07 17:48 ` [RFC PATCH 00/12] Implement XDP bpf_redirect vairants John Fastabend
2017-07-08  9:46   ` David Miller
2017-07-08 19:06     ` Jesper Dangaard Brouer
2017-07-10 18:30       ` Jesper Dangaard Brouer
2017-07-11  0:59         ` John Fastabend
2017-07-11 14:23           ` Jesper Dangaard Brouer
2017-07-11 18:26             ` John Fastabend
2017-07-13 11:14               ` Jesper Dangaard Brouer
2017-07-13 16:16                 ` Jesper Dangaard Brouer
2017-07-13 17:00                   ` John Fastabend
2017-07-13 18:21                     ` David Miller
2017-07-11 15:36       ` Jesper Dangaard Brouer
2017-07-11 17:48         ` John Fastabend
2017-07-11 18:01           ` Jesper Dangaard Brouer
2017-07-11 18:29             ` John Fastabend
2017-07-11 18:44             ` Jesper Dangaard Brouer
2017-07-11 18:56               ` John Fastabend
2017-07-11 19:19                 ` Jesper Dangaard Brouer
2017-07-11 19:37                   ` John Fastabend
2017-07-16  8:23                     ` Jesper Dangaard Brouer
2017-07-17 17:04                       ` Jesse Brandeburg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).