All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding
@ 2016-07-12  7:51 Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 01/11] bpf: add XDP prog type for early driver filter Brenden Blanco
                   ` (11 more replies)
  0 siblings, 12 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.

Extend on this with the ability to modify packet data and send back out
on the same port.

Patch 1 introduces the new prog type and helpers for validating the bpf
  program. A new userspace struct is defined containing only data and
  data_end as fields, with others to follow in the future.
In patch 2, create a new ndo to pass the fd to supported drivers.
In patch 3, expose a new rtnl option to userspace.
In patch 4, enable support in mlx4 driver.
In patch 5, create a sample drop and count program. With single core,
  achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
  packet data access, bpf array lookup, and increment.
In patch 6, add a page recycle facility to mlx4 rx, enabled when xdp is
  active.
In patch 7, add the XDP_TX type to bpf.h
In patch 8, add helper in tx patch for writing tx_desc
In patch 9, add support in mlx4 for packet data write and forwarding
In patch 10, turn on packet write support in the bpf verifier
In patch 11, add a sample program for packet write and forwarding. With
  single core, achieved ~10 Mpps rewrite and forwarding.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

v8:
 1/11: Reduce WARN_ONCE to single line. Also, change act param of that
   function to u32 to match return type of bpf_prog_run_xdp.
 2/11: Clarify locking semantics in ndo comment.
 4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.

v7:
 Addressing two of the major discussion points: return codes and ndo.
 The rest will be taken as todo items for separate patches.

 Add an XDP_ABORTED type, which explicitly falls through to DROP. The
 same result must be taken for the default case as well, as it is now
 well-defined API behavior.

 Merge ndo_xdp_* into a single ndo. The style is similar to
 ndo_setup_tc, but with less unidirectional naming convention. The IFLA
 parameter names are unchanged.

 TODOs:
 Add ethtool per-ring stats for aborted, default cases, maybe even drop
 and tx as well.
 Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
 Saeed.

  1/12: Add XDP_ABORTED enum, reword API comment, and update commit
   message.
  2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
    calling convention.
  3/12: Switch to ndo_xdp callback.
  4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
    ndo_xdp.
 12/12: Dropped, this will need some more work.

v6:
  2/12: drop unnecessary netif_device_present check
  4/12, 6/12, 9/12: Reorder default case statement above drop case to
    remove some copy/paste.

v5:
  0/12: Rebase and remove previous 1/13 patch
  1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
    in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
    out of bounds action is returned by the program. Add a comment to
    bpf.h denoting the undefined nature of out of bounds returns.
  2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
    ndo_xdp_attached().
  3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
    for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
  4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
    bpf_warn_invalid_xdp_action helper.
  5/12: Adjust to using the nested netlink options.
  6/12: kbuild was complaining about overflow of u16 on tile
    architecture...bump frag_stride to u32. The page_offset member that
    is computed from this was already u32.

v4:
  2/12: Add inline helper for calling xdp bpf prog under rcu
  3/12: Add detail to ndo comments
  5/12: Remove mlx4_call_xdp and use inline helper instead.
  6/12: Fix checkpatch complaints
  9/12: Introduce new patch 9/12 with common helper for tx_desc write
    Refactor to use common tx_desc write helper
 11/12: Fix checkpatch complaints

v3:
  Rewrite from v2 trying to incorporate feedback from multiple sources.
  Specifically, add ability to forward packets out the same port and
    allow packet modification.
  For packet forwarding, the driver reserves a dedicated set of tx rings
    for exclusive use by xdp. Upon completion, the pages on this ring are
    recycled directly back to a small per-rx-ring page cache without
    being dma unmapped.
  Use of the percpu skb is dropped in favor of a lightweight struct
    xdp_buff. The direct packet access feature is leveraged to remove
    dependence on the skb.
  The mlx4 driver implementation allocates a page-per-packet and maps it
    in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
  Naming is converted to use "xdp" instead of "phys_dev".

v2:
  1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
    Introduce enum for return values from phys_dev hook
  2/5: Move prog->type check to just before invoking ndo
    Change ndo to take a bpf_prog * instead of fd
    Add ndo_bpf_get rather than keeping a bool in the netdev struct
  3/5: Use ndo_bpf_get to fetch bool
  4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing
    mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely
    to set a bpf prog when priv->num_frags > 1
    Rename pseudo_skb to bpf_phys_dev_md
    Implement ndo_bpf_get
    Add dma sync just before invoking prog
    Check for explicit bpf return code rather than nonzero
    Remove increment of rx_dropped
  5/5: Use explicit bpf return code in example
    Update commit log with higher pps numbers

Brenden Blanco (11):
  bpf: add XDP prog type for early driver filter
  net: add ndo to setup/query xdp prog in adapter rx
  rtnl: add option for setting link xdp prog
  net/mlx4_en: add support for fast rx drop bpf program
  Add sample for adding simple drop program to link
  net/mlx4_en: add page recycle to prepare rx ring for tx support
  bpf: add XDP_TX xdp_action for direct forwarding
  net/mlx4_en: break out tx_desc write into separate function
  net/mlx4_en: add xdp forwarding and data write support
  bpf: enable direct packet data write for xdp progs
  bpf: add sample for xdp forwarding and rewrite

 drivers/infiniband/hw/mlx4/qp.c                 |  11 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  17 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 108 +++++++++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      | 120 +++++++++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 253 +++++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  31 ++-
 include/linux/filter.h                          |  18 ++
 include/linux/mlx4/qp.h                         |  18 +-
 include/linux/netdevice.h                       |  34 ++++
 include/uapi/linux/bpf.h                        |  21 ++
 include/uapi/linux/if_link.h                    |  12 ++
 kernel/bpf/verifier.c                           |  18 +-
 net/core/dev.c                                  |  33 ++++
 net/core/filter.c                               |  79 ++++++++
 net/core/rtnetlink.c                            |  64 ++++++
 samples/bpf/Makefile                            |   9 +
 samples/bpf/bpf_load.c                          |   8 +
 samples/bpf/xdp1_kern.c                         |  93 +++++++++
 samples/bpf/xdp1_user.c                         | 181 +++++++++++++++++
 samples/bpf/xdp2_kern.c                         | 114 +++++++++++
 20 files changed, 1154 insertions(+), 88 deletions(-)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c
 create mode 100644 samples/bpf/xdp2_kern.c

-- 
2.8.2

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12 13:14   ` Jesper Dangaard Brouer
  2016-07-12  7:51 ` [PATCH v8 02/11] net: add ndo to setup/query xdp prog in adapter rx Brenden Blanco
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.

An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/filter.h   | 18 +++++++++++
 include/uapi/linux/bpf.h | 20 ++++++++++++
 kernel/bpf/verifier.c    |  1 +
 net/core/filter.c        | 79 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 118 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6fc31ef..15d816a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -368,6 +368,11 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
+struct xdp_buff {
+	void *data;
+	void *data_end;
+};
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf and act_bpf programs
  */
@@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 	return BPF_PROG_RUN(prog, skb);
 }
 
+static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+				   struct xdp_buff *xdp)
+{
+	u32 ret;
+
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, (void *)xdp);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
 	return max(sizeof(struct bpf_prog),
@@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
+void bpf_warn_invalid_xdp_action(u32 act);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 262a7e8..4282d44 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SCHED_CLS,
 	BPF_PROG_TYPE_SCHED_ACT,
 	BPF_PROG_TYPE_TRACEPOINT,
+	BPF_PROG_TYPE_XDP,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
@@ -437,4 +438,23 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* User return codes for XDP prog type.
+ * A valid XDP program must return one of these defined values. All other
+ * return codes are reserved for future use. Unknown return codes will result
+ * in packet drop.
+ */
+enum xdp_action {
+	XDP_ABORTED = 0,
+	XDP_DROP,
+	XDP_PASS,
+};
+
+/* user accessible metadata for XDP packet hook
+ * new fields must be added to the end of this structure
+ */
+struct xdp_md {
+	__u32 data;
+	__u32 data_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e206c21..a8d67d0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_XDP:
 		break;
 	default:
 		verbose("verifier is misconfigured\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 10c4a2f..2d770f5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+xdp_func_proto(enum bpf_func_id func_id)
+{
+	return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2436,6 +2442,44 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+				  enum bpf_access_type type)
+{
+	if (off < 0 || off >= sizeof(struct xdp_md))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (size != 4)
+		return false;
+
+	return true;
+}
+
+static bool xdp_is_valid_access(int off, int size,
+				enum bpf_access_type type,
+				enum bpf_reg_type *reg_type)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	switch (off) {
+	case offsetof(struct xdp_md, data):
+		*reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*reg_type = PTR_TO_PACKET_END;
+		break;
+	}
+
+	return __is_valid_xdp_access(off, size, type);
+}
+
+void bpf_warn_invalid_xdp_action(u32 act)
+{
+	WARN_ONCE(1, "Illegal XDP return value %u, expect packet loss\n", act);
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
+
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 				      int src_reg, int ctx_off,
 				      struct bpf_insn *insn_buf,
@@ -2587,6 +2631,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+				  int src_reg, int ctx_off,
+				  struct bpf_insn *insn_buf,
+				  struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct xdp_md, data):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data));
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static const struct bpf_verifier_ops sk_filter_ops = {
 	.get_func_proto		= sk_filter_func_proto,
 	.is_valid_access	= sk_filter_is_valid_access,
@@ -2599,6 +2666,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
 	.convert_ctx_access	= bpf_net_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops xdp_ops = {
+	.get_func_proto		= xdp_func_proto,
+	.is_valid_access	= xdp_is_valid_access,
+	.convert_ctx_access	= xdp_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2614,11 +2687,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_SCHED_ACT,
 };
 
+static struct bpf_prog_type_list xdp_type __read_mostly = {
+	.ops	= &xdp_ops,
+	.type	= BPF_PROG_TYPE_XDP,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
+	bpf_register_prog_type(&xdp_type);
 
 	return 0;
 }
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 02/11] net: add ndo to setup/query xdp prog in adapter rx
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 01/11] bpf: add XDP prog type for early driver filter Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 03/11] rtnl: add option for setting link xdp prog Brenden Blanco
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/netdevice.h | 34 ++++++++++++++++++++++++++++++++++
 net/core/dev.c            | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 49736a3..fab9a1c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -63,6 +63,7 @@ struct wpan_dev;
 struct mpls_dev;
 /* UDP Tunnel offloads */
 struct udp_tunnel_info;
+struct bpf_prog;
 
 void netdev_set_default_ethtool_ops(struct net_device *dev,
 				    const struct ethtool_ops *ops);
@@ -799,6 +800,33 @@ struct tc_to_netdev {
 	};
 };
 
+/* These structures hold the attributes of xdp state that are being passed
+ * to the netdevice through the xdp op.
+ */
+enum xdp_netdev_command {
+	/* Set or clear a bpf program used in the earliest stages of packet
+	 * rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The callee
+	 * is responsible for calling bpf_prog_put on any old progs that are
+	 * stored. In case of error, the callee need not release the new prog
+	 * reference, but on success it takes ownership and must bpf_prog_put
+	 * when it is no longer used.
+	 */
+	XDP_SETUP_PROG,
+	/* Check if a bpf program is set on the device.  The callee should
+	 * return true if a program is currently attached and running.
+	 */
+	XDP_QUERY_PROG,
+};
+
+struct netdev_xdp {
+	enum xdp_netdev_command command;
+	union {
+		/* XDP_SETUP_PROG */
+		struct bpf_prog *prog;
+		/* XDP_QUERY_PROG */
+		bool prog_attached;
+	};
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1087,6 +1115,9 @@ struct tc_to_netdev {
  *	appropriate rx headroom value allows avoiding skb head copy on
  *	forward. Setting a negative value resets the rx headroom to the
  *	default value.
+ * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp);
+ *	This function is used to set or query state related to XDP on the
+ *	netdevice. See definition of enum xdp_netdev_command for details.
  *
  */
 struct net_device_ops {
@@ -1271,6 +1302,8 @@ struct net_device_ops {
 						       struct sk_buff *skb);
 	void			(*ndo_set_rx_headroom)(struct net_device *dev,
 						       int needed_headroom);
+	int			(*ndo_xdp)(struct net_device *dev,
+					   struct netdev_xdp *xdp);
 };
 
 /**
@@ -3257,6 +3290,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
 			   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
+int dev_change_xdp_fd(struct net_device *dev, int fd);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7894e40..2a9c39f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -94,6 +94,7 @@
 #include <linux/ethtool.h>
 #include <linux/notifier.h>
 #include <linux/skbuff.h>
+#include <linux/bpf.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/busy_poll.h>
@@ -6615,6 +6616,38 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
 EXPORT_SYMBOL(dev_change_proto_down);
 
 /**
+ *	dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ *	@dev: device
+ *	@fd: new program fd or negative value to clear
+ *
+ *	Set or clear a bpf program for a device
+ */
+int dev_change_xdp_fd(struct net_device *dev, int fd)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct bpf_prog *prog = NULL;
+	struct netdev_xdp xdp = {};
+	int err;
+
+	if (!ops->ndo_xdp)
+		return -EOPNOTSUPP;
+	if (fd >= 0) {
+		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	xdp.command = XDP_SETUP_PROG;
+	xdp.prog = prog;
+	err = ops->ndo_xdp(dev, &xdp);
+	if (err < 0 && prog)
+		bpf_prog_put(prog);
+
+	return err;
+}
+EXPORT_SYMBOL(dev_change_xdp_fd);
+
+/**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
  *
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 03/11] rtnl: add option for setting link xdp prog
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 01/11] bpf: add XDP prog type for early driver filter Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 02/11] net: add ndo to setup/query xdp prog in adapter rx Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/if_link.h | 12 +++++++++
 net/core/rtnetlink.c         | 64 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4285ac3..a1b5202 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -156,6 +156,7 @@ enum {
 	IFLA_GSO_MAX_SEGS,
 	IFLA_GSO_MAX_SIZE,
 	IFLA_PAD,
+	IFLA_XDP,
 	__IFLA_MAX
 };
 
@@ -843,4 +844,15 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* XDP section */
+
+enum {
+	IFLA_XDP_UNSPEC,
+	IFLA_XDP_FD,
+	IFLA_XDP_ATTACHED,
+	__IFLA_XDP_MAX,
+};
+
+#define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a9e3805..eba2b82 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -891,6 +891,16 @@ static size_t rtnl_port_size(const struct net_device *dev,
 		return port_self_size;
 }
 
+static size_t rtnl_xdp_size(const struct net_device *dev)
+{
+	size_t xdp_size = nla_total_size(1);	/* XDP_ATTACHED */
+
+	if (!dev->netdev_ops->ndo_xdp)
+		return 0;
+	else
+		return xdp_size;
+}
+
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
 				     u32 ext_filter_mask)
 {
@@ -927,6 +937,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
 	       + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
+	       + rtnl_xdp_size(dev) /* IFLA_XDP */
 	       + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1211,6 +1222,33 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	struct netdev_xdp xdp_op = {};
+	struct nlattr *xdp;
+	int err;
+
+	if (!dev->netdev_ops->ndo_xdp)
+		return 0;
+	xdp = nla_nest_start(skb, IFLA_XDP);
+	if (!xdp)
+		return -EMSGSIZE;
+	xdp_op.command = XDP_QUERY_PROG;
+	err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
+	if (err)
+		goto err_cancel;
+	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, xdp_op.prog_attached);
+	if (err)
+		goto err_cancel;
+
+	nla_nest_end(skb, xdp);
+	return 0;
+
+err_cancel:
+	nla_nest_cancel(skb, xdp);
+	return err;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask)
@@ -1307,6 +1345,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_port_fill(skb, dev, ext_filter_mask))
 		goto nla_put_failure;
 
+	if (rtnl_xdp_fill(skb, dev))
+		goto nla_put_failure;
+
 	if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) {
 		if (rtnl_link_fill(skb, dev) < 0)
 			goto nla_put_failure;
@@ -1392,6 +1433,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
 	[IFLA_PROTO_DOWN]	= { .type = NLA_U8 },
+	[IFLA_XDP]		= { .type = NLA_NESTED },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1429,6 +1471,11 @@ static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = {
 	[IFLA_PORT_RESPONSE]	= { .type = NLA_U16, },
 };
 
+static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
+	[IFLA_XDP_FD]		= { .type = NLA_S32 },
+	[IFLA_XDP_ATTACHED]	= { .type = NLA_U8 },
+};
+
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
 {
 	const struct rtnl_link_ops *ops = NULL;
@@ -2054,6 +2101,23 @@ static int do_setlink(const struct sk_buff *skb,
 		status |= DO_SETLINK_NOTIFY;
 	}
 
+	if (tb[IFLA_XDP]) {
+		struct nlattr *xdp[IFLA_XDP_MAX + 1];
+
+		err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
+				       ifla_xdp_policy);
+		if (err < 0)
+			goto errout;
+
+		if (xdp[IFLA_XDP_FD]) {
+			err = dev_change_xdp_fd(dev,
+						nla_get_s32(xdp[IFLA_XDP_FD]));
+			if (err)
+				goto errout;
+			status |= DO_SETLINK_NOTIFY;
+		}
+	}
+
 errout:
 	if (status & DO_SETLINK_MODIFIED) {
 		if (status & DO_SETLINK_NOTIFY)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (2 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 03/11] rtnl: add option for setting link xdp prog Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12 12:02   ` Tariq Toukan
                     ` (2 more replies)
  2016-07-12  7:51 ` [PATCH v8 05/11] Add sample for adding simple drop program to link Brenden Blanco
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.

In tc/socket bpf programs, helpers linearize skb fragments as needed
when the program touches the packet data. However, in the pursuit of
speed, XDP programs will not be allowed to use these slower functions,
especially if it involves allocating an skb.

Therefore, disallow MTU settings that would produce a multi-fragment
packet that XDP programs would fail to access. Future enhancements could
be done to increase the allowable MTU.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 51 ++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 37 +++++++++++++++++--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 +++
 3 files changed, 89 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6083775..369a2ef 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@
  *
  */
 
+#include <linux/bpf.h>
 #include <linux/etherdevice.h>
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
@@ -2084,6 +2085,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 	if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_TS)
 		mlx4_en_remove_timestamp(mdev);
 
+	if (priv->prog)
+		bpf_prog_put(priv->prog);
+
 	/* Detach the netdev so tasks would not attempt to access it */
 	mutex_lock(&mdev->state_lock);
 	mdev->pndev[priv->port] = NULL;
@@ -2112,6 +2116,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 		en_err(priv, "Bad MTU size:%d.\n", new_mtu);
 		return -EPERM;
 	}
+	if (priv->prog && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
+		en_err(priv, "MTU size:%d requires frags but XDP prog running",
+		       new_mtu);
+		return -EOPNOTSUPP;
+	}
 	dev->mtu = new_mtu;
 
 	if (netif_running(dev)) {
@@ -2520,6 +2529,46 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 	return err;
 }
 
+static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+
+	if (priv->num_frags > 1) {
+		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* This xchg is paired with READ_ONCE in the fast path, but is
+	 * also protected from itself via rtnl lock
+	 */
+	old_prog = xchg(&priv->prog, prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+static bool mlx4_xdp_attached(struct net_device *dev)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	return !!READ_ONCE(priv->prog);
+}
+
+static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return mlx4_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = mlx4_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
@@ -2548,6 +2597,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp		= mlx4_xdp,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2584,6 +2634,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp		= mlx4_xdp,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c1b3a9c..adfa123 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -743,6 +743,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
+	struct bpf_prog *prog;
 	struct sk_buff *skb;
 	int index;
 	int nr;
@@ -759,6 +760,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	if (budget <= 0)
 		return polled;
 
+	prog = READ_ONCE(priv->prog);
+
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
 	 * reading 'cqe->index' */
@@ -835,6 +838,35 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
 			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
+		/* A bpf program gets first chance to drop the packet. It may
+		 * read bytes but not past the end of the frag.
+		 */
+		if (prog) {
+			struct xdp_buff xdp;
+			dma_addr_t dma;
+			u32 act;
+
+			dma = be64_to_cpu(rx_desc->data[0].addr);
+			dma_sync_single_for_cpu(priv->ddev, dma,
+						priv->frag_info[0].frag_size,
+						DMA_FROM_DEVICE);
+
+			xdp.data = page_address(frags[0].page) +
+							frags[0].page_offset;
+			xdp.data_end = xdp.data + length;
+
+			act = bpf_prog_run_xdp(prog, &xdp);
+			switch (act) {
+			case XDP_PASS:
+				break;
+			default:
+				bpf_warn_invalid_xdp_action(act);
+			case XDP_ABORTED:
+			case XDP_DROP:
+				goto next;
+			}
+		}
+
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
@@ -1062,10 +1094,7 @@ static const int frag_sizes[] = {
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
-	 * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
-	 */
-	int eff_mtu = dev->mtu + ETH_HLEN + (2 * VLAN_HLEN);
+	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
 	int buf_size = 0;
 	int i = 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d39bf59..35ecfa2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -164,6 +164,10 @@ enum {
 #define MLX4_LOOPBACK_TEST_PAYLOAD (HEADER_COPY_SIZE - ETH_HLEN)
 
 #define MLX4_EN_MIN_MTU		46
+/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
+ * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
+ */
+#define MLX4_EN_EFF_MTU(mtu)	((mtu) + ETH_HLEN + (2 * VLAN_HLEN))
 #define ETH_BCAST		0xffffffffffffULL
 
 #define MLX4_EN_LOOPBACK_RETRIES	5
@@ -590,6 +594,7 @@ struct mlx4_en_priv {
 	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
 	struct hwtstamp_config hwtstamp_config;
 	u32 counter_index;
+	struct bpf_prog *prog;
 
 #ifdef CONFIG_MLX4_EN_DCB
 #define MLX4_EN_DCB_ENABLED	0x3
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 05/11] Add sample for adding simple drop program to link
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (3 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
hook of a link. With the drop-only program, observed single core rate is
~20Mpps.

Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.

$ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
proto 17:   20403027 drops/s

./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
Running... ctrl^C to stop
Device: eth4@0
Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
  5056638pps 2427Mb/sec (2427186240bps) errors: 0
Device: eth4@1
Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
  5133311pps 2463Mb/sec (2463989280bps) errors: 0
Device: eth4@2
Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
  5077431pps 2437Mb/sec (2437166880bps) errors: 0
Device: eth4@3
Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
  5043067pps 2420Mb/sec (2420672160bps) errors: 0

perf report --no-children:
 26.05%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 17.84%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
  5.52%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
  4.90%  swapper      [kernel.vmlinux]  [k] poll_idle
  4.14%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  2.78%  ksoftirqd/0  [kernel.vmlinux]  [k] __free_pages_ok
  2.57%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.51%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
  1.94%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  1.45%  swapper      [mlx4_en]         [k] mlx4_en_alloc_frags
  1.35%  ksoftirqd/0  [kernel.vmlinux]  [k] free_one_page
  1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
  1.04%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5c5
  0.96%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c58d
  0.93%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6ee
  0.92%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6b9
  0.89%  ksoftirqd/0  [kernel.vmlinux]  [k] __alloc_pages_nodemask
  0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c686
  0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5d5
  0.78%  ksoftirqd/0  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
  0.77%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5b4
  0.77%  ksoftirqd/0  [kernel.vmlinux]  [k] net_rx_action

machine specs:
 receiver - Intel E5-1630 v3 @ 3.70GHz
 sender - Intel E5645 @ 2.40GHz
 Mellanox ConnectX-3 @40G

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 samples/bpf/Makefile    |   4 ++
 samples/bpf/bpf_load.c  |   8 +++
 samples/bpf/xdp1_kern.c |  93 +++++++++++++++++++++++++
 samples/bpf/xdp1_user.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 286 insertions(+)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a98b780..0e4ab3a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -21,6 +21,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += xdp1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -42,6 +43,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -64,6 +66,7 @@ always += test_overhead_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
+always += xdp1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -84,6 +87,7 @@ HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
+HOSTLOADLIBES_xdp1 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 022af71..0cfda23 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -50,6 +50,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
 	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
 	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+	bool is_xdp = strncmp(event, "xdp", 3) == 0;
 	enum bpf_prog_type prog_type;
 	char buf[256];
 	int fd, efd, err, id;
@@ -66,6 +67,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_KPROBE;
 	} else if (is_tracepoint) {
 		prog_type = BPF_PROG_TYPE_TRACEPOINT;
+	} else if (is_xdp) {
+		prog_type = BPF_PROG_TYPE_XDP;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -79,6 +82,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 
 	prog_fd[prog_cnt++] = fd;
 
+	if (is_xdp)
+		return 0;
+
 	if (is_socket) {
 		event += 6;
 		if (*event != '/')
@@ -319,6 +325,7 @@ int load_bpf_file(char *path)
 			if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
 			    memcmp(shname_prog, "kretprobe/", 10) == 0 ||
 			    memcmp(shname_prog, "tracepoint/", 11) == 0 ||
+			    memcmp(shname_prog, "xdp", 3) == 0 ||
 			    memcmp(shname_prog, "socket", 6) == 0)
 				load_and_attach(shname_prog, insns, data_prog->d_size);
 		}
@@ -336,6 +343,7 @@ int load_bpf_file(char *path)
 		if (memcmp(shname, "kprobe/", 7) == 0 ||
 		    memcmp(shname, "kretprobe/", 10) == 0 ||
 		    memcmp(shname, "tracepoint/", 11) == 0 ||
+		    memcmp(shname, "xdp", 3) == 0 ||
 		    memcmp(shname, "socket", 6) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
new file mode 100644
index 0000000..e7dd8ac
--- /dev/null
+++ b/samples/bpf/xdp1_kern.c
@@ -0,0 +1,93 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+	struct iphdr *iph = data + nh_off;
+
+	if (iph + 1 > data_end)
+		return 0;
+	return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+	struct ipv6hdr *ip6h = data + nh_off;
+
+	if (ip6h + 1 > data_end)
+		return 0;
+	return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u64 nh_off;
+	u32 index;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+
+	if (h_proto == htons(ETH_P_IP))
+		index = parse_ipv4(data, nh_off, data_end);
+	else if (h_proto == htons(ETH_P_IPV6))
+		index = parse_ipv6(data, nh_off, data_end);
+	else
+		index = 0;
+
+	value = bpf_map_lookup_elem(&dropcnt, &index);
+	if (value)
+		*value += 1;
+
+	return rc;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
new file mode 100644
index 0000000..a5e109e
--- /dev/null
+++ b/samples/bpf/xdp1_user.c
@@ -0,0 +1,181 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+
+static int set_link_xdp_fd(int ifindex, int fd)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+
+	nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len = NLA_HDRLEN + nla_xdp->nla_len;
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
+
+static int ifindex;
+
+static void int_exit(int sig)
+{
+	set_link_xdp_fd(ifindex, -1);
+	exit(0);
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+	const unsigned int nr_keys = 256;
+	__u64 values[nr_cpus], prev[nr_keys][nr_cpus];
+	__u32 key;
+	int i;
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		sleep(interval);
+
+		for (key = 0; key < nr_keys; key++) {
+			__u64 sum = 0;
+
+			assert(bpf_lookup_elem(map_fd[0], &key, values) == 0);
+			for (i = 0; i < nr_cpus; i++)
+				sum += (values[i] - prev[key][i]);
+			if (sum)
+				printf("proto %u: %10llu pkt/s\n",
+				       key, sum / interval);
+			memcpy(prev[key], values, sizeof(values));
+		}
+	}
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (ac != 2) {
+		printf("usage: %s IFINDEX\n", argv[0]);
+		return 1;
+	}
+
+	ifindex = strtoul(argv[1], NULL, 0);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+
+	if (set_link_xdp_fd(ifindex, prog_fd[0]) < 0) {
+		printf("link set xdp fd failed\n");
+		return 1;
+	}
+
+	poll_stats(2);
+
+	return 0;
+}
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (4 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 05/11] Add sample for adding simple drop program to link Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12 12:09   ` Tariq Toukan
  2016-07-12 21:18   ` David Miller
  2016-07-12  7:51 ` [PATCH v8 07/11] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

The mlx4 driver by default allocates order-3 pages for the ring to
consume in multiple fragments. When the device has an xdp program, this
behavior will prevent tx actions since the page must be re-mapped in
TODEVICE mode, which cannot be done if the page is still shared.

Start by making the allocator configurable based on whether xdp is
running, such that order-0 pages are always used and never shared.

Since this will stress the page allocator, add a simple page cache to
each rx ring. Pages in the cache are left dma-mapped, and in drop-only
stress tests the page allocator is eliminated from the perf report.

Note that setting an xdp program will now require the rings to be
reconfigured.

Before:
 26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
  6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
  4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
  2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq

After:
 31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
  8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
  7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
  6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
  4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
  4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
  3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
  2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
  1.37%  swapper      [kernel.vmlinux]       [k] menu_select
  1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 46 +++++++++++++++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      | 69 ++++++++++++++++++++++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    | 12 ++++-
 4 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 51a2e82..d3d51fa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -47,7 +47,7 @@
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0xffff)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0xffffffff)
 
-static int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
+int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
 {
 	int i;
 	int err = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 369a2ef..252c893d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2532,21 +2532,59 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
 	struct bpf_prog *old_prog;
+	int port_up = 0;
+	int err;
+
+	/* No need to reconfigure buffers when simply swapping the
+	 * program for a new one.
+	 */
+	if (READ_ONCE(priv->prog) && prog) {
+		/* This xchg is paired with READ_ONCE in the fast path, but is
+		 * also protected from itself via rtnl lock
+		 */
+		old_prog = xchg(&priv->prog, prog);
+		if (old_prog)
+			bpf_prog_put(old_prog);
+		return 0;
+	}
 
 	if (priv->num_frags > 1) {
 		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
 		return -EOPNOTSUPP;
 	}
 
-	/* This xchg is paired with READ_ONCE in the fast path, but is
-	 * also protected from itself via rtnl lock
-	 */
+	mutex_lock(&mdev->state_lock);
+	if (priv->port_up) {
+		port_up = 1;
+		mlx4_en_stop_port(dev, 1);
+	}
+
+	mlx4_en_free_resources(priv);
+
 	old_prog = xchg(&priv->prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
-	return 0;
+	err = mlx4_en_alloc_resources(priv);
+	if (err) {
+		en_err(priv, "Failed reallocating port resources\n");
+		goto out;
+	}
+	if (port_up) {
+		err = mlx4_en_start_port(dev);
+		if (err)
+			en_err(priv, "Failed starting port\n");
+	}
+
+	err = mlx4_en_moderation_update(priv);
+
+out:
+	if (err)
+		priv->prog = NULL;
+	mutex_unlock(&mdev->state_lock);
+	return err;
 }
 
 static bool mlx4_xdp_attached(struct net_device *dev)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index adfa123..f26306c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -57,7 +57,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 	struct page *page;
 	dma_addr_t dma;
 
-	for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
+	for (order = frag_info->order; ;) {
 		gfp_t gfp = _gfp;
 
 		if (order)
@@ -70,7 +70,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 			return -ENOMEM;
 	}
 	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
-			   PCI_DMA_FROMDEVICE);
+			   frag_info->dma_dir);
 	if (dma_mapping_error(priv->ddev, dma)) {
 		put_page(page);
 		return -ENOMEM;
@@ -124,7 +124,8 @@ out:
 	while (i--) {
 		if (page_alloc[i].page != ring_alloc[i].page) {
 			dma_unmap_page(priv->ddev, page_alloc[i].dma,
-				page_alloc[i].page_size, PCI_DMA_FROMDEVICE);
+				page_alloc[i].page_size,
+				priv->frag_info[i].dma_dir);
 			page = page_alloc[i].page;
 			/* Revert changes done by mlx4_alloc_pages */
 			page_ref_sub(page, page_alloc[i].page_size /
@@ -145,7 +146,7 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
 
 	if (next_frag_end > frags[i].page_size)
 		dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
-			       PCI_DMA_FROMDEVICE);
+			       frag_info->dma_dir);
 
 	if (frags[i].page)
 		put_page(frags[i].page);
@@ -176,7 +177,8 @@ out:
 
 		page_alloc = &ring->page_alloc[i];
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-			       page_alloc->page_size, PCI_DMA_FROMDEVICE);
+			       page_alloc->page_size,
+			       priv->frag_info[i].dma_dir);
 		page = page_alloc->page;
 		/* Revert changes done by mlx4_alloc_pages */
 		page_ref_sub(page, page_alloc->page_size /
@@ -201,7 +203,7 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
 		       i, page_count(page_alloc->page));
 
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-				page_alloc->page_size, PCI_DMA_FROMDEVICE);
+				page_alloc->page_size, frag_info->dma_dir);
 		while (page_alloc->page_offset + frag_info->frag_stride <
 		       page_alloc->page_size) {
 			put_page(page_alloc->page);
@@ -244,6 +246,12 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc *frags = ring->rx_info +
 					(index << priv->log_rx_info);
 
+	if (ring->page_cache.index > 0) {
+		frags[0] = ring->page_cache.buf[--ring->page_cache.index];
+		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
+		return 0;
+	}
+
 	return mlx4_en_alloc_frags(priv, rx_desc, frags, ring->page_alloc, gfp);
 }
 
@@ -502,12 +510,39 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 	}
 }
 
+/* When the rx ring is running in page-per-packet mode, a released frame can go
+ * directly into a small cache, to avoid unmapping or touching the page
+ * allocator. In bpf prog performance scenarios, buffers are either forwarded
+ * or dropped, never converted to skbs, so every page can come directly from
+ * this cache when it is sized to be a multiple of the napi budget.
+ */
+bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
+			struct mlx4_en_rx_alloc *frame)
+{
+	struct mlx4_en_page_cache *cache = &ring->page_cache;
+
+	if (cache->index >= MLX4_EN_CACHE_SIZE)
+		return false;
+
+	cache->buf[cache->index++] = *frame;
+	return true;
+}
+
 void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 			     struct mlx4_en_rx_ring **pring,
 			     u32 size, u16 stride)
 {
 	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_rx_ring *ring = *pring;
+	int i;
+
+	for (i = 0; i < ring->page_cache.index; i++) {
+		struct mlx4_en_rx_alloc *frame = &ring->page_cache.buf[i];
+
+		dma_unmap_page(priv->ddev, frame->dma, frame->page_size,
+			       priv->frag_info[0].dma_dir);
+		put_page(frame->page);
+	}
 
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
 	vfree(ring->rx_info);
@@ -863,6 +898,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 				bpf_warn_invalid_xdp_action(act);
 			case XDP_ABORTED:
 			case XDP_DROP:
+				if (mlx4_en_rx_recycle(ring, frags))
+					goto consumed;
 				goto next;
 			}
 		}
@@ -1018,6 +1055,7 @@ next:
 		for (nr = 0; nr < priv->num_frags; nr++)
 			mlx4_en_free_frag(priv, frags, nr);
 
+consumed:
 		++cq->mcq.cons_index;
 		index = (cq->mcq.cons_index) & ring->size_mask;
 		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
@@ -1093,19 +1131,34 @@ static const int frag_sizes[] = {
 
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
+	enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
+	int order = MLX4_EN_ALLOC_PREFER_ORDER;
+	u32 align = SMP_CACHE_BYTES;
 	int buf_size = 0;
 	int i = 0;
 
+	/* bpf requires buffers to be set up as 1 packet per page.
+	 * This only works when num_frags == 1.
+	 */
+	if (priv->prog) {
+		/* This will gain efficient xdp frame recycling at the expense
+		 * of more costly truesize accounting
+		 */
+		align = PAGE_SIZE;
+		order = 0;
+	}
+
 	while (buf_size < eff_mtu) {
+		priv->frag_info[i].order = order;
 		priv->frag_info[i].frag_size =
 			(eff_mtu > buf_size + frag_sizes[i]) ?
 				frag_sizes[i] : eff_mtu - buf_size;
 		priv->frag_info[i].frag_prefix_size = buf_size;
 		priv->frag_info[i].frag_stride =
-				ALIGN(priv->frag_info[i].frag_size,
-				      SMP_CACHE_BYTES);
+				ALIGN(priv->frag_info[i].frag_size, align);
+		priv->frag_info[i].dma_dir = dma_dir;
 		buf_size += priv->frag_info[i].frag_size;
 		i++;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 35ecfa2..0e0ecd1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -259,6 +259,12 @@ struct mlx4_en_rx_alloc {
 	u32		page_size;
 };
 
+#define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
+struct mlx4_en_page_cache {
+	u32 index;
+	struct mlx4_en_rx_alloc buf[MLX4_EN_CACHE_SIZE];
+};
+
 struct mlx4_en_tx_ring {
 	/* cache line used and dirtied in tx completion
 	 * (mlx4_en_free_tx_buf())
@@ -323,6 +329,7 @@ struct mlx4_en_rx_ring {
 	u8  fcs_del;
 	void *buf;
 	void *rx_info;
+	struct mlx4_en_page_cache page_cache;
 	unsigned long bytes;
 	unsigned long packets;
 	unsigned long csum_ok;
@@ -442,7 +449,9 @@ struct mlx4_en_mc_list {
 struct mlx4_en_frag_info {
 	u16 frag_size;
 	u16 frag_prefix_size;
-	u16 frag_stride;
+	u32 frag_stride;
+	enum dma_data_direction dma_dir;
+	int order;
 };
 
 #ifdef CONFIG_MLX4_EN_DCB
@@ -654,6 +663,7 @@ void mlx4_en_set_stats_bitmap(struct mlx4_dev *dev,
 
 void mlx4_en_free_resources(struct mlx4_en_priv *priv);
 int mlx4_en_alloc_resources(struct mlx4_en_priv *priv);
+int mlx4_en_moderation_update(struct mlx4_en_priv *priv);
 
 int mlx4_en_create_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq **pcq,
 		      int entries, int ring, enum cq_type mode, int node);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 07/11] bpf: add XDP_TX xdp_action for direct forwarding
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (5 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

XDP enabled drivers must transmit received packets back out on the same
port they were received on when a program returns this action.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4282d44..a8f1ea1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -447,6 +447,7 @@ enum xdp_action {
 	XDP_ABORTED = 0,
 	XDP_DROP,
 	XDP_PASS,
+	XDP_TX,
 };
 
 /* user accessible metadata for XDP packet hook
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (6 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 07/11] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12 12:16   ` Tariq Toukan
  2016-07-12  7:51 ` [PATCH v8 09/11] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

In preparation for writing the tx descriptor from multiple functions,
create a helper for both normal and blueflame access.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/infiniband/hw/mlx4/qp.c            |  11 +--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 127 +++++++++++++++++------------
 include/linux/mlx4/qp.h                    |  18 ++--
 3 files changed, 90 insertions(+), 66 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8db8405..768085f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -232,7 +232,7 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size)
 		}
 	} else {
 		ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
-		s = (ctrl->fence_size & 0x3f) << 4;
+		s = (ctrl->qpn_vlan.fence_size & 0x3f) << 4;
 		for (i = 64; i < s; i += 64) {
 			wqe = buf + i;
 			*wqe = cpu_to_be32(0xffffffff);
@@ -264,7 +264,7 @@ static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size)
 		inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl));
 	}
 	ctrl->srcrb_flags = 0;
-	ctrl->fence_size = size / 16;
+	ctrl->qpn_vlan.fence_size = size / 16;
 	/*
 	 * Make sure descriptor is fully written before setting ownership bit
 	 * (because HW can start executing as soon as we do).
@@ -1992,7 +1992,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 			ctrl = get_send_wqe(qp, i);
 			ctrl->owner_opcode = cpu_to_be32(1 << 31);
 			if (qp->sq_max_wqes_per_wr == 1)
-				ctrl->fence_size = 1 << (qp->sq.wqe_shift - 4);
+				ctrl->qpn_vlan.fence_size =
+						1 << (qp->sq.wqe_shift - 4);
 
 			stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift);
 		}
@@ -3169,8 +3170,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		wmb();
 		*lso_wqe = lso_hdr_sz;
 
-		ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ?
-				    MLX4_WQE_CTRL_FENCE : 0) | size;
+		ctrl->qpn_vlan.fence_size = (wr->send_flags & IB_SEND_FENCE ?
+					     MLX4_WQE_CTRL_FENCE : 0) | size;
 
 		/*
 		 * Make sure descriptor is fully written before
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 76aa4d2..c29191e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -700,10 +700,66 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
 
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
+{
+	wmb();
+	/* Since there is no iowrite*_native() that writes the
+	 * value as is, without byteswapping - using the one
+	 * the doesn't do byteswapping in the relevant arch
+	 * endianness.
+	 */
+#if defined(__LITTLE_ENDIAN)
+	iowrite32(
+#else
+	iowrite32be(
+#endif
+		  ring->doorbell_qpn,
+		  ring->bf.uar->map + MLX4_SEND_DOORBELL);
+}
+
+static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
+				  struct mlx4_en_tx_desc *tx_desc,
+				  union mlx4_wqe_qpn_vlan qpn_vlan,
+				  int desc_size, int bf_index,
+				  __be32 op_own, bool bf_ok,
+				  bool send_doorbell)
+{
+	tx_desc->ctrl.qpn_vlan = qpn_vlan;
+
+	if (bf_ok) {
+		op_own |= htonl((bf_index & 0xffff) << 8);
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
+		dma_wmb();
+		tx_desc->ctrl.owner_opcode = op_own;
+
+		wmb();
+
+		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+			     desc_size);
+
+		wmb();
+
+		ring->bf.offset ^= ring->bf.buf_size;
+	} else {
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
+		dma_wmb();
+		tx_desc->ctrl.owner_opcode = op_own;
+		if (send_doorbell)
+			mlx4_en_xmit_doorbell(ring);
+		else
+			ring->xmit_more++;
+	}
+}
+
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
 	struct device *ddev = priv->ddev;
 	struct mlx4_en_tx_ring *ring;
 	struct mlx4_en_tx_desc *tx_desc;
@@ -725,6 +781,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool stop_queue;
 	bool inline_ok;
 	u32 ring_cons;
+	bool bf_ok;
 
 	tx_ind = skb_get_queue_mapping(skb);
 	ring = priv->tx_ring[tx_ind];
@@ -749,9 +806,17 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 	}
 
+	bf_ok = ring->bf_enabled;
 	if (skb_vlan_tag_present(skb)) {
-		vlan_tag = skb_vlan_tag_get(skb);
+		qpn_vlan.vlan_tag = skb_vlan_tag_get(skb);
 		vlan_proto = be16_to_cpu(skb->vlan_proto);
+		if (vlan_proto == ETH_P_8021AD)
+			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
+		else if (vlan_proto == ETH_P_8021Q)
+			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
+		else
+			qpn_vlan.ins_vlan = 0;
+		bf_ok = false;
 	}
 
 	netdev_txq_bql_enqueue_prefetchw(ring->tx_queue);
@@ -771,6 +836,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	else {
 		tx_desc = (struct mlx4_en_tx_desc *) ring->bounce_buf;
 		bounce = true;
+		bf_ok = false;
 	}
 
 	/* Save skb in tx_info ring */
@@ -946,60 +1012,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	real_size = (real_size / 16) & 0x3f;
 
-	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
-	    !skb_vlan_tag_present(skb) && send_doorbell) {
-		tx_desc->ctrl.bf_qpn = ring->doorbell_qpn |
-				       cpu_to_be32(real_size);
-
-		op_own |= htonl((bf_index & 0xffff) << 8);
-		/* Ensure new descriptor hits memory
-		 * before setting ownership of this descriptor to HW
-		 */
-		dma_wmb();
-		tx_desc->ctrl.owner_opcode = op_own;
-
-		wmb();
-
-		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
-			     desc_size);
-
-		wmb();
-
-		ring->bf.offset ^= ring->bf.buf_size;
-	} else {
-		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
-		if (vlan_proto == ETH_P_8021AD)
-			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
-		else if (vlan_proto == ETH_P_8021Q)
-			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
-		else
-			tx_desc->ctrl.ins_vlan = 0;
+	bf_ok &= desc_size <= MAX_BF && send_doorbell;
 
-		tx_desc->ctrl.fence_size = real_size;
+	if (bf_ok)
+		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
+	else
+		qpn_vlan.fence_size = real_size;
 
-		/* Ensure new descriptor hits memory
-		 * before setting ownership of this descriptor to HW
-		 */
-		dma_wmb();
-		tx_desc->ctrl.owner_opcode = op_own;
-		if (send_doorbell) {
-			wmb();
-			/* Since there is no iowrite*_native() that writes the
-			 * value as is, without byteswapping - using the one
-			 * the doesn't do byteswapping in the relevant arch
-			 * endianness.
-			 */
-#if defined(__LITTLE_ENDIAN)
-			iowrite32(
-#else
-			iowrite32be(
-#endif
-				  ring->doorbell_qpn,
-				  ring->bf.uar->map + MLX4_SEND_DOORBELL);
-		} else {
-			ring->xmit_more++;
-		}
-	}
+	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
+			      op_own, bf_ok, send_doorbell);
 
 	if (unlikely(stop_queue)) {
 		/* If queue was emptied after the if (stop_queue) , and before
diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index 587cdf9..deaa221 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -291,16 +291,18 @@ enum {
 	MLX4_WQE_CTRL_FORCE_LOOPBACK	= 1 << 0,
 };
 
+union mlx4_wqe_qpn_vlan {
+	struct {
+		__be16	vlan_tag;
+		u8	ins_vlan;
+		u8	fence_size;
+	};
+	__be32		bf_qpn;
+};
+
 struct mlx4_wqe_ctrl_seg {
 	__be32			owner_opcode;
-	union {
-		struct {
-			__be16			vlan_tag;
-			u8			ins_vlan;
-			u8			fence_size;
-		};
-		__be32			bf_qpn;
-	};
+	union mlx4_wqe_qpn_vlan	qpn_vlan;
 	/*
 	 * High 24 bits are SRC remote buffer; low 8 bits are flags:
 	 * [7]   SO (strong ordering)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 09/11] net/mlx4_en: add xdp forwarding and data write support
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (7 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 10/11] bpf: enable direct packet data write for xdp progs Brenden Blanco
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

A user will now be able to loop packets back out of the same port using
a bpf program attached to xdp hook. Updates to the packet contents from
the bpf program is also supported.

For the packet write feature to work, the rx buffers are now mapped as
bidirectional when the page is allocated. This occurs only when the xdp
hook is active.

When the program returns a TX action, enqueue the packet directly to a
dedicated tx ring, so as to avoid completely any locking. This requires
the tx ring to be allocated 1:1 for each rx ring, as well as the tx
completion running in the same softirq.

Upon tx completion, this dedicated tx ring recycles pages without
unmapping directly back to the original rx ring. In steady state tx/drop
workload, effectively 0 page allocs/frees will occur.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  15 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  19 +++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      |  14 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 126 +++++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  14 ++-
 5 files changed, 181 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index d3d51fa..10642b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1694,6 +1694,11 @@ static int mlx4_en_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd)
 	return err;
 }
 
+static int mlx4_en_max_tx_channels(struct mlx4_en_priv *priv)
+{
+	return (MAX_TX_RINGS - priv->rsv_tx_rings) / MLX4_EN_NUM_UP;
+}
+
 static void mlx4_en_get_channels(struct net_device *dev,
 				 struct ethtool_channels *channel)
 {
@@ -1705,7 +1710,7 @@ static void mlx4_en_get_channels(struct net_device *dev,
 	channel->max_tx = MLX4_EN_MAX_TX_RING_P_UP;
 
 	channel->rx_count = priv->rx_ring_num;
-	channel->tx_count = priv->tx_ring_num / MLX4_EN_NUM_UP;
+	channel->tx_count = priv->num_tx_rings_p_up;
 }
 
 static int mlx4_en_set_channels(struct net_device *dev,
@@ -1717,7 +1722,7 @@ static int mlx4_en_set_channels(struct net_device *dev,
 	int err = 0;
 
 	if (channel->other_count || channel->combined_count ||
-	    channel->tx_count > MLX4_EN_MAX_TX_RING_P_UP ||
+	    channel->tx_count > mlx4_en_max_tx_channels(priv) ||
 	    channel->rx_count > MAX_RX_RINGS ||
 	    !channel->tx_count || !channel->rx_count)
 		return -EINVAL;
@@ -1731,7 +1736,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
 	mlx4_en_free_resources(priv);
 
 	priv->num_tx_rings_p_up = channel->tx_count;
-	priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP;
+	priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP +
+							priv->rsv_tx_rings;
 	priv->rx_ring_num = channel->rx_count;
 
 	err = mlx4_en_alloc_resources(priv);
@@ -1740,7 +1746,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
 		goto out;
 	}
 
-	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
+	netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
+							priv->rsv_tx_rings);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
 	if (dev->num_tc)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 252c893d..50fbc8f 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1636,7 +1636,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-			i / priv->num_tx_rings_p_up);
+			i / (priv->tx_ring_num / MLX4_EN_NUM_UP));
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -2022,6 +2022,16 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
 			goto err;
 	}
 
+	/* When rsv_tx_rings is non-zero, each rx ring will have a
+	 * corresponding tx ring, with the tx completion event for that ring
+	 * recycling buffers into the cache.
+	 */
+	for (i = 0; i < priv->rsv_tx_rings; i++) {
+		int j = (priv->tx_ring_num - priv->rsv_tx_rings) + i;
+
+		priv->tx_ring[j]->recycle_ring = priv->rx_ring[i];
+	}
+
 #ifdef CONFIG_RFS_ACCEL
 	priv->dev->rx_cpu_rmap = mlx4_get_cpu_rmap(priv->mdev->dev, priv->port);
 #endif
@@ -2534,9 +2544,12 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_dev *mdev = priv->mdev;
 	struct bpf_prog *old_prog;
+	int rsv_tx_rings;
 	int port_up = 0;
 	int err;
 
+	rsv_tx_rings = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;
+
 	/* No need to reconfigure buffers when simply swapping the
 	 * program for a new one.
 	 */
@@ -2563,6 +2576,10 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 
 	mlx4_en_free_resources(priv);
 
+	priv->rsv_tx_rings = rsv_tx_rings;
+	priv->tx_ring_num = priv->num_tx_rings_p_up * MLX4_EN_NUM_UP +
+							priv->rsv_tx_rings;
+
 	old_prog = xchg(&priv->prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index f26306c..9215940 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -779,7 +779,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
 	struct bpf_prog *prog;
+	int doorbell_pending;
 	struct sk_buff *skb;
+	int tx_index;
 	int index;
 	int nr;
 	unsigned int length;
@@ -796,6 +798,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		return polled;
 
 	prog = READ_ONCE(priv->prog);
+	doorbell_pending = 0;
+	tx_index = (priv->tx_ring_num - priv->rsv_tx_rings) + cq->ring;
 
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
@@ -894,6 +898,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			switch (act) {
 			case XDP_PASS:
 				break;
+			case XDP_TX:
+				if (!mlx4_en_xmit_frame(frags, dev,
+							length, tx_index,
+							&doorbell_pending))
+					goto consumed;
+				break;
 			default:
 				bpf_warn_invalid_xdp_action(act);
 			case XDP_ABORTED:
@@ -1064,6 +1074,9 @@ consumed:
 	}
 
 out:
+	if (doorbell_pending)
+		mlx4_en_xmit_doorbell(priv->tx_ring[tx_index]);
+
 	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
 	mlx4_cq_set_ci(&cq->mcq);
 	wmb(); /* ensure HW sees CQ consumer before we post new buffers */
@@ -1143,6 +1156,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	 * This only works when num_frags == 1.
 	 */
 	if (priv->prog) {
+		dma_dir = PCI_DMA_BIDIRECTIONAL;
 		/* This will gain efficient xdp frame recycling at the expense
 		 * of more costly truesize accounting
 		 */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index c29191e..ba7b5eb 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -274,10 +274,28 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
 	struct mlx4_wqe_data_seg *data = (void *) tx_desc + tx_info->data_offset;
 	void *end = ring->buf + ring->buf_size;
-	struct sk_buff *skb = tx_info->skb;
 	int nr_maps = tx_info->nr_maps;
+	struct sk_buff *skb;
 	int i;
 
+	if (ring->recycle_ring) {
+		struct mlx4_en_rx_alloc frame = {
+			.page = tx_info->page,
+			.dma = tx_info->map0_dma,
+			.page_offset = 0,
+			.page_size = PAGE_SIZE,
+		};
+
+		if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
+			dma_unmap_page(priv->ddev, tx_info->map0_dma,
+				       PAGE_SIZE, priv->frag_info[0].dma_dir);
+			put_page(tx_info->page);
+		}
+		return tx_info->nr_txbb;
+	}
+
+	skb = tx_info->skb;
+
 	/* We do not touch skb here, so prefetch skb->users location
 	 * to speedup consume_skb()
 	 */
@@ -476,6 +494,9 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
 	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
 
+	if (ring->recycle_ring)
+		return done < budget;
+
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
 
 	/* Wakeup Tx queue if this stopped, and ring is not full.
@@ -1055,3 +1076,106 @@ tx_drop:
 	return NETDEV_TX_OK;
 }
 
+netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
+			       struct net_device *dev, unsigned int length,
+			       int tx_ind, int *doorbell_pending)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
+	struct mlx4_en_tx_ring *ring;
+	struct mlx4_en_tx_desc *tx_desc;
+	struct mlx4_wqe_data_seg *data;
+	struct mlx4_en_tx_info *tx_info;
+	int index, bf_index;
+	bool send_doorbell;
+	int nr_txbb = 1;
+	bool stop_queue;
+	dma_addr_t dma;
+	int real_size;
+	__be32 op_own;
+	u32 ring_cons;
+	bool bf_ok;
+
+	BUILD_BUG_ON_MSG(ALIGN(CTRL_SIZE + DS_SIZE, TXBB_SIZE) != TXBB_SIZE,
+			 "mlx4_en_xmit_frame requires minimum size tx desc");
+
+	ring = priv->tx_ring[tx_ind];
+
+	if (!priv->port_up)
+		goto tx_drop;
+
+	if (mlx4_en_is_tx_ring_full(ring))
+		goto tx_drop;
+
+	/* fetch ring->cons far ahead before needing it to avoid stall */
+	ring_cons = READ_ONCE(ring->cons);
+
+	index = ring->prod & ring->size_mask;
+	tx_info = &ring->tx_info[index];
+
+	bf_ok = ring->bf_enabled;
+
+	/* Track current inflight packets for performance analysis */
+	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
+			 (u32)(ring->prod - ring_cons - 1));
+
+	bf_index = ring->prod;
+	tx_desc = ring->buf + index * TXBB_SIZE;
+	data = &tx_desc->data;
+
+	dma = frame->dma;
+
+	tx_info->page = frame->page;
+	frame->page = NULL;
+	tx_info->map0_dma = dma;
+	tx_info->map0_byte_count = length;
+	tx_info->nr_txbb = nr_txbb;
+	tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
+	tx_info->data_offset = (void *)data - (void *)tx_desc;
+	tx_info->ts_requested = 0;
+	tx_info->nr_maps = 1;
+	tx_info->linear = 1;
+	tx_info->inl = 0;
+
+	dma_sync_single_for_device(priv->ddev, dma, length, PCI_DMA_TODEVICE);
+
+	data->addr = cpu_to_be64(dma);
+	data->lkey = ring->mr_key;
+	dma_wmb();
+	data->byte_count = cpu_to_be32(length);
+
+	/* tx completion can avoid cache line miss for common cases */
+	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
+
+	op_own = cpu_to_be32(MLX4_OPCODE_SEND) |
+		((ring->prod & ring->size) ?
+		 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
+
+	ring->packets++;
+	ring->bytes += tx_info->nr_bytes;
+	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
+
+	ring->prod += nr_txbb;
+
+	stop_queue = mlx4_en_is_tx_ring_full(ring);
+	send_doorbell = stop_queue ||
+				*doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
+	bf_ok &= send_doorbell;
+
+	real_size = ((CTRL_SIZE + nr_txbb * DS_SIZE) / 16) & 0x3f;
+
+	if (bf_ok)
+		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
+	else
+		qpn_vlan.fence_size = real_size;
+
+	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
+			      op_own, bf_ok, send_doorbell);
+	*doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
+
+	return NETDEV_TX_OK;
+
+tx_drop:
+	ring->tx_dropped++;
+	return NETDEV_TX_BUSY;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 0e0ecd1..7309308 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -132,6 +132,7 @@ enum {
 					 MLX4_EN_NUM_UP)
 
 #define MLX4_EN_DEFAULT_TX_WORK		256
+#define MLX4_EN_DOORBELL_BUDGET		8
 
 /* Target number of packets to coalesce with interrupt moderation */
 #define MLX4_EN_RX_COAL_TARGET	44
@@ -219,7 +220,10 @@ enum cq_type {
 
 
 struct mlx4_en_tx_info {
-	struct sk_buff *skb;
+	union {
+		struct sk_buff *skb;
+		struct page *page;
+	};
 	dma_addr_t	map0_dma;
 	u32		map0_byte_count;
 	u32		nr_txbb;
@@ -298,6 +302,7 @@ struct mlx4_en_tx_ring {
 	__be32			mr_key;
 	void			*buf;
 	struct mlx4_en_tx_info	*tx_info;
+	struct mlx4_en_rx_ring	*recycle_ring;
 	u8			*bounce_buf;
 	struct mlx4_qp_context	context;
 	int			qpn;
@@ -604,6 +609,7 @@ struct mlx4_en_priv {
 	struct hwtstamp_config hwtstamp_config;
 	u32 counter_index;
 	struct bpf_prog *prog;
+	int rsv_tx_rings;
 
 #ifdef CONFIG_MLX4_EN_DCB
 #define MLX4_EN_DCB_ENABLED	0x3
@@ -678,6 +684,12 @@ void mlx4_en_tx_irq(struct mlx4_cq *mcq);
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 			 void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
+			       struct net_device *dev, unsigned int length,
+			       int tx_ind, int *doorbell_pending);
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
+bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
+			struct mlx4_en_rx_alloc *frame);
 
 int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_tx_ring **pring,
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 10/11] bpf: enable direct packet data write for xdp progs
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (8 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 09/11] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12  7:51 ` [PATCH v8 11/11] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
  2016-07-12 14:38 ` [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Tariq Toukan
  11 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

For forwarding to be effective, XDP programs should be allowed to
rewrite packet data.

This requires that the drivers supporting XDP must all map the packet
memory as TODEVICE or BIDIRECTIONAL before invoking the program.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 kernel/bpf/verifier.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8d67d0..f72f23b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -653,6 +653,16 @@ static int check_map_access(struct verifier_env *env, u32 regno, int off,
 
 #define MAX_PACKET_OFF 0xffff
 
+static bool may_write_pkt_data(enum bpf_prog_type type)
+{
+	switch (type) {
+	case BPF_PROG_TYPE_XDP:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int check_packet_access(struct verifier_env *env, u32 regno, int off,
 			       int size)
 {
@@ -806,10 +816,15 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off,
 			err = check_stack_read(state, off, size, value_regno);
 		}
 	} else if (state->regs[regno].type == PTR_TO_PACKET) {
-		if (t == BPF_WRITE) {
+		if (t == BPF_WRITE && !may_write_pkt_data(env->prog->type)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
 		}
+		if (t == BPF_WRITE && value_regno >= 0 &&
+		    is_pointer_value(env, value_regno)) {
+			verbose("R%d leaks addr into packet\n", value_regno);
+			return -EACCES;
+		}
 		err = check_packet_access(env, regno, off, size);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown_value(state->regs, value_regno);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v8 11/11] bpf: add sample for xdp forwarding and rewrite
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (9 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 10/11] bpf: enable direct packet data write for xdp progs Brenden Blanco
@ 2016-07-12  7:51 ` Brenden Blanco
  2016-07-12 14:38 ` [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Tariq Toukan
  11 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-12  7:51 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add a sample that rewrites and forwards packets out on the same
interface. Observed single core forwarding performance of ~10Mpps.

Since the mlx4 driver under test recycles every single packet page, the
perf output shows almost exclusively just the ring management and bpf
program work. Slowdowns are likely occurring due to cache misses.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 samples/bpf/Makefile    |   5 +++
 samples/bpf/xdp2_kern.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)
 create mode 100644 samples/bpf/xdp2_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0e4ab3a..d2d2b35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
+hostprogs-y += xdp2
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
+# reuse xdp1 source intentionally
+xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
+always += xdp2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
+HOSTLOADLIBES_xdp2 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
new file mode 100644
index 0000000..38fe7e1
--- /dev/null
+++ b/samples/bpf/xdp2_kern.c
@@ -0,0 +1,114 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+static void swap_src_dst_mac(void *data)
+{
+	unsigned short *p = data;
+	unsigned short dst[3];
+
+	dst[0] = p[0];
+	dst[1] = p[1];
+	dst[2] = p[2];
+	p[0] = p[3];
+	p[1] = p[4];
+	p[2] = p[5];
+	p[3] = dst[0];
+	p[4] = dst[1];
+	p[5] = dst[2];
+}
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+	struct iphdr *iph = data + nh_off;
+
+	if (iph + 1 > data_end)
+		return 0;
+	return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+	struct ipv6hdr *ip6h = data + nh_off;
+
+	if (ip6h + 1 > data_end)
+		return 0;
+	return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u64 nh_off;
+	u32 index;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+
+	if (h_proto == htons(ETH_P_IP))
+		index = parse_ipv4(data, nh_off, data_end);
+	else if (h_proto == htons(ETH_P_IPV6))
+		index = parse_ipv6(data, nh_off, data_end);
+	else
+		index = 0;
+
+	value = bpf_map_lookup_elem(&dropcnt, &index);
+	if (value)
+		*value += 1;
+
+	if (index == 17) {
+		swap_src_dst_mac(data);
+		rc = XDP_TX;
+	}
+
+	return rc;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* RE: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-12  7:51 ` [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
@ 2016-07-12 12:02   ` Tariq Toukan
  2016-07-13 11:27   ` David Laight
  2016-07-14  7:25   ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 47+ messages in thread
From: Tariq Toukan @ 2016-07-12 12:02 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann


>Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
>In tc/socket bpf programs, helpers linearize skb fragments as needed when the program touches the packet data. However, in the pursuit of speed, XDP programs will not be allowed to use these slower functions, especially if it involves allocating an skb.
>
>Therefore, disallow MTU settings that would produce a multi-fragment packet that XDP programs would fail to access. Future enhancements could be done to increase the allowable MTU.
>
>Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-12  7:51 ` [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
@ 2016-07-12 12:09   ` Tariq Toukan
  2016-07-12 21:18   ` David Miller
  1 sibling, 0 replies; 47+ messages in thread
From: Tariq Toukan @ 2016-07-12 12:09 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

>The mlx4 driver by default allocates order-3 pages for the ring to consume in multiple fragments. When the device has an xdp program, this behavior will prevent tx actions since the page must be re-mapped in TODEVICE mode, which cannot be done if the page is still shared.
>
>Start by making the allocator configurable based on whether xdp is running, such that order-0 pages are always used and never shared.
>
>Since this will stress the page allocator, add a simple page cache to each rx ring. Pages in the cache are left dma-mapped, and in drop-only stress tests the page allocator is eliminated from the perf report.
>
>Note that setting an xdp program will now require the rings to be reconfigured.
>
>Before:
> 26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
> 17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
>  6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
>  4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
>  3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
>  2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
>  2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
>
>After:
> 31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
>  8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
>  7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
>  6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
>  4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
>  4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
>  3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
>  2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
>  1.37%  swapper      [kernel.vmlinux]       [k] menu_select
>  1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem
>
>Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function
  2016-07-12  7:51 ` [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
@ 2016-07-12 12:16   ` Tariq Toukan
  0 siblings, 0 replies; 47+ messages in thread
From: Tariq Toukan @ 2016-07-12 12:16 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

>In preparation for writing the tx descriptor from multiple functions, create a helper for both normal and blueflame access.
>
>Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>

Reviewed-by: Tariq Toukan <tariqt@mellanox.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
  2016-07-12  7:51 ` [PATCH v8 01/11] bpf: add XDP prog type for early driver filter Brenden Blanco
@ 2016-07-12 13:14   ` Jesper Dangaard Brouer
  2016-07-12 14:52     ` Tom Herbert
  0 siblings, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-12 13:14 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, brouer


On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
> 
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. Out
> of bounds return codes should alias to XDP_ABORTED.

This is going to be a serious usability problem, when XDP gets extended
with newer features.

How can users know what XDP capabilities a given driver support?

If the user loads a XDP program using capabilities not avail in the
driver, then all his traffic will be hard dropped.


My proposal is to NOT allow XDP programs to be loaded if they want to use
return-codes/features not supported by the driver.  Thus, eBPF programs
register/annotate their needed capabilities upfront (I guess this could
also help HW offload engines).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding
  2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (10 preceding siblings ...)
  2016-07-12  7:51 ` [PATCH v8 11/11] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
@ 2016-07-12 14:38 ` Tariq Toukan
  2016-07-13 15:00   ` Tariq Toukan
  11 siblings, 1 reply; 47+ messages in thread
From: Tariq Toukan @ 2016-07-12 14:38 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann

Regression tests for mlx4_en are currently running, results will be 
ready by tomorrow morning.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
  2016-07-12 13:14   ` Jesper Dangaard Brouer
@ 2016-07-12 14:52     ` Tom Herbert
  2016-07-12 16:08       ` Jakub Kicinski
  2016-07-13  4:14       ` Alexei Starovoitov
  0 siblings, 2 replies; 47+ messages in thread
From: Tom Herbert @ 2016-07-12 14:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Tue, Jul 12, 2016 at 6:14 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>
>> Add a new bpf prog type that is intended to run in early stages of the
>> packet rx path. Only minimal packet metadata will be available, hence a
>> new context type, struct xdp_md, is exposed to userspace. So far only
>> expose the packet start and end pointers, and only in read mode.
>>
>> An XDP program must return one of the well known enum values, all other
>> return codes are reserved for future use. Unfortunately, this
>> restriction is hard to enforce at verification time, so take the
>> approach of warning at runtime when such programs are encountered. Out
>> of bounds return codes should alias to XDP_ABORTED.
>
> This is going to be a serious usability problem, when XDP gets extended
> with newer features.
>
> How can users know what XDP capabilities a given driver support?
>
We talked about this a the XDP summit and I have some slides on this
(hope to have slides posted shortly) . Drivers, more generally XDP
platforms, will provide a list APIs that they support. APIs would be
contained in a sort common specification files that and would have
some name like basic_xdp_api, adv_switch_api, etc. An XDP program is
written using one of the published APIs and the API it uses is
expressed as part of the program. At runtime (possibly compile time)
the API used by the program is checked against the list of APIs for
the target platform-- if the API is not in the set then the program is
not allowed to be loaded. To whatever extent possible we should also
try to verify that program adhere's to the API as load and compile
time. In the event that program violates the API it claims to be
using, such as return invalid return code for the API, that is an
error condition.

> If the user loads a XDP program using capabilities not avail in the
> driver, then all his traffic will be hard dropped.
>
>
> My proposal is to NOT allow XDP programs to be loaded if they want to use
> return-codes/features not supported by the driver.  Thus, eBPF programs
> register/annotate their needed capabilities upfront (I guess this could
> also help HW offload engines).
>
Exactly. We just need to define exactly how this is done ;-)

One open issue is whether XDP needs to be binary compatible across
platforms or source code compatible (also require recompile in latter
case for each platform). Personally I prefer source code compatible
since that might allow platforms to implement the API specifically for
their needs (like the ordering of fields in a meta data structure.

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
  2016-07-12 14:52     ` Tom Herbert
@ 2016-07-12 16:08       ` Jakub Kicinski
  2016-07-13  4:14       ` Alexei Starovoitov
  1 sibling, 0 replies; 47+ messages in thread
From: Jakub Kicinski @ 2016-07-12 16:08 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Alexei Starovoitov,
	Or Gerlitz, john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Tue, 12 Jul 2016 07:52:53 -0700, Tom Herbert wrote:
> On Tue, Jul 12, 2016 at 6:14 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> >  
> >> Add a new bpf prog type that is intended to run in early stages of the
> >> packet rx path. Only minimal packet metadata will be available, hence a
> >> new context type, struct xdp_md, is exposed to userspace. So far only
> >> expose the packet start and end pointers, and only in read mode.
> >>
> >> An XDP program must return one of the well known enum values, all other
> >> return codes are reserved for future use. Unfortunately, this
> >> restriction is hard to enforce at verification time, so take the
> >> approach of warning at runtime when such programs are encountered. Out
> >> of bounds return codes should alias to XDP_ABORTED.  
> >
> > This is going to be a serious usability problem, when XDP gets extended
> > with newer features.
> >
> > How can users know what XDP capabilities a given driver support?

I'm personally not a big fan of return codes.  I'm hoping we can express
most decisions by setting fields in metadata.  It provides a better
structure and is easier to detect/translate/optimize.

> We talked about this a the XDP summit and I have some slides on this
> (hope to have slides posted shortly) . Drivers, more generally XDP
> platforms, will provide a list APIs that they support. APIs would be
> contained in a sort common specification files that and would have
> some name like basic_xdp_api, adv_switch_api, etc. An XDP program is
> written using one of the published APIs and the API it uses is
> expressed as part of the program. At runtime (possibly compile time)
> the API used by the program is checked against the list of APIs for
> the target platform-- if the API is not in the set then the program is
> not allowed to be loaded. To whatever extent possible we should also
> try to verify that program adhere's to the API as load and compile
> time. In the event that program violates the API it claims to be
> using, such as return invalid return code for the API, that is an
> error condition.

+1

> > If the user loads a XDP program using capabilities not avail in the
> > driver, then all his traffic will be hard dropped.
> >
> >
> > My proposal is to NOT allow XDP programs to be loaded if they want to use
> > return-codes/features not supported by the driver.  Thus, eBPF programs
> > register/annotate their needed capabilities upfront (I guess this could
> > also help HW offload engines).

Not sure we need annotation, use of an API will probably be easily
detectable (call of a function, access to metadata field).  Verifier
could collect that info in-kernel with little effort.

> Exactly. We just need to define exactly how this is done ;-)
> 
> One open issue is whether XDP needs to be binary compatible across
> platforms or source code compatible (also require recompile in latter
> case for each platform). Personally I prefer source code compatible
> since that might allow platforms to implement the API specifically for
> their needs (like the ordering of fields in a meta data structure.

Since XDP is so close to hardware by design, I think we could consider
introducing per-HW translation stage between the verifier and host JIT.
For extended metadata problem for example - we could define metadata as:

meta {
	void *packet_start;
	void *packet_end;
	struct standard_meta *extended_meta;
}

standard_meta {
	vlan;
	timestamp;
	hash;
	...
}

Program coming from the user space would just use standard_meta but
per-driver/per-HW translator would patch it up.  extended_meta pointer
would probably become pointer to HW-specific queue descriptor at
runtime.  The per-HW translator stage would change the offsets and add
necessary checks and fix-ups.  It could even be possible to translate
from extended_meta access to access to metadata prepended to the frame,
translator would need a hint from the verifier on how to get the packet
pointer.

I haven't thought this through in detail, it's just an idea which would
help us to keep binary compatibility.  HW-specific optimizations in
user space compiler would be great, but breaking binary compatibility
is a bit of a scary step.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-12  7:51 ` [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
  2016-07-12 12:09   ` Tariq Toukan
@ 2016-07-12 21:18   ` David Miller
  2016-07-13  0:54     ` Brenden Blanco
  1 sibling, 1 reply; 47+ messages in thread
From: David Miller @ 2016-07-12 21:18 UTC (permalink / raw)
  To: bblanco
  Cc: netdev, jhs, saeedm, kafai, brouer, as754m, alexei.starovoitov,
	gerlitz.or, john.fastabend, hannes, tgraf, tom, daniel

From: Brenden Blanco <bblanco@plumgrid.com>
Date: Tue, 12 Jul 2016 00:51:29 -0700

> +	mlx4_en_free_resources(priv);
> +
>  	old_prog = xchg(&priv->prog, prog);
>  	if (old_prog)
>  		bpf_prog_put(old_prog);
>  
> -	return 0;
> +	err = mlx4_en_alloc_resources(priv);
> +	if (err) {
> +		en_err(priv, "Failed reallocating port resources\n");
> +		goto out;
> +	}
> +	if (port_up) {
> +		err = mlx4_en_start_port(dev);
> +		if (err)
> +			en_err(priv, "Failed starting port\n");

A failed configuration operation should _NEVER_ leave the interface in
an inoperative state like these error paths do.

You must instead preallocate the necessary resources, and only change
the chip's configuration and commit to the new settings once you have
successfully allocated those resources.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-12 21:18   ` David Miller
@ 2016-07-13  0:54     ` Brenden Blanco
  2016-07-13  7:17       ` Tariq Toukan
  0 siblings, 1 reply; 47+ messages in thread
From: Brenden Blanco @ 2016-07-13  0:54 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, jhs, saeedm, kafai, brouer, as754m, alexei.starovoitov,
	gerlitz.or, john.fastabend, hannes, tgraf, tom, daniel

On Tue, Jul 12, 2016 at 02:18:32PM -0700, David Miller wrote:
> From: Brenden Blanco <bblanco@plumgrid.com>
> Date: Tue, 12 Jul 2016 00:51:29 -0700
> 
> > +	mlx4_en_free_resources(priv);
> > +
> >  	old_prog = xchg(&priv->prog, prog);
> >  	if (old_prog)
> >  		bpf_prog_put(old_prog);
> >  
> > -	return 0;
> > +	err = mlx4_en_alloc_resources(priv);
> > +	if (err) {
> > +		en_err(priv, "Failed reallocating port resources\n");
> > +		goto out;
> > +	}
> > +	if (port_up) {
> > +		err = mlx4_en_start_port(dev);
> > +		if (err)
> > +			en_err(priv, "Failed starting port\n");
> 
> A failed configuration operation should _NEVER_ leave the interface in
> an inoperative state like these error paths do.
> 
> You must instead preallocate the necessary resources, and only change
> the chip's configuration and commit to the new settings once you have
> successfully allocated those resources.
I'll see what I can do here.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 01/11] bpf: add XDP prog type for early driver filter
  2016-07-12 14:52     ` Tom Herbert
  2016-07-12 16:08       ` Jakub Kicinski
@ 2016-07-13  4:14       ` Alexei Starovoitov
  1 sibling, 0 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2016-07-13  4:14 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Tue, Jul 12, 2016 at 07:52:53AM -0700, Tom Herbert wrote:
> On Tue, Jul 12, 2016 at 6:14 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Tue, 12 Jul 2016 00:51:24 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> >
> >> Add a new bpf prog type that is intended to run in early stages of the
> >> packet rx path. Only minimal packet metadata will be available, hence a
> >> new context type, struct xdp_md, is exposed to userspace. So far only
> >> expose the packet start and end pointers, and only in read mode.
> >>
> >> An XDP program must return one of the well known enum values, all other
> >> return codes are reserved for future use. Unfortunately, this
> >> restriction is hard to enforce at verification time, so take the
> >> approach of warning at runtime when such programs are encountered. Out
> >> of bounds return codes should alias to XDP_ABORTED.
> >
> > This is going to be a serious usability problem, when XDP gets extended
> > with newer features.
> >
> > How can users know what XDP capabilities a given driver support?
> >
> We talked about this a the XDP summit and I have some slides on this
> (hope to have slides posted shortly) . Drivers, more generally XDP
> platforms, will provide a list APIs that they support. APIs would be
> contained in a sort common specification files that and would have
> some name like basic_xdp_api, adv_switch_api, etc. An XDP program is
> written using one of the published APIs and the API it uses is
> expressed as part of the program. At runtime (possibly compile time)
> the API used by the program is checked against the list of APIs for
> the target platform-- if the API is not in the set then the program is
> not allowed to be loaded. To whatever extent possible we should also
> try to verify that program adhere's to the API as load and compile
> time. In the event that program violates the API it claims to be
> using, such as return invalid return code for the API, that is an
> error condition.
> 
> > If the user loads a XDP program using capabilities not avail in the
> > driver, then all his traffic will be hard dropped.
> >
> >
> > My proposal is to NOT allow XDP programs to be loaded if they want to use
> > return-codes/features not supported by the driver.  Thus, eBPF programs
> > register/annotate their needed capabilities upfront (I guess this could
> > also help HW offload engines).
> >
> Exactly. We just need to define exactly how this is done ;-)

yep. As we discussed at the summit there is a basic set of commands
drop/pass/tx that all drivers are expected to perform if they claim
xdp support and there will be driver/hw specific extensions.
We should be able to exercise all hw capabilites in a vendor neutral way.
It's a bit contradiction, obviously. We want xdp to be generic
and at the same time allow hw specific extensions.

> One open issue is whether XDP needs to be binary compatible across
> platforms or source code compatible (also require recompile in latter
> case for each platform). Personally I prefer source code compatible
> since that might allow platforms to implement the API specifically for
> their needs (like the ordering of fields in a meta data structure.

I agree that for some cases we have to give up on binary compatibility.
I think it will be similar to cpu world. Where applications
are compiled for different flavors of instruction set.
Like we might have encryption facility that is available
for sw path and offloaded into the most capable NICs.
I don't think annotations will solve that, since annotations
are hints when compiler/user space can be trusted which is
not the case here. It's more likely that we'll have driver
specific metadata that will hook into the verifier framework, so
we can make sure at load time that memory accesses and helper
calls are valid. That will make programs using hw specific
extensions to be loadable only on the given hw which I think
is inevitable, since we're operating at the lowest level,
right next to hw and any sw abstraction/generalization will
only slow things down.
skb is a generic facility where physical and virtual devices
look the same to programs. xdp is hw specific place.
If we try to generalize things too much at xdp level,
it will become skb-like. So beyond drop/pass/tx I can only
think of very few 'for all drivers' features.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-13  0:54     ` Brenden Blanco
@ 2016-07-13  7:17       ` Tariq Toukan
  2016-07-13 15:40         ` Brenden Blanco
  0 siblings, 1 reply; 47+ messages in thread
From: Tariq Toukan @ 2016-07-13  7:17 UTC (permalink / raw)
  To: Brenden Blanco, David Miller
  Cc: netdev, jhs, saeedm, kafai, brouer, as754m, alexei.starovoitov,
	gerlitz.or, john.fastabend, hannes, tgraf, tom, daniel


On 13/07/2016 3:54 AM, Brenden Blanco wrote:
> On Tue, Jul 12, 2016 at 02:18:32PM -0700, David Miller wrote:
>> From: Brenden Blanco <bblanco@plumgrid.com>
>> Date: Tue, 12 Jul 2016 00:51:29 -0700
>>
>>> +	mlx4_en_free_resources(priv);
>>> +
>>>   	old_prog = xchg(&priv->prog, prog);
>>>   	if (old_prog)
>>>   		bpf_prog_put(old_prog);
>>>   
>>> -	return 0;
>>> +	err = mlx4_en_alloc_resources(priv);
>>> +	if (err) {
>>> +		en_err(priv, "Failed reallocating port resources\n");
>>> +		goto out;
>>> +	}
>>> +	if (port_up) {
>>> +		err = mlx4_en_start_port(dev);
>>> +		if (err)
>>> +			en_err(priv, "Failed starting port\n");
>> A failed configuration operation should _NEVER_ leave the interface in
>> an inoperative state like these error paths do.
>>
>> You must instead preallocate the necessary resources, and only change
>> the chip's configuration and commit to the new settings once you have
>> successfully allocated those resources.
> I'll see what I can do here.
That's exactly what we're doing in a patchset that will be submitted to 
net very soon (this week).
It fixes/refactors these failure flows just like Dave described, 
something like:

     err = mlx4_en_try_alloc_resources(priv, tmp, &new_prof);
     if (err)
         goto out;

     if (priv->port_up) {
         port_up = 1;
         mlx4_en_stop_port(dev, 1);
     }

     mlx4_en_safe_replace_resources(priv, tmp);

     if (port_up) {
         err = mlx4_en_start_port(dev);
         if (err)
             en_err(priv, "Failed starting port\n");
     }

I suggest you keep your code aligned with current net-next driver, and 
later I will take it and fix it (once merged with net).

Regards,
Tariq

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-12  7:51 ` [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
  2016-07-12 12:02   ` Tariq Toukan
@ 2016-07-13 11:27   ` David Laight
  2016-07-13 14:08     ` Brenden Blanco
  2016-07-14  7:25   ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 47+ messages in thread
From: David Laight @ 2016-07-13 11:27 UTC (permalink / raw)
  To: 'Brenden Blanco', davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann

From: Brenden Blanco
> Sent: 12 July 2016 08:51
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> 
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touches the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
> 
> Therefore, disallow MTU settings that would produce a multi-fragment
> packet that XDP programs would fail to access. Future enhancements could
> be done to increase the allowable MTU.

Maybe I'm misunderstanding what is going on here...
But what has the MTU to do with how skb are fragmented?

If the skb come from a reasonably written USB ethernet interface they could
easily have arbitrary fragment boundaries (the frames get packed into USB
buffers).

Outbound skb can also have fragments depending on how they are generated.

	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-13 11:27   ` David Laight
@ 2016-07-13 14:08     ` Brenden Blanco
  0 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-13 14:08 UTC (permalink / raw)
  To: David Laight
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Wed, Jul 13, 2016 at 11:27:23AM +0000, David Laight wrote:
> From: Brenden Blanco
> > Sent: 12 July 2016 08:51
> > Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> > 
> > In tc/socket bpf programs, helpers linearize skb fragments as needed
> > when the program touches the packet data. However, in the pursuit of
> > speed, XDP programs will not be allowed to use these slower functions,
> > especially if it involves allocating an skb.
> > 
> > Therefore, disallow MTU settings that would produce a multi-fragment
> > packet that XDP programs would fail to access. Future enhancements could
> > be done to increase the allowable MTU.
> 
> Maybe I'm misunderstanding what is going on here...
> But what has the MTU to do with how skb are fragmented?
This is mlx4 specific...depending on the MTU the driver will write data
into 1536, 1536+4096, 1536+4096+4096, etc. fragments.
> 
> If the skb come from a reasonably written USB ethernet interface they could
> easily have arbitrary fragment boundaries (the frames get packed into USB
> buffers).
The XDP program is operating directly on the packet memory, before any
skb has been allocated. The program also expects a continguous memory
region to inspect...it's too expensive to linearize the data like we do
in the tc hook case, that's a feature that costs too much for this type
of low level feature. Therefore, XDP can only be turned on in
combination with a cooperative driver, that's the performance tradeoff
we're imposing here.
> 
> Outbound skb can also have fragments depending on how they are generated.
Sure, but XDP won't run on those. This is an rx-only feature.
> 
> 	David

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding
  2016-07-12 14:38 ` [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Tariq Toukan
@ 2016-07-13 15:00   ` Tariq Toukan
  0 siblings, 0 replies; 47+ messages in thread
From: Tariq Toukan @ 2016-07-13 15:00 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann


On 12/07/2016 5:38 PM, Tariq Toukan wrote:
> Regression tests for mlx4_en are currently running, results will be 
> ready by tomorrow morning.
Functional regression results look fine.
>
> Regards,
> Tariq

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-13  7:17       ` Tariq Toukan
@ 2016-07-13 15:40         ` Brenden Blanco
  2016-07-15 21:52           ` Brenden Blanco
  0 siblings, 1 reply; 47+ messages in thread
From: Brenden Blanco @ 2016-07-13 15:40 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, netdev, jhs, saeedm, kafai, brouer, as754m,
	alexei.starovoitov, gerlitz.or, john.fastabend, hannes, tgraf,
	tom, daniel

On Wed, Jul 13, 2016 at 10:17:26AM +0300, Tariq Toukan wrote:
> 
> On 13/07/2016 3:54 AM, Brenden Blanco wrote:
> >On Tue, Jul 12, 2016 at 02:18:32PM -0700, David Miller wrote:
> >>From: Brenden Blanco <bblanco@plumgrid.com>
> >>Date: Tue, 12 Jul 2016 00:51:29 -0700
> >>
> >>>+	mlx4_en_free_resources(priv);
> >>>+
> >>>  	old_prog = xchg(&priv->prog, prog);
> >>>  	if (old_prog)
> >>>  		bpf_prog_put(old_prog);
> >>>-	return 0;
> >>>+	err = mlx4_en_alloc_resources(priv);
> >>>+	if (err) {
> >>>+		en_err(priv, "Failed reallocating port resources\n");
> >>>+		goto out;
> >>>+	}
> >>>+	if (port_up) {
> >>>+		err = mlx4_en_start_port(dev);
> >>>+		if (err)
> >>>+			en_err(priv, "Failed starting port\n");
> >>A failed configuration operation should _NEVER_ leave the interface in
> >>an inoperative state like these error paths do.
> >>
> >>You must instead preallocate the necessary resources, and only change
> >>the chip's configuration and commit to the new settings once you have
> >>successfully allocated those resources.
> >I'll see what I can do here.
> That's exactly what we're doing in a patchset that will be submitted
> to net very soon (this week).
Thanks Tariq!
As an example, I had originally tried to integrate this code into
mlx4_en_set_channels, which seems to have the same problem.
> It fixes/refactors these failure flows just like Dave described,
> something like:
> 
>     err = mlx4_en_try_alloc_resources(priv, tmp, &new_prof);
>     if (err)
>         goto out;
> 
>     if (priv->port_up) {
>         port_up = 1;
>         mlx4_en_stop_port(dev, 1);
>     }
> 
>     mlx4_en_safe_replace_resources(priv, tmp);
> 
>     if (port_up) {
>         err = mlx4_en_start_port(dev);
>         if (err)
>             en_err(priv, "Failed starting port\n");
>     }
> 
> I suggest you keep your code aligned with current net-next driver,
> and later I will take it and fix it (once merged with net).
Another option is to avoid entirely the tx_ring_num change, so as to
keep the majority of the initialized state valid. We would only allocate
a new set of pages and refill the rx rings once we have confirmed there
are enough resources.

So others can follow the discussion, there are multiple reasons to
reconfigure the rings.
1. The rx frags should be page-per-packet
2. The pages should be mapped DMA_BIDIRECTIONAL
3. Each rx ring should have a dedicated tx ring, which is off limits
from the upper stack
4. The dedicated tx ring will have a pointer back to its rx ring for
recycling

#1 and #2 can be done to the side ahead of time, as you are also
suggesting.

Currently, to achieve #3, we increase tx_ring_num while keeping
num_tx_rings_p_up the same. This precipitates a round of
free/alloc_resources, which takes some time and has many opportunities
for failure.
However, we could resurrect an earlier approach that keeps the
tx_ring_num unchanged, and instead just do a
netif_set_real_num_tx_queues(tx_ring_num - rsv_tx_rings) to hide it from
the stack. This would require that there be enough rings ahead of time,
with a simple bounds check like:
if (tx_ring_num < rsv_tx_rings + MLX4_EN_MAX_TX_RING_P_UP) {
	en_err(priv, "XDP requires minimum %d + %d rings\n", rsv_tx_rings,
		MLX4_EN_MAX_TX_RING_P_UP);
	return -EINVAL;
}
The default values for tx_ring_num and rx_ring_num will only hit this
case when operating in a low memory environment, in which case the user
must increase the number of channels manually. I think that is a fair
tradeoff.

The rest of #1, #2, and #4 can be done in a guaranteed fashion once the
buffers are allocated, since it would just be a few loops to refresh the
rx_desc and recycle_ring.
> 
> Regards,
> Tariq

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-12  7:51 ` [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
  2016-07-12 12:02   ` Tariq Toukan
  2016-07-13 11:27   ` David Laight
@ 2016-07-14  7:25   ` Jesper Dangaard Brouer
  2016-07-15  3:30     ` Alexei Starovoitov
  2016-07-15 18:08     ` Tom Herbert
  2 siblings, 2 replies; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-14  7:25 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, brouer


I would really really like to see the XDP program associated with the
RX ring queues, instead of a single XDP program covering the entire NIC.
(Just move the bpf_prog pointer to struct mlx4_en_rx_ring)

So, why is this so important? It is a fundamental architectural choice.

With a single XDP program per NIC, then we are not better than DPDK,
where a single application monopolize the entire NIC.  Recently netmap
added support for running on a single specific queue[1].  This is the
number one argument our customers give, for not wanting to run DPDK,
because they need to dedicate an entire NIC per high speed application.

As John Fastabend says, his NICs have thousands of queues, and he want
to bind applications to the queues.  This idea of binding queues to
applications, goes all the way back to Van Jacobson's 2006
netchannels[2].  Where creating an application channel allow for lock
free single producer single consumer (SPSC) queue directly into the
application.  A XDP program "locked" to a single RX queue can make
these optimizations, a global XDP programm cannot.


Why this change now, why can't this wait?

I'm starting to see more and more code assuming that a single global
XDP program owns the NIC.  This will be harder and harder to cleanup.
I'm fine with the first patch iteration only supports setting the XDP
program on all RX queue (e.g. returns ENOSUPPORT on specific
queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
and appropriate refcnt handling is done.



On Tue, 12 Jul 2016 00:51:27 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> 
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touches the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
> 
[...]
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 51 ++++++++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 37 +++++++++++++++++--
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 +++
>  3 files changed, 89 insertions(+), 4 deletions(-)
> 
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index c1b3a9c..adfa123 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -743,6 +743,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
>  	struct mlx4_en_rx_alloc *frags;
>  	struct mlx4_en_rx_desc *rx_desc;
> +	struct bpf_prog *prog;
>  	struct sk_buff *skb;
>  	int index;
>  	int nr;
> @@ -759,6 +760,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	if (budget <= 0)
>  		return polled;
>  
> +	prog = READ_ONCE(priv->prog);

prog = READ_ONCE(ring->prog);

>  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>  	 * descriptor offset can be deduced from the CQE index instead of
>  	 * reading 'cqe->index' */
> @@ -835,6 +838,35 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag.
> +		 */
> +		if (prog) {
> +			struct xdp_buff xdp;
> +			dma_addr_t dma;
> +			u32 act;
> +
> +			dma = be64_to_cpu(rx_desc->data[0].addr);
> +			dma_sync_single_for_cpu(priv->ddev, dma,
> +						priv->frag_info[0].frag_size,
> +						DMA_FROM_DEVICE);
> +
> +			xdp.data = page_address(frags[0].page) +
> +							frags[0].page_offset;
> +			xdp.data_end = xdp.data + length;
> +
> +			act = bpf_prog_run_xdp(prog, &xdp);
> +			switch (act) {
> +			case XDP_PASS:
> +				break;
> +			default:
> +				bpf_warn_invalid_xdp_action(act);
> +			case XDP_ABORTED:
> +			case XDP_DROP:
> +				goto next;
> +			}
> +		}
[...]

> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index d39bf59..35ecfa2 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
[...]
> @@ -590,6 +594,7 @@ struct mlx4_en_priv {
>  	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
>  	struct hwtstamp_config hwtstamp_config;
>  	u32 counter_index;
> +	struct bpf_prog *prog;

Move to struct mlx4_en_rx_ring.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-14  7:25   ` Jesper Dangaard Brouer
@ 2016-07-15  3:30     ` Alexei Starovoitov
  2016-07-15  8:21       ` Jesper Dangaard Brouer
  2016-07-15 16:18       ` Tom Herbert
  2016-07-15 18:08     ` Tom Herbert
  1 sibling, 2 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2016-07-15  3:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Thu, Jul 14, 2016 at 09:25:43AM +0200, Jesper Dangaard Brouer wrote:
> 
> I would really really like to see the XDP program associated with the
> RX ring queues, instead of a single XDP program covering the entire NIC.
> (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
> 
> So, why is this so important? It is a fundamental architectural choice.
> 
> With a single XDP program per NIC, then we are not better than DPDK,
> where a single application monopolize the entire NIC.  Recently netmap
> added support for running on a single specific queue[1].  This is the
> number one argument our customers give, for not wanting to run DPDK,
> because they need to dedicate an entire NIC per high speed application.
> 
> As John Fastabend says, his NICs have thousands of queues, and he want
> to bind applications to the queues.  This idea of binding queues to
> applications, goes all the way back to Van Jacobson's 2006
> netchannels[2].  Where creating an application channel allow for lock
> free single producer single consumer (SPSC) queue directly into the
> application.  A XDP program "locked" to a single RX queue can make
> these optimizations, a global XDP programm cannot.
> 
> 
> Why this change now, why can't this wait?
> 
> I'm starting to see more and more code assuming that a single global
> XDP program owns the NIC.  This will be harder and harder to cleanup.
> I'm fine with the first patch iteration only supports setting the XDP
> program on all RX queue (e.g. returns ENOSUPPORT on specific
> queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
> and appropriate refcnt handling is done.

attaching program to all rings at once is a fundamental part for correct
operation. As was pointed out in the past the bpf_prog pointer
in the ring design loses atomicity of the update. While the new program is
being attached the old program is still running on other rings.
That is not something user space can compensate for.
So for current 'one prog for all rings' we cannot do what you're suggesting,
yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
aspects need to be agreed upon before we jump into implementation:
- what is the way for the program to know which ring it's running on?
  if there is no such way, then attaching the same prog to multiple
  ring is meaningless.
  we can easily extend 'struct xdp_md' in the future if we decide
  that it's worth doing.
- should we allow different programs to attach to different rings?
  we certainly can, but at this point there are only two XDP programs
  ILA router and L4 load balancer. Both require single program on all rings.
  Before we add new feature, we need to have real use case for it.
- if program knows the rx ring, should it be able to specify tx ring?
  It's doable, but it requires locking and performs will tank.

> I'm starting to see more and more code assuming that a single global
> XDP program owns the NIC.  This will be harder and harder to cleanup.

Two xdp programs in the world today want to see all rings at once.
We don't need extra comlexity of figuring out number of rings and
struggling with lack of atomicity.
There is nothing to 'cleanup' at this point.

The reason netmap/dpdk want to run on a given hw ring come from
the problem that they cannot share the nic with stack.
XDP is different. It natively integrates with the stack.
XDP_PASS is a programmable way to indicate that the packet is a control
plane packet and should be passed into the stack and further into applications.
netmap/dpdk don't have such ability, so they have to resort to
bifurcated driver model.
At this point I don't see a _real_ use case where we want to see
different bpf programs running on different rings, but as soon as
it comes we can certainly add support for it.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15  3:30     ` Alexei Starovoitov
@ 2016-07-15  8:21       ` Jesper Dangaard Brouer
  2016-07-15 16:56         ` Alexei Starovoitov
  2016-07-15 16:18       ` Tom Herbert
  1 sibling, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-15  8:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, brouer

On Thu, 14 Jul 2016 20:30:59 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Jul 14, 2016 at 09:25:43AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > I would really really like to see the XDP program associated with the
> > RX ring queues, instead of a single XDP program covering the entire NIC.
> > (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
> > 
> > So, why is this so important? It is a fundamental architectural choice.
> > 
> > With a single XDP program per NIC, then we are not better than DPDK,
> > where a single application monopolize the entire NIC.  Recently netmap
> > added support for running on a single specific queue[1].  This is the
> > number one argument our customers give, for not wanting to run DPDK,
> > because they need to dedicate an entire NIC per high speed application.
> > 
> > As John Fastabend says, his NICs have thousands of queues, and he want
> > to bind applications to the queues.  This idea of binding queues to
> > applications, goes all the way back to Van Jacobson's 2006
> > netchannels[2].  Where creating an application channel allow for lock
> > free single producer single consumer (SPSC) queue directly into the
> > application.  A XDP program "locked" to a single RX queue can make
> > these optimizations, a global XDP programm cannot.
> > 
> > 
> > Why this change now, why can't this wait?
> > 
> > I'm starting to see more and more code assuming that a single global
> > XDP program owns the NIC.  This will be harder and harder to cleanup.
> > I'm fine with the first patch iteration only supports setting the XDP
> > program on all RX queue (e.g. returns ENOSUPPORT on specific
> > queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
> > and appropriate refcnt handling is done.  
> 
> attaching program to all rings at once is a fundamental part for correct
> operation. As was pointed out in the past the bpf_prog pointer
> in the ring design loses atomicity of the update. While the new program is
> being attached the old program is still running on other rings.
> That is not something user space can compensate for.

I don't see a problem with this.  This is how iptables have been
working for years.  The iptables ruleset exchange is atomic, but only
atomic per CPU.  It's been working fine for iptables.

> So for current 'one prog for all rings' we cannot do what you're suggesting,
> yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> aspects need to be agreed upon before we jump into implementation:
> - what is the way for the program to know which ring it's running on?
>   if there is no such way, then attaching the same prog to multiple
>   ring is meaningless.
>   we can easily extend 'struct xdp_md' in the future if we decide
>   that it's worth doing.

Not sure we need to extend 'struct xdp_md' with a ring number. If we
allow assigning the program to a specific queue (if not we might need to).

The setup sequence would be:
1. userspace setup ntuple filter into a queue number
2. userspace register XDP program on this queue number
3. kernel XDP program queue packets into SPSC queue (like netmap)
   (no locking, single RX queue, single producer)
4. userspace reads packets from SPSC queue (like netmap)

For step 2, the XDP program should return some identifier for the SPSC
queue.  And step 3, is of cause a new XDP feature.


> - should we allow different programs to attach to different rings?

Yes, that is one of my main points.  You assume that a single program
own the entire NIC.  John's proposal was that he can create 1000's of
queues, and wanted to bind this to (e.g. 1000) different applications.


>   we certainly can, but at this point there are only two XDP programs
>   ILA router and L4 load balancer. Both require single program on all rings.
>   Before we add new feature, we need to have real use case for it.
> - if program knows the rx ring, should it be able to specify tx ring?
>   It's doable, but it requires locking and performs will tank.

No, the selection of the TX ring number is something internally.  For
forwarding inside the kernel (_not_ above SPSC), I imagine the XDP
selection of the TX "port" will be using some port-table abstraction.
Like the mSwitch article:
 "mSwitch: a highly-scalable, modular software switch"
 http://info.iet.unipi.it/~luigi/20150617-mswitch-paper.pdf


> > I'm starting to see more and more code assuming that a single global
> > XDP program owns the NIC.  This will be harder and harder to
> > cleanup.  
> 
> Two xdp programs in the world today want to see all rings at once.
> We don't need extra comlexity of figuring out number of rings and
> struggling with lack of atomicity.
> There is nothing to 'cleanup' at this point.
> 
> The reason netmap/dpdk want to run on a given hw ring come from
> the problem that they cannot share the nic with stack.
> XDP is different. It natively integrates with the stack.
> XDP_PASS is a programmable way to indicate that the packet is a
> control plane packet and should be passed into the stack and further
> into applications. netmap/dpdk don't have such ability, so they have
> to resort to bifurcated driver model.

Netmap have this capability already:
 https://blog.cloudflare.com/partial-kernel-bypass-merged-netmap/

> At this point I don't see a _real_ use case where we want to see
> different bpf programs running on different rings, but as soon as
> it comes we can certainly add support for it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15  3:30     ` Alexei Starovoitov
  2016-07-15  8:21       ` Jesper Dangaard Brouer
@ 2016-07-15 16:18       ` Tom Herbert
  2016-07-15 16:47         ` Alexei Starovoitov
  1 sibling, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-07-15 16:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Thu, Jul 14, 2016 at 8:30 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Jul 14, 2016 at 09:25:43AM +0200, Jesper Dangaard Brouer wrote:
>>
>> I would really really like to see the XDP program associated with the
>> RX ring queues, instead of a single XDP program covering the entire NIC.
>> (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
>>
>> So, why is this so important? It is a fundamental architectural choice.
>>
>> With a single XDP program per NIC, then we are not better than DPDK,
>> where a single application monopolize the entire NIC.  Recently netmap
>> added support for running on a single specific queue[1].  This is the
>> number one argument our customers give, for not wanting to run DPDK,
>> because they need to dedicate an entire NIC per high speed application.
>>
>> As John Fastabend says, his NICs have thousands of queues, and he want
>> to bind applications to the queues.  This idea of binding queues to
>> applications, goes all the way back to Van Jacobson's 2006
>> netchannels[2].  Where creating an application channel allow for lock
>> free single producer single consumer (SPSC) queue directly into the
>> application.  A XDP program "locked" to a single RX queue can make
>> these optimizations, a global XDP programm cannot.
>>
>>
>> Why this change now, why can't this wait?
>>
>> I'm starting to see more and more code assuming that a single global
>> XDP program owns the NIC.  This will be harder and harder to cleanup.
>> I'm fine with the first patch iteration only supports setting the XDP
>> program on all RX queue (e.g. returns ENOSUPPORT on specific
>> queues). Only requesting that this is moved to struct mlx4_en_rx_ring,
>> and appropriate refcnt handling is done.
>
> attaching program to all rings at once is a fundamental part for correct
> operation. As was pointed out in the past the bpf_prog pointer
> in the ring design loses atomicity of the update. While the new program is
> being attached the old program is still running on other rings.
> That is not something user space can compensate for.
> So for current 'one prog for all rings' we cannot do what you're suggesting,
> yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> aspects need to be agreed upon before we jump into implementation:
> - what is the way for the program to know which ring it's running on?
>   if there is no such way, then attaching the same prog to multiple
>   ring is meaningless.

Why would it need to know? If the user can say run this program on
this ring that should be sufficient.

>   we can easily extend 'struct xdp_md' in the future if we decide
>   that it's worth doing.
> - should we allow different programs to attach to different rings?
>   we certainly can, but at this point there are only two XDP programs
>   ILA router and L4 load balancer. Both require single program on all rings.
>   Before we add new feature, we need to have real use case for it.
> - if program knows the rx ring, should it be able to specify tx ring?
>   It's doable, but it requires locking and performs will tank.
>
>> I'm starting to see more and more code assuming that a single global
>> XDP program owns the NIC.  This will be harder and harder to cleanup.
>

I agree with Jesper on this. If we mandate that all rings must run the
same program enforces the notion that all rings must be equivalent,
but that is not a requirement with the stack and doesn't leverage
features like ntuple filter that are good to purposely steer traffic
to rings having different. Just one program across all rings would be
very limiting.

> Two xdp programs in the world today want to see all rings at once.

That is only under the initial design. For instance, one thing we
could do for the ILA router is to split SIR prefixed traffic between
different rings using an ntuple filter. That way we only need to run
the ILA router on rings where we need to do translation, other traffic
would not need to go through that XDP program.

> We don't need extra comlexity of figuring out number of rings and
> struggling with lack of atomicity.

We already have this problem with other per ring configuration.

> There is nothing to 'cleanup' at this point.
>
> The reason netmap/dpdk want to run on a given hw ring come from
> the problem that they cannot share the nic with stack.
> XDP is different. It natively integrates with the stack.
> XDP_PASS is a programmable way to indicate that the packet is a control
> plane packet and should be passed into the stack and further into applications.
> netmap/dpdk don't have such ability, so they have to resort to
> bifurcated driver model.
> At this point I don't see a _real_ use case where we want to see
> different bpf programs running on different rings, but as soon as
> it comes we can certainly add support for it.
>
See ILA example above :-)

Tom

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15 16:18       ` Tom Herbert
@ 2016-07-15 16:47         ` Alexei Starovoitov
  2016-07-15 17:49           ` Tom Herbert
  2016-07-15 19:09           ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2016-07-15 16:47 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
> > attaching program to all rings at once is a fundamental part for correct
> > operation. As was pointed out in the past the bpf_prog pointer
> > in the ring design loses atomicity of the update. While the new program is
> > being attached the old program is still running on other rings.
> > That is not something user space can compensate for.
> > So for current 'one prog for all rings' we cannot do what you're suggesting,
> > yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> > aspects need to be agreed upon before we jump into implementation:
> > - what is the way for the program to know which ring it's running on?
> >   if there is no such way, then attaching the same prog to multiple
> >   ring is meaningless.
> 
> Why would it need to know? If the user can say run this program on
> this ring that should be sufficient.

and the program would have to be recompiled with #define for every ring?
Ouch. We have to do on the fly recompilation for tracing because kernel
data structures are different between different kernels, but for
networking that will be unnecessary headache.

> >   we can easily extend 'struct xdp_md' in the future if we decide
> >   that it's worth doing.
> > - should we allow different programs to attach to different rings?
> >   we certainly can, but at this point there are only two XDP programs
> >   ILA router and L4 load balancer. Both require single program on all rings.
> >   Before we add new feature, we need to have real use case for it.
> > - if program knows the rx ring, should it be able to specify tx ring?
> >   It's doable, but it requires locking and performs will tank.
> >
> >> I'm starting to see more and more code assuming that a single global
> >> XDP program owns the NIC.  This will be harder and harder to cleanup.
> >
> 
> I agree with Jesper on this. If we mandate that all rings must run the
> same program enforces the notion that all rings must be equivalent,
> but that is not a requirement with the stack and doesn't leverage
> features like ntuple filter that are good to purposely steer traffic
> to rings having different. Just one program across all rings would be
> very limiting.
> 
> > Two xdp programs in the world today want to see all rings at once.
> 
> That is only under the initial design. For instance, one thing we
> could do for the ILA router is to split SIR prefixed traffic between
> different rings using an ntuple filter. That way we only need to run
> the ILA router on rings where we need to do translation, other traffic
> would not need to go through that XDP program.

now we talking :)
such use case would indeed be great to have, but we need to
spray sir prefixed traffic to all rx rings.
99% job of ila router is doing the routing, so we need all cpus
to participate, if we can reserve a ring to send non-sir-prefixed
traffic there, then it would be good. Can ntuple do that?
So it finally it goes back to what I was proposing all along.
1. we need to attach a program to all rings
2. we need to be able to exclude few rings from it for control
plane traffic.
we can preserve atomicity of attach with extra boolean flag per ring
that costs nothing in fast path, since the cost of 'if' is
one per napi invocation.

> > We don't need extra comlexity of figuring out number of rings and
> > struggling with lack of atomicity.
> 
> We already have this problem with other per ring configuration.

not really. without atomicity of the program change, the user space
daemon that controls it will struggle to adjust. Consider the case
where we're pushing new update for loadbalancer. In such case we
want to reuse the established bpf map, since we cannot atomically
move it from old to new, but we want to swap the program that uses
in one go, otherwise two different programs will be accessing
the same map. Technically it's valid, but difference in the programs
may cause issues. Lack of atomicity is not intractable problem,
it just makes user space quite a bit more complex for no reason.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15  8:21       ` Jesper Dangaard Brouer
@ 2016-07-15 16:56         ` Alexei Starovoitov
  0 siblings, 0 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2016-07-15 16:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Fri, Jul 15, 2016 at 10:21:23AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > attaching program to all rings at once is a fundamental part for correct
> > operation. As was pointed out in the past the bpf_prog pointer
> > in the ring design loses atomicity of the update. While the new program is
> > being attached the old program is still running on other rings.
> > That is not something user space can compensate for.
> 
> I don't see a problem with this.  This is how iptables have been
> working for years.  The iptables ruleset exchange is atomic, but only
> atomic per CPU.  It's been working fine for iptables.

And how is that a good thing?

> > So for current 'one prog for all rings' we cannot do what you're suggesting,
> > yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
> > aspects need to be agreed upon before we jump into implementation:
> > - what is the way for the program to know which ring it's running on?
> >   if there is no such way, then attaching the same prog to multiple
> >   ring is meaningless.
> >   we can easily extend 'struct xdp_md' in the future if we decide
> >   that it's worth doing.
> 
> Not sure we need to extend 'struct xdp_md' with a ring number. If we
> allow assigning the program to a specific queue (if not we might need to).
> 
> The setup sequence would be:
> 1. userspace setup ntuple filter into a queue number
> 2. userspace register XDP program on this queue number
> 3. kernel XDP program queue packets into SPSC queue (like netmap)
>    (no locking, single RX queue, single producer)
> 4. userspace reads packets from SPSC queue (like netmap)
> 
> For step 2, the XDP program should return some identifier for the SPSC
> queue.  And step 3, is of cause a new XDP feature.

so you want 2 now while having zero code for 3 and 4 ?
Frankly I thought with all the talk about zero copy the goal was
to improve networking performance of VM traffic, but above sounds
like that you want to build a new kernel bypass (like netmap).
In such case there is no need for bpf or xdp at all.
Building kernel bypass looks like separatere problem to me with
its own headaches. We can certainly discuss it, but let's keep
xdp out of the picture then, since the goals are not aligned.

> > - should we allow different programs to attach to different rings?
> 
> Yes, that is one of my main points.  You assume that a single program
> own the entire NIC.  John's proposal was that he can create 1000's of
> queues, and wanted to bind this to (e.g. 1000) different applications.

reserving rx queues for different applications is yet another very
different problem. I think it would quite cool to do that, but
I don't see how it's related to what we want to achieve with xdp.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15 16:47         ` Alexei Starovoitov
@ 2016-07-15 17:49           ` Tom Herbert
  2016-07-18  9:10             ` Thomas Graf
  2016-07-15 19:09           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-07-15 17:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Fri, Jul 15, 2016 at 9:47 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
>> > attaching program to all rings at once is a fundamental part for correct
>> > operation. As was pointed out in the past the bpf_prog pointer
>> > in the ring design loses atomicity of the update. While the new program is
>> > being attached the old program is still running on other rings.
>> > That is not something user space can compensate for.
>> > So for current 'one prog for all rings' we cannot do what you're suggesting,
>> > yet it doesn't mean we won't do prog per ring tomorrow. To do that the other
>> > aspects need to be agreed upon before we jump into implementation:
>> > - what is the way for the program to know which ring it's running on?
>> >   if there is no such way, then attaching the same prog to multiple
>> >   ring is meaningless.
>>
>> Why would it need to know? If the user can say run this program on
>> this ring that should be sufficient.
>
> and the program would have to be recompiled with #define for every ring?

Why would we need to recompile? We should be able to run the same
program on different rings, this just a matter of associating each
ring with a program.

>> We already have this problem with other per ring configuration.
>
> not really. without atomicity of the program change, the user space
> daemon that controls it will struggle to adjust. Consider the case
> where we're pushing new update for loadbalancer. In such case we
> want to reuse the established bpf map, since we cannot atomically
> move it from old to new, but we want to swap the program that uses
> in one go, otherwise two different programs will be accessing
> the same map. Technically it's valid, but difference in the programs
> may cause issues. Lack of atomicity is not intractable problem,
> it just makes user space quite a bit more complex for no reason.
>

I'm really missing why having a program pointer per ring could be so
complicated. This should just a matter of maintaining a pointer to the
BPF program program in each RX queue. If we want to latch together all
the rings to run the same program then just have an API that does
that-- walk all the queues and set the pointer to the program.  if
necessary this can be done atomically by taking the device down for
the operation.

To me, an XDP program is just another attribute of an RX queue, it's
really not special!. We already have a very good infrastructure for
managing multiqueue and pretty much everything in the receive path
operates at the queue level not the device level-- we should follow
that model.

Tom

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-14  7:25   ` Jesper Dangaard Brouer
  2016-07-15  3:30     ` Alexei Starovoitov
@ 2016-07-15 18:08     ` Tom Herbert
  2016-07-15 18:45       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-07-15 18:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Thu, Jul 14, 2016 at 12:25 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> I would really really like to see the XDP program associated with the
> RX ring queues, instead of a single XDP program covering the entire NIC.
> (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
>
I think it would be helpful to have some concrete implementation to
look at for this. Jesper, can you code up some patches, taking into
account Alexei's concerns about the atomic program update problem?.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15 18:08     ` Tom Herbert
@ 2016-07-15 18:45       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-15 18:45 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann, brouer

On Fri, 15 Jul 2016 11:08:06 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Thu, Jul 14, 2016 at 12:25 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > I would really really like to see the XDP program associated with the
> > RX ring queues, instead of a single XDP program covering the entire NIC.
> > (Just move the bpf_prog pointer to struct mlx4_en_rx_ring)
> >  
> I think it would be helpful to have some concrete implementation to
> look at for this. Jesper, can you code up some patches, taking into
> account Alexei's concerns about the atomic program update problem?.

Bad timing, as I'm going on 3 weeks vacation from today.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15 16:47         ` Alexei Starovoitov
  2016-07-15 17:49           ` Tom Herbert
@ 2016-07-15 19:09           ` Jesper Dangaard Brouer
  2016-07-18  4:01             ` Alexei Starovoitov
  1 sibling, 1 reply; 47+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-15 19:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann, brouer


On Fri, 15 Jul 2016 09:47:46 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
[..]
> > > We don't need extra comlexity of figuring out number of rings and
> > > struggling with lack of atomicity.  
> > 
> > We already have this problem with other per ring configuration.  
> 
> not really. without atomicity of the program change, the user space
> daemon that controls it will struggle to adjust. Consider the case
> where we're pushing new update for loadbalancer. In such case we
> want to reuse the established bpf map, since we cannot atomically
> move it from old to new, but we want to swap the program that uses
> in one go, otherwise two different programs will be accessing
> the same map. Technically it's valid, but difference in the programs
> may cause issues. Lack of atomicity is not intractable problem,
> it just makes user space quite a bit more complex for no reason.

I don't think you have a problem with updating the program per queue
basis, as they will be updated atomically per RX queue (thus a CPU can
only see one program).
 Today, you already have to handle that multiple CPUs running the same
program, need to access the same map.

You mention that, there might be a problem, if the program differs too
much to share the map.  But that is the same problem as today.  If you
need to load a program that e.g. change the map layout, then you
obviously cannot allow it inherit the old map, but must feed the new
program a new map (with the new layout).


There is actually a performance advantage of knowing that a program is
only attached to a single RX queue. As only a single CPU can process a
RX ring. Thus, when e.g. accessing a map (or other lookup table) you can
avoid any locking.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-13 15:40         ` Brenden Blanco
@ 2016-07-15 21:52           ` Brenden Blanco
       [not found]             ` <6d638467-eea6-d3e1-6984-88a1198ef303@gmail.com>
  0 siblings, 1 reply; 47+ messages in thread
From: Brenden Blanco @ 2016-07-15 21:52 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, netdev, jhs, saeedm, kafai, brouer, as754m,
	alexei.starovoitov, gerlitz.or, john.fastabend, hannes, tgraf,
	tom, daniel

On Wed, Jul 13, 2016 at 08:40:59AM -0700, Brenden Blanco wrote:
> On Wed, Jul 13, 2016 at 10:17:26AM +0300, Tariq Toukan wrote:
> > 
> > On 13/07/2016 3:54 AM, Brenden Blanco wrote:
> > >On Tue, Jul 12, 2016 at 02:18:32PM -0700, David Miller wrote:
> > >>From: Brenden Blanco <bblanco@plumgrid.com>
> > >>Date: Tue, 12 Jul 2016 00:51:29 -0700
> > >>
> > >>>+	mlx4_en_free_resources(priv);
> > >>>+
> > >>>  	old_prog = xchg(&priv->prog, prog);
> > >>>  	if (old_prog)
> > >>>  		bpf_prog_put(old_prog);
> > >>>-	return 0;
> > >>>+	err = mlx4_en_alloc_resources(priv);
> > >>>+	if (err) {
> > >>>+		en_err(priv, "Failed reallocating port resources\n");
> > >>>+		goto out;
> > >>>+	}
> > >>>+	if (port_up) {
> > >>>+		err = mlx4_en_start_port(dev);
> > >>>+		if (err)
> > >>>+			en_err(priv, "Failed starting port\n");
> > >>A failed configuration operation should _NEVER_ leave the interface in
> > >>an inoperative state like these error paths do.
> > >>
> > >>You must instead preallocate the necessary resources, and only change
> > >>the chip's configuration and commit to the new settings once you have
> > >>successfully allocated those resources.
> > >I'll see what I can do here.
> > That's exactly what we're doing in a patchset that will be submitted
> > to net very soon (this week).
> Thanks Tariq!
> As an example, I had originally tried to integrate this code into
> mlx4_en_set_channels, which seems to have the same problem.
> > It fixes/refactors these failure flows just like Dave described,
> > something like:
> > 
> >     err = mlx4_en_try_alloc_resources(priv, tmp, &new_prof);
> >     if (err)
> >         goto out;
> > 
> >     if (priv->port_up) {
> >         port_up = 1;
> >         mlx4_en_stop_port(dev, 1);
> >     }
> > 
> >     mlx4_en_safe_replace_resources(priv, tmp);
> > 
> >     if (port_up) {
> >         err = mlx4_en_start_port(dev);
> >         if (err)
> >             en_err(priv, "Failed starting port\n");
> >     }
> > 
> > I suggest you keep your code aligned with current net-next driver,
> > and later I will take it and fix it (once merged with net).
So, I took Dave's suggestion to heart, and spent the last 2 days seeing
what was possible to implement with just xdp as the focus, rather than
an overall cleanup which Tariq will be looking at.

Unfortunately, this turned out to a be a bit of a rat hole.

What I wanted to do was to pre-allocate all the required pages before
reaching the point of no return. Doing this isn't all that hard, since
it should just be a few loops. However, I ended with a bit more
duplicated code than one would like, since I had to tease out the
various sections that assume exclusive access to hardware.

But, more than that, is that I don't see a way to fill these pages into
the rings safely while hardware still has ability to write into the old
ones. There was no "pause" API that I could find besides
mlx4_en_stop_port(). That function is fairly destructive and requires
the resource allocation in mlx4_en_start_port() to succeed to recover
the port status.

One option that I considered would be to drain buffers from the rx ring,
and just let mlx4_en_recover_from_oom() do its job once we update the
page template in frag_info[]. This, however, also requires the queues to
be paused safely, so we again have to rely on mlx4_en_stop_port().

One change I can make is to avoid allocating additional tx rings, which
means that we can skip the calls to mlx4_en_free/alloc_resources().

The resulting code would then mirror what mlx4_en_change_mtu() does:

	if (port_up) {
		err = mlx4_en_start_port(dev);
		if (err)
			queue_work(mdev->workqueue, &priv->watchdog_task);
	}

I intend to respin the patchset with this approach, and a few other
changes as requested elsewhere. If the above is still unacceptable, feel
free to let me know and I will avoid spamming the list.
> Another option is to avoid entirely the tx_ring_num change, so as to
> keep the majority of the initialized state valid. We would only allocate
> a new set of pages and refill the rx rings once we have confirmed there
> are enough resources.
> 
> So others can follow the discussion, there are multiple reasons to
> reconfigure the rings.
> 1. The rx frags should be page-per-packet
> 2. The pages should be mapped DMA_BIDIRECTIONAL
> 3. Each rx ring should have a dedicated tx ring, which is off limits
> from the upper stack
> 4. The dedicated tx ring will have a pointer back to its rx ring for
> recycling
> 
> #1 and #2 can be done to the side ahead of time, as you are also
> suggesting.
> 
> Currently, to achieve #3, we increase tx_ring_num while keeping
> num_tx_rings_p_up the same. This precipitates a round of
> free/alloc_resources, which takes some time and has many opportunities
> for failure.
> However, we could resurrect an earlier approach that keeps the
> tx_ring_num unchanged, and instead just do a
> netif_set_real_num_tx_queues(tx_ring_num - rsv_tx_rings) to hide it from
> the stack. This would require that there be enough rings ahead of time,
> with a simple bounds check like:
> if (tx_ring_num < rsv_tx_rings + MLX4_EN_MAX_TX_RING_P_UP) {
> 	en_err(priv, "XDP requires minimum %d + %d rings\n", rsv_tx_rings,
> 		MLX4_EN_MAX_TX_RING_P_UP);
> 	return -EINVAL;
> }
> The default values for tx_ring_num and rx_ring_num will only hit this
> case when operating in a low memory environment, in which case the user
> must increase the number of channels manually. I think that is a fair
> tradeoff.
> 
> The rest of #1, #2, and #4 can be done in a guaranteed fashion once the
> buffers are allocated, since it would just be a few loops to refresh the
> rx_desc and recycle_ring.
> > 
> > Regards,
> > Tariq
Thanks,
Brenden

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15 19:09           ` Jesper Dangaard Brouer
@ 2016-07-18  4:01             ` Alexei Starovoitov
  2016-07-18  8:35               ` Daniel Borkmann
  0 siblings, 1 reply; 47+ messages in thread
From: Alexei Starovoitov @ 2016-07-18  4:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Fri, Jul 15, 2016 at 09:09:52PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Fri, 15 Jul 2016 09:47:46 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
> [..]
> > > > We don't need extra comlexity of figuring out number of rings and
> > > > struggling with lack of atomicity.  
> > > 
> > > We already have this problem with other per ring configuration.  
> > 
> > not really. without atomicity of the program change, the user space
> > daemon that controls it will struggle to adjust. Consider the case
> > where we're pushing new update for loadbalancer. In such case we
> > want to reuse the established bpf map, since we cannot atomically
> > move it from old to new, but we want to swap the program that uses
> > in one go, otherwise two different programs will be accessing
> > the same map. Technically it's valid, but difference in the programs
> > may cause issues. Lack of atomicity is not intractable problem,
> > it just makes user space quite a bit more complex for no reason.
> 
> I don't think you have a problem with updating the program per queue
> basis, as they will be updated atomically per RX queue (thus a CPU can
> only see one program).
>  Today, you already have to handle that multiple CPUs running the same
> program, need to access the same map.
> 
> You mention that, there might be a problem, if the program differs too
> much to share the map.  But that is the same problem as today.  If you
> need to load a program that e.g. change the map layout, then you
> obviously cannot allow it inherit the old map, but must feed the new
> program a new map (with the new layout).
> 
> 
> There is actually a performance advantage of knowing that a program is
> only attached to a single RX queue. As only a single CPU can process a
> RX ring. Thus, when e.g. accessing a map (or other lookup table) you can
> avoid any locking.

rx queue is not always == cpu. We have different nics with different
number of queues. We'll try to keep dataplane and control plane as generic
as possible otherwise it's operational headache. That's why 'attach to all'
default makes the most sense.
I've been thinking more about atomicity and think we'll be able to
add 'prog per rx queue' while preserving atomicity.
We can do it by extra indirection 'struct bpf_prog **prog'. The xchg
will be swapping the single pointer while all rings will be pointing to it.
Anyway I think we need to table this discussion, since Jesper's email
is already bouncing with happy vacation message :) and Tom is traveling.
I'm pretty sure we'll be able to add support for 'prog per rx ring'
while preserving atomicity of prog swap that this patch does.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-18  4:01             ` Alexei Starovoitov
@ 2016-07-18  8:35               ` Daniel Borkmann
  0 siblings, 0 replies; 47+ messages in thread
From: Daniel Borkmann @ 2016-07-18  8:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Jesper Dangaard Brouer
  Cc: Tom Herbert, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf

On 07/18/2016 06:01 AM, Alexei Starovoitov wrote:
> On Fri, Jul 15, 2016 at 09:09:52PM +0200, Jesper Dangaard Brouer wrote:
>> On Fri, 15 Jul 2016 09:47:46 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>> On Fri, Jul 15, 2016 at 09:18:13AM -0700, Tom Herbert wrote:
>> [..]
>>>>> We don't need extra comlexity of figuring out number of rings and
>>>>> struggling with lack of atomicity.
>>>>
>>>> We already have this problem with other per ring configuration.
>>>
>>> not really. without atomicity of the program change, the user space
>>> daemon that controls it will struggle to adjust. Consider the case
>>> where we're pushing new update for loadbalancer. In such case we
>>> want to reuse the established bpf map, since we cannot atomically
>>> move it from old to new, but we want to swap the program that uses
>>> in one go, otherwise two different programs will be accessing
>>> the same map. Technically it's valid, but difference in the programs
>>> may cause issues. Lack of atomicity is not intractable problem,
>>> it just makes user space quite a bit more complex for no reason.
>>
>> I don't think you have a problem with updating the program per queue
>> basis, as they will be updated atomically per RX queue (thus a CPU can
>> only see one program).
>>   Today, you already have to handle that multiple CPUs running the same
>> program, need to access the same map.
>>
>> You mention that, there might be a problem, if the program differs too
>> much to share the map.  But that is the same problem as today.  If you
>> need to load a program that e.g. change the map layout, then you
>> obviously cannot allow it inherit the old map, but must feed the new
>> program a new map (with the new layout).
>>
>> There is actually a performance advantage of knowing that a program is
>> only attached to a single RX queue. As only a single CPU can process a
>> RX ring. Thus, when e.g. accessing a map (or other lookup table) you can
>> avoid any locking.
>
> rx queue is not always == cpu. We have different nics with different
> number of queues. We'll try to keep dataplane and control plane as generic
> as possible otherwise it's operational headache. That's why 'attach to all'
> default makes the most sense.
> I've been thinking more about atomicity and think we'll be able to
> add 'prog per rx queue' while preserving atomicity.
> We can do it by extra indirection 'struct bpf_prog **prog'. The xchg
> will be swapping the single pointer while all rings will be pointing to it.

That makes sense to me, and also still allows for the xchg on individual
programs then. You could also have a second **prog_inactive where all the
setup is done first when it comes to cases where not all programs to be
attached are the same, and move that after setup atomically over to **prog
for going live, vice versa for teardown.

> Anyway I think we need to table this discussion, since Jesper's email
> is already bouncing with happy vacation message :) and Tom is traveling.
> I'm pretty sure we'll be able to add support for 'prog per rx ring'
> while preserving atomicity of prog swap that this patch does.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-15 17:49           ` Tom Herbert
@ 2016-07-18  9:10             ` Thomas Graf
  2016-07-18 11:39               ` Tom Herbert
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Graf @ 2016-07-18  9:10 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Daniel Borkmann

On 07/15/16 at 10:49am, Tom Herbert wrote:
> I'm really missing why having a program pointer per ring could be so
> complicated. This should just a matter of maintaining a pointer to the
> BPF program program in each RX queue. If we want to latch together all
> the rings to run the same program then just have an API that does
> that-- walk all the queues and set the pointer to the program.  if
> necessary this can be done atomically by taking the device down for
> the operation.

I think two different use cases are being discussed here. Running
individual programs on different rings vs providing guarantees for the
straight forward solo program use case.

Implementing the program per ring doesn't sound complicated and it
looks like we are only debating on whether to add it now or as a second
step.

For the solo program use case: an excellent property of BPF with cls_bpf
right now is that it is possible to atomically replace a BPF program
without disruption or dropping any packets (thanks to the properties of
tc). This makes updating BPF programs simple and reliable. Even map
layout updates can be managed relatively easily right now.

It should be a goal to preserve that property in XDP. As a user, I
will not expect the same guarantees when I add different programs to
different rings whereas when I add a program on net_device level I will
expect an atomic update without taking down the device.

> To me, an XDP program is just another attribute of an RX queue, it's
> really not special!. We already have a very good infrastructure for
> managing multiqueue and pretty much everything in the receive path
> operates at the queue level not the device level-- we should follow
> that model.

I agree with that but I would like to keep the current per net_device
atomic properties.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-18  9:10             ` Thomas Graf
@ 2016-07-18 11:39               ` Tom Herbert
  2016-07-18 12:48                 ` Thomas Graf
  2016-07-18 19:03                 ` Brenden Blanco
  0 siblings, 2 replies; 47+ messages in thread
From: Tom Herbert @ 2016-07-18 11:39 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Daniel Borkmann

On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 07/15/16 at 10:49am, Tom Herbert wrote:
>> I'm really missing why having a program pointer per ring could be so
>> complicated. This should just a matter of maintaining a pointer to the
>> BPF program program in each RX queue. If we want to latch together all
>> the rings to run the same program then just have an API that does
>> that-- walk all the queues and set the pointer to the program.  if
>> necessary this can be done atomically by taking the device down for
>> the operation.
>
> I think two different use cases are being discussed here. Running
> individual programs on different rings vs providing guarantees for the
> straight forward solo program use case.
>
> Implementing the program per ring doesn't sound complicated and it
> looks like we are only debating on whether to add it now or as a second
> step.
>
> For the solo program use case: an excellent property of BPF with cls_bpf
> right now is that it is possible to atomically replace a BPF program
> without disruption or dropping any packets (thanks to the properties of
> tc). This makes updating BPF programs simple and reliable. Even map
> layout updates can be managed relatively easily right now.
>
> It should be a goal to preserve that property in XDP. As a user, I
> will not expect the same guarantees when I add different programs to
> different rings whereas when I add a program on net_device level I will
> expect an atomic update without taking down the device.
>
>> To me, an XDP program is just another attribute of an RX queue, it's
>> really not special!. We already have a very good infrastructure for
>> managing multiqueue and pretty much everything in the receive path
>> operates at the queue level not the device level-- we should follow
>> that model.
>
> I agree with that but I would like to keep the current per net_device
> atomic properties.

I don't see that see that there is any synchronization guarantees
using xchg. For instance, if the pointer is set right after being read
by a thread for one queue and right before being read by a thread for
another queue, this could result in the old and new program running
concurrently or old one running after new. If we need to synchronize
the operation across all queues then sequence
ifdown,modify-config,ifup will work.

Tom

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-18 11:39               ` Tom Herbert
@ 2016-07-18 12:48                 ` Thomas Graf
  2016-07-18 13:07                   ` Tom Herbert
  2016-07-18 19:03                 ` Brenden Blanco
  1 sibling, 1 reply; 47+ messages in thread
From: Thomas Graf @ 2016-07-18 12:48 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Daniel Borkmann

On 07/18/16 at 01:39pm, Tom Herbert wrote:
> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > I agree with that but I would like to keep the current per net_device
> > atomic properties.
> 
> I don't see that see that there is any synchronization guarantees
> using xchg. For instance, if the pointer is set right after being read
> by a thread for one queue and right before being read by a thread for
> another queue, this could result in the old and new program running
> concurrently or old one running after new. If we need to synchronize
> the operation across all queues then sequence
> ifdown,modify-config,ifup will work.

Right, there are no synchronization guarantees between threads and I
don't think that's needed. The guarantee that is provided is that if
I replace a BPF program, the replace either succeeds in which case
all packets have been either processed by the old or new program. Or
the replace failed in which case the old program was left intact and
all packets are still going through the old program.

This is a nice atomic replacement principle which would be nice to
preserve.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-18 12:48                 ` Thomas Graf
@ 2016-07-18 13:07                   ` Tom Herbert
  2016-07-19  2:45                     ` Alexei Starovoitov
  0 siblings, 1 reply; 47+ messages in thread
From: Tom Herbert @ 2016-07-18 13:07 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Daniel Borkmann

On Mon, Jul 18, 2016 at 2:48 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 07/18/16 at 01:39pm, Tom Herbert wrote:
>> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> > I agree with that but I would like to keep the current per net_device
>> > atomic properties.
>>
>> I don't see that see that there is any synchronization guarantees
>> using xchg. For instance, if the pointer is set right after being read
>> by a thread for one queue and right before being read by a thread for
>> another queue, this could result in the old and new program running
>> concurrently or old one running after new. If we need to synchronize
>> the operation across all queues then sequence
>> ifdown,modify-config,ifup will work.
>
> Right, there are no synchronization guarantees between threads and I
> don't think that's needed. The guarantee that is provided is that if
> I replace a BPF program, the replace either succeeds in which case
> all packets have been either processed by the old or new program. Or
> the replace failed in which case the old program was left intact and
> all packets are still going through the old program.
>
> This is a nice atomic replacement principle which would be nice to
> preserve.

Sure, if replace operation fails then old program should remain in
place. But xchg can't fail, so it seems like part is just giving a
false sense of security that program replacement is somehow
synchronized across queues.

Tom

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-18 11:39               ` Tom Herbert
  2016-07-18 12:48                 ` Thomas Graf
@ 2016-07-18 19:03                 ` Brenden Blanco
  1 sibling, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-18 19:03 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, Alexei Starovoitov, Jesper Dangaard Brouer,
	David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Daniel Borkmann

On Mon, Jul 18, 2016 at 01:39:02PM +0200, Tom Herbert wrote:
> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > On 07/15/16 at 10:49am, Tom Herbert wrote:
[...]
> >> To me, an XDP program is just another attribute of an RX queue, it's
> >> really not special!. We already have a very good infrastructure for
> >> managing multiqueue and pretty much everything in the receive path
> >> operates at the queue level not the device level-- we should follow
> >> that model.
> >
> > I agree with that but I would like to keep the current per net_device
> > atomic properties.
> 
> I don't see that see that there is any synchronization guarantees
> using xchg. For instance, if the pointer is set right after being read
> by a thread for one queue and right before being read by a thread for
> another queue, this could result in the old and new program running
> concurrently or old one running after new. If we need to synchronize
> the operation across all queues then sequence
> ifdown,modify-config,ifup will work.
The case you mentioned is a valid criticism. The reason I wanted to keep this
fast xchg around is because the full stop/start operation on mlx4 is a second
or longer of downtime. I think something like the following should suffice to
have a clean cut between programs without bringing the whole port down, buffers
and all:

{
        struct bpf_prog *old_prog;
        bool port_up;
        int i;

        mutex_lock(&mdev->state_lock);
        port_up = priv->port_up;
        priv->port_up = false;
        for (i = 0; i < priv->rx_ring_num; i++)
                napi_synchronize(&priv->rx_cq[i]->napi);

        old_prog = xchg(&priv->prog, prog);
        if (old_prog)
                bpf_prog_put(old_prog);

        priv->port_up = port_up;
        mutex_unlock(&mdev->state_lock);
}

Thoughts?

> 
> Tom

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-18 13:07                   ` Tom Herbert
@ 2016-07-19  2:45                     ` Alexei Starovoitov
  0 siblings, 0 replies; 47+ messages in thread
From: Alexei Starovoitov @ 2016-07-19  2:45 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Daniel Borkmann

On Mon, Jul 18, 2016 at 03:07:01PM +0200, Tom Herbert wrote:
> On Mon, Jul 18, 2016 at 2:48 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > On 07/18/16 at 01:39pm, Tom Herbert wrote:
> >> On Mon, Jul 18, 2016 at 11:10 AM, Thomas Graf <tgraf@suug.ch> wrote:
> >> > I agree with that but I would like to keep the current per net_device
> >> > atomic properties.
> >>
> >> I don't see that see that there is any synchronization guarantees
> >> using xchg. For instance, if the pointer is set right after being read
> >> by a thread for one queue and right before being read by a thread for
> >> another queue, this could result in the old and new program running
> >> concurrently or old one running after new. If we need to synchronize
> >> the operation across all queues then sequence
> >> ifdown,modify-config,ifup will work.
> >
> > Right, there are no synchronization guarantees between threads and I
> > don't think that's needed. The guarantee that is provided is that if
> > I replace a BPF program, the replace either succeeds in which case
> > all packets have been either processed by the old or new program. Or
> > the replace failed in which case the old program was left intact and
> > all packets are still going through the old program.
> >
> > This is a nice atomic replacement principle which would be nice to
> > preserve.
> 
> Sure, if replace operation fails then old program should remain in
> place. But xchg can't fail, so it seems like part is just giving a
> false sense of security that program replacement is somehow
> synchronized across queues.

good point. we do read_once at the beginning of napi, so we can
process a bunch of packets in other cpus even after xchg is all done.
Then I guess we can have a prog pointers in rings and it only marginally
increases the race. Why not if it doesn't increase the patch complexity...
btw we definitely want to avoid drain/start/stop or any slow operation
during prog xchg. When xdp is used for dos, the prog swap needs to be fast.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support
       [not found]             ` <6d638467-eea6-d3e1-6984-88a1198ef303@gmail.com>
@ 2016-07-19 17:41               ` Brenden Blanco
  0 siblings, 0 replies; 47+ messages in thread
From: Brenden Blanco @ 2016-07-19 17:41 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, netdev, jhs, saeedm, kafai, brouer, as754m,
	alexei.starovoitov, gerlitz.or, john.fastabend, hannes, tgraf,
	tom, daniel

On Tue, Jul 19, 2016 at 04:33:28PM +0300, Tariq Toukan wrote:
[...]
> >So, I took Dave's suggestion to heart, and spent the last 2 days seeing
> >what was possible to implement with just xdp as the focus, rather than
> >an overall cleanup which Tariq will be looking at.
> >
> >Unfortunately, this turned out to a be a bit of a rat hole.
> >
> >What I wanted to do was to pre-allocate all the required pages before
> >reaching the point of no return. Doing this isn't all that hard, since
> >it should just be a few loops. However, I ended with a bit more
> >duplicated code than one would like, since I had to tease out the
> >various sections that assume exclusive access to hardware.
> >
> >But, more than that, is that I don't see a way to fill these pages into
> >the rings safely while hardware still has ability to write into the old
> >ones. There was no "pause" API that I could find besides
> >mlx4_en_stop_port(). That function is fairly destructive and requires
> >the resource allocation in mlx4_en_start_port() to succeed to recover
> >the port status.
> >
> >One option that I considered would be to drain buffers from the rx ring,
> >and just let mlx4_en_recover_from_oom() do its job once we update the
> >page template in frag_info[]. This, however, also requires the queues to
> >be paused safely, so we again have to rely on mlx4_en_stop_port().
> >
> >One change I can make is to avoid allocating additional tx rings, which
> >means that we can skip the calls to mlx4_en_free/alloc_resources().
> I think it's more natural to manage the two types of tx rings
> without converting one type to the other dynamically, just to avoid
> re-allocation.
> I don't see the re-allocation of resources as a serious issue here,
> especially after I submitted patches that add resiliency and make it
> more safe (see them in [1]).
Yes, I saw those, it would be helpful for the v8 version but with v9
onwards I guess it is not conflicting, since no reallocation is done.
> We just need to make sure we don't end up with an inoperative port,
> I will take care of it once net and net-next are merged.
Thanks
> 
> I'd like to keep the tx rings separation, also when it comes to
> resource allocation, so that we allocate xdp rings when needed, and
> not borrow them from the other type.
> I have two possible solutions in mind:
> 1) Based on the suggestion you have in v8, in which rsv tx rings
> grow beyond to the regular tx array, while maintaining the
> accounting.
> One improvement to make the code more clear and readable is to
> explicitly define two fields for accounting the number of tx rings:
> - rename rsv_tx_rings to xdp_tx_rings.
I'll incorporate this one, since anyway the code is changing a bit to
accommodate prog-per-ring.
> - have a new priv->netdev_num_tx_rings = priv->num_tx_rings - xdp_tx_rings.
> and then modify driver rings iterators accordingly.
The ring iteration is sprinkled in too many places to incorporate
easily, so this makes sense as its own cleanup, with either (1) or (2)
from your proposal.
> 
> 2) Another solution I have in mind goes as follows.
> Expand the current tx_ring array to be two dimensional, where
> tx_ring[0] points to the regular tx array, and tx_ring[1] points to
> the xdp rings.
> We look to keep using the same napi handlers and not get into code
> duplication, and make minimal adjustments that do not harm
> performance.
> The idea is to mark CQ's types in creation time, so that we know for
> each CQ whether it's for a regular TX ring or an XDP tx ring.
> When we get a completion, we can just read the CQ type, and then
> choose the correct tx_ring according to cq type and cq->ring,
> something like tx_ring[cq->cq_type][cq->ring]; (we can do it so that
> we avoid using an if statement).
> It is possible to use the existing enum cq_type field "cq->is_tx",
> rename it (say to 'type'), and expand the enum cq_type with an
> additional value for TX_XDP.
> In addition, similarly to the previous suggestion, also here we
> would want to use two separate fields for the number of tx rings,
> one for netdev and one for xdp, and modify the rings iterators
> accordingly.
I think (2) makes sense, but the proof will be in the code, depending on
how easy to review it is.
> 
> [1]:
> https://patchwork.ozlabs.org/patch/649620/
> https://patchwork.ozlabs.org/patch/649619/
> 
> Another (not related) nit, as you're already working on the next V,
> I suggest we rename priv->prog to priv->bpf_prog.
I think xdp_prog is better, but point taken. Will update.
> 
> Regards,
> Tariq

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2016-07-19 17:41 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-12  7:51 [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 01/11] bpf: add XDP prog type for early driver filter Brenden Blanco
2016-07-12 13:14   ` Jesper Dangaard Brouer
2016-07-12 14:52     ` Tom Herbert
2016-07-12 16:08       ` Jakub Kicinski
2016-07-13  4:14       ` Alexei Starovoitov
2016-07-12  7:51 ` [PATCH v8 02/11] net: add ndo to setup/query xdp prog in adapter rx Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 03/11] rtnl: add option for setting link xdp prog Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 04/11] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
2016-07-12 12:02   ` Tariq Toukan
2016-07-13 11:27   ` David Laight
2016-07-13 14:08     ` Brenden Blanco
2016-07-14  7:25   ` Jesper Dangaard Brouer
2016-07-15  3:30     ` Alexei Starovoitov
2016-07-15  8:21       ` Jesper Dangaard Brouer
2016-07-15 16:56         ` Alexei Starovoitov
2016-07-15 16:18       ` Tom Herbert
2016-07-15 16:47         ` Alexei Starovoitov
2016-07-15 17:49           ` Tom Herbert
2016-07-18  9:10             ` Thomas Graf
2016-07-18 11:39               ` Tom Herbert
2016-07-18 12:48                 ` Thomas Graf
2016-07-18 13:07                   ` Tom Herbert
2016-07-19  2:45                     ` Alexei Starovoitov
2016-07-18 19:03                 ` Brenden Blanco
2016-07-15 19:09           ` Jesper Dangaard Brouer
2016-07-18  4:01             ` Alexei Starovoitov
2016-07-18  8:35               ` Daniel Borkmann
2016-07-15 18:08     ` Tom Herbert
2016-07-15 18:45       ` Jesper Dangaard Brouer
2016-07-12  7:51 ` [PATCH v8 05/11] Add sample for adding simple drop program to link Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 06/11] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
2016-07-12 12:09   ` Tariq Toukan
2016-07-12 21:18   ` David Miller
2016-07-13  0:54     ` Brenden Blanco
2016-07-13  7:17       ` Tariq Toukan
2016-07-13 15:40         ` Brenden Blanco
2016-07-15 21:52           ` Brenden Blanco
     [not found]             ` <6d638467-eea6-d3e1-6984-88a1198ef303@gmail.com>
2016-07-19 17:41               ` Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 07/11] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 08/11] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
2016-07-12 12:16   ` Tariq Toukan
2016-07-12  7:51 ` [PATCH v8 09/11] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 10/11] bpf: enable direct packet data write for xdp progs Brenden Blanco
2016-07-12  7:51 ` [PATCH v8 11/11] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
2016-07-12 14:38 ` [PATCH v8 00/11] Add driver bpf hook for early packet drop and forwarding Tariq Toukan
2016-07-13 15:00   ` Tariq Toukan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.