All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding
@ 2016-07-08  2:15 Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
                   ` (12 more replies)
  0 siblings, 13 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.

Extend on this with the ability to modify packet data and send back out
on the same port.

Patch 1 introduces the new prog type and helpers for validating the bpf
  program. A new userspace struct is defined containing only data and
  data_end as fields, with others to follow in the future.
In patch 2, create a new ndo to pass the fd to supported drivers.
In patch 3, expose a new rtnl option to userspace.
In patch 4, enable support in mlx4 driver.
In patch 5, create a sample drop and count program. With single core,
  achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
  packet data access, bpf array lookup, and increment.
In patch 6, add a page recycle facility to mlx4 rx, enabled when xdp is
  active.
In patch 7, add the XDP_TX type to bpf.h
In patch 8, add helper in tx patch for writing tx_desc
In patch 9, add support in mlx4 for packet data write and forwarding
In patch 10, turn on packet write support in the bpf verifier
In patch 11, add a sample program for packet write and forwarding. With
  single core, achieved ~10 Mpps rewrite and forwarding.
In patch 12, add prefetch to mlx4 rx to bump forwarding to 12 Mpps

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

v6:
  2/12: drop unnecessary netif_device_present check
  4/12, 6/12, 9/12: Reorder default case statement above drop case to
    remove some copy/paste.

v5:
  0/12: Rebase and remove previous 1/13 patch
  1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
    in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
    out of bounds action is returned by the program. Add a comment to
    bpf.h denoting the undefined nature of out of bounds returns.
  2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
    ndo_xdp_attached().
  3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
    for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
  4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
    bpf_warn_invalid_xdp_action helper.
  5/12: Adjust to using the nested netlink options.
  6/12: kbuild was complaining about overflow of u16 on tile
    architecture...bump frag_stride to u32. The page_offset member that
    is computed from this was already u32.
    
v4:
  2/12: Add inline helper for calling xdp bpf prog under rcu
  3/12: Add detail to ndo comments
  5/12: Remove mlx4_call_xdp and use inline helper instead.
  6/12: Fix checkpatch complaints
  9/12: Introduce new patch 9/12 with common helper for tx_desc write
    Refactor to use common tx_desc write helper
 11/12: Fix checkpatch complaints

v3:
  Rewrite from v2 trying to incorporate feedback from multiple sources.
  Specifically, add ability to forward packets out the same port and
    allow packet modification.
  For packet forwarding, the driver reserves a dedicated set of tx rings
    for exclusive use by xdp. Upon completion, the pages on this ring are
    recycled directly back to a small per-rx-ring page cache without
    being dma unmapped.
  Use of the percpu skb is dropped in favor of a lightweight struct
    xdp_buff. The direct packet access feature is leveraged to remove
    dependence on the skb.
  The mlx4 driver implementation allocates a page-per-packet and maps it
    in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
  Naming is converted to use "xdp" instead of "phys_dev".

v2:
  1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
    Introduce enum for return values from phys_dev hook
  2/5: Move prog->type check to just before invoking ndo
    Change ndo to take a bpf_prog * instead of fd
    Add ndo_bpf_get rather than keeping a bool in the netdev struct
  3/5: Use ndo_bpf_get to fetch bool
  4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing
    mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely
    to set a bpf prog when priv->num_frags > 1
    Rename pseudo_skb to bpf_phys_dev_md
    Implement ndo_bpf_get
    Add dma sync just before invoking prog
    Check for explicit bpf return code rather than nonzero
    Remove increment of rx_dropped
  5/5: Use explicit bpf return code in example
    Update commit log with higher pps numbers

Brenden Blanco (12):
  bpf: add XDP prog type for early driver filter
  net: add ndo to set xdp prog in adapter rx
  rtnl: add option for setting link xdp prog
  net/mlx4_en: add support for fast rx drop bpf program
  Add sample for adding simple drop program to link
  net/mlx4_en: add page recycle to prepare rx ring for tx support
  bpf: add XDP_TX xdp_action for direct forwarding
  net/mlx4_en: break out tx_desc write into separate function
  net/mlx4_en: add xdp forwarding and data write support
  bpf: enable direct packet data write for xdp progs
  bpf: add sample for xdp forwarding and rewrite
  net/mlx4_en: add prefetch in xdp rx path

 drivers/infiniband/hw/mlx4/qp.c                 |  11 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  17 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  95 ++++++++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      | 126 ++++++++++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 254 +++++++++++++++++++-----
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  31 ++-
 include/linux/filter.h                          |  18 ++
 include/linux/mlx4/qp.h                         |  18 +-
 include/linux/netdevice.h                       |  14 ++
 include/uapi/linux/bpf.h                        |  20 ++
 include/uapi/linux/if_link.h                    |  12 ++
 kernel/bpf/verifier.c                           |  18 +-
 net/core/dev.c                                  |  30 +++
 net/core/filter.c                               |  91 +++++++++
 net/core/rtnetlink.c                            |  55 +++++
 samples/bpf/Makefile                            |   9 +
 samples/bpf/bpf_load.c                          |   8 +
 samples/bpf/xdp1_kern.c                         |  93 +++++++++
 samples/bpf/xdp1_user.c                         | 181 +++++++++++++++++
 samples/bpf/xdp2_kern.c                         | 114 +++++++++++
 20 files changed, 1127 insertions(+), 88 deletions(-)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c
 create mode 100644 samples/bpf/xdp2_kern.c

-- 
2.8.2

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-09  8:14   ` Jesper Dangaard Brouer
                     ` (2 more replies)
  2016-07-08  2:15 ` [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx Brenden Blanco
                   ` (11 subsequent siblings)
  12 siblings, 3 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.

An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. The
driver can choose to implement unknown return codes however it wants,
but must invoke the warning helper with the action value.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/filter.h   | 18 ++++++++++
 include/uapi/linux/bpf.h | 19 ++++++++++
 kernel/bpf/verifier.c    |  1 +
 net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 129 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6fc31ef..522dbc9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -368,6 +368,11 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
+struct xdp_buff {
+	void *data;
+	void *data_end;
+};
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf and act_bpf programs
  */
@@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 	return BPF_PROG_RUN(prog, skb);
 }
 
+static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+				   struct xdp_buff *xdp)
+{
+	u32 ret;
+
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, (void *)xdp);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
 	return max(sizeof(struct bpf_prog),
@@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
+void bpf_warn_invalid_xdp_action(int act);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c14ca1c..5b47ac3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SCHED_CLS,
 	BPF_PROG_TYPE_SCHED_ACT,
 	BPF_PROG_TYPE_TRACEPOINT,
+	BPF_PROG_TYPE_XDP,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
@@ -430,4 +431,22 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* User return codes for XDP prog type.
+ * A valid XDP program must return one of these defined values. All other
+ * return codes are reserved for future use. Unknown return codes will result
+ * in driver-dependent behavior.
+ */
+enum xdp_action {
+	XDP_DROP,
+	XDP_PASS,
+};
+
+/* user accessible metadata for XDP packet hook
+ * new fields must be added to the end of this structure
+ */
+struct xdp_md {
+	__u32 data;
+	__u32 data_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e206c21..a8d67d0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_XDP:
 		break;
 	default:
 		verbose("verifier is misconfigured\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 10c4a2f..4ba446f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+xdp_func_proto(enum bpf_func_id func_id)
+{
+	return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2436,6 +2442,56 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+				  enum bpf_access_type type)
+{
+	if (off < 0 || off >= sizeof(struct xdp_md))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (size != 4)
+		return false;
+
+	return true;
+}
+
+static bool xdp_is_valid_access(int off, int size,
+				enum bpf_access_type type,
+				enum bpf_reg_type *reg_hint)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	switch (off) {
+	case offsetof(struct xdp_md, data):
+		*reg_hint = PTR_TO_PACKET;
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*reg_hint = PTR_TO_PACKET_END;
+		break;
+	}
+
+	return __is_valid_xdp_access(off, size, type);
+}
+
+void bpf_warn_invalid_xdp_action(int act)
+{
+	WARN_ONCE(1, "\n"
+		     "*****************************************************\n"
+		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
+		     "**                                               **\n"
+		     "** XDP program returned unknown value %-10u **\n"
+		     "**                                               **\n"
+		     "** XDP programs must return a well-known return  **\n"
+		     "** value. Invalid return values will result in   **\n"
+		     "** undefined packet actions.                     **\n"
+		     "**                                               **\n"
+		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
+		     "*****************************************************\n",
+		  act);
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
+
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 				      int src_reg, int ctx_off,
 				      struct bpf_insn *insn_buf,
@@ -2587,6 +2643,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+				  int src_reg, int ctx_off,
+				  struct bpf_insn *insn_buf,
+				  struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct xdp_md, data):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data));
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static const struct bpf_verifier_ops sk_filter_ops = {
 	.get_func_proto		= sk_filter_func_proto,
 	.is_valid_access	= sk_filter_is_valid_access,
@@ -2599,6 +2678,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
 	.convert_ctx_access	= bpf_net_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops xdp_ops = {
+	.get_func_proto		= xdp_func_proto,
+	.is_valid_access	= xdp_is_valid_access,
+	.convert_ctx_access	= xdp_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2614,11 +2699,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_SCHED_ACT,
 };
 
+static struct bpf_prog_type_list xdp_type __read_mostly = {
+	.ops	= &xdp_ops,
+	.type	= BPF_PROG_TYPE_XDP,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
+	bpf_register_prog_type(&xdp_type);
 
 	return 0;
 }
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-10 20:59   ` Tom Herbert
  2016-07-08  2:15 ` [PATCH v6 03/12] rtnl: add option for setting link xdp prog Brenden Blanco
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add two new set/check netdev ops for drivers implementing the
BPF_PROG_TYPE_XDP filter.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/netdevice.h | 14 ++++++++++++++
 net/core/dev.c            | 30 ++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 49736a3..36ae955 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -63,6 +63,7 @@ struct wpan_dev;
 struct mpls_dev;
 /* UDP Tunnel offloads */
 struct udp_tunnel_info;
+struct bpf_prog;
 
 void netdev_set_default_ethtool_ops(struct net_device *dev,
 				    const struct ethtool_ops *ops);
@@ -1087,6 +1088,15 @@ struct tc_to_netdev {
  *	appropriate rx headroom value allows avoiding skb head copy on
  *	forward. Setting a negative value resets the rx headroom to the
  *	default value.
+ * int (*ndo_xdp_set)(struct net_device *dev, struct bpf_prog *prog);
+ *	This function is used to set or clear a bpf program used in the
+ *	earliest stages of packet rx. The prog will have been loaded as
+ *	BPF_PROG_TYPE_XDP. The callee is responsible for calling bpf_prog_put
+ *	on any old progs that are stored, but not on the passed in prog.
+ * bool (*ndo_xdp_attached)(struct net_device *dev);
+ *	This function is used to check if a bpf program is set on the device.
+ *	The callee should return true if a program is currently attached and
+ *	running.
  *
  */
 struct net_device_ops {
@@ -1271,6 +1281,9 @@ struct net_device_ops {
 						       struct sk_buff *skb);
 	void			(*ndo_set_rx_headroom)(struct net_device *dev,
 						       int needed_headroom);
+	int			(*ndo_xdp_set)(struct net_device *dev,
+					       struct bpf_prog *prog);
+	bool			(*ndo_xdp_attached)(struct net_device *dev);
 };
 
 /**
@@ -3257,6 +3270,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
 			   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
+int dev_change_xdp_fd(struct net_device *dev, int fd);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index b92d63b..154b057 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -94,6 +94,7 @@
 #include <linux/ethtool.h>
 #include <linux/notifier.h>
 #include <linux/skbuff.h>
+#include <linux/bpf.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/busy_poll.h>
@@ -6615,6 +6616,35 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
 EXPORT_SYMBOL(dev_change_proto_down);
 
 /**
+ *	dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ *	@dev: device
+ *	@fd: new program fd or negative value to clear
+ *
+ *	Set or clear a bpf program for a device
+ */
+int dev_change_xdp_fd(struct net_device *dev, int fd)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct bpf_prog *prog = NULL;
+	int err;
+
+	if (!ops->ndo_xdp_set)
+		return -EOPNOTSUPP;
+	if (fd >= 0) {
+		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	err = ops->ndo_xdp_set(dev, prog);
+	if (err < 0 && prog)
+		bpf_prog_put(prog);
+
+	return err;
+}
+EXPORT_SYMBOL(dev_change_xdp_fd);
+
+/**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
  *
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 03/12] rtnl: add option for setting link xdp prog
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/if_link.h | 12 ++++++++++
 net/core/rtnetlink.c         | 55 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4285ac3..a1b5202 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -156,6 +156,7 @@ enum {
 	IFLA_GSO_MAX_SEGS,
 	IFLA_GSO_MAX_SIZE,
 	IFLA_PAD,
+	IFLA_XDP,
 	__IFLA_MAX
 };
 
@@ -843,4 +844,15 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* XDP section */
+
+enum {
+	IFLA_XDP_UNSPEC,
+	IFLA_XDP_FD,
+	IFLA_XDP_ATTACHED,
+	__IFLA_XDP_MAX,
+};
+
+#define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a9e3805..017780e 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -891,6 +891,16 @@ static size_t rtnl_port_size(const struct net_device *dev,
 		return port_self_size;
 }
 
+static size_t rtnl_xdp_size(const struct net_device *dev)
+{
+	size_t xdp_size = nla_total_size(1);	/* XDP_ATTACHED */
+
+	if (!dev->netdev_ops->ndo_xdp_attached)
+		return 0;
+	else
+		return xdp_size;
+}
+
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
 				     u32 ext_filter_mask)
 {
@@ -927,6 +937,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
 	       + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
+	       + rtnl_xdp_size(dev) /* IFLA_XDP */
 	       + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1211,6 +1222,24 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	struct nlattr *xdp;
+
+	if (!dev->netdev_ops->ndo_xdp_attached)
+		return 0;
+	xdp = nla_nest_start(skb, IFLA_XDP);
+	if (!xdp)
+		return -EMSGSIZE;
+	if (nla_put_u8(skb, IFLA_XDP_ATTACHED,
+		       dev->netdev_ops->ndo_xdp_attached(dev))) {
+		nla_nest_cancel(skb, xdp);
+		return -EMSGSIZE;
+	}
+	nla_nest_end(skb, xdp);
+	return 0;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask)
@@ -1307,6 +1336,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_port_fill(skb, dev, ext_filter_mask))
 		goto nla_put_failure;
 
+	if (rtnl_xdp_fill(skb, dev))
+		goto nla_put_failure;
+
 	if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) {
 		if (rtnl_link_fill(skb, dev) < 0)
 			goto nla_put_failure;
@@ -1392,6 +1424,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
 	[IFLA_PROTO_DOWN]	= { .type = NLA_U8 },
+	[IFLA_XDP]		= { .type = NLA_NESTED },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1429,6 +1462,11 @@ static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = {
 	[IFLA_PORT_RESPONSE]	= { .type = NLA_U16, },
 };
 
+static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
+	[IFLA_XDP_FD]		= { .type = NLA_S32 },
+	[IFLA_XDP_ATTACHED]	= { .type = NLA_U8 },
+};
+
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
 {
 	const struct rtnl_link_ops *ops = NULL;
@@ -2054,6 +2092,23 @@ static int do_setlink(const struct sk_buff *skb,
 		status |= DO_SETLINK_NOTIFY;
 	}
 
+	if (tb[IFLA_XDP]) {
+		struct nlattr *xdp[IFLA_XDP_MAX + 1];
+
+		err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
+				       ifla_xdp_policy);
+		if (err < 0)
+			goto errout;
+
+		if (xdp[IFLA_XDP_FD]) {
+			err = dev_change_xdp_fd(dev,
+						nla_get_s32(xdp[IFLA_XDP_FD]));
+			if (err)
+				goto errout;
+			status |= DO_SETLINK_NOTIFY;
+		}
+	}
+
 errout:
 	if (status & DO_SETLINK_MODIFIED) {
 		if (status & DO_SETLINK_NOTIFY)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (2 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 03/12] rtnl: add option for setting link xdp prog Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-09 14:07   ` Or Gerlitz
  2016-07-09 19:58   ` Saeed Mahameed
  2016-07-08  2:15 ` [PATCH v6 05/12] Add sample for adding simple drop program to link Brenden Blanco
                   ` (8 subsequent siblings)
  12 siblings, 2 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.

In tc/socket bpf programs, helpers linearize skb fragments as needed
when the program touchs the packet data. However, in the pursuit of
speed, XDP programs will not be allowed to use these slower functions,
especially if it involves allocating an skb.

Therefore, disallow MTU settings that would produce a multi-fragment
packet that XDP programs would fail to access. Future enhancements could
be done to increase the allowable MTU.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 38 ++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 36 +++++++++++++++++++++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 ++++
 3 files changed, 75 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6083775..5c6b1a0c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@
  *
  */
 
+#include <linux/bpf.h>
 #include <linux/etherdevice.h>
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
@@ -2084,6 +2085,9 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
 	if (mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_TS)
 		mlx4_en_remove_timestamp(mdev);
 
+	if (priv->prog)
+		bpf_prog_put(priv->prog);
+
 	/* Detach the netdev so tasks would not attempt to access it */
 	mutex_lock(&mdev->state_lock);
 	mdev->pndev[priv->port] = NULL;
@@ -2112,6 +2116,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 		en_err(priv, "Bad MTU size:%d.\n", new_mtu);
 		return -EPERM;
 	}
+	if (priv->prog && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
+		en_err(priv, "MTU size:%d requires frags but bpf prog running",
+		       new_mtu);
+		return -EOPNOTSUPP;
+	}
 	dev->mtu = new_mtu;
 
 	if (netif_running(dev)) {
@@ -2520,6 +2529,31 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 	return err;
 }
 
+static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+
+	if (priv->num_frags > 1)
+		return -EOPNOTSUPP;
+
+	/* This xchg is paired with READ_ONCE in the fast path, but is
+	 * also protected from itself via rtnl lock
+	 */
+	old_prog = xchg(&priv->prog, prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+static bool mlx4_xdp_attached(struct net_device *dev)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	return !!READ_ONCE(priv->prog);
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
@@ -2548,6 +2582,8 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp_set		= mlx4_xdp_set,
+	.ndo_xdp_attached	= mlx4_xdp_attached,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2584,6 +2620,8 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp_set		= mlx4_xdp_set,
+	.ndo_xdp_attached	= mlx4_xdp_attached,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c1b3a9c..2bf3d62 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -743,6 +743,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
+	struct bpf_prog *prog;
 	struct sk_buff *skb;
 	int index;
 	int nr;
@@ -759,6 +760,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	if (budget <= 0)
 		return polled;
 
+	prog = READ_ONCE(priv->prog);
+
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
 	 * reading 'cqe->index' */
@@ -835,6 +838,34 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
 			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
+		/* A bpf program gets first chance to drop the packet. It may
+		 * read bytes but not past the end of the frag.
+		 */
+		if (prog) {
+			struct xdp_buff xdp;
+			dma_addr_t dma;
+			u32 act;
+
+			dma = be64_to_cpu(rx_desc->data[0].addr);
+			dma_sync_single_for_cpu(priv->ddev, dma,
+						priv->frag_info[0].frag_size,
+						DMA_FROM_DEVICE);
+
+			xdp.data = page_address(frags[0].page) +
+							frags[0].page_offset;
+			xdp.data_end = xdp.data + length;
+
+			act = bpf_prog_run_xdp(prog, &xdp);
+			switch (act) {
+			case XDP_PASS:
+				break;
+			default:
+				bpf_warn_invalid_xdp_action(act);
+			case XDP_DROP:
+				goto next;
+			}
+		}
+
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
@@ -1062,10 +1093,7 @@ static const int frag_sizes[] = {
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
-	 * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
-	 */
-	int eff_mtu = dev->mtu + ETH_HLEN + (2 * VLAN_HLEN);
+	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
 	int buf_size = 0;
 	int i = 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d39bf59..35ecfa2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -164,6 +164,10 @@ enum {
 #define MLX4_LOOPBACK_TEST_PAYLOAD (HEADER_COPY_SIZE - ETH_HLEN)
 
 #define MLX4_EN_MIN_MTU		46
+/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
+ * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
+ */
+#define MLX4_EN_EFF_MTU(mtu)	((mtu) + ETH_HLEN + (2 * VLAN_HLEN))
 #define ETH_BCAST		0xffffffffffffULL
 
 #define MLX4_EN_LOOPBACK_RETRIES	5
@@ -590,6 +594,7 @@ struct mlx4_en_priv {
 	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
 	struct hwtstamp_config hwtstamp_config;
 	u32 counter_index;
+	struct bpf_prog *prog;
 
 #ifdef CONFIG_MLX4_EN_DCB
 #define MLX4_EN_DCB_ENABLED	0x3
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 05/12] Add sample for adding simple drop program to link
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (3 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-09 20:21   ` Saeed Mahameed
  2016-07-11 11:09   ` Jamal Hadi Salim
  2016-07-08  2:15 ` [PATCH v6 06/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
hook of a link. With the drop-only program, observed single core rate is
~20Mpps.

Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.

$ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
proto 17:   20403027 drops/s

./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
Running... ctrl^C to stop
Device: eth4@0
Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
  5056638pps 2427Mb/sec (2427186240bps) errors: 0
Device: eth4@1
Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
  5133311pps 2463Mb/sec (2463989280bps) errors: 0
Device: eth4@2
Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
  5077431pps 2437Mb/sec (2437166880bps) errors: 0
Device: eth4@3
Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
  5043067pps 2420Mb/sec (2420672160bps) errors: 0

perf report --no-children:
 26.05%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 17.84%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
  5.52%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
  4.90%  swapper      [kernel.vmlinux]  [k] poll_idle
  4.14%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  2.78%  ksoftirqd/0  [kernel.vmlinux]  [k] __free_pages_ok
  2.57%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.51%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
  1.94%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  1.45%  swapper      [mlx4_en]         [k] mlx4_en_alloc_frags
  1.35%  ksoftirqd/0  [kernel.vmlinux]  [k] free_one_page
  1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
  1.04%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5c5
  0.96%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c58d
  0.93%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6ee
  0.92%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6b9
  0.89%  ksoftirqd/0  [kernel.vmlinux]  [k] __alloc_pages_nodemask
  0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c686
  0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5d5
  0.78%  ksoftirqd/0  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
  0.77%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5b4
  0.77%  ksoftirqd/0  [kernel.vmlinux]  [k] net_rx_action

machine specs:
 receiver - Intel E5-1630 v3 @ 3.70GHz
 sender - Intel E5645 @ 2.40GHz
 Mellanox ConnectX-3 @40G

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 samples/bpf/Makefile    |   4 ++
 samples/bpf/bpf_load.c  |   8 +++
 samples/bpf/xdp1_kern.c |  93 +++++++++++++++++++++++++
 samples/bpf/xdp1_user.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 286 insertions(+)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a98b780..0e4ab3a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -21,6 +21,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += xdp1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -42,6 +43,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -64,6 +66,7 @@ always += test_overhead_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
+always += xdp1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -84,6 +87,7 @@ HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
+HOSTLOADLIBES_xdp1 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 022af71..0cfda23 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -50,6 +50,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
 	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
 	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+	bool is_xdp = strncmp(event, "xdp", 3) == 0;
 	enum bpf_prog_type prog_type;
 	char buf[256];
 	int fd, efd, err, id;
@@ -66,6 +67,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_KPROBE;
 	} else if (is_tracepoint) {
 		prog_type = BPF_PROG_TYPE_TRACEPOINT;
+	} else if (is_xdp) {
+		prog_type = BPF_PROG_TYPE_XDP;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -79,6 +82,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 
 	prog_fd[prog_cnt++] = fd;
 
+	if (is_xdp)
+		return 0;
+
 	if (is_socket) {
 		event += 6;
 		if (*event != '/')
@@ -319,6 +325,7 @@ int load_bpf_file(char *path)
 			if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
 			    memcmp(shname_prog, "kretprobe/", 10) == 0 ||
 			    memcmp(shname_prog, "tracepoint/", 11) == 0 ||
+			    memcmp(shname_prog, "xdp", 3) == 0 ||
 			    memcmp(shname_prog, "socket", 6) == 0)
 				load_and_attach(shname_prog, insns, data_prog->d_size);
 		}
@@ -336,6 +343,7 @@ int load_bpf_file(char *path)
 		if (memcmp(shname, "kprobe/", 7) == 0 ||
 		    memcmp(shname, "kretprobe/", 10) == 0 ||
 		    memcmp(shname, "tracepoint/", 11) == 0 ||
+		    memcmp(shname, "xdp", 3) == 0 ||
 		    memcmp(shname, "socket", 6) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
new file mode 100644
index 0000000..e7dd8ac
--- /dev/null
+++ b/samples/bpf/xdp1_kern.c
@@ -0,0 +1,93 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+	struct iphdr *iph = data + nh_off;
+
+	if (iph + 1 > data_end)
+		return 0;
+	return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+	struct ipv6hdr *ip6h = data + nh_off;
+
+	if (ip6h + 1 > data_end)
+		return 0;
+	return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u64 nh_off;
+	u32 index;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+
+	if (h_proto == htons(ETH_P_IP))
+		index = parse_ipv4(data, nh_off, data_end);
+	else if (h_proto == htons(ETH_P_IPV6))
+		index = parse_ipv6(data, nh_off, data_end);
+	else
+		index = 0;
+
+	value = bpf_map_lookup_elem(&dropcnt, &index);
+	if (value)
+		*value += 1;
+
+	return rc;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
new file mode 100644
index 0000000..a5e109e
--- /dev/null
+++ b/samples/bpf/xdp1_user.c
@@ -0,0 +1,181 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+
+static int set_link_xdp_fd(int ifindex, int fd)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+
+	nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len = NLA_HDRLEN + nla_xdp->nla_len;
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
+
+static int ifindex;
+
+static void int_exit(int sig)
+{
+	set_link_xdp_fd(ifindex, -1);
+	exit(0);
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+	const unsigned int nr_keys = 256;
+	__u64 values[nr_cpus], prev[nr_keys][nr_cpus];
+	__u32 key;
+	int i;
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		sleep(interval);
+
+		for (key = 0; key < nr_keys; key++) {
+			__u64 sum = 0;
+
+			assert(bpf_lookup_elem(map_fd[0], &key, values) == 0);
+			for (i = 0; i < nr_cpus; i++)
+				sum += (values[i] - prev[key][i]);
+			if (sum)
+				printf("proto %u: %10llu pkt/s\n",
+				       key, sum / interval);
+			memcpy(prev[key], values, sizeof(values));
+		}
+	}
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (ac != 2) {
+		printf("usage: %s IFINDEX\n", argv[0]);
+		return 1;
+	}
+
+	ifindex = strtoul(argv[1], NULL, 0);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+
+	if (set_link_xdp_fd(ifindex, prog_fd[0]) < 0) {
+		printf("link set xdp fd failed\n");
+		return 1;
+	}
+
+	poll_stats(2);
+
+	return 0;
+}
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 06/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (4 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 05/12] Add sample for adding simple drop program to link Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 07/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

The mlx4 driver by default allocates order-3 pages for the ring to
consume in multiple fragments. When the device has an xdp program, this
behavior will prevent tx actions since the page must be re-mapped in
TODEVICE mode, which cannot be done if the page is still shared.

Start by making the allocator configurable based on whether xdp is
running, such that order-0 pages are always used and never shared.

Since this will stress the page allocator, add a simple page cache to
each rx ring. Pages in the cache are left dma-mapped, and in drop-only
stress tests the page allocator is eliminated from the perf report.

Note that setting an xdp program will now require the rings to be
reconfigured.

Before:
 26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
  6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
  4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
  2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq

After:
 31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
  8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
  7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
  6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
  4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
  4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
  3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
  2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
  1.37%  swapper      [kernel.vmlinux]       [k] menu_select
  1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 46 +++++++++++++++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      | 69 ++++++++++++++++++++++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    | 12 ++++-
 4 files changed, 115 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 51a2e82..d3d51fa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -47,7 +47,7 @@
 #define EN_ETHTOOL_SHORT_MASK cpu_to_be16(0xffff)
 #define EN_ETHTOOL_WORD_MASK  cpu_to_be32(0xffffffff)
 
-static int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
+int mlx4_en_moderation_update(struct mlx4_en_priv *priv)
 {
 	int i;
 	int err = 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 5c6b1a0c..2883315 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2532,19 +2532,57 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
 	struct bpf_prog *old_prog;
+	int port_up = 0;
+	int err;
+
+	/* No need to reconfigure buffers when simply swapping the
+	 * program for a new one.
+	 */
+	if (READ_ONCE(priv->prog) && prog) {
+		/* This xchg is paired with READ_ONCE in the fast path, but is
+		 * also protected from itself via rtnl lock
+		 */
+		old_prog = xchg(&priv->prog, prog);
+		if (old_prog)
+			bpf_prog_put(old_prog);
+		return 0;
+	}
 
 	if (priv->num_frags > 1)
 		return -EOPNOTSUPP;
 
-	/* This xchg is paired with READ_ONCE in the fast path, but is
-	 * also protected from itself via rtnl lock
-	 */
+	mutex_lock(&mdev->state_lock);
+	if (priv->port_up) {
+		port_up = 1;
+		mlx4_en_stop_port(dev, 1);
+	}
+
+	mlx4_en_free_resources(priv);
+
 	old_prog = xchg(&priv->prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
-	return 0;
+	err = mlx4_en_alloc_resources(priv);
+	if (err) {
+		en_err(priv, "Failed reallocating port resources\n");
+		goto out;
+	}
+	if (port_up) {
+		err = mlx4_en_start_port(dev);
+		if (err)
+			en_err(priv, "Failed starting port\n");
+	}
+
+	err = mlx4_en_moderation_update(priv);
+
+out:
+	if (err)
+		priv->prog = NULL;
+	mutex_unlock(&mdev->state_lock);
+	return err;
 }
 
 static bool mlx4_xdp_attached(struct net_device *dev)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 2bf3d62..02d63a0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -57,7 +57,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 	struct page *page;
 	dma_addr_t dma;
 
-	for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
+	for (order = frag_info->order; ;) {
 		gfp_t gfp = _gfp;
 
 		if (order)
@@ -70,7 +70,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 			return -ENOMEM;
 	}
 	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
-			   PCI_DMA_FROMDEVICE);
+			   frag_info->dma_dir);
 	if (dma_mapping_error(priv->ddev, dma)) {
 		put_page(page);
 		return -ENOMEM;
@@ -124,7 +124,8 @@ out:
 	while (i--) {
 		if (page_alloc[i].page != ring_alloc[i].page) {
 			dma_unmap_page(priv->ddev, page_alloc[i].dma,
-				page_alloc[i].page_size, PCI_DMA_FROMDEVICE);
+				page_alloc[i].page_size,
+				priv->frag_info[i].dma_dir);
 			page = page_alloc[i].page;
 			/* Revert changes done by mlx4_alloc_pages */
 			page_ref_sub(page, page_alloc[i].page_size /
@@ -145,7 +146,7 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
 
 	if (next_frag_end > frags[i].page_size)
 		dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
-			       PCI_DMA_FROMDEVICE);
+			       frag_info->dma_dir);
 
 	if (frags[i].page)
 		put_page(frags[i].page);
@@ -176,7 +177,8 @@ out:
 
 		page_alloc = &ring->page_alloc[i];
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-			       page_alloc->page_size, PCI_DMA_FROMDEVICE);
+			       page_alloc->page_size,
+			       priv->frag_info[i].dma_dir);
 		page = page_alloc->page;
 		/* Revert changes done by mlx4_alloc_pages */
 		page_ref_sub(page, page_alloc->page_size /
@@ -201,7 +203,7 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
 		       i, page_count(page_alloc->page));
 
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-				page_alloc->page_size, PCI_DMA_FROMDEVICE);
+				page_alloc->page_size, frag_info->dma_dir);
 		while (page_alloc->page_offset + frag_info->frag_stride <
 		       page_alloc->page_size) {
 			put_page(page_alloc->page);
@@ -244,6 +246,12 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc *frags = ring->rx_info +
 					(index << priv->log_rx_info);
 
+	if (ring->page_cache.index > 0) {
+		frags[0] = ring->page_cache.buf[--ring->page_cache.index];
+		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
+		return 0;
+	}
+
 	return mlx4_en_alloc_frags(priv, rx_desc, frags, ring->page_alloc, gfp);
 }
 
@@ -502,12 +510,39 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 	}
 }
 
+/* When the rx ring is running in page-per-packet mode, a released frame can go
+ * directly into a small cache, to avoid unmapping or touching the page
+ * allocator. In bpf prog performance scenarios, buffers are either forwarded
+ * or dropped, never converted to skbs, so every page can come directly from
+ * this cache when it is sized to be a multiple of the napi budget.
+ */
+bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
+			struct mlx4_en_rx_alloc *frame)
+{
+	struct mlx4_en_page_cache *cache = &ring->page_cache;
+
+	if (cache->index >= MLX4_EN_CACHE_SIZE)
+		return false;
+
+	cache->buf[cache->index++] = *frame;
+	return true;
+}
+
 void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 			     struct mlx4_en_rx_ring **pring,
 			     u32 size, u16 stride)
 {
 	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_rx_ring *ring = *pring;
+	int i;
+
+	for (i = 0; i < ring->page_cache.index; i++) {
+		struct mlx4_en_rx_alloc *frame = &ring->page_cache.buf[i];
+
+		dma_unmap_page(priv->ddev, frame->dma, frame->page_size,
+			       priv->frag_info[0].dma_dir);
+		put_page(frame->page);
+	}
 
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
 	vfree(ring->rx_info);
@@ -862,6 +897,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			default:
 				bpf_warn_invalid_xdp_action(act);
 			case XDP_DROP:
+				if (mlx4_en_rx_recycle(ring, frags))
+					goto consumed;
 				goto next;
 			}
 		}
@@ -1017,6 +1054,7 @@ next:
 		for (nr = 0; nr < priv->num_frags; nr++)
 			mlx4_en_free_frag(priv, frags, nr);
 
+consumed:
 		++cq->mcq.cons_index;
 		index = (cq->mcq.cons_index) & ring->size_mask;
 		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
@@ -1092,19 +1130,34 @@ static const int frag_sizes[] = {
 
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
+	enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
+	int order = MLX4_EN_ALLOC_PREFER_ORDER;
+	u32 align = SMP_CACHE_BYTES;
 	int buf_size = 0;
 	int i = 0;
 
+	/* bpf requires buffers to be set up as 1 packet per page.
+	 * This only works when num_frags == 1.
+	 */
+	if (priv->prog) {
+		/* This will gain efficient xdp frame recycling at the expense
+		 * of more costly truesize accounting
+		 */
+		align = PAGE_SIZE;
+		order = 0;
+	}
+
 	while (buf_size < eff_mtu) {
+		priv->frag_info[i].order = order;
 		priv->frag_info[i].frag_size =
 			(eff_mtu > buf_size + frag_sizes[i]) ?
 				frag_sizes[i] : eff_mtu - buf_size;
 		priv->frag_info[i].frag_prefix_size = buf_size;
 		priv->frag_info[i].frag_stride =
-				ALIGN(priv->frag_info[i].frag_size,
-				      SMP_CACHE_BYTES);
+				ALIGN(priv->frag_info[i].frag_size, align);
+		priv->frag_info[i].dma_dir = dma_dir;
 		buf_size += priv->frag_info[i].frag_size;
 		i++;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 35ecfa2..0e0ecd1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -259,6 +259,12 @@ struct mlx4_en_rx_alloc {
 	u32		page_size;
 };
 
+#define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
+struct mlx4_en_page_cache {
+	u32 index;
+	struct mlx4_en_rx_alloc buf[MLX4_EN_CACHE_SIZE];
+};
+
 struct mlx4_en_tx_ring {
 	/* cache line used and dirtied in tx completion
 	 * (mlx4_en_free_tx_buf())
@@ -323,6 +329,7 @@ struct mlx4_en_rx_ring {
 	u8  fcs_del;
 	void *buf;
 	void *rx_info;
+	struct mlx4_en_page_cache page_cache;
 	unsigned long bytes;
 	unsigned long packets;
 	unsigned long csum_ok;
@@ -442,7 +449,9 @@ struct mlx4_en_mc_list {
 struct mlx4_en_frag_info {
 	u16 frag_size;
 	u16 frag_prefix_size;
-	u16 frag_stride;
+	u32 frag_stride;
+	enum dma_data_direction dma_dir;
+	int order;
 };
 
 #ifdef CONFIG_MLX4_EN_DCB
@@ -654,6 +663,7 @@ void mlx4_en_set_stats_bitmap(struct mlx4_dev *dev,
 
 void mlx4_en_free_resources(struct mlx4_en_priv *priv);
 int mlx4_en_alloc_resources(struct mlx4_en_priv *priv);
+int mlx4_en_moderation_update(struct mlx4_en_priv *priv);
 
 int mlx4_en_create_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq **pcq,
 		      int entries, int ring, enum cq_type mode, int node);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 07/12] bpf: add XDP_TX xdp_action for direct forwarding
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (5 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 06/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 08/12] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

XDP enabled drivers must transmit received packets back out on the same
port they were received on when a program returns this action.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5b47ac3..e3c3b92 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -439,6 +439,7 @@ struct bpf_tunnel_key {
 enum xdp_action {
 	XDP_DROP,
 	XDP_PASS,
+	XDP_TX,
 };
 
 /* user accessible metadata for XDP packet hook
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 08/12] net/mlx4_en: break out tx_desc write into separate function
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (6 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 07/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 09/12] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

In preparation for writing the tx descriptor from multiple functions,
create a helper for both normal and blueflame access.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/infiniband/hw/mlx4/qp.c            |  11 +--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 127 +++++++++++++++++------------
 include/linux/mlx4/qp.h                    |  18 ++--
 3 files changed, 90 insertions(+), 66 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8db8405..768085f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -232,7 +232,7 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size)
 		}
 	} else {
 		ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
-		s = (ctrl->fence_size & 0x3f) << 4;
+		s = (ctrl->qpn_vlan.fence_size & 0x3f) << 4;
 		for (i = 64; i < s; i += 64) {
 			wqe = buf + i;
 			*wqe = cpu_to_be32(0xffffffff);
@@ -264,7 +264,7 @@ static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size)
 		inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl));
 	}
 	ctrl->srcrb_flags = 0;
-	ctrl->fence_size = size / 16;
+	ctrl->qpn_vlan.fence_size = size / 16;
 	/*
 	 * Make sure descriptor is fully written before setting ownership bit
 	 * (because HW can start executing as soon as we do).
@@ -1992,7 +1992,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 			ctrl = get_send_wqe(qp, i);
 			ctrl->owner_opcode = cpu_to_be32(1 << 31);
 			if (qp->sq_max_wqes_per_wr == 1)
-				ctrl->fence_size = 1 << (qp->sq.wqe_shift - 4);
+				ctrl->qpn_vlan.fence_size =
+						1 << (qp->sq.wqe_shift - 4);
 
 			stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift);
 		}
@@ -3169,8 +3170,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		wmb();
 		*lso_wqe = lso_hdr_sz;
 
-		ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ?
-				    MLX4_WQE_CTRL_FENCE : 0) | size;
+		ctrl->qpn_vlan.fence_size = (wr->send_flags & IB_SEND_FENCE ?
+					     MLX4_WQE_CTRL_FENCE : 0) | size;
 
 		/*
 		 * Make sure descriptor is fully written before
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 76aa4d2..c29191e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -700,10 +700,66 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
 
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
+{
+	wmb();
+	/* Since there is no iowrite*_native() that writes the
+	 * value as is, without byteswapping - using the one
+	 * the doesn't do byteswapping in the relevant arch
+	 * endianness.
+	 */
+#if defined(__LITTLE_ENDIAN)
+	iowrite32(
+#else
+	iowrite32be(
+#endif
+		  ring->doorbell_qpn,
+		  ring->bf.uar->map + MLX4_SEND_DOORBELL);
+}
+
+static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
+				  struct mlx4_en_tx_desc *tx_desc,
+				  union mlx4_wqe_qpn_vlan qpn_vlan,
+				  int desc_size, int bf_index,
+				  __be32 op_own, bool bf_ok,
+				  bool send_doorbell)
+{
+	tx_desc->ctrl.qpn_vlan = qpn_vlan;
+
+	if (bf_ok) {
+		op_own |= htonl((bf_index & 0xffff) << 8);
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
+		dma_wmb();
+		tx_desc->ctrl.owner_opcode = op_own;
+
+		wmb();
+
+		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+			     desc_size);
+
+		wmb();
+
+		ring->bf.offset ^= ring->bf.buf_size;
+	} else {
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
+		dma_wmb();
+		tx_desc->ctrl.owner_opcode = op_own;
+		if (send_doorbell)
+			mlx4_en_xmit_doorbell(ring);
+		else
+			ring->xmit_more++;
+	}
+}
+
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
 	struct device *ddev = priv->ddev;
 	struct mlx4_en_tx_ring *ring;
 	struct mlx4_en_tx_desc *tx_desc;
@@ -725,6 +781,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool stop_queue;
 	bool inline_ok;
 	u32 ring_cons;
+	bool bf_ok;
 
 	tx_ind = skb_get_queue_mapping(skb);
 	ring = priv->tx_ring[tx_ind];
@@ -749,9 +806,17 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 	}
 
+	bf_ok = ring->bf_enabled;
 	if (skb_vlan_tag_present(skb)) {
-		vlan_tag = skb_vlan_tag_get(skb);
+		qpn_vlan.vlan_tag = skb_vlan_tag_get(skb);
 		vlan_proto = be16_to_cpu(skb->vlan_proto);
+		if (vlan_proto == ETH_P_8021AD)
+			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
+		else if (vlan_proto == ETH_P_8021Q)
+			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
+		else
+			qpn_vlan.ins_vlan = 0;
+		bf_ok = false;
 	}
 
 	netdev_txq_bql_enqueue_prefetchw(ring->tx_queue);
@@ -771,6 +836,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	else {
 		tx_desc = (struct mlx4_en_tx_desc *) ring->bounce_buf;
 		bounce = true;
+		bf_ok = false;
 	}
 
 	/* Save skb in tx_info ring */
@@ -946,60 +1012,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	real_size = (real_size / 16) & 0x3f;
 
-	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
-	    !skb_vlan_tag_present(skb) && send_doorbell) {
-		tx_desc->ctrl.bf_qpn = ring->doorbell_qpn |
-				       cpu_to_be32(real_size);
-
-		op_own |= htonl((bf_index & 0xffff) << 8);
-		/* Ensure new descriptor hits memory
-		 * before setting ownership of this descriptor to HW
-		 */
-		dma_wmb();
-		tx_desc->ctrl.owner_opcode = op_own;
-
-		wmb();
-
-		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
-			     desc_size);
-
-		wmb();
-
-		ring->bf.offset ^= ring->bf.buf_size;
-	} else {
-		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
-		if (vlan_proto == ETH_P_8021AD)
-			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
-		else if (vlan_proto == ETH_P_8021Q)
-			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
-		else
-			tx_desc->ctrl.ins_vlan = 0;
+	bf_ok &= desc_size <= MAX_BF && send_doorbell;
 
-		tx_desc->ctrl.fence_size = real_size;
+	if (bf_ok)
+		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
+	else
+		qpn_vlan.fence_size = real_size;
 
-		/* Ensure new descriptor hits memory
-		 * before setting ownership of this descriptor to HW
-		 */
-		dma_wmb();
-		tx_desc->ctrl.owner_opcode = op_own;
-		if (send_doorbell) {
-			wmb();
-			/* Since there is no iowrite*_native() that writes the
-			 * value as is, without byteswapping - using the one
-			 * the doesn't do byteswapping in the relevant arch
-			 * endianness.
-			 */
-#if defined(__LITTLE_ENDIAN)
-			iowrite32(
-#else
-			iowrite32be(
-#endif
-				  ring->doorbell_qpn,
-				  ring->bf.uar->map + MLX4_SEND_DOORBELL);
-		} else {
-			ring->xmit_more++;
-		}
-	}
+	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
+			      op_own, bf_ok, send_doorbell);
 
 	if (unlikely(stop_queue)) {
 		/* If queue was emptied after the if (stop_queue) , and before
diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index 587cdf9..deaa221 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -291,16 +291,18 @@ enum {
 	MLX4_WQE_CTRL_FORCE_LOOPBACK	= 1 << 0,
 };
 
+union mlx4_wqe_qpn_vlan {
+	struct {
+		__be16	vlan_tag;
+		u8	ins_vlan;
+		u8	fence_size;
+	};
+	__be32		bf_qpn;
+};
+
 struct mlx4_wqe_ctrl_seg {
 	__be32			owner_opcode;
-	union {
-		struct {
-			__be16			vlan_tag;
-			u8			ins_vlan;
-			u8			fence_size;
-		};
-		__be32			bf_qpn;
-	};
+	union mlx4_wqe_qpn_vlan	qpn_vlan;
 	/*
 	 * High 24 bits are SRC remote buffer; low 8 bits are flags:
 	 * [7]   SO (strong ordering)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 09/12] net/mlx4_en: add xdp forwarding and data write support
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (7 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 08/12] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 10/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

A user will now be able to loop packets back out of the same port using
a bpf program attached to xdp hook. Updates to the packet contents from
the bpf program is also supported.

For the packet write feature to work, the rx buffers are now mapped as
bidirectional when the page is allocated. This occurs only when the xdp
hook is active.

When the program returns a TX action, enqueue the packet directly to a
dedicated tx ring, so as to avoid completely any locking. This requires
the tx ring to be allocated 1:1 for each rx ring, as well as the tx
completion running in the same softirq.

Upon tx completion, this dedicated tx ring recycles pages without
unmapping directly back to the original rx ring. In steady state tx/drop
workload, effectively 0 page allocs/frees will occur.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |  15 ++-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  19 +++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      |  14 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 127 +++++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  14 ++-
 5 files changed, 182 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index d3d51fa..10642b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1694,6 +1694,11 @@ static int mlx4_en_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd)
 	return err;
 }
 
+static int mlx4_en_max_tx_channels(struct mlx4_en_priv *priv)
+{
+	return (MAX_TX_RINGS - priv->rsv_tx_rings) / MLX4_EN_NUM_UP;
+}
+
 static void mlx4_en_get_channels(struct net_device *dev,
 				 struct ethtool_channels *channel)
 {
@@ -1705,7 +1710,7 @@ static void mlx4_en_get_channels(struct net_device *dev,
 	channel->max_tx = MLX4_EN_MAX_TX_RING_P_UP;
 
 	channel->rx_count = priv->rx_ring_num;
-	channel->tx_count = priv->tx_ring_num / MLX4_EN_NUM_UP;
+	channel->tx_count = priv->num_tx_rings_p_up;
 }
 
 static int mlx4_en_set_channels(struct net_device *dev,
@@ -1717,7 +1722,7 @@ static int mlx4_en_set_channels(struct net_device *dev,
 	int err = 0;
 
 	if (channel->other_count || channel->combined_count ||
-	    channel->tx_count > MLX4_EN_MAX_TX_RING_P_UP ||
+	    channel->tx_count > mlx4_en_max_tx_channels(priv) ||
 	    channel->rx_count > MAX_RX_RINGS ||
 	    !channel->tx_count || !channel->rx_count)
 		return -EINVAL;
@@ -1731,7 +1736,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
 	mlx4_en_free_resources(priv);
 
 	priv->num_tx_rings_p_up = channel->tx_count;
-	priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP;
+	priv->tx_ring_num = channel->tx_count * MLX4_EN_NUM_UP +
+							priv->rsv_tx_rings;
 	priv->rx_ring_num = channel->rx_count;
 
 	err = mlx4_en_alloc_resources(priv);
@@ -1740,7 +1746,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
 		goto out;
 	}
 
-	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
+	netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
+							priv->rsv_tx_rings);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
 	if (dev->num_tc)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 2883315..7b4cb5a6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1636,7 +1636,7 @@ int mlx4_en_start_port(struct net_device *dev)
 		/* Configure ring */
 		tx_ring = priv->tx_ring[i];
 		err = mlx4_en_activate_tx_ring(priv, tx_ring, cq->mcq.cqn,
-			i / priv->num_tx_rings_p_up);
+			i / (priv->tx_ring_num / MLX4_EN_NUM_UP));
 		if (err) {
 			en_err(priv, "Failed allocating Tx ring\n");
 			mlx4_en_deactivate_cq(priv, cq);
@@ -2022,6 +2022,16 @@ int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
 			goto err;
 	}
 
+	/* When rsv_tx_rings is non-zero, each rx ring will have a
+	 * corresponding tx ring, with the tx completion event for that ring
+	 * recycling buffers into the cache.
+	 */
+	for (i = 0; i < priv->rsv_tx_rings; i++) {
+		int j = (priv->tx_ring_num - priv->rsv_tx_rings) + i;
+
+		priv->tx_ring[j]->recycle_ring = priv->rx_ring[i];
+	}
+
 #ifdef CONFIG_RFS_ACCEL
 	priv->dev->rx_cpu_rmap = mlx4_get_cpu_rmap(priv->mdev->dev, priv->port);
 #endif
@@ -2534,9 +2544,12 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	struct mlx4_en_dev *mdev = priv->mdev;
 	struct bpf_prog *old_prog;
+	int rsv_tx_rings;
 	int port_up = 0;
 	int err;
 
+	rsv_tx_rings = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;
+
 	/* No need to reconfigure buffers when simply swapping the
 	 * program for a new one.
 	 */
@@ -2561,6 +2574,10 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 
 	mlx4_en_free_resources(priv);
 
+	priv->rsv_tx_rings = rsv_tx_rings;
+	priv->tx_ring_num = priv->num_tx_rings_p_up * MLX4_EN_NUM_UP +
+							priv->rsv_tx_rings;
+
 	old_prog = xchg(&priv->prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 02d63a0..41c76fe 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -779,7 +779,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
 	struct bpf_prog *prog;
+	int doorbell_pending;
 	struct sk_buff *skb;
+	int tx_index;
 	int index;
 	int nr;
 	unsigned int length;
@@ -796,6 +798,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		return polled;
 
 	prog = READ_ONCE(priv->prog);
+	doorbell_pending = 0;
+	tx_index = (priv->tx_ring_num - priv->rsv_tx_rings) + cq->ring;
 
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
@@ -894,6 +898,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			switch (act) {
 			case XDP_PASS:
 				break;
+			case XDP_TX:
+				if (!mlx4_en_xmit_frame(frags, dev,
+							length, tx_index,
+							&doorbell_pending))
+					goto consumed;
+				break;
 			default:
 				bpf_warn_invalid_xdp_action(act);
 			case XDP_DROP:
@@ -1063,6 +1073,9 @@ consumed:
 	}
 
 out:
+	if (doorbell_pending)
+		mlx4_en_xmit_doorbell(priv->tx_ring[tx_index]);
+
 	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
 	mlx4_cq_set_ci(&cq->mcq);
 	wmb(); /* ensure HW sees CQ consumer before we post new buffers */
@@ -1142,6 +1155,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	 * This only works when num_frags == 1.
 	 */
 	if (priv->prog) {
+		dma_dir = PCI_DMA_BIDIRECTIONAL;
 		/* This will gain efficient xdp frame recycling at the expense
 		 * of more costly truesize accounting
 		 */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index c29191e..3dcfed9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -274,10 +274,28 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
 	struct mlx4_wqe_data_seg *data = (void *) tx_desc + tx_info->data_offset;
 	void *end = ring->buf + ring->buf_size;
-	struct sk_buff *skb = tx_info->skb;
 	int nr_maps = tx_info->nr_maps;
+	struct sk_buff *skb;
 	int i;
 
+	if (ring->recycle_ring) {
+		struct mlx4_en_rx_alloc frame = {
+			.page = tx_info->page,
+			.dma = tx_info->map0_dma,
+			.page_offset = 0,
+			.page_size = PAGE_SIZE,
+		};
+
+		if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
+			dma_unmap_page(priv->ddev, tx_info->map0_dma,
+				       PAGE_SIZE, priv->frag_info[0].dma_dir);
+			put_page(tx_info->page);
+		}
+		return tx_info->nr_txbb;
+	}
+
+	skb = tx_info->skb;
+
 	/* We do not touch skb here, so prefetch skb->users location
 	 * to speedup consume_skb()
 	 */
@@ -476,6 +494,9 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
 	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
 
+	if (ring->recycle_ring)
+		return done < budget;
+
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
 
 	/* Wakeup Tx queue if this stopped, and ring is not full.
@@ -1055,3 +1076,107 @@ tx_drop:
 	return NETDEV_TX_OK;
 }
 
+netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
+			       struct net_device *dev, unsigned int length,
+			       int tx_ind, int *doorbell_pending)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
+	struct mlx4_en_tx_ring *ring;
+	struct mlx4_en_tx_desc *tx_desc;
+	struct mlx4_wqe_data_seg *data;
+	struct mlx4_en_tx_info *tx_info;
+	int index, bf_index;
+	bool send_doorbell;
+	int nr_txbb = 1;
+	bool stop_queue;
+	dma_addr_t dma;
+	int real_size;
+	__be32 op_own;
+	u32 ring_cons;
+	bool bf_ok;
+
+	BUILD_BUG_ON_MSG(ALIGN(CTRL_SIZE + DS_SIZE, TXBB_SIZE) != TXBB_SIZE,
+			 "mlx4_en_xmit_frame requires minimum size tx desc");
+
+	ring = priv->tx_ring[tx_ind];
+
+	if (!priv->port_up)
+		goto tx_drop;
+
+	if (mlx4_en_is_tx_ring_full(ring))
+		goto tx_drop;
+
+	/* fetch ring->cons far ahead before needing it to avoid stall */
+	ring_cons = READ_ONCE(ring->cons);
+
+	index = ring->prod & ring->size_mask;
+	tx_info = &ring->tx_info[index];
+
+	bf_ok = ring->bf_enabled;
+
+	/* Track current inflight packets for performance analysis */
+	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
+			 (u32)(ring->prod - ring_cons - 1));
+
+	bf_index = ring->prod;
+	tx_desc = ring->buf + index * TXBB_SIZE;
+	data = &tx_desc->data;
+
+	dma = frame->dma;
+
+	tx_info->page = frame->page;
+	frame->page = NULL;
+	tx_info->map0_dma = dma;
+	tx_info->map0_byte_count = length;
+	tx_info->nr_txbb = nr_txbb;
+	tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
+	tx_info->data_offset = (void *)data - (void *)tx_desc;
+	tx_info->ts_requested = 0;
+	tx_info->nr_maps = 1;
+	tx_info->linear = 1;
+	tx_info->inl = 0;
+
+	dma_sync_single_for_device(priv->ddev, dma, length, PCI_DMA_TODEVICE);
+
+	data->addr = cpu_to_be64(dma);
+	data->lkey = ring->mr_key;
+	dma_wmb();
+	data->byte_count = cpu_to_be32(length);
+
+	/* tx completion can avoid cache line miss for common cases */
+	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
+
+	op_own = cpu_to_be32(MLX4_OPCODE_SEND) |
+		((ring->prod & ring->size) ?
+		 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
+
+	ring->packets++;
+	ring->bytes += tx_info->nr_bytes;
+	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
+
+	ring->prod += nr_txbb;
+
+	stop_queue = mlx4_en_is_tx_ring_full(ring);
+	send_doorbell = stop_queue ||
+				*doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
+	bf_ok &= send_doorbell;
+
+	real_size = ((CTRL_SIZE + nr_txbb * DS_SIZE) / 16) & 0x3f;
+
+	if (bf_ok)
+		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
+	else
+		qpn_vlan.fence_size = real_size;
+
+	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
+			      op_own, bf_ok, send_doorbell);
+	*doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
+
+	return NETDEV_TX_OK;
+
+tx_drop:
+	ring->tx_dropped++;
+	return NETDEV_TX_BUSY;
+}
+
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 0e0ecd1..7309308 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -132,6 +132,7 @@ enum {
 					 MLX4_EN_NUM_UP)
 
 #define MLX4_EN_DEFAULT_TX_WORK		256
+#define MLX4_EN_DOORBELL_BUDGET		8
 
 /* Target number of packets to coalesce with interrupt moderation */
 #define MLX4_EN_RX_COAL_TARGET	44
@@ -219,7 +220,10 @@ enum cq_type {
 
 
 struct mlx4_en_tx_info {
-	struct sk_buff *skb;
+	union {
+		struct sk_buff *skb;
+		struct page *page;
+	};
 	dma_addr_t	map0_dma;
 	u32		map0_byte_count;
 	u32		nr_txbb;
@@ -298,6 +302,7 @@ struct mlx4_en_tx_ring {
 	__be32			mr_key;
 	void			*buf;
 	struct mlx4_en_tx_info	*tx_info;
+	struct mlx4_en_rx_ring	*recycle_ring;
 	u8			*bounce_buf;
 	struct mlx4_qp_context	context;
 	int			qpn;
@@ -604,6 +609,7 @@ struct mlx4_en_priv {
 	struct hwtstamp_config hwtstamp_config;
 	u32 counter_index;
 	struct bpf_prog *prog;
+	int rsv_tx_rings;
 
 #ifdef CONFIG_MLX4_EN_DCB
 #define MLX4_EN_DCB_ENABLED	0x3
@@ -678,6 +684,12 @@ void mlx4_en_tx_irq(struct mlx4_cq *mcq);
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 			 void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
+			       struct net_device *dev, unsigned int length,
+			       int tx_ind, int *doorbell_pending);
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
+bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
+			struct mlx4_en_rx_alloc *frame);
 
 int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_tx_ring **pring,
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 10/12] bpf: enable direct packet data write for xdp progs
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (8 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 09/12] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 11/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

For forwarding to be effective, XDP programs should be allowed to
rewrite packet data.

This requires that the drivers supporting XDP must all map the packet
memory as TODEVICE or BIDIRECTIONAL before invoking the program.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 kernel/bpf/verifier.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8d67d0..f72f23b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -653,6 +653,16 @@ static int check_map_access(struct verifier_env *env, u32 regno, int off,
 
 #define MAX_PACKET_OFF 0xffff
 
+static bool may_write_pkt_data(enum bpf_prog_type type)
+{
+	switch (type) {
+	case BPF_PROG_TYPE_XDP:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int check_packet_access(struct verifier_env *env, u32 regno, int off,
 			       int size)
 {
@@ -806,10 +816,15 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off,
 			err = check_stack_read(state, off, size, value_regno);
 		}
 	} else if (state->regs[regno].type == PTR_TO_PACKET) {
-		if (t == BPF_WRITE) {
+		if (t == BPF_WRITE && !may_write_pkt_data(env->prog->type)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
 		}
+		if (t == BPF_WRITE && value_regno >= 0 &&
+		    is_pointer_value(env, value_regno)) {
+			verbose("R%d leaks addr into packet\n", value_regno);
+			return -EACCES;
+		}
 		err = check_packet_access(env, regno, off, size);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown_value(state->regs, value_regno);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 11/12] bpf: add sample for xdp forwarding and rewrite
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (9 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 10/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  2:15 ` [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path Brenden Blanco
  2016-07-10 16:14 ` [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Tariq Toukan
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

Add a sample that rewrites and forwards packets out on the same
interface. Observed single core forwarding performance of ~10Mpps.

Since the mlx4 driver under test recycles every single packet page, the
perf output shows almost exclusively just the ring management and bpf
program work. Slowdowns are likely occurring due to cache misses.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 samples/bpf/Makefile    |   5 +++
 samples/bpf/xdp2_kern.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)
 create mode 100644 samples/bpf/xdp2_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0e4ab3a..d2d2b35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
+hostprogs-y += xdp2
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
+# reuse xdp1 source intentionally
+xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
+always += xdp2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
+HOSTLOADLIBES_xdp2 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
new file mode 100644
index 0000000..38fe7e1
--- /dev/null
+++ b/samples/bpf/xdp2_kern.c
@@ -0,0 +1,114 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+static void swap_src_dst_mac(void *data)
+{
+	unsigned short *p = data;
+	unsigned short dst[3];
+
+	dst[0] = p[0];
+	dst[1] = p[1];
+	dst[2] = p[2];
+	p[0] = p[3];
+	p[1] = p[4];
+	p[2] = p[5];
+	p[3] = dst[0];
+	p[4] = dst[1];
+	p[5] = dst[2];
+}
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+	struct iphdr *iph = data + nh_off;
+
+	if (iph + 1 > data_end)
+		return 0;
+	return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+	struct ipv6hdr *ip6h = data + nh_off;
+
+	if (ip6h + 1 > data_end)
+		return 0;
+	return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u64 nh_off;
+	u32 index;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+
+	if (h_proto == htons(ETH_P_IP))
+		index = parse_ipv4(data, nh_off, data_end);
+	else if (h_proto == htons(ETH_P_IPV6))
+		index = parse_ipv6(data, nh_off, data_end);
+	else
+		index = 0;
+
+	value = bpf_map_lookup_elem(&dropcnt, &index);
+	if (value)
+		*value += 1;
+
+	if (index == 17) {
+		swap_src_dst_mac(data);
+		rc = XDP_TX;
+	}
+
+	return rc;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (10 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 11/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
@ 2016-07-08  2:15 ` Brenden Blanco
  2016-07-08  3:56   ` Eric Dumazet
  2016-07-10 16:14 ` [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Tariq Toukan
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08  2:15 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

XDP programs read and/or write packet data very early, and cache miss is
seen to be a bottleneck.

Add prefetch logic in the xdp case 3 packets in the future. Throughput
improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
with >3 showing dimishing returns.

Before:
 21.94%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001d6e4
 12.96%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 12.28%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_xmit_frame
 11.93%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_poll_tx_cq
  4.77%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_prepare_rx_desc
  3.13%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_tx_desc.isra.30
  2.68%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.22%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  2.02%  ksoftirqd/0  [mlx4_core]       [k] mlx4_eq_int
  1.92%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_rx_recycle

After:
 20.70%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_xmit_frame
 18.14%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 16.30%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_poll_tx_cq
  6.49%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_prepare_rx_desc
  4.06%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_tx_desc.isra.30
  2.76%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_rx_recycle
  2.37%  ksoftirqd/0  [mlx4_core]       [k] mlx4_eq_int
  1.44%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  1.43%  swapper      [kernel.vmlinux]  [k] intel_idle
  1.20%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  1.19%  ksoftirqd/0  [mlx4_core]       [k] 0x0000000000049eb8

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 41c76fe..65e93f7 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -881,10 +881,17 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		 * read bytes but not past the end of the frag.
 		 */
 		if (prog) {
+			struct mlx4_en_rx_alloc *pref;
 			struct xdp_buff xdp;
+			int pref_index;
 			dma_addr_t dma;
 			u32 act;
 
+			pref_index = (index + 3) & ring->size_mask;
+			pref = ring->rx_info +
+					(pref_index << priv->log_rx_info);
+			prefetch(page_address(pref->page) + pref->page_offset);
+
 			dma = be64_to_cpu(rx_desc->data[0].addr);
 			dma_sync_single_for_cpu(priv->ddev, dma,
 						priv->frag_info[0].frag_size,
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08  2:15 ` [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path Brenden Blanco
@ 2016-07-08  3:56   ` Eric Dumazet
  2016-07-08  4:16     ` Alexei Starovoitov
  2016-07-08 15:20     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-07-08  3:56 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote:
> XDP programs read and/or write packet data very early, and cache miss is
> seen to be a bottleneck.
> 
> Add prefetch logic in the xdp case 3 packets in the future. Throughput
> improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
> reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
> with >3 showing dimishing returns.

This is what I feared with XDP.

Instead of making generic changes in the driver(s), we now have 'patches
that improve XDP numbers'

Careful prefetches make sense in NIC drivers, regardless of XDP being
used or not.

On mlx4, prefetching next cqe could probably help as well.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08  3:56   ` Eric Dumazet
@ 2016-07-08  4:16     ` Alexei Starovoitov
  2016-07-08  6:56       ` Eric Dumazet
  2016-07-08 15:20     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-08  4:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Brenden Blanco, davem, netdev, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john.fastabend,
	hannes, Thomas Graf, Tom Herbert, Daniel Borkmann

On Fri, Jul 08, 2016 at 05:56:31AM +0200, Eric Dumazet wrote:
> On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote:
> > XDP programs read and/or write packet data very early, and cache miss is
> > seen to be a bottleneck.
> > 
> > Add prefetch logic in the xdp case 3 packets in the future. Throughput
> > improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
> > reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
> > with >3 showing dimishing returns.
> 
> This is what I feared with XDP.
> 
> Instead of making generic changes in the driver(s), we now have 'patches
> that improve XDP numbers'
> 
> Careful prefetches make sense in NIC drivers, regardless of XDP being
> used or not.
> 
> On mlx4, prefetching next cqe could probably help as well.

I've tried this style of prefetching in the past for normal stack
and it didn't help at all.
It helps XDP because inner processing loop is short with small number
of memory accesses, so prefetching Nth packet in advance helps.
Prefetching next packet doesn't help as much, since bpf prog is
too short and hw prefetch logic doesn't have time to actually pull
the data in.
The effectiveness of this patch depends on size of the bpf program
and ammount of work it does. For small and medium sizes it works well.
For large prorgrams probably not so much, but we didn't get to
this point yet. I think eventually the prefetch distance should
be calculated dynamically based on size of prog and amount of work
it does or configured via knob (which would be unfortunate).
The performance gain is sizable, so I think it makes sense to
keep it... even to demonstrate the prefetch logic.
Also note this is ddio capable cpu. On desktop class cpus
the prefetch is mandatory for all bpf programs to have good performance.

Another alternative we considered is to allow bpf programs to
indicate to xdp infra how much in advance to prefetch, so
xdp side will prefetch only when program gives a hint.
But that would be the job of future patches.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08  4:16     ` Alexei Starovoitov
@ 2016-07-08  6:56       ` Eric Dumazet
  2016-07-08 16:49         ` Brenden Blanco
  0 siblings, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2016-07-08  6:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, davem, netdev, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john.fastabend,
	hannes, Thomas Graf, Tom Herbert, Daniel Borkmann

On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote:

> I've tried this style of prefetching in the past for normal stack
> and it didn't help at all.

This is very nice, but my experience showed opposite numbers.
So I guess you did not choose the proper prefetch strategy.

prefetching in mlx4 gave me good results, once I made sure our compiler
was not moving the actual prefetch operations on x86_64 (ie forcing use
of asm volatile as in x86_32 instead of the builtin prefetch). You might
check if your compiler does the proper thing because this really hurt me
in the past.

In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of
64 bytes allowed to remove one stall in GRO engine when using TCP with
TS (total header size : 66 bytes), or tunnels.

The problem with prefetch is that it works well assuming a given rate
(in pps), and given cpus, as prefetch behavior is varying among flavors.

Brenden chose to prefetch N+3, based on some experiments, on some
hardware,

prefetch N+3 can actually slow down if you receive a moderate load,
which is the case 99% of the time in typical workloads on modern servers
with multi queue NIC.

This is why it was hard to upstream such changes, because they focus on
max throughput instead of low latencies.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08  3:56   ` Eric Dumazet
  2016-07-08  4:16     ` Alexei Starovoitov
@ 2016-07-08 15:20     ` Jesper Dangaard Brouer
  2016-07-08 16:02       ` [net-next PATCH RFC] mlx4: RX prefetch loop Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-08 15:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Brenden Blanco, davem, netdev, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, brouer, Rana Shahout

On Fri, 08 Jul 2016 05:56:31 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2016-07-07 at 19:15 -0700, Brenden Blanco wrote:
> > XDP programs read and/or write packet data very early, and cache miss is
> > seen to be a bottleneck.
> > 
> > Add prefetch logic in the xdp case 3 packets in the future. Throughput
> > improved from 10Mpps to 12.5Mpps.  LLC misses as reported by perf stat
> > reduced from ~14% to ~7%.  Prefetch values of 0 through 5 were compared
> > with >3 showing dimishing returns.  
> 
> This is what I feared with XDP.
> 
> Instead of making generic changes in the driver(s), we now have 'patches
> that improve XDP numbers'
> 
> Careful prefetches make sense in NIC drivers, regardless of XDP being
> used or not.

I feel the same way.  Our work on XDP should also benefit the normal
driver usage. Yes, I know that is much much harder to achieve, but it
will be worth it.


I've been playing with prefetch changes to mxl4 (and mlx5) what are generic. 
I'll post my patch as RFC.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-08 15:20     ` Jesper Dangaard Brouer
@ 2016-07-08 16:02       ` Jesper Dangaard Brouer
  2016-07-11 11:09         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-08 16:02 UTC (permalink / raw)
  To: netdev
  Cc: kafai, daniel, tom, bblanco, Jesper Dangaard Brouer,
	john.fastabend, gerlitz.or, hannes, rana.shahot, tgraf,
	David S. Miller, as754m

This patch is about prefetching without being opportunistic.
The idea is only to start prefetching on packets that are marked as
ready/completed in the RX ring.

This is acheived by splitting the napi_poll call mlx4_en_process_rx_cq()
loop into two.  The first loop extract completed CQEs and start
prefetching on data and RX descriptors. The second loop process the
real packets.

Details: The batching of CQEs are limited to 8 in-order to avoid
stressing the LFB (Line Fill Buffer) and cache usage.

I've left some opportunities for prefetching CQE descriptors.


The performance improvements on my platform are huge, as I tested this
on a CPU without DDIO.  The performance for XDP is the same as with
Brendens prefetch hack.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   70 +++++++++++++++++++++++++---
 1 file changed, 62 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 41c76fe00a7f..c5efe03e31ce 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -782,7 +782,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	int doorbell_pending;
 	struct sk_buff *skb;
 	int tx_index;
-	int index;
+	int index, saved_index, i;
 	int nr;
 	unsigned int length;
 	int polled = 0;
@@ -790,6 +790,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	int factor = priv->cqe_factor;
 	u64 timestamp;
 	bool l2_tunnel;
+#define PREFETCH_BATCH 8
+	struct mlx4_cqe *cqe_array[PREFETCH_BATCH];
+	int cqe_idx;
+	bool cqe_more;
 
 	if (!priv->port_up)
 		return 0;
@@ -801,24 +805,75 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	doorbell_pending = 0;
 	tx_index = (priv->tx_ring_num - priv->rsv_tx_rings) + cq->ring;
 
+next_prefetch_batch:
+	cqe_idx = 0;
+	cqe_more = false;
+
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
 	 * reading 'cqe->index' */
 	index = cq->mcq.cons_index & ring->size_mask;
+	saved_index = index;
 	cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
 
-	/* Process all completed CQEs */
+	/* Extract and prefetch completed CQEs */
 	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
 		    cq->mcq.cons_index & cq->size)) {
+		void *data;
 
 		frags = ring->rx_info + (index << priv->log_rx_info);
 		rx_desc = ring->buf + (index << ring->log_stride);
+		prefetch(rx_desc);
 
 		/*
 		 * make sure we read the CQE after we read the ownership bit
 		 */
 		dma_rmb();
 
+		cqe_array[cqe_idx++] = cqe;
+
+		/* Base error handling here, free handled in next loop */
+		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
+			     MLX4_CQE_OPCODE_ERROR))
+			goto skip;
+
+		data = page_address(frags[0].page) + frags[0].page_offset;
+		prefetch(data);
+	skip:
+		++cq->mcq.cons_index;
+		index = (cq->mcq.cons_index) & ring->size_mask;
+		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
+		/* likely too slow prefetching CQE here ... do look-a-head ? */
+		//prefetch(cqe + priv->cqe_size * 3);
+
+		if (++polled == budget) {
+			cqe_more = false;
+			break;
+		}
+		if (cqe_idx == PREFETCH_BATCH) {
+			cqe_more = true;
+			// IDEA: Opportunistic prefetch CQEs for next_prefetch_batch?
+			//for (i = 0; i < PREFETCH_BATCH; i++) {
+			//	prefetch(cqe + priv->cqe_size * i);
+			//}
+			break;
+		}
+	}
+	/* Hint: The cqe_idx will be number of packets, it can be used
+	 * for bulk allocating SKBs
+	 */
+
+	/* Now, index function as index for rx_desc */
+	index = saved_index;
+
+	/* Process completed CQEs in cqe_array */
+	for (i = 0; i < cqe_idx; i++) {
+
+		cqe = cqe_array[i];
+
+		frags = ring->rx_info + (index << priv->log_rx_info);
+		rx_desc = ring->buf + (index << ring->log_stride);
+
 		/* Drop packet on bad receive or bad checksum */
 		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
 						MLX4_CQE_OPCODE_ERROR)) {
@@ -1065,14 +1120,13 @@ next:
 			mlx4_en_free_frag(priv, frags, nr);
 
 consumed:
-		++cq->mcq.cons_index;
-		index = (cq->mcq.cons_index) & ring->size_mask;
-		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
-		if (++polled == budget)
-			goto out;
+		++index;
+		index = index & ring->size_mask;
 	}
+	/* Check for more completed CQEs */
+	if (cqe_more)
+		goto next_prefetch_batch;
 
-out:
 	if (doorbell_pending)
 		mlx4_en_xmit_doorbell(priv->tx_ring[tx_index]);
 

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08  6:56       ` Eric Dumazet
@ 2016-07-08 16:49         ` Brenden Blanco
  2016-07-10 20:48           ` Tom Herbert
  2016-07-10 20:50           ` Tom Herbert
  0 siblings, 2 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-08 16:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, davem, netdev, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john.fastabend,
	hannes, Thomas Graf, Tom Herbert, Daniel Borkmann

On Fri, Jul 08, 2016 at 08:56:45AM +0200, Eric Dumazet wrote:
> On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote:
> 
> > I've tried this style of prefetching in the past for normal stack
> > and it didn't help at all.
> 
> This is very nice, but my experience showed opposite numbers.
> So I guess you did not choose the proper prefetch strategy.
> 
> prefetching in mlx4 gave me good results, once I made sure our compiler
> was not moving the actual prefetch operations on x86_64 (ie forcing use
> of asm volatile as in x86_32 instead of the builtin prefetch). You might
> check if your compiler does the proper thing because this really hurt me
> in the past.
> 
> In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of
> 64 bytes allowed to remove one stall in GRO engine when using TCP with
> TS (total header size : 66 bytes), or tunnels.
> 
> The problem with prefetch is that it works well assuming a given rate
> (in pps), and given cpus, as prefetch behavior is varying among flavors.
> 
> Brenden chose to prefetch N+3, based on some experiments, on some
> hardware,
> 
> prefetch N+3 can actually slow down if you receive a moderate load,
> which is the case 99% of the time in typical workloads on modern servers
> with multi queue NIC.
Thanks for the feedback Eric!

This particular patch in the series is meant to be standalone exactly
for this reason. I don't pretend to assert that this optimization will
work for everybody, or even for a future version of me with different
hardware. But, it passes my internal criteria for usefulness:
1. It provides a measurable gain in the experiments that I have at hand
2. The code is easy to review
3. The change does not negatively impact non-XDP users

I would love to have a solution for all mlx4 driver users, but this
patch set is focused on a different goal. So, without munging a
different set of changes for the universal use case, and probably
violating criteria #2 or #3, I went with what you see.

In hopes of not derailing the whole patch series, what is an actionable
next step for this patch #12?
Ideas:
Pick a safer N? (I saw improvements with N=1 as well)
Drop this patch?

One thing I definitely don't want to do is go into the weeds trying to
get a universal prefetch logic in order to merge the XDP framework, even
though I agree the net result would benefit everybody.
> 
> This is why it was hard to upstream such changes, because they focus on
> max throughput instead of low latencies.
> 
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
@ 2016-07-09  8:14   ` Jesper Dangaard Brouer
  2016-07-09 13:47     ` Tom Herbert
  2016-07-10 20:56   ` Tom Herbert
  2016-07-10 21:04   ` Tom Herbert
  2 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-09  8:14 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Martin KaFai Lau, Ari Saha, Alexei Starovoitov,
	Or Gerlitz, john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, brouer

On Thu,  7 Jul 2016 19:15:13 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
> 
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. The
> driver can choose to implement unknown return codes however it wants,
> but must invoke the warning helper with the action value.

I believe we should define a stronger semantics for unknown/future
return codes than the once stated above:
 "driver can choose to implement unknown return codes however it wants"

The mlx4 driver implementation in:
 [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program

On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:

> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag.
> +		 */
> +		if (prog) {
> +			struct xdp_buff xdp;
> +			dma_addr_t dma;
> +			u32 act;
> +
> +			dma = be64_to_cpu(rx_desc->data[0].addr);
> +			dma_sync_single_for_cpu(priv->ddev, dma,
> +						priv->frag_info[0].frag_size,
> +						DMA_FROM_DEVICE);
> +
> +			xdp.data = page_address(frags[0].page) +
> +							frags[0].page_offset;
> +			xdp.data_end = xdp.data + length;
> +
> +			act = bpf_prog_run_xdp(prog, &xdp);
> +			switch (act) {
> +			case XDP_PASS:
> +				break;
> +			default:
> +				bpf_warn_invalid_xdp_action(act);
> +			case XDP_DROP:
> +				goto next;
> +			}
> +		}

Thus, mlx4 choice is to drop packets for unknown/future return codes.

I think this is the wrong choice.  I think the choice should be
XDP_PASS, to pass the packet up the stack.

I find "XDP_DROP" problematic because it happen so early in the driver,
that we lost all possibilities to debug what packets gets dropped.  We
get a single kernel log warning, but we cannot inspect the packets any
longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
tcpdump) is available.


I can also imagine that, defaulting to XDP_PASS, can be an important
feature in the future.

In the future we will likely have features, where XDP can "offload"
packet delivery from the normal stack (e.g. delivery into a VM).  On a
running production system you can then load your XDP program.  If the
driver was too old defaulting to XDP_DROP, then you lost your service,
instead if defaulting to XDP_PASS, your service would survive, falling
back to normal delivery.

(For the VM delivery use-case, there will likely be a need for having a
fallback delivery method in place, when the XDP program is not active,
in-order to support VM migration).



> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c14ca1c..5b47ac3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
[...]
>  
> +/* User return codes for XDP prog type.
> + * A valid XDP program must return one of these defined values. All other
> + * return codes are reserved for future use. Unknown return codes will result
> + * in driver-dependent behavior.
> + */
> +enum xdp_action {
> +	XDP_DROP,
> +	XDP_PASS,
> +};
> +
[...]
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e206c21..a8d67d0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[...]
> +void bpf_warn_invalid_xdp_action(int act)
> +{
> +	WARN_ONCE(1, "\n"
> +		     "*****************************************************\n"
> +		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +		     "**                                               **\n"
> +		     "** XDP program returned unknown value %-10u **\n"
> +		     "**                                               **\n"
> +		     "** XDP programs must return a well-known return  **\n"
> +		     "** value. Invalid return values will result in   **\n"
> +		     "** undefined packet actions.                     **\n"
> +		     "**                                               **\n"
> +		     "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +		     "*****************************************************\n",
> +		  act);
> +}
> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> +


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-09  8:14   ` Jesper Dangaard Brouer
@ 2016-07-09 13:47     ` Tom Herbert
  2016-07-10 13:37       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-09 13:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Thu,  7 Jul 2016 19:15:13 -0700
> Brenden Blanco <bblanco@plumgrid.com> wrote:
>
>> Add a new bpf prog type that is intended to run in early stages of the
>> packet rx path. Only minimal packet metadata will be available, hence a
>> new context type, struct xdp_md, is exposed to userspace. So far only
>> expose the packet start and end pointers, and only in read mode.
>>
>> An XDP program must return one of the well known enum values, all other
>> return codes are reserved for future use. Unfortunately, this
>> restriction is hard to enforce at verification time, so take the
>> approach of warning at runtime when such programs are encountered. The
>> driver can choose to implement unknown return codes however it wants,
>> but must invoke the warning helper with the action value.
>
> I believe we should define a stronger semantics for unknown/future
> return codes than the once stated above:
>  "driver can choose to implement unknown return codes however it wants"
>
> The mlx4 driver implementation in:
>  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
>
> On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>
>> +             /* A bpf program gets first chance to drop the packet. It may
>> +              * read bytes but not past the end of the frag.
>> +              */
>> +             if (prog) {
>> +                     struct xdp_buff xdp;
>> +                     dma_addr_t dma;
>> +                     u32 act;
>> +
>> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
>> +                     dma_sync_single_for_cpu(priv->ddev, dma,
>> +                                             priv->frag_info[0].frag_size,
>> +                                             DMA_FROM_DEVICE);
>> +
>> +                     xdp.data = page_address(frags[0].page) +
>> +                                                     frags[0].page_offset;
>> +                     xdp.data_end = xdp.data + length;
>> +
>> +                     act = bpf_prog_run_xdp(prog, &xdp);
>> +                     switch (act) {
>> +                     case XDP_PASS:
>> +                             break;
>> +                     default:
>> +                             bpf_warn_invalid_xdp_action(act);
>> +                     case XDP_DROP:
>> +                             goto next;
>> +                     }
>> +             }
>
> Thus, mlx4 choice is to drop packets for unknown/future return codes.
>
> I think this is the wrong choice.  I think the choice should be
> XDP_PASS, to pass the packet up the stack.
>
> I find "XDP_DROP" problematic because it happen so early in the driver,
> that we lost all possibilities to debug what packets gets dropped.  We
> get a single kernel log warning, but we cannot inspect the packets any
> longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> tcpdump) is available.
>
It's an API issue though not a problem with the packet. Allowing
unknown return codes to pass seems like a major security problem also.

Tom

>
> I can also imagine that, defaulting to XDP_PASS, can be an important
> feature in the future.
>
> In the future we will likely have features, where XDP can "offload"
> packet delivery from the normal stack (e.g. delivery into a VM).  On a
> running production system you can then load your XDP program.  If the
> driver was too old defaulting to XDP_DROP, then you lost your service,
> instead if defaulting to XDP_PASS, your service would survive, falling
> back to normal delivery.
>
> (For the VM delivery use-case, there will likely be a need for having a
> fallback delivery method in place, when the XDP program is not active,
> in-order to support VM migration).
>
>
>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index c14ca1c..5b47ac3 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
> [...]
>>
>> +/* User return codes for XDP prog type.
>> + * A valid XDP program must return one of these defined values. All other
>> + * return codes are reserved for future use. Unknown return codes will result
>> + * in driver-dependent behavior.
>> + */
>> +enum xdp_action {
>> +     XDP_DROP,
>> +     XDP_PASS,
>> +};
>> +
> [...]
>>  #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index e206c21..a8d67d0 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
> [...]
>> +void bpf_warn_invalid_xdp_action(int act)
>> +{
>> +     WARN_ONCE(1, "\n"
>> +                  "*****************************************************\n"
>> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
>> +                  "**                                               **\n"
>> +                  "** XDP program returned unknown value %-10u **\n"
>> +                  "**                                               **\n"
>> +                  "** XDP programs must return a well-known return  **\n"
>> +                  "** value. Invalid return values will result in   **\n"
>> +                  "** undefined packet actions.                     **\n"
>> +                  "**                                               **\n"
>> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
>> +                  "*****************************************************\n",
>> +               act);
>> +}
>> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>> +
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-08  2:15 ` [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
@ 2016-07-09 14:07   ` Or Gerlitz
  2016-07-10 15:40     ` Brenden Blanco
  2016-07-09 19:58   ` Saeed Mahameed
  1 sibling, 1 reply; 59+ messages in thread
From: Or Gerlitz @ 2016-07-09 14:07 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David Miller, Linux Netdev List, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov,
	john fastabend, Hannes Frederic Sowa, Thomas Graf, Tom Herbert,
	Daniel Borkmann

On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touchs the packet data. However, in the pursuit of

nit, for the next version touchs --> touches

> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.


[...]

> @@ -835,6 +838,34 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                 l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>                         (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
> +               /* A bpf program gets first chance to drop the packet. It may
> +                * read bytes but not past the end of the frag.
> +                */
> +               if (prog) {
> +                       struct xdp_buff xdp;
> +                       dma_addr_t dma;
> +                       u32 act;
> +
> +                       dma = be64_to_cpu(rx_desc->data[0].addr);
> +                       dma_sync_single_for_cpu(priv->ddev, dma,
> +                                               priv->frag_info[0].frag_size,
> +                                               DMA_FROM_DEVICE);
> +
> +                       xdp.data = page_address(frags[0].page) +
> +                                                       frags[0].page_offset;
> +                       xdp.data_end = xdp.data + length;
> +
> +                       act = bpf_prog_run_xdp(prog, &xdp);
> +                       switch (act) {
> +                       case XDP_PASS:
> +                               break;
> +                       default:
> +                               bpf_warn_invalid_xdp_action(act);
> +                       case XDP_DROP:
> +                               goto next;
> +                       }
> +               }


(probably a nit too, but wanted to make sure we don't miss something
here) is the default case preceding the DROP one in purpose? any
special reason to do that?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-08  2:15 ` [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
  2016-07-09 14:07   ` Or Gerlitz
@ 2016-07-09 19:58   ` Saeed Mahameed
  2016-07-09 21:37     ` Or Gerlitz
  2016-07-10 15:25     ` Tariq Toukan
  1 sibling, 2 replies; 59+ messages in thread
From: Saeed Mahameed @ 2016-07-09 19:58 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Netdev List, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann

On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touchs the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
>
> Therefore, disallow MTU settings that would produce a multi-fragment
> packet that XDP programs would fail to access. Future enhancements could
> be done to increase the allowable MTU.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 38 ++++++++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 36 +++++++++++++++++++++---
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 ++++
>  3 files changed, 75 insertions(+), 4 deletions(-)
>
[...]
> +               /* A bpf program gets first chance to drop the packet. It may
> +                * read bytes but not past the end of the frag.
> +                */
> +               if (prog) {
> +                       struct xdp_buff xdp;
> +                       dma_addr_t dma;
> +                       u32 act;
> +
> +                       dma = be64_to_cpu(rx_desc->data[0].addr);
> +                       dma_sync_single_for_cpu(priv->ddev, dma,
> +                                               priv->frag_info[0].frag_size,
> +                                               DMA_FROM_DEVICE);

In case of XDP_PASS we will dma_sync again in the normal path, this
can be improved by doing the dma_sync as soon as we can and once and
for all, regardless of the path the packet is going to take
(XDP_DROP/mlx4_en_complete_rx_desc/mlx4_en_rx_skb).

> +
> +                       xdp.data = page_address(frags[0].page) +
> +                                                       frags[0].page_offset;
> +                       xdp.data_end = xdp.data + length;
> +
> +                       act = bpf_prog_run_xdp(prog, &xdp);
> +                       switch (act) {
> +                       case XDP_PASS:
> +                               break;
> +                       default:
> +                               bpf_warn_invalid_xdp_action(act);
> +                       case XDP_DROP:
> +                               goto next;

The drop action here (goto next) will release the current rx_desc
buffers and use new ones to refill, I know that the mlx4 rx scheme
will release/allocate new pages once every ~32 packet, but one
improvement can really help here especially for XDP_DROP benchmarks is
to reuse the current rx_desc buffers in case it is going to be
dropped.

Considering if mlx4 rx buffer scheme doesn't allow gaps, Maybe this
can be added later as future improvement for the whole mlx4 rx data
path drop decisions.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 05/12] Add sample for adding simple drop program to link
  2016-07-08  2:15 ` [PATCH v6 05/12] Add sample for adding simple drop program to link Brenden Blanco
@ 2016-07-09 20:21   ` Saeed Mahameed
  2016-07-11 11:09   ` Jamal Hadi Salim
  1 sibling, 0 replies; 59+ messages in thread
From: Saeed Mahameed @ 2016-07-09 20:21 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Netdev List, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann

On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
> hook of a link. With the drop-only program, observed single core rate is
> ~20Mpps.
>
> Other tests were run, for instance without the dropcnt increment or
> without reading from the packet header, the packet rate was mostly
> unchanged.
>
> $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
> proto 17:   20403027 drops/s
>
> ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> Running... ctrl^C to stop
> Device: eth4@0
> Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
>   5056638pps 2427Mb/sec (2427186240bps) errors: 0
> Device: eth4@1
> Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
>   5133311pps 2463Mb/sec (2463989280bps) errors: 0
> Device: eth4@2
> Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
>   5077431pps 2437Mb/sec (2437166880bps) errors: 0
> Device: eth4@3
> Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
>   5043067pps 2420Mb/sec (2420672160bps) errors: 0
>
> perf report --no-children:
>  26.05%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
>  17.84%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
>   5.52%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag

This just proves my point on the previous patch, reusing the rx_desc
buffers we are going to drop will save us here ~23% CPU wasted on
(alloc_frags & free_frags ) ! and this can improve some benchmarks
results where the CPU is the bottleneck.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-09 19:58   ` Saeed Mahameed
@ 2016-07-09 21:37     ` Or Gerlitz
  2016-07-10 15:25     ` Tariq Toukan
  1 sibling, 0 replies; 59+ messages in thread
From: Or Gerlitz @ 2016-07-09 21:37 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Brenden Blanco, David S. Miller, Linux Netdev List,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, john fastabend, Hannes Frederic Sowa,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Sat, Jul 9, 2016 at 10:58 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
>> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>>
>> In tc/socket bpf programs, helpers linearize skb fragments as needed
>> when the program touchs the packet data. However, in the pursuit of
>> speed, XDP programs will not be allowed to use these slower functions,
>> especially if it involves allocating an skb.
>>
>> Therefore, disallow MTU settings that would produce a multi-fragment
>> packet that XDP programs would fail to access. Future enhancements could
>> be done to increase the allowable MTU.
>>
>> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 38 ++++++++++++++++++++++++++
>>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 36 +++++++++++++++++++++---
>>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 ++++
>>  3 files changed, 75 insertions(+), 4 deletions(-)
>>
> [...]
>> +               /* A bpf program gets first chance to drop the packet. It may
>> +                * read bytes but not past the end of the frag.
>> +                */
>> +               if (prog) {
>> +                       struct xdp_buff xdp;
>> +                       dma_addr_t dma;
>> +                       u32 act;
>> +
>> +                       dma = be64_to_cpu(rx_desc->data[0].addr);
>> +                       dma_sync_single_for_cpu(priv->ddev, dma,
>> +                                               priv->frag_info[0].frag_size,
>> +                                               DMA_FROM_DEVICE);
>
> In case of XDP_PASS we will dma_sync again in the normal path,

yep, correct

> this can be improved by doing the dma_sync as soon as we can and once and
> for all, regardless of the path the packet is going to take
> (XDP_DROP/mlx4_en_complete_rx_desc/mlx4_en_rx_skb).

how you would envision this can be done in a not very ugly way?


>> +
>> +                       xdp.data = page_address(frags[0].page) +
>> +                                                       frags[0].page_offset;
>> +                       xdp.data_end = xdp.data + length;
>> +
>> +                       act = bpf_prog_run_xdp(prog, &xdp);
>> +                       switch (act) {
>> +                       case XDP_PASS:
>> +                               break;
>> +                       default:
>> +                               bpf_warn_invalid_xdp_action(act);
>> +                       case XDP_DROP:
>> +                               goto next;
>
> The drop action here (goto next) will release the current rx_desc
> buffers and use new ones to refill, I know that the mlx4 rx scheme
> will release/allocate new pages once every ~32 packet, but one
> improvement can really help here especially for XDP_DROP benchmarks is
> to reuse the current rx_desc buffers in case it is going to be
> dropped.

> Considering if mlx4 rx buffer scheme doesn't allow gaps, Maybe this
> can be added later as future improvement for the whole mlx4 rx data
> path drop decisions.

yes, I think it makes sense to look on this as future improvement.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-09 13:47     ` Tom Herbert
@ 2016-07-10 13:37       ` Jesper Dangaard Brouer
  2016-07-10 17:09         ` Brenden Blanco
  2016-07-10 20:27         ` Tom Herbert
  0 siblings, 2 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-10 13:37 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann, brouer

On Sat, 9 Jul 2016 08:47:52 -0500
Tom Herbert <tom@herbertland.com> wrote:

> On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Thu,  7 Jul 2016 19:15:13 -0700
> > Brenden Blanco <bblanco@plumgrid.com> wrote:
> >  
> >> Add a new bpf prog type that is intended to run in early stages of the
> >> packet rx path. Only minimal packet metadata will be available, hence a
> >> new context type, struct xdp_md, is exposed to userspace. So far only
> >> expose the packet start and end pointers, and only in read mode.
> >>
> >> An XDP program must return one of the well known enum values, all other
> >> return codes are reserved for future use. Unfortunately, this
> >> restriction is hard to enforce at verification time, so take the
> >> approach of warning at runtime when such programs are encountered. The
> >> driver can choose to implement unknown return codes however it wants,
> >> but must invoke the warning helper with the action value.  
> >
> > I believe we should define a stronger semantics for unknown/future
> > return codes than the once stated above:
> >  "driver can choose to implement unknown return codes however it wants"
> >
> > The mlx4 driver implementation in:
> >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
> >
> > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> >  
> >> +             /* A bpf program gets first chance to drop the packet. It may
> >> +              * read bytes but not past the end of the frag.
> >> +              */
> >> +             if (prog) {
> >> +                     struct xdp_buff xdp;
> >> +                     dma_addr_t dma;
> >> +                     u32 act;
> >> +
> >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
> >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
> >> +                                             priv->frag_info[0].frag_size,
> >> +                                             DMA_FROM_DEVICE);
> >> +
> >> +                     xdp.data = page_address(frags[0].page) +
> >> +                                                     frags[0].page_offset;
> >> +                     xdp.data_end = xdp.data + length;
> >> +
> >> +                     act = bpf_prog_run_xdp(prog, &xdp);
> >> +                     switch (act) {
> >> +                     case XDP_PASS:
> >> +                             break;
> >> +                     default:
> >> +                             bpf_warn_invalid_xdp_action(act);
> >> +                     case XDP_DROP:
> >> +                             goto next;
> >> +                     }
> >> +             }  
> >
> > Thus, mlx4 choice is to drop packets for unknown/future return codes.
> >
> > I think this is the wrong choice.  I think the choice should be
> > XDP_PASS, to pass the packet up the stack.
> >
> > I find "XDP_DROP" problematic because it happen so early in the driver,
> > that we lost all possibilities to debug what packets gets dropped.  We
> > get a single kernel log warning, but we cannot inspect the packets any
> > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> > tcpdump) is available.
> >  
>
> It's an API issue though not a problem with the packet. Allowing
> unknown return codes to pass seems like a major security problem also.

We have the full power and flexibility of the normal Linux stack to
drop these packets.  And from a usability perspective it gives insight
into what is wrong and counters metrics.  Would you rather blindly drop
e.g. 0.01% of the packets in your data-centers without knowing.

We already talk about XDP as an offload mechanism.  Normally when
loading a (XDP) "offload" program it should be rejected, e.g. by the
validator.  BUT we cannot validate all return eBPF codes, because they
can originate from a table lookup.  Thus, we _do_ allow programs to be
loaded, with future unknown return code.
 This then corresponds to only part of the program can be offloaded,
thus the natural response is to fallback, handling this is the
non-offloaded slower-path.

I see the XDP_PASS fallback as a natural way of supporting loading
newer/future programs on older "versions" of XDP.
  E.g. I can have a XDP program that have a valid filter protection
mechanism, but also use a newer mechanism, and my server fleet contains
different NIC vendors, some NICs only support the filter part.  Then I
want to avoid having to compile and maintain different XDP/eBPF
programs per NIC vendor. (Instead I prefer having a Linux stack
fallback mechanism, and transparently XDP offload as much as the NIC
driver supports).


> > I can also imagine that, defaulting to XDP_PASS, can be an important
> > feature in the future.
> >
> > In the future we will likely have features, where XDP can "offload"
> > packet delivery from the normal stack (e.g. delivery into a VM).  On a
> > running production system you can then load your XDP program.  If the
> > driver was too old defaulting to XDP_DROP, then you lost your service,
> > instead if defaulting to XDP_PASS, your service would survive, falling
> > back to normal delivery.
> >
> > (For the VM delivery use-case, there will likely be a need for having a
> > fallback delivery method in place, when the XDP program is not active,
> > in-order to support VM migration).
> >
> >
> >  
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index c14ca1c..5b47ac3 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h  
> > [...]  
> >>
> >> +/* User return codes for XDP prog type.
> >> + * A valid XDP program must return one of these defined values. All other
> >> + * return codes are reserved for future use. Unknown return codes will result
> >> + * in driver-dependent behavior.
> >> + */
> >> +enum xdp_action {
> >> +     XDP_DROP,
> >> +     XDP_PASS,
> >> +};
> >> +  
> > [...]  
> >>  #endif /* _UAPI__LINUX_BPF_H__ */
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index e206c21..a8d67d0 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c  
> > [...]  
> >> +void bpf_warn_invalid_xdp_action(int act)
> >> +{
> >> +     WARN_ONCE(1, "\n"
> >> +                  "*****************************************************\n"
> >> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> >> +                  "**                                               **\n"
> >> +                  "** XDP program returned unknown value %-10u **\n"
> >> +                  "**                                               **\n"
> >> +                  "** XDP programs must return a well-known return  **\n"
> >> +                  "** value. Invalid return values will result in   **\n"
> >> +                  "** undefined packet actions.                     **\n"
> >> +                  "**                                               **\n"
> >> +                  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> >> +                  "*****************************************************\n",
> >> +               act);
> >> +}
> >> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> >> +  
> >
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   Author of http://www.iptv-analyzer.org
> >   LinkedIn: http://www.linkedin.com/in/brouer  



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-09 19:58   ` Saeed Mahameed
  2016-07-09 21:37     ` Or Gerlitz
@ 2016-07-10 15:25     ` Tariq Toukan
  2016-07-10 16:05       ` Brenden Blanco
  1 sibling, 1 reply; 59+ messages in thread
From: Tariq Toukan @ 2016-07-10 15:25 UTC (permalink / raw)
  To: Saeed Mahameed, Brenden Blanco
  Cc: David S. Miller, Linux Netdev List, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann


On 09/07/2016 10:58 PM, Saeed Mahameed wrote:
> On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
>> +               /* A bpf program gets first chance to drop the packet. It may
>> +                * read bytes but not past the end of the frag.
>> +                */
>> +               if (prog) {
>> +                       struct xdp_buff xdp;
>> +                       dma_addr_t dma;
>> +                       u32 act;
>> +
>> +                       dma = be64_to_cpu(rx_desc->data[0].addr);
>> +                       dma_sync_single_for_cpu(priv->ddev, dma,
>> +                                               priv->frag_info[0].frag_size,
>> +                                               DMA_FROM_DEVICE);
> In case of XDP_PASS we will dma_sync again in the normal path, this
> can be improved by doing the dma_sync as soon as we can and once and
> for all, regardless of the path the packet is going to take
> (XDP_DROP/mlx4_en_complete_rx_desc/mlx4_en_rx_skb).
I agree with Saeed, dma_sync is a heavy operation that is now done twice 
for all packets with XDP_PASS.
We should try our best to avoid performance degradation in the flow of 
unfiltered packets.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-09 14:07   ` Or Gerlitz
@ 2016-07-10 15:40     ` Brenden Blanco
  2016-07-10 16:38       ` Tariq Toukan
  0 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-10 15:40 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David Miller, Linux Netdev List, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov,
	john fastabend, Hannes Frederic Sowa, Thomas Graf, Tom Herbert,
	Daniel Borkmann

On Sat, Jul 09, 2016 at 05:07:36PM +0300, Or Gerlitz wrote:
> On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> > Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> >
> > In tc/socket bpf programs, helpers linearize skb fragments as needed
> > when the program touchs the packet data. However, in the pursuit of
> 
> nit, for the next version touchs --> touches
Will fix.
> 
[...]
> > +                       switch (act) {
> > +                       case XDP_PASS:
> > +                               break;
> > +                       default:
> > +                               bpf_warn_invalid_xdp_action(act);
> > +                       case XDP_DROP:
> > +                               goto next;
> > +                       }
> > +               }
> 
> 
> (probably a nit too, but wanted to make sure we don't miss something
> here) is the default case preceding the DROP one in purpose? any
> special reason to do that?
This is intentional, and legal though unconventional C. Without this
order, the later patches end up with a bit too much copy/paste for my
liking, as in:

case XDP_DROP:
        if (mlx4_en_rx_recycle(ring, frags))
                goto consumed;
        goto next;
default:
        bpf_warn_invalid_xdp_action(act);
        if (mlx4_en_rx_recycle(ring, frags))
                goto consumed;
        goto next;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-10 15:25     ` Tariq Toukan
@ 2016-07-10 16:05       ` Brenden Blanco
  2016-07-11 11:48         ` Saeed Mahameed
  0 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-10 16:05 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Saeed Mahameed, David S. Miller, Linux Netdev List,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Sun, Jul 10, 2016 at 06:25:40PM +0300, Tariq Toukan wrote:
> 
> On 09/07/2016 10:58 PM, Saeed Mahameed wrote:
> >On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> >>+               /* A bpf program gets first chance to drop the packet. It may
> >>+                * read bytes but not past the end of the frag.
> >>+                */
> >>+               if (prog) {
> >>+                       struct xdp_buff xdp;
> >>+                       dma_addr_t dma;
> >>+                       u32 act;
> >>+
> >>+                       dma = be64_to_cpu(rx_desc->data[0].addr);
> >>+                       dma_sync_single_for_cpu(priv->ddev, dma,
> >>+                                               priv->frag_info[0].frag_size,
> >>+                                               DMA_FROM_DEVICE);
> >In case of XDP_PASS we will dma_sync again in the normal path, this
> >can be improved by doing the dma_sync as soon as we can and once and
> >for all, regardless of the path the packet is going to take
> >(XDP_DROP/mlx4_en_complete_rx_desc/mlx4_en_rx_skb).
> I agree with Saeed, dma_sync is a heavy operation that is now done
> twice for all packets with XDP_PASS.
> We should try our best to avoid performance degradation in the flow
> of unfiltered packets.
Makes sense, do folks here see a way to do this cleanly?
> 
> Regards,
> Tariq

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding
  2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (11 preceding siblings ...)
  2016-07-08  2:15 ` [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path Brenden Blanco
@ 2016-07-10 16:14 ` Tariq Toukan
  12 siblings, 0 replies; 59+ messages in thread
From: Tariq Toukan @ 2016-07-10 16:14 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann


On 08/07/2016 5:15 AM, Brenden Blanco wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling eXpress Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before
> even an skb has been allocated.
>
> Extend on this with the ability to modify packet data and send back out
> on the same port.
>
> Patch 1 introduces the new prog type and helpers for validating the bpf
>    program. A new userspace struct is defined containing only data and
>    data_end as fields, with others to follow in the future.
> In patch 2, create a new ndo to pass the fd to supported drivers.
> In patch 3, expose a new rtnl option to userspace.
> In patch 4, enable support in mlx4 driver.
> In patch 5, create a sample drop and count program. With single core,
>    achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
>    packet data access, bpf array lookup, and increment.
> In patch 6, add a page recycle facility to mlx4 rx, enabled when xdp is
>    active.
> In patch 7, add the XDP_TX type to bpf.h
> In patch 8, add helper in tx patch for writing tx_desc
> In patch 9, add support in mlx4 for packet data write and forwarding
> In patch 10, turn on packet write support in the bpf verifier
> In patch 11, add a sample program for packet write and forwarding. With
>    single core, achieved ~10 Mpps rewrite and forwarding.
> In patch 12, add prefetch to mlx4 rx to bump forwarding to 12 Mpps
In general, I need some time (2-3 days) to run regression tests over 
this updated patch-set, as it does non-trivial changes to our mlx4_en 
data-path flows.
Other specific comments will be addressed separately.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-10 15:40     ` Brenden Blanco
@ 2016-07-10 16:38       ` Tariq Toukan
  0 siblings, 0 replies; 59+ messages in thread
From: Tariq Toukan @ 2016-07-10 16:38 UTC (permalink / raw)
  To: Brenden Blanco, Or Gerlitz, Jesper Dangaard Brouer
  Cc: David Miller, Linux Netdev List, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov,
	john fastabend, Hannes Frederic Sowa, Thomas Graf, Tom Herbert,
	Daniel Borkmann


>>> +                       switch (act) {
>>> +                       case XDP_PASS:
>>> +                               break;
>>> +                       default:
>>> +                               bpf_warn_invalid_xdp_action(act);
>>> +                       case XDP_DROP:
>>> +                               goto next;
>>> +                       }
>>> +               }
>>
>> (probably a nit too, but wanted to make sure we don't miss something
>> here) is the default case preceding the DROP one in purpose? any
>> special reason to do that?
> This is intentional, and legal though unconventional C. Without this
> order, the later patches end up with a bit too much copy/paste for my
> liking, as in:
>
> case XDP_DROP:
>          if (mlx4_en_rx_recycle(ring, frags))
>                  goto consumed;
>          goto next;
> default:
>          bpf_warn_invalid_xdp_action(act);
>          if (mlx4_en_rx_recycle(ring, frags))
>                  goto consumed;
>          goto next;
The more critical issue here is the default action. All packets that get 
unknown/unsupported filter classification will be dropped.
I think the XDP_PASS is the better choice here as default action, there 
are good reasons why it should be, as Jesper already explained in 
replies to other patch in the series.

Regards,
Tariq

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-10 13:37       ` Jesper Dangaard Brouer
@ 2016-07-10 17:09         ` Brenden Blanco
  2016-07-10 20:30           ` Tom Herbert
  2016-07-10 20:27         ` Tom Herbert
  1 sibling, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-10 17:09 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Sun, Jul 10, 2016 at 03:37:31PM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 9 Jul 2016 08:47:52 -0500
> Tom Herbert <tom@herbertland.com> wrote:
> 
> > On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> > > On Thu,  7 Jul 2016 19:15:13 -0700
> > > Brenden Blanco <bblanco@plumgrid.com> wrote:
> > >  
> > >> Add a new bpf prog type that is intended to run in early stages of the
> > >> packet rx path. Only minimal packet metadata will be available, hence a
> > >> new context type, struct xdp_md, is exposed to userspace. So far only
> > >> expose the packet start and end pointers, and only in read mode.
> > >>
> > >> An XDP program must return one of the well known enum values, all other
> > >> return codes are reserved for future use. Unfortunately, this
> > >> restriction is hard to enforce at verification time, so take the
> > >> approach of warning at runtime when such programs are encountered. The
> > >> driver can choose to implement unknown return codes however it wants,
> > >> but must invoke the warning helper with the action value.  
> > >
> > > I believe we should define a stronger semantics for unknown/future
> > > return codes than the once stated above:
> > >  "driver can choose to implement unknown return codes however it wants"
> > >
> > > The mlx4 driver implementation in:
> > >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
> > >
> > > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> > >  
> > >> +             /* A bpf program gets first chance to drop the packet. It may
> > >> +              * read bytes but not past the end of the frag.
> > >> +              */
> > >> +             if (prog) {
> > >> +                     struct xdp_buff xdp;
> > >> +                     dma_addr_t dma;
> > >> +                     u32 act;
> > >> +
> > >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
> > >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
> > >> +                                             priv->frag_info[0].frag_size,
> > >> +                                             DMA_FROM_DEVICE);
> > >> +
> > >> +                     xdp.data = page_address(frags[0].page) +
> > >> +                                                     frags[0].page_offset;
> > >> +                     xdp.data_end = xdp.data + length;
> > >> +
> > >> +                     act = bpf_prog_run_xdp(prog, &xdp);
> > >> +                     switch (act) {
> > >> +                     case XDP_PASS:
> > >> +                             break;
> > >> +                     default:
> > >> +                             bpf_warn_invalid_xdp_action(act);
> > >> +                     case XDP_DROP:
> > >> +                             goto next;
> > >> +                     }
> > >> +             }  
> > >
> > > Thus, mlx4 choice is to drop packets for unknown/future return codes.
> > >
> > > I think this is the wrong choice.  I think the choice should be
> > > XDP_PASS, to pass the packet up the stack.
> > >
> > > I find "XDP_DROP" problematic because it happen so early in the driver,
> > > that we lost all possibilities to debug what packets gets dropped.  We
> > > get a single kernel log warning, but we cannot inspect the packets any
> > > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> > > tcpdump) is available.
The goal of XDP is performance, and therefore the driver-specific choice
I am making here is to drop, because it preserves resources to do so. I
cannot say for all drivers whether this is the right choice or not.
Therefore, in the user-facing API I leave it undefined, so that future
drivers can have flexibility to choose the most performant
implementation for themselves.

Consider the UDP DDoS use case that we have mentioned before. Suppose an
attacker has knowledge of the particular XDP program that is being used
to filter their packets, and can somehow overflow the return code of the
program. The attacker would prefer that the overflow case consumes
time/memory/both, which if the mlx4 driver were to pass to stack it
would certainly do, and so we must choose the opposite if we have
network security in mind (we do!).
> > >  
> >
> > It's an API issue though not a problem with the packet. Allowing
> > unknown return codes to pass seems like a major security problem also.
> 
> We have the full power and flexibility of the normal Linux stack to
> drop these packets.  And from a usability perspective it gives insight
> into what is wrong and counters metrics.  Would you rather blindly drop
> e.g. 0.01% of the packets in your data-centers without knowing.
Full power, but not full speed, and in the case of DDoS mitigation this
is a strong enough argument IMHO.
> 
> We already talk about XDP as an offload mechanism.  Normally when
> loading a (XDP) "offload" program it should be rejected, e.g. by the
> validator.  BUT we cannot validate all return eBPF codes, because they
> can originate from a table lookup.  Thus, we _do_ allow programs to be
> loaded, with future unknown return code.
>  This then corresponds to only part of the program can be offloaded,
> thus the natural response is to fallback, handling this is the
> non-offloaded slower-path.
> 
> I see the XDP_PASS fallback as a natural way of supporting loading
> newer/future programs on older "versions" of XDP.
>   E.g. I can have a XDP program that have a valid filter protection
> mechanism, but also use a newer mechanism, and my server fleet contains
> different NIC vendors, some NICs only support the filter part.  Then I
> want to avoid having to compile and maintain different XDP/eBPF
> programs per NIC vendor. (Instead I prefer having a Linux stack
> fallback mechanism, and transparently XDP offload as much as the NIC
> driver supports).
I would then argue to only support offloading of XDP programs with
verifiable return codes. We're not at that stage yet, and I think we can
choose different defaults for these two cases.

We have conflicting examples here, which lead to different conclusions.
Reiterating an earlier argument that I made for others on the list to
consider:
"""
Besides, I don't see how PASS is any more correct than DROP. Consider a
future program that is intended to rewrite a packet and forward it out
another port (with some TX_OTHER return code or whatever). If the driver
PASSes the packet, it will still not be interpreted by the stack, since
it may have been destined for some other machine.
"""
So, IMHO there is not a clear right or wrong, and I still fall back to
the security argument to resolve the dilemma. The point there is not
drop/pass, but resource preservation.

> 
> 
> > > I can also imagine that, defaulting to XDP_PASS, can be an important
> > > feature in the future.
> > >
> > > In the future we will likely have features, where XDP can "offload"
> > > packet delivery from the normal stack (e.g. delivery into a VM).  On a
> > > running production system you can then load your XDP program.  If the
> > > driver was too old defaulting to XDP_DROP, then you lost your service,
> > > instead if defaulting to XDP_PASS, your service would survive, falling
> > > back to normal delivery.
> > >
> > > (For the VM delivery use-case, there will likely be a need for having a
> > > fallback delivery method in place, when the XDP program is not active,
> > > in-order to support VM migration).
> > >
> > >
> > >  
[...]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-10 13:37       ` Jesper Dangaard Brouer
  2016-07-10 17:09         ` Brenden Blanco
@ 2016-07-10 20:27         ` Tom Herbert
  2016-07-11 11:36           ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-10 20:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann

On Sun, Jul 10, 2016 at 8:37 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Sat, 9 Jul 2016 08:47:52 -0500
> Tom Herbert <tom@herbertland.com> wrote:
>
>> On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > On Thu,  7 Jul 2016 19:15:13 -0700
>> > Brenden Blanco <bblanco@plumgrid.com> wrote:
>> >
>> >> Add a new bpf prog type that is intended to run in early stages of the
>> >> packet rx path. Only minimal packet metadata will be available, hence a
>> >> new context type, struct xdp_md, is exposed to userspace. So far only
>> >> expose the packet start and end pointers, and only in read mode.
>> >>
>> >> An XDP program must return one of the well known enum values, all other
>> >> return codes are reserved for future use. Unfortunately, this
>> >> restriction is hard to enforce at verification time, so take the
>> >> approach of warning at runtime when such programs are encountered. The
>> >> driver can choose to implement unknown return codes however it wants,
>> >> but must invoke the warning helper with the action value.
>> >
>> > I believe we should define a stronger semantics for unknown/future
>> > return codes than the once stated above:
>> >  "driver can choose to implement unknown return codes however it wants"
>> >
>> > The mlx4 driver implementation in:
>> >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
>> >
>> > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>> >
>> >> +             /* A bpf program gets first chance to drop the packet. It may
>> >> +              * read bytes but not past the end of the frag.
>> >> +              */
>> >> +             if (prog) {
>> >> +                     struct xdp_buff xdp;
>> >> +                     dma_addr_t dma;
>> >> +                     u32 act;
>> >> +
>> >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
>> >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
>> >> +                                             priv->frag_info[0].frag_size,
>> >> +                                             DMA_FROM_DEVICE);
>> >> +
>> >> +                     xdp.data = page_address(frags[0].page) +
>> >> +                                                     frags[0].page_offset;
>> >> +                     xdp.data_end = xdp.data + length;
>> >> +
>> >> +                     act = bpf_prog_run_xdp(prog, &xdp);
>> >> +                     switch (act) {
>> >> +                     case XDP_PASS:
>> >> +                             break;
>> >> +                     default:
>> >> +                             bpf_warn_invalid_xdp_action(act);
>> >> +                     case XDP_DROP:
>> >> +                             goto next;
>> >> +                     }
>> >> +             }
>> >
>> > Thus, mlx4 choice is to drop packets for unknown/future return codes.
>> >
>> > I think this is the wrong choice.  I think the choice should be
>> > XDP_PASS, to pass the packet up the stack.
>> >
>> > I find "XDP_DROP" problematic because it happen so early in the driver,
>> > that we lost all possibilities to debug what packets gets dropped.  We
>> > get a single kernel log warning, but we cannot inspect the packets any
>> > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
>> > tcpdump) is available.
>> >
>>
>> It's an API issue though not a problem with the packet. Allowing
>> unknown return codes to pass seems like a major security problem also.
>
> We have the full power and flexibility of the normal Linux stack to
> drop these packets.  And from a usability perspective it gives insight
> into what is wrong and counters metrics.  Would you rather blindly drop
> e.g. 0.01% of the packets in your data-centers without knowing.
>
This is not blindly dropping packets; the bad action should be logged,
counters incremented, and packet could be passed to the stack as an
error if deeper inspection is needed. IMO, I would rather drop
something not understood than accept it-- determinism is a goal also.

> We already talk about XDP as an offload mechanism.  Normally when
> loading a (XDP) "offload" program it should be rejected, e.g. by the
> validator.  BUT we cannot validate all return eBPF codes, because they
> can originate from a table lookup.  Thus, we _do_ allow programs to be
> loaded, with future unknown return code.
>  This then corresponds to only part of the program can be offloaded,
> thus the natural response is to fallback, handling this is the
> non-offloaded slower-path.
>
> I see the XDP_PASS fallback as a natural way of supporting loading
> newer/future programs on older "versions" of XDP.

Then in this model we could only add codes that allow passing packets.
For instance, what if a new return code means "Drop this packet and
log it as critical because if you receive it the stack will crash"?
;-) IMO ignoring something not understood for the sake of
extensibility is a red herring. In the long run doing this actually
limits are ability to extend things for both APIs and protocols (a
great example of this is VLXAN that mandates  unknown flags are
ignored in RX so VXLAN-GPE has a be a new incompatible protocol to get
a next protocol field).

>   E.g. I can have a XDP program that have a valid filter protection
> mechanism, but also use a newer mechanism, and my server fleet contains
> different NIC vendors, some NICs only support the filter part.  Then I
> want to avoid having to compile and maintain different XDP/eBPF
> programs per NIC vendor. (Instead I prefer having a Linux stack
> fallback mechanism, and transparently XDP offload as much as the NIC
> driver supports).
>
As Brenden points out, fallbacks easily become DOS vectors.

Tom

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-10 17:09         ` Brenden Blanco
@ 2016-07-10 20:30           ` Tom Herbert
  2016-07-11 10:15             ` Daniel Borkmann
  0 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-10 20:30 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Jesper Dangaard Brouer, David S. Miller,
	Linux Kernel Network Developers, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Sun, Jul 10, 2016 at 12:09 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Sun, Jul 10, 2016 at 03:37:31PM +0200, Jesper Dangaard Brouer wrote:
>> On Sat, 9 Jul 2016 08:47:52 -0500
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>> > On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
>> > <brouer@redhat.com> wrote:
>> > > On Thu,  7 Jul 2016 19:15:13 -0700
>> > > Brenden Blanco <bblanco@plumgrid.com> wrote:
>> > >
>> > >> Add a new bpf prog type that is intended to run in early stages of the
>> > >> packet rx path. Only minimal packet metadata will be available, hence a
>> > >> new context type, struct xdp_md, is exposed to userspace. So far only
>> > >> expose the packet start and end pointers, and only in read mode.
>> > >>
>> > >> An XDP program must return one of the well known enum values, all other
>> > >> return codes are reserved for future use. Unfortunately, this
>> > >> restriction is hard to enforce at verification time, so take the
>> > >> approach of warning at runtime when such programs are encountered. The
>> > >> driver can choose to implement unknown return codes however it wants,
>> > >> but must invoke the warning helper with the action value.
>> > >
>> > > I believe we should define a stronger semantics for unknown/future
>> > > return codes than the once stated above:
>> > >  "driver can choose to implement unknown return codes however it wants"
>> > >
>> > > The mlx4 driver implementation in:
>> > >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
>> > >
>> > > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>> > >
>> > >> +             /* A bpf program gets first chance to drop the packet. It may
>> > >> +              * read bytes but not past the end of the frag.
>> > >> +              */
>> > >> +             if (prog) {
>> > >> +                     struct xdp_buff xdp;
>> > >> +                     dma_addr_t dma;
>> > >> +                     u32 act;
>> > >> +
>> > >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
>> > >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
>> > >> +                                             priv->frag_info[0].frag_size,
>> > >> +                                             DMA_FROM_DEVICE);
>> > >> +
>> > >> +                     xdp.data = page_address(frags[0].page) +
>> > >> +                                                     frags[0].page_offset;
>> > >> +                     xdp.data_end = xdp.data + length;
>> > >> +
>> > >> +                     act = bpf_prog_run_xdp(prog, &xdp);
>> > >> +                     switch (act) {
>> > >> +                     case XDP_PASS:
>> > >> +                             break;
>> > >> +                     default:
>> > >> +                             bpf_warn_invalid_xdp_action(act);
>> > >> +                     case XDP_DROP:
>> > >> +                             goto next;
>> > >> +                     }
>> > >> +             }
>> > >
>> > > Thus, mlx4 choice is to drop packets for unknown/future return codes.
>> > >
>> > > I think this is the wrong choice.  I think the choice should be
>> > > XDP_PASS, to pass the packet up the stack.
>> > >
>> > > I find "XDP_DROP" problematic because it happen so early in the driver,
>> > > that we lost all possibilities to debug what packets gets dropped.  We
>> > > get a single kernel log warning, but we cannot inspect the packets any
>> > > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
>> > > tcpdump) is available.
> The goal of XDP is performance, and therefore the driver-specific choice
> I am making here is to drop, because it preserves resources to do so. I
> cannot say for all drivers whether this is the right choice or not.
> Therefore, in the user-facing API I leave it undefined, so that future
> drivers can have flexibility to choose the most performant
> implementation for themselves.
>
I don't think this should be undefined. If a driver receives a code
from XDP that it doesn't understand this is an API mismatch and a bug
somewhere.

> Consider the UDP DDoS use case that we have mentioned before. Suppose an
> attacker has knowledge of the particular XDP program that is being used
> to filter their packets, and can somehow overflow the return code of the
> program. The attacker would prefer that the overflow case consumes
> time/memory/both, which if the mlx4 driver were to pass to stack it
> would certainly do, and so we must choose the opposite if we have
> network security in mind (we do!).

Yes.

>> > >
>> >
>> > It's an API issue though not a problem with the packet. Allowing
>> > unknown return codes to pass seems like a major security problem also.
>>
>> We have the full power and flexibility of the normal Linux stack to
>> drop these packets.  And from a usability perspective it gives insight
>> into what is wrong and counters metrics.  Would you rather blindly drop
>> e.g. 0.01% of the packets in your data-centers without knowing.
> Full power, but not full speed, and in the case of DDoS mitigation this
> is a strong enough argument IMHO.
>>
>> We already talk about XDP as an offload mechanism.  Normally when
>> loading a (XDP) "offload" program it should be rejected, e.g. by the
>> validator.  BUT we cannot validate all return eBPF codes, because they
>> can originate from a table lookup.  Thus, we _do_ allow programs to be
>> loaded, with future unknown return code.
>>  This then corresponds to only part of the program can be offloaded,
>> thus the natural response is to fallback, handling this is the
>> non-offloaded slower-path.
>>
>> I see the XDP_PASS fallback as a natural way of supporting loading
>> newer/future programs on older "versions" of XDP.
>>   E.g. I can have a XDP program that have a valid filter protection
>> mechanism, but also use a newer mechanism, and my server fleet contains
>> different NIC vendors, some NICs only support the filter part.  Then I
>> want to avoid having to compile and maintain different XDP/eBPF
>> programs per NIC vendor. (Instead I prefer having a Linux stack
>> fallback mechanism, and transparently XDP offload as much as the NIC
>> driver supports).
> I would then argue to only support offloading of XDP programs with
> verifiable return codes. We're not at that stage yet, and I think we can
> choose different defaults for these two cases.
>
> We have conflicting examples here, which lead to different conclusions.
> Reiterating an earlier argument that I made for others on the list to
> consider:
> """
> Besides, I don't see how PASS is any more correct than DROP. Consider a
> future program that is intended to rewrite a packet and forward it out
> another port (with some TX_OTHER return code or whatever). If the driver
> PASSes the packet, it will still not be interpreted by the stack, since
> it may have been destined for some other machine.
> """
> So, IMHO there is not a clear right or wrong, and I still fall back to
> the security argument to resolve the dilemma. The point there is not
> drop/pass, but resource preservation.
>
Blind pass is a security risk, drop is always a correct action in that sense.

Tom

>>
>>
>> > > I can also imagine that, defaulting to XDP_PASS, can be an important
>> > > feature in the future.
>> > >
>> > > In the future we will likely have features, where XDP can "offload"
>> > > packet delivery from the normal stack (e.g. delivery into a VM).  On a
>> > > running production system you can then load your XDP program.  If the
>> > > driver was too old defaulting to XDP_DROP, then you lost your service,
>> > > instead if defaulting to XDP_PASS, your service would survive, falling
>> > > back to normal delivery.
>> > >
>> > > (For the VM delivery use-case, there will likely be a need for having a
>> > > fallback delivery method in place, when the XDP program is not active,
>> > > in-order to support VM migration).
>> > >
>> > >
>> > >
> [...]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08 16:49         ` Brenden Blanco
@ 2016-07-10 20:48           ` Tom Herbert
  2016-07-10 20:50           ` Tom Herbert
  1 sibling, 0 replies; 59+ messages in thread
From: Tom Herbert @ 2016-07-10 20:48 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Eric Dumazet, Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Fri, Jul 8, 2016 at 11:49 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Fri, Jul 08, 2016 at 08:56:45AM +0200, Eric Dumazet wrote:
>> On Thu, 2016-07-07 at 21:16 -0700, Alexei Starovoitov wrote:
>>
>> > I've tried this style of prefetching in the past for normal stack
>> > and it didn't help at all.
>>
>> This is very nice, but my experience showed opposite numbers.
>> So I guess you did not choose the proper prefetch strategy.
>>
>> prefetching in mlx4 gave me good results, once I made sure our compiler
>> was not moving the actual prefetch operations on x86_64 (ie forcing use
>> of asm volatile as in x86_32 instead of the builtin prefetch). You might
>> check if your compiler does the proper thing because this really hurt me
>> in the past.
>>
>> In my case, I was using 40Gbit NIC, and prefetching 128 bytes instead of
>> 64 bytes allowed to remove one stall in GRO engine when using TCP with
>> TS (total header size : 66 bytes), or tunnels.
>>
>> The problem with prefetch is that it works well assuming a given rate
>> (in pps), and given cpus, as prefetch behavior is varying among flavors.
>>
>> Brenden chose to prefetch N+3, based on some experiments, on some
>> hardware,
>>
>> prefetch N+3 can actually slow down if you receive a moderate load,
>> which is the case 99% of the time in typical workloads on modern servers
>> with multi queue NIC.
> Thanks for the feedback Eric!
>
> This particular patch in the series is meant to be standalone exactly
> for this reason. I don't pretend to assert that this optimization will
> work for everybody, or even for a future version of me with different
> hardware. But, it passes my internal criteria for usefulness:
> 1. It provides a measurable gain in the experiments that I have at hand
> 2. The code is easy to review
> 3. The change does not negatively impact non-XDP users
>
> I would love to have a solution for all mlx4 driver users, but this
> patch set is focused on a different goal. So, without munging a
> different set of changes for the universal use case, and probably
> violating criteria #2 or #3, I went with what you see.
>
> In hopes of not derailing the whole patch series, what is an actionable
> next step for this patch #12?
> Ideas:
> Pick a safer N? (I saw improvements with N=1 as well)
> Drop this patch?
>
As Alexei mentioned prefect may be dependent on workload. The XDP
program for an ILA router is is far less code path than packets going
through TCP so it makes sense that we would want different prefetch
characteristics to optimize for the each case. Can we make this a
configurable value for each RX queue?

> One thing I definitely don't want to do is go into the weeds trying to
> get a universal prefetch logic in order to merge the XDP framework, even
> though I agree the net result would benefit everybody.

Agreed, a salient point of XDP is that it's _not_ a generic mechanism.
The performance comparison between XDP should be against HW solutions
that we're trying to replace with commodity HW not the full general
purpose SW stack.
>>
>> This is why it was hard to upstream such changes, because they focus on
>> max throughput instead of low latencies.
>>
>>
>>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-08 16:49         ` Brenden Blanco
  2016-07-10 20:48           ` Tom Herbert
@ 2016-07-10 20:50           ` Tom Herbert
  2016-07-11 14:54             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-10 20:50 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Eric Dumazet, Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

> This particular patch in the series is meant to be standalone exactly
> for this reason. I don't pretend to assert that this optimization will
> work for everybody, or even for a future version of me with different
> hardware. But, it passes my internal criteria for usefulness:
> 1. It provides a measurable gain in the experiments that I have at hand
> 2. The code is easy to review
> 3. The change does not negatively impact non-XDP users
>
> I would love to have a solution for all mlx4 driver users, but this
> patch set is focused on a different goal. So, without munging a
> different set of changes for the universal use case, and probably
> violating criteria #2 or #3, I went with what you see.
>
> In hopes of not derailing the whole patch series, what is an actionable
> next step for this patch #12?
> Ideas:
> Pick a safer N? (I saw improvements with N=1 as well)
> Drop this patch?
>
As Alexei mentioned the right prefetch model may be dependent on
workload. For instance, the XDP program for an ILA router is is far
less code path than packets going through TCP so it makes sense that
we would want different prefetch characteristics to optimize for each
case. Can we make this a configurable knob for each RX queue to allow
that?

> One thing I definitely don't want to do is go into the weeds trying to
> get a universal prefetch logic in order to merge the XDP framework, even
> though I agree the net result would benefit everybody.

Agreed, a salient point of XDP is that it's _not_ a generic mechanism
meant for all applications. We don't want to sacrifice performance for
generality.

Tom

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
  2016-07-09  8:14   ` Jesper Dangaard Brouer
@ 2016-07-10 20:56   ` Tom Herbert
  2016-07-11 16:51     ` Brenden Blanco
  2016-07-10 21:04   ` Tom Herbert
  2 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-10 20:56 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
>
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. The
> driver can choose to implement unknown return codes however it wants,
> but must invoke the warning helper with the action value.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/linux/filter.h   | 18 ++++++++++
>  include/uapi/linux/bpf.h | 19 ++++++++++
>  kernel/bpf/verifier.c    |  1 +
>  net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 129 insertions(+)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 6fc31ef..522dbc9 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -368,6 +368,11 @@ struct bpf_skb_data_end {
>         void *data_end;
>  };
>
> +struct xdp_buff {
> +       void *data;
> +       void *data_end;
> +};
> +
>  /* compute the linear packet data range [data, data_end) which
>   * will be accessed by cls_bpf and act_bpf programs
>   */
> @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
>         return BPF_PROG_RUN(prog, skb);
>  }
>
> +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
> +                                  struct xdp_buff *xdp)
> +{
> +       u32 ret;
> +
> +       rcu_read_lock();
> +       ret = BPF_PROG_RUN(prog, (void *)xdp);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
>  static inline unsigned int bpf_prog_size(unsigned int proglen)
>  {
>         return max(sizeof(struct bpf_prog),
> @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
>
>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>                                        const struct bpf_insn *patch, u32 len);
> +void bpf_warn_invalid_xdp_action(int act);
>
>  #ifdef CONFIG_BPF_JIT
>  extern int bpf_jit_enable;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c14ca1c..5b47ac3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
>         BPF_PROG_TYPE_TRACEPOINT,
> +       BPF_PROG_TYPE_XDP,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -430,4 +431,22 @@ struct bpf_tunnel_key {
>         __u32 tunnel_label;
>  };
>
> +/* User return codes for XDP prog type.
> + * A valid XDP program must return one of these defined values. All other
> + * return codes are reserved for future use. Unknown return codes will result
> + * in driver-dependent behavior.
> + */
> +enum xdp_action {
> +       XDP_DROP,
> +       XDP_PASS,
> +};
> +
> +/* user accessible metadata for XDP packet hook
> + * new fields must be added to the end of this structure
> + */
> +struct xdp_md {
> +       __u32 data;
> +       __u32 data_end;
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e206c21..a8d67d0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
>         switch (env->prog->type) {
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_XDP:
>                 break;
>         default:
>                 verbose("verifier is misconfigured\n");
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 10c4a2f..4ba446f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +xdp_func_proto(enum bpf_func_id func_id)
> +{
> +       return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         if (off < 0 || off >= sizeof(struct __sk_buff))

Can off be size_t to eliminate <0 check?

> @@ -2436,6 +2442,56 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>         return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_xdp_access(int off, int size,
> +                                 enum bpf_access_type type)
> +{
> +       if (off < 0 || off >= sizeof(struct xdp_md))
> +               return false;
> +       if (off % size != 0)

off & 3 != 0

> +               return false;
> +       if (size != 4)
> +               return false;

If size must always be 4 why is it even an argument?

> +
> +       return true;
> +}
> +
> +static bool xdp_is_valid_access(int off, int size,
> +                               enum bpf_access_type type,
> +                               enum bpf_reg_type *reg_hint)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct xdp_md, data):
> +               *reg_hint = PTR_TO_PACKET;
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *reg_hint = PTR_TO_PACKET_END;
> +               break;
case sizeof(int) for below?

> +       }
> +
> +       return __is_valid_xdp_access(off, size, type);
> +}
> +
> +void bpf_warn_invalid_xdp_action(int act)
> +{
> +       WARN_ONCE(1, "\n"
> +                    "*****************************************************\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "**                                               **\n"
> +                    "** XDP program returned unknown value %-10u **\n"
> +                    "**                                               **\n"
> +                    "** XDP programs must return a well-known return  **\n"
> +                    "** value. Invalid return values will result in   **\n"
> +                    "** undefined packet actions.                     **\n"
> +                    "**                                               **\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "*****************************************************\n",
> +                 act);

Seems a little verbose to be. Just do a simple WARN_ONCE and probably
more important to bump a counter. Also function should take the skb in
question as an argument in case we want to do more inspection or
logging in the future.

> +}
> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>                                       int src_reg, int ctx_off,
>                                       struct bpf_insn *insn_buf,
> @@ -2587,6 +2643,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>         return insn - insn_buf;
>  }
>
> +static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> +                                 int src_reg, int ctx_off,
> +                                 struct bpf_insn *insn_buf,
> +                                 struct bpf_prog *prog)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (ctx_off) {
> +       case offsetof(struct xdp_md, data):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data));
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data_end));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
>         .get_func_proto         = sk_filter_func_proto,
>         .is_valid_access        = sk_filter_is_valid_access,
> @@ -2599,6 +2678,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
>         .convert_ctx_access     = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops xdp_ops = {
> +       .get_func_proto         = xdp_func_proto,
> +       .is_valid_access        = xdp_is_valid_access,
> +       .convert_ctx_access     = xdp_convert_ctx_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops    = &sk_filter_ops,
>         .type   = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2614,11 +2699,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
>         .type   = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list xdp_type __read_mostly = {
> +       .ops    = &xdp_ops,
> +       .type   = BPF_PROG_TYPE_XDP,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
> +       bpf_register_prog_type(&xdp_type);
>
>         return 0;
>  }
> --
> 2.8.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx
  2016-07-08  2:15 ` [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx Brenden Blanco
@ 2016-07-10 20:59   ` Tom Herbert
  2016-07-11 10:35     ` Daniel Borkmann
  0 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-10 20:59 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add two new set/check netdev ops for drivers implementing the
> BPF_PROG_TYPE_XDP filter.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/linux/netdevice.h | 14 ++++++++++++++
>  net/core/dev.c            | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 44 insertions(+)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 49736a3..36ae955 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -63,6 +63,7 @@ struct wpan_dev;
>  struct mpls_dev;
>  /* UDP Tunnel offloads */
>  struct udp_tunnel_info;
> +struct bpf_prog;
>
>  void netdev_set_default_ethtool_ops(struct net_device *dev,
>                                     const struct ethtool_ops *ops);
> @@ -1087,6 +1088,15 @@ struct tc_to_netdev {
>   *     appropriate rx headroom value allows avoiding skb head copy on
>   *     forward. Setting a negative value resets the rx headroom to the
>   *     default value.
> + * int (*ndo_xdp_set)(struct net_device *dev, struct bpf_prog *prog);
> + *     This function is used to set or clear a bpf program used in the
> + *     earliest stages of packet rx. The prog will have been loaded as
> + *     BPF_PROG_TYPE_XDP. The callee is responsible for calling bpf_prog_put
> + *     on any old progs that are stored, but not on the passed in prog.
> + * bool (*ndo_xdp_attached)(struct net_device *dev);
> + *     This function is used to check if a bpf program is set on the device.
> + *     The callee should return true if a program is currently attached and
> + *     running.
>   *
>   */
>  struct net_device_ops {
> @@ -1271,6 +1281,9 @@ struct net_device_ops {
>                                                        struct sk_buff *skb);
>         void                    (*ndo_set_rx_headroom)(struct net_device *dev,
>                                                        int needed_headroom);
> +       int                     (*ndo_xdp_set)(struct net_device *dev,
> +                                              struct bpf_prog *prog);
> +       bool                    (*ndo_xdp_attached)(struct net_device *dev);

It might nice if everything could be accomplished with with one ndo
function (just too many ndo's flying around). Also, may want to
consider future like maybe we have an XDP function in output path, or
multiple programs pipelined together somehow.

>  };
>
>  /**
> @@ -3257,6 +3270,7 @@ int dev_get_phys_port_id(struct net_device *dev,
>  int dev_get_phys_port_name(struct net_device *dev,
>                            char *name, size_t len);
>  int dev_change_proto_down(struct net_device *dev, bool proto_down);
> +int dev_change_xdp_fd(struct net_device *dev, int fd);
>  struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev);
>  struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
>                                     struct netdev_queue *txq, int *ret);
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b92d63b..154b057 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -94,6 +94,7 @@
>  #include <linux/ethtool.h>
>  #include <linux/notifier.h>
>  #include <linux/skbuff.h>
> +#include <linux/bpf.h>
>  #include <net/net_namespace.h>
>  #include <net/sock.h>
>  #include <net/busy_poll.h>
> @@ -6615,6 +6616,35 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
>  EXPORT_SYMBOL(dev_change_proto_down);
>
>  /**
> + *     dev_change_xdp_fd - set or clear a bpf program for a device rx path
> + *     @dev: device
> + *     @fd: new program fd or negative value to clear
> + *
> + *     Set or clear a bpf program for a device
> + */
> +int dev_change_xdp_fd(struct net_device *dev, int fd)
> +{
> +       const struct net_device_ops *ops = dev->netdev_ops;
> +       struct bpf_prog *prog = NULL;
> +       int err;
> +
> +       if (!ops->ndo_xdp_set)
> +               return -EOPNOTSUPP;
> +       if (fd >= 0) {
> +               prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
> +               if (IS_ERR(prog))
> +                       return PTR_ERR(prog);
> +       }
> +
> +       err = ops->ndo_xdp_set(dev, prog);
> +       if (err < 0 && prog)
> +               bpf_prog_put(prog);
> +
> +       return err;
> +}
> +EXPORT_SYMBOL(dev_change_xdp_fd);
> +
> +/**
>   *     dev_new_index   -       allocate an ifindex
>   *     @net: the applicable net namespace
>   *
> --
> 2.8.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
  2016-07-09  8:14   ` Jesper Dangaard Brouer
  2016-07-10 20:56   ` Tom Herbert
@ 2016-07-10 21:04   ` Tom Herbert
  2016-07-11 13:53     ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-10 21:04 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
>
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. The
> driver can choose to implement unknown return codes however it wants,
> but must invoke the warning helper with the action value.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/linux/filter.h   | 18 ++++++++++
>  include/uapi/linux/bpf.h | 19 ++++++++++
>  kernel/bpf/verifier.c    |  1 +
>  net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 129 insertions(+)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 6fc31ef..522dbc9 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -368,6 +368,11 @@ struct bpf_skb_data_end {
>         void *data_end;
>  };
>
> +struct xdp_buff {
> +       void *data;
> +       void *data_end;
> +};
> +
>  /* compute the linear packet data range [data, data_end) which
>   * will be accessed by cls_bpf and act_bpf programs
>   */
> @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
>         return BPF_PROG_RUN(prog, skb);
>  }
>
> +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
> +                                  struct xdp_buff *xdp)
> +{
> +       u32 ret;
> +
> +       rcu_read_lock();
> +       ret = BPF_PROG_RUN(prog, (void *)xdp);
> +       rcu_read_unlock();
> +
> +       return ret;
> +}
> +
>  static inline unsigned int bpf_prog_size(unsigned int proglen)
>  {
>         return max(sizeof(struct bpf_prog),
> @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
>
>  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
>                                        const struct bpf_insn *patch, u32 len);
> +void bpf_warn_invalid_xdp_action(int act);
>
>  #ifdef CONFIG_BPF_JIT
>  extern int bpf_jit_enable;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c14ca1c..5b47ac3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -94,6 +94,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
>         BPF_PROG_TYPE_TRACEPOINT,
> +       BPF_PROG_TYPE_XDP,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -430,4 +431,22 @@ struct bpf_tunnel_key {
>         __u32 tunnel_label;
>  };
>
> +/* User return codes for XDP prog type.
> + * A valid XDP program must return one of these defined values. All other
> + * return codes are reserved for future use. Unknown return codes will result
> + * in driver-dependent behavior.
> + */
> +enum xdp_action {
> +       XDP_DROP,
> +       XDP_PASS,

I think that we should be able to distinguish an abort in BPF program
from a normal programmatic drop. e.g.:

enum xdp_action {
      XDP_ABORTED = 0,
      XDP_DROP,
      XDP_PASS,
};

> +};
> +
> +/* user accessible metadata for XDP packet hook
> + * new fields must be added to the end of this structure
> + */
> +struct xdp_md {
> +       __u32 data;
> +       __u32 data_end;
> +};
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e206c21..a8d67d0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
>         switch (env->prog->type) {
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_XDP:
>                 break;
>         default:
>                 verbose("verifier is misconfigured\n");
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 10c4a2f..4ba446f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +xdp_func_proto(enum bpf_func_id func_id)
> +{
> +       return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         if (off < 0 || off >= sizeof(struct __sk_buff))
> @@ -2436,6 +2442,56 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>         return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_xdp_access(int off, int size,
> +                                 enum bpf_access_type type)
> +{
> +       if (off < 0 || off >= sizeof(struct xdp_md))
> +               return false;
> +       if (off % size != 0)
> +               return false;
> +       if (size != 4)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool xdp_is_valid_access(int off, int size,
> +                               enum bpf_access_type type,
> +                               enum bpf_reg_type *reg_hint)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct xdp_md, data):
> +               *reg_hint = PTR_TO_PACKET;
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *reg_hint = PTR_TO_PACKET_END;
> +               break;
> +       }
> +
> +       return __is_valid_xdp_access(off, size, type);
> +}
> +
> +void bpf_warn_invalid_xdp_action(int act)
> +{
> +       WARN_ONCE(1, "\n"
> +                    "*****************************************************\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "**                                               **\n"
> +                    "** XDP program returned unknown value %-10u **\n"
> +                    "**                                               **\n"
> +                    "** XDP programs must return a well-known return  **\n"
> +                    "** value. Invalid return values will result in   **\n"
> +                    "** undefined packet actions.                     **\n"
> +                    "**                                               **\n"
> +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +                    "*****************************************************\n",
> +                 act);
> +}
> +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>                                       int src_reg, int ctx_off,
>                                       struct bpf_insn *insn_buf,
> @@ -2587,6 +2643,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>         return insn - insn_buf;
>  }
>
> +static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> +                                 int src_reg, int ctx_off,
> +                                 struct bpf_insn *insn_buf,
> +                                 struct bpf_prog *prog)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (ctx_off) {
> +       case offsetof(struct xdp_md, data):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data));
> +               break;
> +       case offsetof(struct xdp_md, data_end):
> +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
> +                                     dst_reg, src_reg,
> +                                     offsetof(struct xdp_buff, data_end));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
>         .get_func_proto         = sk_filter_func_proto,
>         .is_valid_access        = sk_filter_is_valid_access,
> @@ -2599,6 +2678,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
>         .convert_ctx_access     = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops xdp_ops = {
> +       .get_func_proto         = xdp_func_proto,
> +       .is_valid_access        = xdp_is_valid_access,
> +       .convert_ctx_access     = xdp_convert_ctx_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops    = &sk_filter_ops,
>         .type   = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2614,11 +2699,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
>         .type   = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list xdp_type __read_mostly = {
> +       .ops    = &xdp_ops,
> +       .type   = BPF_PROG_TYPE_XDP,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
> +       bpf_register_prog_type(&xdp_type);
>
>         return 0;
>  }
> --
> 2.8.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-10 20:30           ` Tom Herbert
@ 2016-07-11 10:15             ` Daniel Borkmann
  2016-07-11 12:58               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Daniel Borkmann @ 2016-07-11 10:15 UTC (permalink / raw)
  To: Tom Herbert, Brenden Blanco
  Cc: Jesper Dangaard Brouer, David S. Miller,
	Linux Kernel Network Developers, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf

On 07/10/2016 10:30 PM, Tom Herbert wrote:
> On Sun, Jul 10, 2016 at 12:09 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
[...]
>> I would then argue to only support offloading of XDP programs with
>> verifiable return codes. We're not at that stage yet, and I think we can
>> choose different defaults for these two cases.

It's also not really verifiable in the sense that such verdict could be
part of a struct member coming from a policy map and such. You'd loose
this flexibility if you'd only allow return codes encoded into immediate
values.

>> We have conflicting examples here, which lead to different conclusions.
>> Reiterating an earlier argument that I made for others on the list to
>> consider:
>> """
>> Besides, I don't see how PASS is any more correct than DROP. Consider a
>> future program that is intended to rewrite a packet and forward it out
>> another port (with some TX_OTHER return code or whatever). If the driver
>> PASSes the packet, it will still not be interpreted by the stack, since
>> it may have been destined for some other machine.
>> """
>> So, IMHO there is not a clear right or wrong, and I still fall back to
>> the security argument to resolve the dilemma. The point there is not
>> drop/pass, but resource preservation.
>>
> Blind pass is a security risk, drop is always a correct action in that sense.

I agree here that drop would be better, if there's a good reason/use-case
to make the default configurable as in i) drop or ii) fall-back to stack,
then this could be another option to leave admin the choice, but not seeing
it thus far. But hitting the default case could certainly inc a per-cpu error
counter visible for ethtool and et al, to have some more insight.

Additionally, a WARN_ON_ONCE() should be fine telling that the program for
this given configuration is buggy. I'm not sure there will be much support
that you can take a XDP program tailored for a specific kernel and expect it
to run on a, say, 1 year old kernel with XDP there. To make it work properly
you need to have that much insight into the program anyway so you configure
the stack to make up for those non-functioning parts (iff possible) that you
could just as well rewrite/change the affected parts from the XDP program.

Otoh, it should be reasonable to assume that older XDP programs written in
the past for driver xyz can run fine on newer kernels for driver xyz as well,
so that part should be expected.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx
  2016-07-10 20:59   ` Tom Herbert
@ 2016-07-11 10:35     ` Daniel Borkmann
  0 siblings, 0 replies; 59+ messages in thread
From: Daniel Borkmann @ 2016-07-11 10:35 UTC (permalink / raw)
  To: Tom Herbert, Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf

On 07/10/2016 10:59 PM, Tom Herbert wrote:
> On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
>> Add two new set/check netdev ops for drivers implementing the
>> BPF_PROG_TYPE_XDP filter.
>>
>> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
>> ---
>>   include/linux/netdevice.h | 14 ++++++++++++++
>>   net/core/dev.c            | 30 ++++++++++++++++++++++++++++++
>>   2 files changed, 44 insertions(+)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 49736a3..36ae955 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -63,6 +63,7 @@ struct wpan_dev;
>>   struct mpls_dev;
>>   /* UDP Tunnel offloads */
>>   struct udp_tunnel_info;
>> +struct bpf_prog;
>>
>>   void netdev_set_default_ethtool_ops(struct net_device *dev,
>>                                      const struct ethtool_ops *ops);
>> @@ -1087,6 +1088,15 @@ struct tc_to_netdev {
>>    *     appropriate rx headroom value allows avoiding skb head copy on
>>    *     forward. Setting a negative value resets the rx headroom to the
>>    *     default value.
>> + * int (*ndo_xdp_set)(struct net_device *dev, struct bpf_prog *prog);
>> + *     This function is used to set or clear a bpf program used in the
>> + *     earliest stages of packet rx. The prog will have been loaded as
>> + *     BPF_PROG_TYPE_XDP. The callee is responsible for calling bpf_prog_put
>> + *     on any old progs that are stored, but not on the passed in prog.
>> + * bool (*ndo_xdp_attached)(struct net_device *dev);
>> + *     This function is used to check if a bpf program is set on the device.
>> + *     The callee should return true if a program is currently attached and
>> + *     running.
>>    *
>>    */
>>   struct net_device_ops {
>> @@ -1271,6 +1281,9 @@ struct net_device_ops {
>>                                                         struct sk_buff *skb);
>>          void                    (*ndo_set_rx_headroom)(struct net_device *dev,
>>                                                         int needed_headroom);
>> +       int                     (*ndo_xdp_set)(struct net_device *dev,
>> +                                              struct bpf_prog *prog);
>> +       bool                    (*ndo_xdp_attached)(struct net_device *dev);
>
> It might nice if everything could be accomplished with with one ndo
> function (just too many ndo's flying around). Also, may want to
> consider future like maybe we have an XDP function in output path, or
> multiple programs pipelined together somehow.

You could probably have it roughly similar to ndo_setup_tc where you pass
commands down to the driver, if it should just be one central ndo, good
thing is that this is not set in stone anyway.

For pipelining, you'd most likely use tail calls, so you just have the root
program passed here, which is fine already as-is, since the rest for it is
handeled by bpf(2).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-08 16:02       ` [net-next PATCH RFC] mlx4: RX prefetch loop Jesper Dangaard Brouer
@ 2016-07-11 11:09         ` Jesper Dangaard Brouer
  2016-07-11 16:00           ` Brenden Blanco
  2016-07-11 23:05           ` Alexei Starovoitov
  0 siblings, 2 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-11 11:09 UTC (permalink / raw)
  To: netdev
  Cc: kafai, daniel, tom, bblanco, john.fastabend, gerlitz.or, hannes,
	rana.shahot, tgraf, David S. Miller, as754m, brouer, saeedm,
	amira, tzahio, Eric Dumazet

On Fri, 08 Jul 2016 18:02:20 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> This patch is about prefetching without being opportunistic.
> The idea is only to start prefetching on packets that are marked as
> ready/completed in the RX ring.
> 
> This is acheived by splitting the napi_poll call mlx4_en_process_rx_cq()
> loop into two.  The first loop extract completed CQEs and start
> prefetching on data and RX descriptors. The second loop process the
> real packets.
> 
> Details: The batching of CQEs are limited to 8 in-order to avoid
> stressing the LFB (Line Fill Buffer) and cache usage.
> 
> I've left some opportunities for prefetching CQE descriptors.
> 
> 
> The performance improvements on my platform are huge, as I tested this
> on a CPU without DDIO.  The performance for XDP is the same as with
> Brendens prefetch hack.

This patch is based on top of Brenden's patch 11/12, and is mean to
replace patch 12/12.

Prefetching is very important for XDP, especially when using a CPU
without DDIO (here i7-4790K CPU @ 4.00GHz).

Program xdp1: touching-data and dropping packets:
 * 11,363,925 pkt/s == no-prefetch
 * 21,031,096 pkt/s == brenden's-prefetch
 * 21,062,728 pkt/s == this-prefetch-patch

Program xdp2: write-data (swap src_dst_mac) TX-bounce out same interface:
 *  6,726,482 pkt/s == no-prefetch
 * 10,378,163 pkt/s == brenden's-prefetch
 * 10,622,350 pkt/s == this-prefetch-patch

This patch also benefits the normal network stack (the XDP specific
prefetch patch does not).

Dropping packets in iptables -t raw:
 * 4,432,519 pps drop == no-prefetch
 * 5,919,690 pps drop == this-prefetch-patch

Dropping packets in iptables -t filter:
 * 2,768,053 pps drop == no-prefetch
 * 4,038,247 pps drop == this-prefetch-patch


To please Eric, I also ran many different variations of netperf and
didn't see any regressions, only small improvements.  The variation
between runs for netperf is too high to be statistically significant.

The worst-case test for this patchset should be netperf TCP_RR as it
should only have single packet in the queue.  When running 32 parallel
TCP_RR (netserver sink have 8 cores), I actually saw a small 2%
improvement (again with high variation, as we also test CPU sched).

Investigating the TCP_RR case, as patch is constructed to not affect
the case of a single packet in the RX queue.  Using my recent
tracepoint change, we can see that with 32 parallel TCP_RR we do have
situations where napi_poll had several packets in the RX ring:

 # perf record -a -e napi:napi_poll sleep 3
 # perf script | awk '{print $5,$14,$15,$16,$17,$18}' | sort -k3n | uniq -c

 521655 napi:napi_poll: mlx4p1 work 0 budget 64
1477872 napi:napi_poll: mlx4p1 work 1 budget 64
 189081 napi:napi_poll: mlx4p1 work 2 budget 64
  12552 napi:napi_poll: mlx4p1 work 3 budget 64
    464 napi:napi_poll: mlx4p1 work 4 budget 64
     16 napi:napi_poll: mlx4p1 work 5 budget 64
      4 napi:napi_poll: mlx4p1 work 6 budget 64

I do find the "work 0" case a little strange... that cause that?


 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |   70 +++++++++++++++++++++++++---
>  1 file changed, 62 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 41c76fe00a7f..c5efe03e31ce 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -782,7 +782,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	int doorbell_pending;
>  	struct sk_buff *skb;
>  	int tx_index;
> -	int index;
> +	int index, saved_index, i;
>  	int nr;
>  	unsigned int length;
>  	int polled = 0;
> @@ -790,6 +790,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	int factor = priv->cqe_factor;
>  	u64 timestamp;
>  	bool l2_tunnel;
> +#define PREFETCH_BATCH 8
> +	struct mlx4_cqe *cqe_array[PREFETCH_BATCH];
> +	int cqe_idx;
> +	bool cqe_more;
>  
>  	if (!priv->port_up)
>  		return 0;
> @@ -801,24 +805,75 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	doorbell_pending = 0;
>  	tx_index = (priv->tx_ring_num - priv->rsv_tx_rings) + cq->ring;
>  
> +next_prefetch_batch:
> +	cqe_idx = 0;
> +	cqe_more = false;
> +
>  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>  	 * descriptor offset can be deduced from the CQE index instead of
>  	 * reading 'cqe->index' */
>  	index = cq->mcq.cons_index & ring->size_mask;
> +	saved_index = index;
>  	cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
>  
> -	/* Process all completed CQEs */
> +	/* Extract and prefetch completed CQEs */
>  	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
>  		    cq->mcq.cons_index & cq->size)) {
> +		void *data;
>  
>  		frags = ring->rx_info + (index << priv->log_rx_info);
>  		rx_desc = ring->buf + (index << ring->log_stride);
> +		prefetch(rx_desc);
>  
>  		/*
>  		 * make sure we read the CQE after we read the ownership bit
>  		 */
>  		dma_rmb();
>  
> +		cqe_array[cqe_idx++] = cqe;
> +
> +		/* Base error handling here, free handled in next loop */
> +		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
> +			     MLX4_CQE_OPCODE_ERROR))
> +			goto skip;
> +
> +		data = page_address(frags[0].page) + frags[0].page_offset;
> +		prefetch(data);
> +	skip:
> +		++cq->mcq.cons_index;
> +		index = (cq->mcq.cons_index) & ring->size_mask;
> +		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> +		/* likely too slow prefetching CQE here ... do look-a-head ? */
> +		//prefetch(cqe + priv->cqe_size * 3);
> +
> +		if (++polled == budget) {
> +			cqe_more = false;
> +			break;
> +		}
> +		if (cqe_idx == PREFETCH_BATCH) {
> +			cqe_more = true;
> +			// IDEA: Opportunistic prefetch CQEs for next_prefetch_batch?
> +			//for (i = 0; i < PREFETCH_BATCH; i++) {
> +			//	prefetch(cqe + priv->cqe_size * i);
> +			//}
> +			break;
> +		}
> +	}
> +	/* Hint: The cqe_idx will be number of packets, it can be used
> +	 * for bulk allocating SKBs
> +	 */
> +
> +	/* Now, index function as index for rx_desc */
> +	index = saved_index;
> +
> +	/* Process completed CQEs in cqe_array */
> +	for (i = 0; i < cqe_idx; i++) {
> +
> +		cqe = cqe_array[i];
> +
> +		frags = ring->rx_info + (index << priv->log_rx_info);
> +		rx_desc = ring->buf + (index << ring->log_stride);
> +
>  		/* Drop packet on bad receive or bad checksum */
>  		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
>  						MLX4_CQE_OPCODE_ERROR)) {
> @@ -1065,14 +1120,13 @@ next:
>  			mlx4_en_free_frag(priv, frags, nr);
>  
>  consumed:
> -		++cq->mcq.cons_index;
> -		index = (cq->mcq.cons_index) & ring->size_mask;
> -		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
> -		if (++polled == budget)
> -			goto out;
> +		++index;
> +		index = index & ring->size_mask;
>  	}
> +	/* Check for more completed CQEs */
> +	if (cqe_more)
> +		goto next_prefetch_batch;
>  
> -out:
>  	if (doorbell_pending)
>  		mlx4_en_xmit_doorbell(priv->tx_ring[tx_index]);
>  
> 

p.s. for achieving 21Mpps drop the mlx4_core need param tuning:
 /etc/modprobe.d/mlx4.conf
 options mlx4_core log_num_mgm_entry_size=-2

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 05/12] Add sample for adding simple drop program to link
  2016-07-08  2:15 ` [PATCH v6 05/12] Add sample for adding simple drop program to link Brenden Blanco
  2016-07-09 20:21   ` Saeed Mahameed
@ 2016-07-11 11:09   ` Jamal Hadi Salim
  2016-07-11 13:37     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Jamal Hadi Salim @ 2016-07-11 11:09 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On 16-07-07 10:15 PM, Brenden Blanco wrote:
> Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
> hook of a link. With the drop-only program, observed single core rate is
> ~20Mpps.
>
> Other tests were run, for instance without the dropcnt increment or
> without reading from the packet header, the packet rate was mostly
> unchanged.
>
> $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
> proto 17:   20403027 drops/s
>


So - devil's advocate speaking:
I can filter and drop with this very specific NIC at 10x as fast
in hardware, correct?
Would a different NIC (pick something like e1000) have served a better
example?
BTW: Brenden, now that i looked closer here, you really dont have
apple-apple comparison with dropping at tc ingress. You have a
tweaked prefetch and are intentionally running things on a single
core. Note: We are able to do 20Mpps drops with tc with a single
core (as shown in netdev11) on a NUC with removing driver overhead.

cheers,
jamal

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-10 20:27         ` Tom Herbert
@ 2016-07-11 11:36           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-11 11:36 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann, brouer

On Sun, 10 Jul 2016 15:27:38 -0500
Tom Herbert <tom@herbertland.com> wrote:

> On Sun, Jul 10, 2016 at 8:37 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Sat, 9 Jul 2016 08:47:52 -0500
> > Tom Herbert <tom@herbertland.com> wrote:
> >  
> >> On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
> >> <brouer@redhat.com> wrote:  
> >> > On Thu,  7 Jul 2016 19:15:13 -0700
> >> > Brenden Blanco <bblanco@plumgrid.com> wrote:
> >> >  
> >> >> Add a new bpf prog type that is intended to run in early stages of the
> >> >> packet rx path. Only minimal packet metadata will be available, hence a
> >> >> new context type, struct xdp_md, is exposed to userspace. So far only
> >> >> expose the packet start and end pointers, and only in read mode.
> >> >>
> >> >> An XDP program must return one of the well known enum values, all other
> >> >> return codes are reserved for future use. Unfortunately, this
> >> >> restriction is hard to enforce at verification time, so take the
> >> >> approach of warning at runtime when such programs are encountered. The
> >> >> driver can choose to implement unknown return codes however it wants,
> >> >> but must invoke the warning helper with the action value.  
> >> >
> >> > I believe we should define a stronger semantics for unknown/future
> >> > return codes than the once stated above:
> >> >  "driver can choose to implement unknown return codes however it wants"
> >> >
> >> > The mlx4 driver implementation in:
> >> >  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
> >> >
> >> > On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> >> >  
> >> >> +             /* A bpf program gets first chance to drop the packet. It may
> >> >> +              * read bytes but not past the end of the frag.
> >> >> +              */
> >> >> +             if (prog) {
> >> >> +                     struct xdp_buff xdp;
> >> >> +                     dma_addr_t dma;
> >> >> +                     u32 act;
> >> >> +
> >> >> +                     dma = be64_to_cpu(rx_desc->data[0].addr);
> >> >> +                     dma_sync_single_for_cpu(priv->ddev, dma,
> >> >> +                                             priv->frag_info[0].frag_size,
> >> >> +                                             DMA_FROM_DEVICE);
> >> >> +
> >> >> +                     xdp.data = page_address(frags[0].page) +
> >> >> +                                                     frags[0].page_offset;
> >> >> +                     xdp.data_end = xdp.data + length;
> >> >> +
> >> >> +                     act = bpf_prog_run_xdp(prog, &xdp);
> >> >> +                     switch (act) {
> >> >> +                     case XDP_PASS:
> >> >> +                             break;
> >> >> +                     default:
> >> >> +                             bpf_warn_invalid_xdp_action(act);
> >> >> +                     case XDP_DROP:
> >> >> +                             goto next;
> >> >> +                     }
> >> >> +             }  
> >> >
> >> > Thus, mlx4 choice is to drop packets for unknown/future return codes.
> >> >
> >> > I think this is the wrong choice.  I think the choice should be
> >> > XDP_PASS, to pass the packet up the stack.
> >> >
> >> > I find "XDP_DROP" problematic because it happen so early in the driver,
> >> > that we lost all possibilities to debug what packets gets dropped.  We
> >> > get a single kernel log warning, but we cannot inspect the packets any
> >> > longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> >> > tcpdump) is available.
> >> >  
> >>
> >> It's an API issue though not a problem with the packet. Allowing
> >> unknown return codes to pass seems like a major security problem also.  
> >
> > We have the full power and flexibility of the normal Linux stack to
> > drop these packets.  And from a usability perspective it gives insight
> > into what is wrong and counters metrics.  Would you rather blindly drop
> > e.g. 0.01% of the packets in your data-centers without knowing.
> >  
> This is not blindly dropping packets; the bad action should be logged,
> counters incremented, and packet could be passed to the stack as an
> error if deeper inspection is needed. 

Well, the patch only logs a single warning.  There is no method of
counting or passing to the stack in this proposal.  And adding such
things is a performance regression risk, and DoS vector in itself.

> IMO, I would rather drop
> something not understood than accept it-- determinism is a goal also.
> 
> > We already talk about XDP as an offload mechanism.  Normally when
> > loading a (XDP) "offload" program it should be rejected, e.g. by the
> > validator.  BUT we cannot validate all return eBPF codes, because they
> > can originate from a table lookup.  Thus, we _do_ allow programs to be
> > loaded, with future unknown return code.
> >  This then corresponds to only part of the program can be offloaded,
> > thus the natural response is to fallback, handling this is the
> > non-offloaded slower-path.
> >
> > I see the XDP_PASS fallback as a natural way of supporting loading
> > newer/future programs on older "versions" of XDP.  
> 
> Then in this model we could only add codes that allow passing packets.
> For instance, what if a new return code means "Drop this packet and
> log it as critical because if you receive it the stack will crash"?

Drop is drop. I don't see how we would need to drop in a "new" way.
If you need to log a critical event do it in the eBPF program.

> ;-) IMO ignoring something not understood for the sake of
> extensibility is a red herring. In the long run doing this actually
> limits are ability to extend things for both APIs and protocols (a
> great example of this is VLXAN that mandates  unknown flags are
> ignored in RX so VXLAN-GPE has a be a new incompatible protocol to get
> a next protocol field).
> 
> >   E.g. I can have a XDP program that have a valid filter protection
> > mechanism, but also use a newer mechanism, and my server fleet contains
> > different NIC vendors, some NICs only support the filter part.  Then I
> > want to avoid having to compile and maintain different XDP/eBPF
> > programs per NIC vendor. (Instead I prefer having a Linux stack
> > fallback mechanism, and transparently XDP offload as much as the NIC
> > driver supports).
> >  
> As Brenden points out, fallbacks easily become DOS vectors.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-10 16:05       ` Brenden Blanco
@ 2016-07-11 11:48         ` Saeed Mahameed
  2016-07-11 21:49           ` Brenden Blanco
  0 siblings, 1 reply; 59+ messages in thread
From: Saeed Mahameed @ 2016-07-11 11:48 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Tariq Toukan, David S. Miller, Linux Netdev List,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Sun, Jul 10, 2016 at 7:05 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Sun, Jul 10, 2016 at 06:25:40PM +0300, Tariq Toukan wrote:
>>
>> On 09/07/2016 10:58 PM, Saeed Mahameed wrote:
>> >On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
>> >>+               /* A bpf program gets first chance to drop the packet. It may
>> >>+                * read bytes but not past the end of the frag.
>> >>+                */
>> >>+               if (prog) {
>> >>+                       struct xdp_buff xdp;
>> >>+                       dma_addr_t dma;
>> >>+                       u32 act;
>> >>+
>> >>+                       dma = be64_to_cpu(rx_desc->data[0].addr);
>> >>+                       dma_sync_single_for_cpu(priv->ddev, dma,
>> >>+                                               priv->frag_info[0].frag_size,
>> >>+                                               DMA_FROM_DEVICE);
>> >In case of XDP_PASS we will dma_sync again in the normal path, this
>> >can be improved by doing the dma_sync as soon as we can and once and
>> >for all, regardless of the path the packet is going to take
>> >(XDP_DROP/mlx4_en_complete_rx_desc/mlx4_en_rx_skb).
>> I agree with Saeed, dma_sync is a heavy operation that is now done
>> twice for all packets with XDP_PASS.
>> We should try our best to avoid performance degradation in the flow
>> of unfiltered packets.
> Makes sense, do folks here see a way to do this cleanly?

yes, we need something like:

+static inline void
+mlx4_en_sync_dma(struct mlx4_en_priv *priv,
+                struct mlx4_en_rx_desc *rx_desc,
+                int length)
+{
+       dma_addr_t dma;
+
+       /* Sync dma addresses from HW descriptor */
+       for (nr = 0; nr < priv->num_frags; nr++) {
+               struct mlx4_en_frag_info *frag_info = &priv->frag_info[nr];
+
+               if (length <= frag_info->frag_prefix_size)
+                       break;
+
+               dma = be64_to_cpu(rx_desc->data[nr].addr);
+               dma_sync_single_for_cpu(priv->ddev, dma, frag_info->frag_size,
+                                       DMA_FROM_DEVICE);
+       }
+}


@@ -790,6 +808,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev,
struct mlx4_en_cq *cq, int bud
                        goto next;
                }

+               length = be32_to_cpu(cqe->byte_cnt);
+               length -= ring->fcs_del;
+
+               mlx4_en_sync_dma(priv,rx_desc, length);
                 /* data is available continue processing the packet */

and make sure to remove all explicit dma_sync_single_for_cpu calls.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-11 10:15             ` Daniel Borkmann
@ 2016-07-11 12:58               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-11 12:58 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Tom Herbert, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, brouer

Trying to sum up the main points of the discussion.

Two main issues:

1) Allowing to load an XDP/eBPF program what use return codes not
   compatible with the drivers.

2) Default dropping at this level make is hard to debug / no-metrics.

To solve issue #1, I proposed defining a fallback semantics.  I guess,
people didn't like that semantics.
 The only other solution I see, is to NOT-allow programs to be loaded
if they want to use return-codes/features not supported by the driver,
e.g reject XDP programs.

Given we cannot automatic deduct used return codes (if prog is table
driven) then we need some kind of versioning or feature codes.  Could
this be modeled after NIC "features"?

I guess this could also help HW offload engines, if eBPF programs
register/annotate their needed capabilities upfront?


For issue #2 (default drop): If the solution for issue #1 is to only
loaded compatible programs, then I agree that unknown return code
should default to drop.

For debug-ability, it should be easy to extend the call
bpf_warn_invalid_xdp_action() to log more information for debugging
purposes.  Which for performance/DoS reasons should be off by default.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 05/12] Add sample for adding simple drop program to link
  2016-07-11 11:09   ` Jamal Hadi Salim
@ 2016-07-11 13:37     ` Jesper Dangaard Brouer
  2016-07-16 14:55       ` Jamal Hadi Salim
  0 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-11 13:37 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Brenden Blanco, davem, netdev, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, brouer

On Mon, 11 Jul 2016 07:09:26 -0400
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 16-07-07 10:15 PM, Brenden Blanco wrote:
> > Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
> > hook of a link. With the drop-only program, observed single core rate is
> > ~20Mpps.
> >
> > Other tests were run, for instance without the dropcnt increment or
> > without reading from the packet header, the packet rate was mostly
> > unchanged.
> >
> > $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
> > proto 17:   20403027 drops/s
> >  
> 
> 
> So - devil's advocate speaking:
> I can filter and drop with this very specific NIC at 10x as fast
> in hardware, correct?

After avoiding the cache-miss, I believe, we have actually reached the
NIC HW limit.  I base this on, my measurements show that the CPU start
to go idle, even enter sleep C-states.  And we exit NAPI mode, not
using the full budget, emptying the RX ring.


> Would a different NIC (pick something like e1000) have served a better
> example?
> BTW: Brenden, now that i looked closer here, you really dont have
> apple-apple comparison with dropping at tc ingress. You have a
> tweaked prefetch and are intentionally running things on a single
> core. Note: We are able to do 20Mpps drops with tc with a single
> core (as shown in netdev11) on a NUC with removing driver overhead.

AFAIK you were using the pktgen "xmit_mode netif_receive" which inject
packets directly into the stack, thus removing the NIC driver from the
equation.  Brenden only is measuring the driver.
  Thus, you are both doing zoom-in-measuring (of a very specific and
limited section of the code) but two completely different pieces of
code.

Notice, Jamal, in your 20Mpps results, your are also avoiding
interacting with the memory allocator, as you are recycling the same
SKB (and don't be confused by seeing kfree_skb() in perf-top as it only
does atomic_dec() [1]).

In this code-zoom-in benchmark (given single CPU is keep 100% busy) you
are actually measuring that the code path (on average) takes 50 nanosec
(1/20*1000) to execute.  Which is cool, but it is only a zoom-in on a
specific code path (which avoids any I-cache misses).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://www.youtube.com/watch?v=M6l1rxZCqLM

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-10 21:04   ` Tom Herbert
@ 2016-07-11 13:53     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-11 13:53 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann, brouer

On Sun, 10 Jul 2016 16:04:16 -0500
Tom Herbert <tom@herbertland.com> wrote:

> > +/* User return codes for XDP prog type.
> > + * A valid XDP program must return one of these defined values. All other
> > + * return codes are reserved for future use. Unknown return codes will result
> > + * in driver-dependent behavior.
> > + */
> > +enum xdp_action {
> > +       XDP_DROP,
> > +       XDP_PASS,  
> 
> I think that we should be able to distinguish an abort in BPF program
> from a normal programmatic drop. e.g.:
> 
> enum xdp_action {
>       XDP_ABORTED = 0,
>       XDP_DROP,
>       XDP_PASS,
> };

I agree.  And maybe we can re-use the bpf_warn_invalid_xdp_action() call
to keep the branch/jump-table as simple as possible, handling the
distinguishing on the slow path.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path
  2016-07-10 20:50           ` Tom Herbert
@ 2016-07-11 14:54             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-11 14:54 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, Eric Dumazet, Alexei Starovoitov,
	David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann, brouer

On Sun, 10 Jul 2016 15:50:10 -0500
Tom Herbert <tom@herbertland.com> wrote:

> > This particular patch in the series is meant to be standalone exactly
> > for this reason. I don't pretend to assert that this optimization will
> > work for everybody, or even for a future version of me with different
> > hardware. But, it passes my internal criteria for usefulness:
> > 1. It provides a measurable gain in the experiments that I have at hand
> > 2. The code is easy to review
> > 3. The change does not negatively impact non-XDP users
> >
> > I would love to have a solution for all mlx4 driver users, but this
> > patch set is focused on a different goal. So, without munging a
> > different set of changes for the universal use case, and probably
> > violating criteria #2 or #3, I went with what you see.
> >
> > In hopes of not derailing the whole patch series, what is an actionable
> > next step for this patch #12?
> > Ideas:
> > Pick a safer N? (I saw improvements with N=1 as well)
> > Drop this patch?
> >  
> As Alexei mentioned the right prefetch model may be dependent on
> workload. For instance, the XDP program for an ILA router is is far
> less code path than packets going through TCP so it makes sense that
> we would want different prefetch characteristics to optimize for each
> case. Can we make this a configurable knob for each RX queue to allow
> that?

Please see my RFC patch[1] solution.  I believe it solves the prefetch
problem in a generic way, that both benefit XDP and the normal network
stack. 

[1] http://thread.gmane.org/gmane.linux.network/420677/focus=420787

> > One thing I definitely don't want to do is go into the weeds trying to
> > get a universal prefetch logic in order to merge the XDP framework, even
> > though I agree the net result would benefit everybody.  
> 
> Agreed, a salient point of XDP is that it's _not_ a generic mechanism
> meant for all applications. We don't want to sacrifice performance for
> generality.

I've just documented[2] that my generic solution does not sacrifice
performance.

[2] http://thread.gmane.org/gmane.linux.network/420677/focus=420999

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-11 11:09         ` Jesper Dangaard Brouer
@ 2016-07-11 16:00           ` Brenden Blanco
  2016-07-11 23:05           ` Alexei Starovoitov
  1 sibling, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-11 16:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, kafai, daniel, tom, john.fastabend, gerlitz.or, hannes,
	rana.shahot, tgraf, David S. Miller, as754m, saeedm, amira,
	tzahio, Eric Dumazet

On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote:
[...]
> This patch is based on top of Brenden's patch 11/12, and is mean to
> replace patch 12/12.
> 
> Prefetching is very important for XDP, especially when using a CPU
> without DDIO (here i7-4790K CPU @ 4.00GHz).
> 
> Program xdp1: touching-data and dropping packets:
>  * 11,363,925 pkt/s == no-prefetch
>  * 21,031,096 pkt/s == brenden's-prefetch
>  * 21,062,728 pkt/s == this-prefetch-patch
> 
> Program xdp2: write-data (swap src_dst_mac) TX-bounce out same interface:
>  *  6,726,482 pkt/s == no-prefetch
>  * 10,378,163 pkt/s == brenden's-prefetch
>  * 10,622,350 pkt/s == this-prefetch-patch
> 
I see the same XDP numbers in my setup with this patch as well.

[...]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-10 20:56   ` Tom Herbert
@ 2016-07-11 16:51     ` Brenden Blanco
  2016-07-11 21:21       ` Daniel Borkmann
  0 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-11 16:51 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann

On Sun, Jul 10, 2016 at 03:56:02PM -0500, Tom Herbert wrote:
> On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> > Add a new bpf prog type that is intended to run in early stages of the
> > packet rx path. Only minimal packet metadata will be available, hence a
> > new context type, struct xdp_md, is exposed to userspace. So far only
> > expose the packet start and end pointers, and only in read mode.
> >
> > An XDP program must return one of the well known enum values, all other
> > return codes are reserved for future use. Unfortunately, this
> > restriction is hard to enforce at verification time, so take the
> > approach of warning at runtime when such programs are encountered. The
> > driver can choose to implement unknown return codes however it wants,
> > but must invoke the warning helper with the action value.
> >
> > Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> > ---
> >  include/linux/filter.h   | 18 ++++++++++
> >  include/uapi/linux/bpf.h | 19 ++++++++++
> >  kernel/bpf/verifier.c    |  1 +
> >  net/core/filter.c        | 91 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 129 insertions(+)
> >
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 6fc31ef..522dbc9 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -368,6 +368,11 @@ struct bpf_skb_data_end {
> >         void *data_end;
> >  };
> >
> > +struct xdp_buff {
> > +       void *data;
> > +       void *data_end;
> > +};
> > +
> >  /* compute the linear packet data range [data, data_end) which
> >   * will be accessed by cls_bpf and act_bpf programs
> >   */
> > @@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
> >         return BPF_PROG_RUN(prog, skb);
> >  }
> >
> > +static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
> > +                                  struct xdp_buff *xdp)
> > +{
> > +       u32 ret;
> > +
> > +       rcu_read_lock();
> > +       ret = BPF_PROG_RUN(prog, (void *)xdp);
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> > +}
> > +
> >  static inline unsigned int bpf_prog_size(unsigned int proglen)
> >  {
> >         return max(sizeof(struct bpf_prog),
> > @@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
> >
> >  struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
> >                                        const struct bpf_insn *patch, u32 len);
> > +void bpf_warn_invalid_xdp_action(int act);
> >
> >  #ifdef CONFIG_BPF_JIT
> >  extern int bpf_jit_enable;
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index c14ca1c..5b47ac3 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -94,6 +94,7 @@ enum bpf_prog_type {
> >         BPF_PROG_TYPE_SCHED_CLS,
> >         BPF_PROG_TYPE_SCHED_ACT,
> >         BPF_PROG_TYPE_TRACEPOINT,
> > +       BPF_PROG_TYPE_XDP,
> >  };
> >
> >  #define BPF_PSEUDO_MAP_FD      1
> > @@ -430,4 +431,22 @@ struct bpf_tunnel_key {
> >         __u32 tunnel_label;
> >  };
> >
> > +/* User return codes for XDP prog type.
> > + * A valid XDP program must return one of these defined values. All other
> > + * return codes are reserved for future use. Unknown return codes will result
> > + * in driver-dependent behavior.
> > + */
> > +enum xdp_action {
> > +       XDP_DROP,
> > +       XDP_PASS,
> > +};
> > +
> > +/* user accessible metadata for XDP packet hook
> > + * new fields must be added to the end of this structure
> > + */
> > +struct xdp_md {
> > +       __u32 data;
> > +       __u32 data_end;
> > +};
> > +
> >  #endif /* _UAPI__LINUX_BPF_H__ */
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index e206c21..a8d67d0 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
> >         switch (env->prog->type) {
> >         case BPF_PROG_TYPE_SCHED_CLS:
> >         case BPF_PROG_TYPE_SCHED_ACT:
> > +       case BPF_PROG_TYPE_XDP:
> >                 break;
> >         default:
> >                 verbose("verifier is misconfigured\n");
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 10c4a2f..4ba446f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -2369,6 +2369,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
> >         }
> >  }
> >
> > +static const struct bpf_func_proto *
> > +xdp_func_proto(enum bpf_func_id func_id)
> > +{
> > +       return sk_filter_func_proto(func_id);
> > +}
> > +
> >  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
> >  {
> >         if (off < 0 || off >= sizeof(struct __sk_buff))
> 
> Can off be size_t to eliminate <0 check?
No, this is coming directly from struct bpf_insn, where it is a __s16.
This function signature shows up in lots of different places and would
be a pain to fixup.
> 
> > @@ -2436,6 +2442,56 @@ static bool tc_cls_act_is_valid_access(int off, int size,
> >         return __is_valid_access(off, size, type);
> >  }
> >
> > +static bool __is_valid_xdp_access(int off, int size,
> > +                                 enum bpf_access_type type)
> > +{
> > +       if (off < 0 || off >= sizeof(struct xdp_md))
> > +               return false;
> > +       if (off % size != 0)
> 
> off & 3 != 0
Feasible, but was intending to keep with the surrounding style. What do
the other bpf maintainers think?
> 
> > +               return false;
> > +       if (size != 4)
> > +               return false;
> 
> If size must always be 4 why is it even an argument?
Because this is the first time that the verifier has a chance to check
it, and size == 4 could potentially be a prog_type-specific requirement.
> 
> > +
> > +       return true;
> > +}
> > +
> > +static bool xdp_is_valid_access(int off, int size,
> > +                               enum bpf_access_type type,
> > +                               enum bpf_reg_type *reg_hint)
> > +{
> > +       if (type == BPF_WRITE)
> > +               return false;
> > +
> > +       switch (off) {
> > +       case offsetof(struct xdp_md, data):
> > +               *reg_hint = PTR_TO_PACKET;
> > +               break;
> > +       case offsetof(struct xdp_md, data_end):
> > +               *reg_hint = PTR_TO_PACKET_END;
> > +               break;
> case sizeof(int) for below?
> 
> > +       }
> > +
> > +       return __is_valid_xdp_access(off, size, type);
> > +}
> > +
> > +void bpf_warn_invalid_xdp_action(int act)
> > +{
> > +       WARN_ONCE(1, "\n"
> > +                    "*****************************************************\n"
> > +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> > +                    "**                                               **\n"
> > +                    "** XDP program returned unknown value %-10u **\n"
> > +                    "**                                               **\n"
> > +                    "** XDP programs must return a well-known return  **\n"
> > +                    "** value. Invalid return values will result in   **\n"
> > +                    "** undefined packet actions.                     **\n"
> > +                    "**                                               **\n"
> > +                    "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> > +                    "*****************************************************\n",
> > +                 act);
> 
> Seems a little verbose to be. Just do a simple WARN_ONCE and probably
> more important to bump a counter.
The verbosity is intentional, modeled after the warning in
trace_printk_init_buffers(). Do you feel strongly on this?

A counter where? It would have to be outside of this function, there is
no common bpf infra for such things. Driver ethtool aware stats were
already mentioned.

> Also function should take the skb in
> question as an argument in case we want to do more inspection or
> logging in the future.
There is no skb to pass in, we're operating on dma'ed memory.
> 
> > +}
> > +EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
> > +
> >  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> >                                       int src_reg, int ctx_off,
> >                                       struct bpf_insn *insn_buf,
> > @@ -2587,6 +2643,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> >         return insn - insn_buf;
> >  }
> >
> > +static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
> > +                                 int src_reg, int ctx_off,
> > +                                 struct bpf_insn *insn_buf,
> > +                                 struct bpf_prog *prog)
> > +{
> > +       struct bpf_insn *insn = insn_buf;
> > +
> > +       switch (ctx_off) {
> > +       case offsetof(struct xdp_md, data):
> > +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
> > +                                     dst_reg, src_reg,
> > +                                     offsetof(struct xdp_buff, data));
> > +               break;
> > +       case offsetof(struct xdp_md, data_end):
> > +               *insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
> > +                                     dst_reg, src_reg,
> > +                                     offsetof(struct xdp_buff, data_end));
> > +               break;
> > +       }
> > +
> > +       return insn - insn_buf;
> > +}
> > +
> >  static const struct bpf_verifier_ops sk_filter_ops = {
> >         .get_func_proto         = sk_filter_func_proto,
> >         .is_valid_access        = sk_filter_is_valid_access,
> > @@ -2599,6 +2678,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
> >         .convert_ctx_access     = bpf_net_convert_ctx_access,
> >  };
> >
> > +static const struct bpf_verifier_ops xdp_ops = {
> > +       .get_func_proto         = xdp_func_proto,
> > +       .is_valid_access        = xdp_is_valid_access,
> > +       .convert_ctx_access     = xdp_convert_ctx_access,
> > +};
> > +
> >  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
> >         .ops    = &sk_filter_ops,
> >         .type   = BPF_PROG_TYPE_SOCKET_FILTER,
> > @@ -2614,11 +2699,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
> >         .type   = BPF_PROG_TYPE_SCHED_ACT,
> >  };
> >
> > +static struct bpf_prog_type_list xdp_type __read_mostly = {
> > +       .ops    = &xdp_ops,
> > +       .type   = BPF_PROG_TYPE_XDP,
> > +};
> > +
> >  static int __init register_sk_filter_ops(void)
> >  {
> >         bpf_register_prog_type(&sk_filter_type);
> >         bpf_register_prog_type(&sched_cls_type);
> >         bpf_register_prog_type(&sched_act_type);
> > +       bpf_register_prog_type(&xdp_type);
> >
> >         return 0;
> >  }
> > --
> > 2.8.2
> >

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter
  2016-07-11 16:51     ` Brenden Blanco
@ 2016-07-11 21:21       ` Daniel Borkmann
  0 siblings, 0 replies; 59+ messages in thread
From: Daniel Borkmann @ 2016-07-11 21:21 UTC (permalink / raw)
  To: Brenden Blanco, Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf

On 07/11/2016 06:51 PM, Brenden Blanco wrote:
> On Sun, Jul 10, 2016 at 03:56:02PM -0500, Tom Herbert wrote:
>> On Thu, Jul 7, 2016 at 9:15 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
[...]
>>> +static bool __is_valid_xdp_access(int off, int size,
>>> +                                 enum bpf_access_type type)
>>> +{
>>> +       if (off < 0 || off >= sizeof(struct xdp_md))
>>> +               return false;
>>> +       if (off % size != 0)
>>
>> off & 3 != 0
> Feasible, but was intending to keep with the surrounding style. What do
> the other bpf maintainers think?
>>
>>> +               return false;
>>> +       if (size != 4)
>>> +               return false;
>>
>> If size must always be 4 why is it even an argument?
> Because this is the first time that the verifier has a chance to check
> it, and size == 4 could potentially be a prog_type-specific requirement.

Yep and wrt above, I think it's more important that all is_valid_*_access()
functions are consistent to each other and easily reviewable than adding
optimizations to some of them, which is slow-path anyway. If we find a nice
simplification, then we should apply it also to others obviously.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-11 11:48         ` Saeed Mahameed
@ 2016-07-11 21:49           ` Brenden Blanco
  0 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-11 21:49 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Tariq Toukan, David S. Miller, Linux Netdev List,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On Mon, Jul 11, 2016 at 02:48:17PM +0300, Saeed Mahameed wrote:
[...]
> 
> yes, we need something like:
> 
> +static inline void
> +mlx4_en_sync_dma(struct mlx4_en_priv *priv,
> +                struct mlx4_en_rx_desc *rx_desc,
> +                int length)
> +{
> +       dma_addr_t dma;
> +
> +       /* Sync dma addresses from HW descriptor */
> +       for (nr = 0; nr < priv->num_frags; nr++) {
> +               struct mlx4_en_frag_info *frag_info = &priv->frag_info[nr];
> +
> +               if (length <= frag_info->frag_prefix_size)
> +                       break;
> +
> +               dma = be64_to_cpu(rx_desc->data[nr].addr);
> +               dma_sync_single_for_cpu(priv->ddev, dma, frag_info->frag_size,
> +                                       DMA_FROM_DEVICE);
> +       }
> +}
> 
> 
> @@ -790,6 +808,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev,
> struct mlx4_en_cq *cq, int bud
>                         goto next;
>                 }
> 
> +               length = be32_to_cpu(cqe->byte_cnt);
> +               length -= ring->fcs_del;
> +
> +               mlx4_en_sync_dma(priv,rx_desc, length);
>                  /* data is available continue processing the packet */
> 
> and make sure to remove all explicit dma_sync_single_for_cpu calls.

I see. At first glance, this may work, but introduces some changes in
the driver that may be unwanted. For instance, the dma sync cost is now
being paid even in the case where no skb will be allocated. So, under
memory pressure, it might cause extra work which would slow down your
ability to recover from the stress.

Let's keep discussing it, but in the context of a standalone cleanup.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-11 11:09         ` Jesper Dangaard Brouer
  2016-07-11 16:00           ` Brenden Blanco
@ 2016-07-11 23:05           ` Alexei Starovoitov
  2016-07-12 12:45             ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-11 23:05 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, kafai, daniel, tom, bblanco, john.fastabend, gerlitz.or,
	hannes, rana.shahot, tgraf, David S. Miller, as754m, saeedm,
	amira, tzahio, Eric Dumazet

On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote:
> > -	/* Process all completed CQEs */
> > +	/* Extract and prefetch completed CQEs */
> >  	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
> >  		    cq->mcq.cons_index & cq->size)) {
> > +		void *data;
> >  
> >  		frags = ring->rx_info + (index << priv->log_rx_info);
> >  		rx_desc = ring->buf + (index << ring->log_stride);
> > +		prefetch(rx_desc);
> >  
> >  		/*
> >  		 * make sure we read the CQE after we read the ownership bit
> >  		 */
> >  		dma_rmb();
> >  
> > +		cqe_array[cqe_idx++] = cqe;
> > +
> > +		/* Base error handling here, free handled in next loop */
> > +		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
> > +			     MLX4_CQE_OPCODE_ERROR))
> > +			goto skip;
> > +
> > +		data = page_address(frags[0].page) + frags[0].page_offset;
> > +		prefetch(data);

that's probably not correct in all cases, since doing prefetch on the address
that is going to be evicted soon may hurt performance.
We need to dma_sync_single_for_cpu() before doing a prefetch or
somehow figure out that dma_sync is a nop, so we can omit it altogether
and do whatever prefetches we like.
Also unconditionally doing batch of 8 may also hurt depending on what
is happening either with the stack, bpf afterwards or even cpu version.
Doing single prefetch of Nth packet is probably ok most of the time,
but asking cpu to prefetch 8 packets at once is unnecessary especially
since single prefetch gives the same performance.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-11 23:05           ` Alexei Starovoitov
@ 2016-07-12 12:45             ` Jesper Dangaard Brouer
  2016-07-12 16:46               ` Alexander Duyck
  0 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-12 12:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, kafai, daniel, tom, bblanco, john.fastabend, gerlitz.or,
	hannes, rana.shahot, tgraf, David S. Miller, as754m, saeedm,
	amira, tzahio, Eric Dumazet, brouer

On Mon, 11 Jul 2016 16:05:11 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote:
> > > -	/* Process all completed CQEs */
> > > +	/* Extract and prefetch completed CQEs */
> > >  	while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
> > >  		    cq->mcq.cons_index & cq->size)) {
> > > +		void *data;
> > >  
> > >  		frags = ring->rx_info + (index << priv->log_rx_info);
> > >  		rx_desc = ring->buf + (index << ring->log_stride);
> > > +		prefetch(rx_desc);
> > >  
> > >  		/*
> > >  		 * make sure we read the CQE after we read the ownership bit
> > >  		 */
> > >  		dma_rmb();
> > >  
> > > +		cqe_array[cqe_idx++] = cqe;
> > > +
> > > +		/* Base error handling here, free handled in next loop */
> > > +		if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
> > > +			     MLX4_CQE_OPCODE_ERROR))
> > > +			goto skip;
> > > +
> > > +		data = page_address(frags[0].page) + frags[0].page_offset;
> > > +		prefetch(data);  
> 
> that's probably not correct in all cases, since doing prefetch on the address
> that is going to be evicted soon may hurt performance.
> We need to dma_sync_single_for_cpu() before doing a prefetch or
> somehow figure out that dma_sync is a nop, so we can omit it altogether
> and do whatever prefetches we like.

Sure, DMA can be synced first (actually already played with this).

> Also unconditionally doing batch of 8 may also hurt depending on what
> is happening either with the stack, bpf afterwards or even cpu version.

See this as software DDIO, if the unlikely case that data will get
evicted, it will still exist in L2 or L3 cache (like DDIO). Notice,
only 1024 bytes are getting prefetched here.

> Doing single prefetch of Nth packet is probably ok most of the time,
> but asking cpu to prefetch 8 packets at once is unnecessary especially
> since single prefetch gives the same performance.

No, unconditionally prefetch of the Nth packet, will be wrong most of
the time, for real work loads, as Eric Dumazet already pointed out.

This patch does NOT unconditionally prefetch 8 packets.  Prefetching
_only_ happens when it is known that packets are ready in the RX ring.
We know this prefetch data will be used/touched, within the NAPI cycle.
Even if the processing of the packet flush L1 cache, then it will be in
L2 or L3 (like DDIO).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-12 12:45             ` Jesper Dangaard Brouer
@ 2016-07-12 16:46               ` Alexander Duyck
  2016-07-12 19:52                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Duyck @ 2016-07-12 16:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Netdev, kafai, Daniel Borkmann, Tom Herbert,
	Brenden Blanco, john fastabend, Or Gerlitz, Hannes Frederic Sowa,
	rana.shahot, Thomas Graf, David S. Miller, as754m,
	Saeed Mahameed, amira, tzahio, Eric Dumazet

On Tue, Jul 12, 2016 at 5:45 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Mon, 11 Jul 2016 16:05:11 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote:
>> > > - /* Process all completed CQEs */
>> > > + /* Extract and prefetch completed CQEs */
>> > >   while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
>> > >               cq->mcq.cons_index & cq->size)) {
>> > > +         void *data;
>> > >
>> > >           frags = ring->rx_info + (index << priv->log_rx_info);
>> > >           rx_desc = ring->buf + (index << ring->log_stride);
>> > > +         prefetch(rx_desc);
>> > >
>> > >           /*
>> > >            * make sure we read the CQE after we read the ownership bit
>> > >            */
>> > >           dma_rmb();
>> > >
>> > > +         cqe_array[cqe_idx++] = cqe;
>> > > +
>> > > +         /* Base error handling here, free handled in next loop */
>> > > +         if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
>> > > +                      MLX4_CQE_OPCODE_ERROR))
>> > > +                 goto skip;
>> > > +
>> > > +         data = page_address(frags[0].page) + frags[0].page_offset;
>> > > +         prefetch(data);
>>
>> that's probably not correct in all cases, since doing prefetch on the address
>> that is going to be evicted soon may hurt performance.
>> We need to dma_sync_single_for_cpu() before doing a prefetch or
>> somehow figure out that dma_sync is a nop, so we can omit it altogether
>> and do whatever prefetches we like.
>
> Sure, DMA can be synced first (actually already played with this).

Yes, but the point I think that Alexei is kind of indirectly getting
at is that you are doing all your tests on x86 architecture are you
not?  The x86 stuff is a very different beast from architectures like
ARM which have a very different architecture when it comes to how they
handle the memory organization of the system.  In the case of x86 the
only time dma_sync is not a nop is if you force swiotlb to be enabled
at which point the whole performance argument is kind of pointless
anyway.

>> Also unconditionally doing batch of 8 may also hurt depending on what
>> is happening either with the stack, bpf afterwards or even cpu version.
>
> See this as software DDIO, if the unlikely case that data will get
> evicted, it will still exist in L2 or L3 cache (like DDIO). Notice,
> only 1024 bytes are getting prefetched here.

I disagree.  DDIO only pushes received frames into the L3 cache.  What
you are potentially doing is flooding the L2 cache.  The difference in
size between the L3 and L2 caches is very significant.  L3 cache size
is in the MB range while the L2 cache is only 256KB or so for Xeon
processors and such.  In addition DDIO is really meant for an
architecture that has a fairly large cache region to spare and it it
limits itself to that cache region, the approach taken in this code
could potentially prefetch a fairly significant chunk of memory.

>> Doing single prefetch of Nth packet is probably ok most of the time,
>> but asking cpu to prefetch 8 packets at once is unnecessary especially
>> since single prefetch gives the same performance.
>
> No, unconditionally prefetch of the Nth packet, will be wrong most of
> the time, for real work loads, as Eric Dumazet already pointed out.
>
> This patch does NOT unconditionally prefetch 8 packets.  Prefetching
> _only_ happens when it is known that packets are ready in the RX ring.
> We know this prefetch data will be used/touched, within the NAPI cycle.
> Even if the processing of the packet flush L1 cache, then it will be in
> L2 or L3 (like DDIO).

I think the point you are missing here Jesper is that the packet isn't
what will be flushed out of L1.  It will be all the data that had been
fetched before that.  So for example the L1 cache can only hold 32K,
and the way it is setup if you fetch the first 64 bytes of 8 pages you
will have evicted everything that was in that cache set will be
flushed out to L2.

Also it might be worth while to see what instruction is being used for
the prefetch.  Last I knew for read prefetches it was prefetchnta on
x86 which would only pull the data into the L1 cache as a
"non-temporal" store.  If I am not mistaken I think you run the risk
of having the data prefetched evicted back out and bypassing the L2
and L3 caches unless it is modified.  That was kind of the point of
the prefetchnta as it really meant to be a read-only prefetch and
meant to avoid polluting the L2 and L3 caches.

- Alex

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-12 16:46               ` Alexander Duyck
@ 2016-07-12 19:52                 ` Jesper Dangaard Brouer
  2016-07-13  1:37                   ` Alexei Starovoitov
  0 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-12 19:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexei Starovoitov, Netdev, kafai, Daniel Borkmann, Tom Herbert,
	Brenden Blanco, john fastabend, Or Gerlitz, Hannes Frederic Sowa,
	rana.shahot, Thomas Graf, David S. Miller, as754m,
	Saeed Mahameed, amira, tzahio, Eric Dumazet, brouer

On Tue, 12 Jul 2016 09:46:26 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Tue, Jul 12, 2016 at 5:45 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Mon, 11 Jul 2016 16:05:11 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >  
> >> On Mon, Jul 11, 2016 at 01:09:22PM +0200, Jesper Dangaard Brouer wrote:  
> >> > > - /* Process all completed CQEs */
> >> > > + /* Extract and prefetch completed CQEs */
> >> > >   while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK,
> >> > >               cq->mcq.cons_index & cq->size)) {
> >> > > +         void *data;
> >> > >
> >> > >           frags = ring->rx_info + (index << priv->log_rx_info);
> >> > >           rx_desc = ring->buf + (index << ring->log_stride);
> >> > > +         prefetch(rx_desc);
> >> > >
> >> > >           /*
> >> > >            * make sure we read the CQE after we read the ownership bit
> >> > >            */
> >> > >           dma_rmb();
> >> > >
> >> > > +         cqe_array[cqe_idx++] = cqe;
> >> > > +
> >> > > +         /* Base error handling here, free handled in next loop */
> >> > > +         if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) ==
> >> > > +                      MLX4_CQE_OPCODE_ERROR))
> >> > > +                 goto skip;
> >> > > +
> >> > > +         data = page_address(frags[0].page) + frags[0].page_offset;
> >> > > +         prefetch(data);  
> >>
> >> that's probably not correct in all cases, since doing prefetch on the address
> >> that is going to be evicted soon may hurt performance.
> >> We need to dma_sync_single_for_cpu() before doing a prefetch or
> >> somehow figure out that dma_sync is a nop, so we can omit it altogether
> >> and do whatever prefetches we like.  
> >
> > Sure, DMA can be synced first (actually already played with this).  
> 
> Yes, but the point I think that Alexei is kind of indirectly getting
> at is that you are doing all your tests on x86 architecture are you
> not?  The x86 stuff is a very different beast from architectures like
> ARM which have a very different architecture when it comes to how they
> handle the memory organization of the system.  In the case of x86 the
> only time dma_sync is not a nop is if you force swiotlb to be enabled
> at which point the whole performance argument is kind of pointless
> anyway.
> 
> >> Also unconditionally doing batch of 8 may also hurt depending on what
> >> is happening either with the stack, bpf afterwards or even cpu version.  
> >
> > See this as software DDIO, if the unlikely case that data will get
> > evicted, it will still exist in L2 or L3 cache (like DDIO). Notice,
> > only 1024 bytes are getting prefetched here.  
> 
> I disagree.  DDIO only pushes received frames into the L3 cache.  What
> you are potentially doing is flooding the L2 cache.  The difference in
> size between the L3 and L2 caches is very significant.  L3 cache size
> is in the MB range while the L2 cache is only 256KB or so for Xeon
> processors and such.  In addition DDIO is really meant for an
> architecture that has a fairly large cache region to spare and it it
> limits itself to that cache region, the approach taken in this code
> could potentially prefetch a fairly significant chunk of memory.

No matter how you slice it, reading this memory is needed, as I'm
making sure only to prefetch packets that are "ready" and are within
the NAPI budget.  (eth_type_trans/eth_get_headlen)
 
> >> Doing single prefetch of Nth packet is probably ok most of the
> >> time, but asking cpu to prefetch 8 packets at once is unnecessary
> >> especially since single prefetch gives the same performance.  
> >
> > No, unconditionally prefetch of the Nth packet, will be wrong most
> > of the time, for real work loads, as Eric Dumazet already pointed
> > out.
> >
> > This patch does NOT unconditionally prefetch 8 packets.  Prefetching
> > _only_ happens when it is known that packets are ready in the RX
> > ring. We know this prefetch data will be used/touched, within the
> > NAPI cycle. Even if the processing of the packet flush L1 cache,
> > then it will be in L2 or L3 (like DDIO).  
> 
> I think the point you are missing here Jesper is that the packet isn't
> what will be flushed out of L1.  It will be all the data that had been
> fetched before that.  So for example the L1 cache can only hold 32K,
> and the way it is setup if you fetch the first 64 bytes of 8 pages you
> will have evicted everything that was in that cache set will be
> flushed out to L2.
> 
> Also it might be worth while to see what instruction is being used for
> the prefetch.  Last I knew for read prefetches it was prefetchnta on
> x86 which would only pull the data into the L1 cache as a
> "non-temporal" store.  If I am not mistaken I think you run the risk
> of having the data prefetched evicted back out and bypassing the L2
> and L3 caches unless it is modified.  That was kind of the point of
> the prefetchnta as it really meant to be a read-only prefetch and
> meant to avoid polluting the L2 and L3 caches.

#ifdef CONFIG_X86_32
# define BASE_PREFETCH		""
# define ARCH_HAS_PREFETCH
#else
# define BASE_PREFETCH		"prefetcht0 %P1"
#endif

static inline void prefetch(const void *x)
{
	alternative_input(BASE_PREFETCH, "prefetchnta %P1",
			  X86_FEATURE_XMM,
			  "m" (*(const char *)x));
}

Thanks for the hint. Looking at the code, it does look like 64 bit CPUs
with MMX does use the prefetchnta instruction.

DPDK use prefetcht1 instruction at RX (on 32 packets).  That might be
the better prefetch instruction to use (or prefetcht2).  Looked at the
arm64 code it does support prefetching, and googling shows arm64 also
supports prefetching to a specific cache level.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [net-next PATCH RFC] mlx4: RX prefetch loop
  2016-07-12 19:52                 ` Jesper Dangaard Brouer
@ 2016-07-13  1:37                   ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-13  1:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Duyck, Netdev, kafai, Daniel Borkmann, Tom Herbert,
	Brenden Blanco, john fastabend, Or Gerlitz, Hannes Frederic Sowa,
	rana.shahot, Thomas Graf, David S. Miller, as754m,
	Saeed Mahameed, amira, tzahio, Eric Dumazet

On Tue, Jul 12, 2016 at 09:52:52PM +0200, Jesper Dangaard Brouer wrote:
> > 
> > >> Also unconditionally doing batch of 8 may also hurt depending on what
> > >> is happening either with the stack, bpf afterwards or even cpu version.  
> > >
> > > See this as software DDIO, if the unlikely case that data will get
> > > evicted, it will still exist in L2 or L3 cache (like DDIO). Notice,
> > > only 1024 bytes are getting prefetched here.  
> > 
> > I disagree.  DDIO only pushes received frames into the L3 cache.  What
> > you are potentially doing is flooding the L2 cache.  The difference in
> > size between the L3 and L2 caches is very significant.  L3 cache size
> > is in the MB range while the L2 cache is only 256KB or so for Xeon
> > processors and such.  In addition DDIO is really meant for an
> > architecture that has a fairly large cache region to spare and it it
> > limits itself to that cache region, the approach taken in this code
> > could potentially prefetch a fairly significant chunk of memory.
> 
> No matter how you slice it, reading this memory is needed, as I'm
> making sure only to prefetch packets that are "ready" and are within
> the NAPI budget.  (eth_type_trans/eth_get_headlen)

when compilers insert prefetches it typically looks like:
for (int i;...; i += S) {
  prefetch(data + i + N);
  access data[i]
}
the N is calculated based on weight of the loop and there is
no check that i + N is within loop bounds.
prefetch by definition is speculative. Too many prefetches hurt.
Wrong prefetch distance N hurts too.
Modern cpus compute stride in hw and do automatic prefetch, so
compilers rarely emit sw prefetch anymore, but the same logic
still applies. The ideal packet processing loop:
for (...) {
  prefetch(packet + i + N);
  access packet + i
}
if there is no loop there is no value in prefetch, since there
is no deterministic way to figure out exact time when packet
data would be accessed.
In case of bpf the program author can tell us 'weight' of
the program and since the program processes the packets
mostly through the same branches and lookups we can issue
prefetch based on author's hint.
Compilers never do:
prefetch data + i
prefetch data + i + 1
prefetch data + i + 2
access data + i
access data + i + 1
access data + i + 2
because by the time access is happening the prefetched data
may be already evicted.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v6 05/12] Add sample for adding simple drop program to link
  2016-07-11 13:37     ` Jesper Dangaard Brouer
@ 2016-07-16 14:55       ` Jamal Hadi Salim
  0 siblings, 0 replies; 59+ messages in thread
From: Jamal Hadi Salim @ 2016-07-16 14:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, Martin KaFai Lau, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann

On 16-07-11 09:37 AM, Jesper Dangaard Brouer wrote:
> On Mon, 11 Jul 2016 07:09:26 -0400
> Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>

>>> $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
>>> proto 17:   20403027 drops/s

[..]

>> So - devil's advocate speaking:
>> I can filter and drop with this very specific NIC at 10x as fast
>> in hardware, correct?
>
> After avoiding the cache-miss, I believe, we have actually reached the
> NIC HW limit.


The NIC offload can hold thousands of flows (and my understanding
millions by end of year with firmware upgrade) with basic actions
to drop, redirect etc. So running out NIC HW limit is questionable.
It is not an impressive use case. Even if you went back 4-5 years
and looked earlier IGBs which can hold 1-200 rules, it is still
not impressive. Now an older NIC - that would have been a different
case.

> I base this on, my measurements show that the CPU start
> to go idle, even enter sleep C-states.  And we exit NAPI mode, not
> using the full budget, emptying the RX ring.
>

Yes, this is an issue albeit a separate one.

>> Would a different NIC (pick something like e1000) have served a better
>> example?
>> BTW: Brenden, now that i looked closer here, you really dont have
>> apple-apple comparison with dropping at tc ingress. You have a
>> tweaked prefetch and are intentionally running things on a single
>> core. Note: We are able to do 20Mpps drops with tc with a single
>> core (as shown in netdev11) on a NUC with removing driver overhead.
>
> AFAIK you were using the pktgen "xmit_mode netif_receive" which inject
> packets directly into the stack, thus removing the NIC driver from the
> equation.  Brenden only is measuring the driver.
>    Thus, you are both doing zoom-in-measuring (of a very specific and
> limited section of the code) but two completely different pieces of
> code.
>
> Notice, Jamal, in your 20Mpps results, your are also avoiding
> interacting with the memory allocator, as you are recycling the same
> SKB (and don't be confused by seeing kfree_skb() in perf-top as it only
> does atomic_dec() [1]).
>

That was design intent in order to isolate the system under test. The
paper goes into lengths of explaining we narrow down what it is we
are testing. If we are testing the classifier - it is unfair to
factor in driver overhead. My point to Brended is 20Mpps is not
an issue  for dropping at tc ingress; the driver overhead and
the magic prefetch strides definetely affect the results.

BTW: The biggest suprise (if you are looking for low hanging fruit) was
that IPV4 forwarding was more of a bottleneck than the egress
qdisc lock. And you are right memory issues were the main challenge.

> In this code-zoom-in benchmark (given single CPU is keep 100% busy) you
> are actually measuring that the code path (on average) takes 50 nanosec
> (1/20*1000) to execute.  Which is cool, but it is only a zoom-in on a
> specific code path (which avoids any I-cache misses).
>

using nanosec as a metric is not a good idea;  It ignores the fact the
fact that processing is affected by more than CPU cycles.
i.e even on the same hardware "nanosec" changes if you use lower
frequency RAM.

cheers,
jamal

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2016-07-16 14:55 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-08  2:15 [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 01/12] bpf: add XDP prog type for early driver filter Brenden Blanco
2016-07-09  8:14   ` Jesper Dangaard Brouer
2016-07-09 13:47     ` Tom Herbert
2016-07-10 13:37       ` Jesper Dangaard Brouer
2016-07-10 17:09         ` Brenden Blanco
2016-07-10 20:30           ` Tom Herbert
2016-07-11 10:15             ` Daniel Borkmann
2016-07-11 12:58               ` Jesper Dangaard Brouer
2016-07-10 20:27         ` Tom Herbert
2016-07-11 11:36           ` Jesper Dangaard Brouer
2016-07-10 20:56   ` Tom Herbert
2016-07-11 16:51     ` Brenden Blanco
2016-07-11 21:21       ` Daniel Borkmann
2016-07-10 21:04   ` Tom Herbert
2016-07-11 13:53     ` Jesper Dangaard Brouer
2016-07-08  2:15 ` [PATCH v6 02/12] net: add ndo to set xdp prog in adapter rx Brenden Blanco
2016-07-10 20:59   ` Tom Herbert
2016-07-11 10:35     ` Daniel Borkmann
2016-07-08  2:15 ` [PATCH v6 03/12] rtnl: add option for setting link xdp prog Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
2016-07-09 14:07   ` Or Gerlitz
2016-07-10 15:40     ` Brenden Blanco
2016-07-10 16:38       ` Tariq Toukan
2016-07-09 19:58   ` Saeed Mahameed
2016-07-09 21:37     ` Or Gerlitz
2016-07-10 15:25     ` Tariq Toukan
2016-07-10 16:05       ` Brenden Blanco
2016-07-11 11:48         ` Saeed Mahameed
2016-07-11 21:49           ` Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 05/12] Add sample for adding simple drop program to link Brenden Blanco
2016-07-09 20:21   ` Saeed Mahameed
2016-07-11 11:09   ` Jamal Hadi Salim
2016-07-11 13:37     ` Jesper Dangaard Brouer
2016-07-16 14:55       ` Jamal Hadi Salim
2016-07-08  2:15 ` [PATCH v6 06/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 07/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 08/12] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 09/12] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 10/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 11/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
2016-07-08  2:15 ` [PATCH v6 12/12] net/mlx4_en: add prefetch in xdp rx path Brenden Blanco
2016-07-08  3:56   ` Eric Dumazet
2016-07-08  4:16     ` Alexei Starovoitov
2016-07-08  6:56       ` Eric Dumazet
2016-07-08 16:49         ` Brenden Blanco
2016-07-10 20:48           ` Tom Herbert
2016-07-10 20:50           ` Tom Herbert
2016-07-11 14:54             ` Jesper Dangaard Brouer
2016-07-08 15:20     ` Jesper Dangaard Brouer
2016-07-08 16:02       ` [net-next PATCH RFC] mlx4: RX prefetch loop Jesper Dangaard Brouer
2016-07-11 11:09         ` Jesper Dangaard Brouer
2016-07-11 16:00           ` Brenden Blanco
2016-07-11 23:05           ` Alexei Starovoitov
2016-07-12 12:45             ` Jesper Dangaard Brouer
2016-07-12 16:46               ` Alexander Duyck
2016-07-12 19:52                 ` Jesper Dangaard Brouer
2016-07-13  1:37                   ` Alexei Starovoitov
2016-07-10 16:14 ` [PATCH v6 00/12] Add driver bpf hook for early packet drop and forwarding Tariq Toukan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.