All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Add driver bpf hook for early packet drop
@ 2016-04-02  1:21 Brenden Blanco
  2016-04-02  1:21 ` [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter Brenden Blanco
                   ` (6 more replies)
  0 siblings, 7 replies; 62+ messages in thread
From: Brenden Blanco @ 2016-04-02  1:21 UTC (permalink / raw)
  To: davem
  Cc: Brenden Blanco, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer

This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling Express Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before even
an skb has been allocated.

With this, hope to enable line rate filtering, with this initial
implementation providing drop/allow action only.

Patch 1 introduces the new prog type and helpers for validating the bpf
program. A new userspace struct is defined containing only len as a field,
with others to follow in the future.
In patch 2, create a new ndo to pass the fd to support drivers. 
In patch 3, expose a new rtnl option to userspace.
In patch 4, enable support in mlx4 driver. No skb allocation is required,
instead a static percpu skb is kept in the driver and minimally initialized
for each driver frag.
In patch 5, create a sample drop and count program. With single core,
achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data
access, bpf array lookup, and increment.

Interestingly, accessing packet data from the program did not have a
noticeable impact on performance. Even so, future enhancements to
prefetching / batching / page-allocs should hopefully improve the
performance in this path.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

Brenden Blanco (5):
  bpf: add PHYS_DEV prog type for early driver filter
  net: add ndo to set bpf prog in adapter rx
  rtnl: add option for setting link bpf prog
  mlx4: add support for fast rx drop bpf program
  Add sample for adding simple drop program to link

 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  61 ++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     |  18 +++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   2 +
 include/linux/netdevice.h                      |   8 ++
 include/uapi/linux/bpf.h                       |   5 +
 include/uapi/linux/if_link.h                   |   1 +
 kernel/bpf/verifier.c                          |   1 +
 net/core/dev.c                                 |  12 ++
 net/core/filter.c                              |  68 +++++++++++
 net/core/rtnetlink.c                           |  10 ++
 samples/bpf/Makefile                           |   4 +
 samples/bpf/bpf_load.c                         |   8 ++
 samples/bpf/netdrvx1_kern.c                    |  26 +++++
 samples/bpf/netdrvx1_user.c                    | 155 +++++++++++++++++++++++++
 14 files changed, 379 insertions(+)
 create mode 100644 samples/bpf/netdrvx1_kern.c
 create mode 100644 samples/bpf/netdrvx1_user.c

-- 
2.8.0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
@ 2016-04-02  1:21 ` Brenden Blanco
  2016-04-02 16:39   ` Tom Herbert
  2016-04-04  8:49   ` Daniel Borkmann
  2016-04-02  1:21 ` [RFC PATCH 2/5] net: add ndo to set bpf prog in adapter rx Brenden Blanco
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 62+ messages in thread
From: Brenden Blanco @ 2016-04-02  1:21 UTC (permalink / raw)
  To: davem
  Cc: Brenden Blanco, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer

Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a new
context type, struct xdp_metadata, is exposed to userspace. So far only
expose the readable packet length, and only in read mode.

The PHYS_DEV name is chosen to represent that the program is meant only
for physical adapters, rather than all netdevs.

While the user visible struct is new, the underlying context must be
implemented as a minimal skb in order for the packet load_* instructions
to work. The skb filled in by the driver must have skb->len, skb->head,
and skb->data set, and skb->data_len == 0.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/bpf.h |  5 ++++
 kernel/bpf/verifier.c    |  1 +
 net/core/filter.c        | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 74 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 924f537..b8a4ef2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -92,6 +92,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_KPROBE,
 	BPF_PROG_TYPE_SCHED_CLS,
 	BPF_PROG_TYPE_SCHED_ACT,
+	BPF_PROG_TYPE_PHYS_DEV,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
@@ -367,6 +368,10 @@ struct __sk_buff {
 	__u32 tc_classid;
 };
 
+struct xdp_metadata {
+	__u32 len;
+};
+
 struct bpf_tunnel_key {
 	__u32 tunnel_id;
 	union {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2e08f8e..804ca70 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_SOCKET_FILTER:
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_PHYS_DEV:
 		return true;
 	default:
 		return false;
diff --git a/net/core/filter.c b/net/core/filter.c
index b7177d0..c417db6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+phys_dev_func_proto(enum bpf_func_id func_id)
+{
+	return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	/* check bounds */
@@ -2073,6 +2079,36 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+				  enum bpf_access_type type)
+{
+	if (off < 0 || off >= sizeof(struct xdp_metadata))
+		return false;
+
+	if (off % size != 0)
+		return false;
+
+	if (size != 4)
+		return false;
+
+	return true;
+}
+
+static bool phys_dev_is_valid_access(int off, int size,
+				     enum bpf_access_type type)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	switch (off) {
+	case offsetof(struct xdp_metadata, len):
+		break;
+	default:
+		return false;
+	}
+	return __is_valid_xdp_access(off, size, type);
+}
+
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 				      int src_reg, int ctx_off,
 				      struct bpf_insn *insn_buf,
@@ -2210,6 +2246,26 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type,
+					   int dst_reg, int src_reg,
+					   int ctx_off,
+					   struct bpf_insn *insn_buf,
+					   struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct xdp_metadata, len):
+		BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
+
+		*insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
+				      offsetof(struct sk_buff, len));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static const struct bpf_verifier_ops sk_filter_ops = {
 	.get_func_proto = sk_filter_func_proto,
 	.is_valid_access = sk_filter_is_valid_access,
@@ -2222,6 +2278,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
 	.convert_ctx_access = bpf_net_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops phys_dev_ops = {
+	.get_func_proto = phys_dev_func_proto,
+	.is_valid_access = phys_dev_is_valid_access,
+	.convert_ctx_access = bpf_phys_dev_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops = &sk_filter_ops,
 	.type = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2237,11 +2299,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
 	.type = BPF_PROG_TYPE_SCHED_ACT,
 };
 
+static struct bpf_prog_type_list phys_dev_type __read_mostly = {
+	.ops = &phys_dev_ops,
+	.type = BPF_PROG_TYPE_PHYS_DEV,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
+	bpf_register_prog_type(&phys_dev_type);
 
 	return 0;
 }
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 2/5] net: add ndo to set bpf prog in adapter rx
  2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
  2016-04-02  1:21 ` [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter Brenden Blanco
@ 2016-04-02  1:21 ` Brenden Blanco
  2016-04-02  1:21 ` [RFC PATCH 3/5] rtnl: add option for setting link bpf prog Brenden Blanco
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 62+ messages in thread
From: Brenden Blanco @ 2016-04-02  1:21 UTC (permalink / raw)
  To: davem
  Cc: Brenden Blanco, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer

Add a new netdev op for drivers implementing the BPF_PROG_TYPE_PHYS_DEV
filter to get configuration. Since the fd is only used by the driver to
fetch the prog, the netdev should just keep a bit to indicate the
program is valid.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/netdevice.h |  8 ++++++++
 net/core/dev.c            | 12 ++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cb0d5d0..c46e2e3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1102,6 +1102,11 @@ struct tc_to_netdev {
  *	appropriate rx headroom value allows avoiding skb head copy on
  *	forward. Setting a negative value resets the rx headroom to the
  *	default value.
+ * int  (*ndo_bpf_set)(struct net_device *dev, int fd);
+ *	This function is used to set or clear a bpf program used in the
+ *	earliest stages of packet rx. The fd must be a program loaded as
+ *	BPF_PROG_TYPE_PHYS_DEV. Negative values of fd indicate the program
+ *	should be removed.
  *
  */
 struct net_device_ops {
@@ -1292,6 +1297,7 @@ struct net_device_ops {
 						       struct sk_buff *skb);
 	void			(*ndo_set_rx_headroom)(struct net_device *dev,
 						       int needed_headroom);
+	int			(*ndo_bpf_set)(struct net_device *dev, int fd);
 };
 
 /**
@@ -1875,6 +1881,7 @@ struct net_device {
 	struct phy_device	*phydev;
 	struct lock_class_key	*qdisc_tx_busylock;
 	bool			proto_down;
+	bool			bpf_valid;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -3268,6 +3275,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
 			   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
+int dev_change_bpf_fd(struct net_device *dev, int fd);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index b9bcbe7..eb93414 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6480,6 +6480,18 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
 }
 EXPORT_SYMBOL(dev_change_proto_down);
 
+int dev_change_bpf_fd(struct net_device *dev, int fd)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (!ops->ndo_bpf_set)
+		return -EOPNOTSUPP;
+	if (!netif_device_present(dev))
+		return -ENODEV;
+	return ops->ndo_bpf_set(dev, fd);
+}
+EXPORT_SYMBOL(dev_change_bpf_fd);
+
 /**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 3/5] rtnl: add option for setting link bpf prog
  2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
  2016-04-02  1:21 ` [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter Brenden Blanco
  2016-04-02  1:21 ` [RFC PATCH 2/5] net: add ndo to set bpf prog in adapter rx Brenden Blanco
@ 2016-04-02  1:21 ` Brenden Blanco
  2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 62+ messages in thread
From: Brenden Blanco @ 2016-04-02  1:21 UTC (permalink / raw)
  To: davem
  Cc: Brenden Blanco, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer

Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_PHYS_DEV.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c         | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index c488066..08d66a3 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -155,6 +155,7 @@ enum {
 	IFLA_PROTO_DOWN,
 	IFLA_GSO_MAX_SEGS,
 	IFLA_GSO_MAX_SIZE,
+	IFLA_BPF_FD,
 	__IFLA_MAX
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index f206677..1c4a603 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -909,6 +909,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + rtnl_link_get_af_size(dev, ext_filter_mask) /* IFLA_AF_SPEC */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
+	       + nla_total_size(4) /* IFLA_BPF_FD */
 	       + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1241,6 +1242,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	     nla_put_string(skb, IFLA_IFALIAS, dev->ifalias)) ||
 	    nla_put_u32(skb, IFLA_CARRIER_CHANGES,
 			atomic_read(&dev->carrier_changes)) ||
+	    nla_put_s32(skb, IFLA_BPF_FD, dev->bpf_valid) ||
 	    nla_put_u8(skb, IFLA_PROTO_DOWN, dev->proto_down))
 		goto nla_put_failure;
 
@@ -1374,6 +1376,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
 	[IFLA_PROTO_DOWN]	= { .type = NLA_U8 },
+	[IFLA_BPF_FD]		= { .type = NLA_S32 },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2028,6 +2031,13 @@ static int do_setlink(const struct sk_buff *skb,
 		status |= DO_SETLINK_NOTIFY;
 	}
 
+	if (tb[IFLA_BPF_FD]) {
+		err = dev_change_bpf_fd(dev, nla_get_s32(tb[IFLA_BPF_FD]));
+		if (err)
+			goto errout;
+		status |= DO_SETLINK_NOTIFY;
+	}
+
 errout:
 	if (status & DO_SETLINK_MODIFIED) {
 		if (status & DO_SETLINK_NOTIFY)
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
                   ` (2 preceding siblings ...)
  2016-04-02  1:21 ` [RFC PATCH 3/5] rtnl: add option for setting link bpf prog Brenden Blanco
@ 2016-04-02  1:21 ` Brenden Blanco
  2016-04-02  2:08   ` Eric Dumazet
                     ` (4 more replies)
  2016-04-02  1:21 ` [RFC PATCH 5/5] Add sample for adding simple drop program to link Brenden Blanco
                   ` (2 subsequent siblings)
  6 siblings, 5 replies; 62+ messages in thread
From: Brenden Blanco @ 2016-04-02  1:21 UTC (permalink / raw)
  To: davem
  Cc: Brenden Blanco, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer

Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
bpf programs require a skb context to navigate the packet, build a
percpu fake skb with the minimal fields. This avoids the costly
allocation for packets that end up being dropped.

Since mlx4 is so far the only user of this pseudo skb, the build
function is defined locally.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 18 ++++++++
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  2 +
 3 files changed, 81 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index b4b258c..89ca787 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@
  *
  */
 
+#include <linux/bpf.h>
 #include <linux/etherdevice.h>
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
@@ -1966,6 +1967,9 @@ void mlx4_en_free_resources(struct mlx4_en_priv *priv)
 			mlx4_en_destroy_cq(priv, &priv->rx_cq[i]);
 	}
 
+	if (priv->prog)
+		bpf_prog_put(priv->prog);
+
 }
 
 int mlx4_en_alloc_resources(struct mlx4_en_priv *priv)
@@ -2456,6 +2460,61 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 	return err;
 }
 
+static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb);
+
+static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data,
+				     unsigned int length)
+{
+	/* data_len is intentionally not set here so that skb_is_nonlinear()
+	 * returns false
+	 */
+
+	skb->len = length;
+	skb->head = data;
+	skb->data = data;
+}
+
+int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length)
+{
+	struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb);
+	int ret;
+
+	build_pseudo_skb_for_bpf(skb, data, length);
+
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, (void *)skb);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static int mlx4_bpf_set(struct net_device *dev, int fd)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct bpf_prog *prog = NULL, *old_prog;
+
+	if (fd >= 0) {
+		prog = bpf_prog_get(fd);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+
+		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
+			bpf_prog_put(prog);
+			return -EINVAL;
+		}
+	}
+
+	old_prog = xchg(&priv->prog, prog);
+	if (old_prog) {
+		synchronize_net();
+		bpf_prog_put(old_prog);
+	}
+
+	priv->dev->bpf_valid = !!prog;
+
+	return 0;
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
@@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_features_check	= mlx4_en_features_check,
 #endif
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_bpf_set		= mlx4_bpf_set,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_features_check	= mlx4_en_features_check,
 #endif
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_bpf_set		= mlx4_bpf_set,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 86bcfe5..03fe005 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
+	struct bpf_prog *prog;
 	struct sk_buff *skb;
 	int index;
 	int nr;
@@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	if (budget <= 0)
 		return polled;
 
+	prog = READ_ONCE(priv->prog);
+
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
 	 * reading 'cqe->index' */
@@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
 			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
+		/* A bpf program gets first chance to drop the packet. It may
+		 * read bytes but not past the end of the frag. A non-zero
+		 * return indicates packet should be dropped.
+		 */
+		if (prog) {
+			struct ethhdr *ethh;
+
+			ethh = (struct ethhdr *)(page_address(frags[0].page) +
+						 frags[0].page_offset);
+			if (mlx4_call_bpf(prog, ethh, length)) {
+				priv->stats.rx_dropped++;
+				goto next;
+			}
+		}
+
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d12ab6a..3d0fc89 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -568,6 +568,7 @@ struct mlx4_en_priv {
 	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
 	struct hwtstamp_config hwtstamp_config;
 	u32 counter_index;
+	struct bpf_prog *prog;
 
 #ifdef CONFIG_MLX4_EN_DCB
 	struct ieee_ets ets;
@@ -682,6 +683,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv);
 void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv);
 int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring);
 void mlx4_en_rx_irq(struct mlx4_cq *mcq);
+int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length);
 
 int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode);
 int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [RFC PATCH 5/5] Add sample for adding simple drop program to link
  2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
                   ` (3 preceding siblings ...)
  2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
@ 2016-04-02  1:21 ` Brenden Blanco
  2016-04-06 19:48   ` Jesper Dangaard Brouer
  2016-04-02 16:47 ` [RFC PATCH 0/5] Add driver bpf hook for early packet drop Tom Herbert
  2016-04-02 18:41 ` Johannes Berg
  6 siblings, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-02  1:21 UTC (permalink / raw)
  To: davem
  Cc: Brenden Blanco, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer

Add a sample program that only drops packets at the
BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
observed single core rate is ~14.6Mpps.

Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.

$ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex)
proto 17:   14597724 drops/s

./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
Running... ctrl^C to stop
Device: eth4@0
Result: OK: 6486875(c6485849+d1026) usec, 23689465 (60byte,0frags)
  3651906pps 1752Mb/sec (1752914880bps) errors: 0
Device: eth4@1
Result: OK: 6486874(c6485656+d1217) usec, 23689489 (60byte,0frags)
  3651911pps 1752Mb/sec (1752917280bps) errors: 0
Device: eth4@2
Result: OK: 6486851(c6485730+d1120) usec, 23687853 (60byte,0frags)
  3651672pps 1752Mb/sec (1752802560bps) errors: 0
Device: eth4@3
Result: OK: 6486879(c6485807+d1071) usec, 23688954 (60byte,0frags)
  3651825pps 1752Mb/sec (1752876000bps) errors: 0

perf report --no-children:
  18.36%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_process_rx_cq
  15.98%  swapper        [kernel.vmlinux]  [k] poll_idle
  12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
   6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
   4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
   4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
   3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
   2.39%  ksoftirqd/1    [mdio]            [k] 0x00000000000074cd
   2.23%  swapper        [mlx4_en]         [k] mlx4_en_alloc_frags
   2.20%  ksoftirqd/1    [kernel.vmlinux]  [k] free_pages_prepare
   2.08%  ksoftirqd/1    [mlx4_en]         [k] mlx4_call_bpf
   1.57%  ksoftirqd/1    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
   1.35%  ksoftirqd/1    [mdio]            [k] 0x00000000000074fa
   1.09%  ksoftirqd/1    [kernel.vmlinux]  [k] free_one_page
   1.02%  ksoftirqd/1    [kernel.vmlinux]  [k] bpf_map_lookup_elem
   0.90%  ksoftirqd/1    [kernel.vmlinux]  [k] __alloc_pages_nodemask
   0.88%  swapper        [kernel.vmlinux]  [k] intel_idle
   0.82%  ksoftirqd/1    [mdio]            [k] 0x00000000000074be
   0.80%  swapper        [mlx4_en]         [k] mlx4_en_free_frag

machine specs:
 receiver - Intel E5-1630 v3 @ 3.70GHz
 sender - Intel E5645 @ 2.40GHz
 Mellanox ConnectX-3 @40G

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 samples/bpf/Makefile        |   4 ++
 samples/bpf/bpf_load.c      |   8 +++
 samples/bpf/netdrvx1_kern.c |  26 ++++++++
 samples/bpf/netdrvx1_user.c | 155 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 193 insertions(+)
 create mode 100644 samples/bpf/netdrvx1_kern.c
 create mode 100644 samples/bpf/netdrvx1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 502c9fc..ad36bb8 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -19,6 +19,7 @@ hostprogs-y += lathist
 hostprogs-y += offwaketime
 hostprogs-y += spintest
 hostprogs-y += map_perf_test
+hostprogs-y += netdrvx1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -38,6 +39,7 @@ lathist-objs := bpf_load.o libbpf.o lathist_user.o
 offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
 spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
+netdrvx1-objs := bpf_load.o libbpf.o netdrvx1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -56,6 +58,7 @@ always += lathist_kern.o
 always += offwaketime_kern.o
 always += spintest_kern.o
 always += map_perf_test_kern.o
+always += netdrvx1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -75,6 +78,7 @@ HOSTLOADLIBES_lathist += -lelf
 HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
+HOSTLOADLIBES_netdrvx1 += -lelf
 
 # point this to your LLVM backend with bpf support
 LLC=$(srctree)/tools/bpf/llvm/bld/Debug+Asserts/bin/llc
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 58f86bd..9308fbc 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -49,6 +49,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_socket = strncmp(event, "socket", 6) == 0;
 	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
 	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+	bool is_phys_dev = strncmp(event, "phys_dev", 8) == 0;
 	enum bpf_prog_type prog_type;
 	char buf[256];
 	int fd, efd, err, id;
@@ -63,6 +64,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
 	} else if (is_kprobe || is_kretprobe) {
 		prog_type = BPF_PROG_TYPE_KPROBE;
+	} else if (is_phys_dev) {
+		prog_type = BPF_PROG_TYPE_PHYS_DEV;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -76,6 +79,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 
 	prog_fd[prog_cnt++] = fd;
 
+	if (is_phys_dev)
+		return 0;
+
 	if (is_socket) {
 		event += 6;
 		if (*event != '/')
@@ -304,6 +310,7 @@ int load_bpf_file(char *path)
 
 			if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
 			    memcmp(shname_prog, "kretprobe/", 10) == 0 ||
+			    memcmp(shname_prog, "phys_dev", 8) == 0 ||
 			    memcmp(shname_prog, "socket", 6) == 0)
 				load_and_attach(shname_prog, insns, data_prog->d_size);
 		}
@@ -320,6 +327,7 @@ int load_bpf_file(char *path)
 
 		if (memcmp(shname, "kprobe/", 7) == 0 ||
 		    memcmp(shname, "kretprobe/", 10) == 0 ||
+		    memcmp(shname, "phys_dev", 8) == 0 ||
 		    memcmp(shname, "socket", 6) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
diff --git a/samples/bpf/netdrvx1_kern.c b/samples/bpf/netdrvx1_kern.c
new file mode 100644
index 0000000..9837d8a
--- /dev/null
+++ b/samples/bpf/netdrvx1_kern.c
@@ -0,0 +1,26 @@
+#include <uapi/linux/bpf.h>
+#include <uapi/linux/if_ether.h>
+#include <uapi/linux/if_packet.h>
+#include <uapi/linux/ip.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+SEC("phys_dev1")
+int bpf_prog1(struct xdp_metadata *ctx)
+{
+	int index = load_byte(ctx, ETH_HLEN + offsetof(struct iphdr, protocol));
+	long *value;
+
+	value = bpf_map_lookup_elem(&dropcnt, &index);
+	if (value)
+		*value += 1;
+
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/netdrvx1_user.c b/samples/bpf/netdrvx1_user.c
new file mode 100644
index 0000000..9e6ec9a
--- /dev/null
+++ b/samples/bpf/netdrvx1_user.c
@@ -0,0 +1,155 @@
+#include <linux/bpf.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <assert.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+
+static int set_link_bpf_fd(int ifindex, int fd)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct rtattr *rta;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+	rta = (struct rtattr *)(((char *) &req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	rta->rta_type = 42/*IFLA_BPF_FD*/;
+	rta->rta_len = RTA_LENGTH(sizeof(unsigned int));
+	req.nh.nlmsg_len = NLMSG_ALIGN(req.nh.nlmsg_len)
+		+ RTA_LENGTH(sizeof(fd));
+	memcpy(RTA_DATA(rta), &fd, sizeof(fd));
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(int secs)
+{
+	unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+	__u64 values[nr_cpus];
+	__u32 key;
+	int i;
+
+	sleep(secs);
+
+	for (key = 0; key < 256; key++) {
+		__u64 sum = 0;
+
+		assert(bpf_lookup_elem(map_fd[0], &key, values) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			sum += values[i];
+		if (sum)
+			printf("proto %u: %10llu drops/s\n", key, sum/secs);
+	}
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+	int ifindex;
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (ac != 2) {
+		printf("usage: %s IFINDEX\n", argv[0]);
+		return 1;
+	}
+
+	ifindex = strtoul(argv[1], NULL, 0);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	if (set_link_bpf_fd(ifindex, prog_fd[0]) < 0) {
+		printf("link set bpf fd failed\n");
+		return 1;
+	}
+
+	poll_stats(5);
+
+	set_link_bpf_fd(ifindex, -1);
+
+	return 0;
+}
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
@ 2016-04-02  2:08   ` Eric Dumazet
  2016-04-02  2:47     ` Alexei Starovoitov
  2016-04-03  6:15     ` Brenden Blanco
  2016-04-02  8:23   ` Jesper Dangaard Brouer
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-04-02  2:08 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer

On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 


> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {
> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +


1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
frame. 

Still you pass a single fragment but total 'length' here : BPF program
can read past the end of this first fragment and panic the box.

Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
mean.

2) priv->stats.rx_dropped is shared by all the RX queues -> false
sharing.

   This is probably the right time to add a rx_dropped field in struct
mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
higher speed links.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  2:08   ` Eric Dumazet
@ 2016-04-02  2:47     ` Alexei Starovoitov
  2016-04-04 14:57       ` Jesper Dangaard Brouer
  2016-04-03  6:15     ` Brenden Blanco
  1 sibling, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-02  2:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	john.fastabend, brouer

On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> > bpf programs require a skb context to navigate the packet, build a
> > percpu fake skb with the minimal fields. This avoids the costly
> > allocation for packets that end up being dropped.
> > 
> 
> 
> > +		/* A bpf program gets first chance to drop the packet. It may
> > +		 * read bytes but not past the end of the frag. A non-zero
> > +		 * return indicates packet should be dropped.
> > +		 */
> > +		if (prog) {
> > +			struct ethhdr *ethh;
> > +
> > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > +						 frags[0].page_offset);
> > +			if (mlx4_call_bpf(prog, ethh, length)) {
> > +				priv->stats.rx_dropped++;
> > +				goto next;
> > +			}
> > +		}
> > +
> 
> 
> 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
> frame. 
> 
> Still you pass a single fragment but total 'length' here : BPF program
> can read past the end of this first fragment and panic the box.
> 
> Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
> mean.

yep.
my reading of that part was that num_frags > 1 is only for large
mtu sizes, so if we limit this for num_frags==1 only for now
we should be ok and it's still applicable for most of the use cases ?

> 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> sharing.

yes. good point. I bet it was copy pasted from few lines below.
Should be trivial to convert it to percpu.

>    This is probably the right time to add a rx_dropped field in struct
> mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> higher speed links.

yes, could be per ring as well.
My guess we're hitting 14.5Mpps limit for empty bpf program
and for program that actually looks into the packet because we're
hitting 10G phy limit of 40G nic. Since physically 40G nic
consists of four 10G phys. There will be the same problem
with 100G and 50G nics. Both will be hitting 25G phy limit.
We need to vary packets somehow. Hopefully Or can explain that
bit of hw design.
Jesper's experiments with mlx4 showed the same 14.5Mpps limit
when sender blasting the same packet over and over again.
Great to see the experiments converging.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
  2016-04-02  2:08   ` Eric Dumazet
@ 2016-04-02  8:23   ` Jesper Dangaard Brouer
  2016-04-03  6:11     ` Brenden Blanco
  2016-04-02 18:40   ` Johannes Berg
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-02  8:23 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, gerlitz, daniel,
	john.fastabend, brouer


First of all, I'm very happy to see people start working on this!
Thanks you Brenden!

On Fri,  1 Apr 2016 18:21:57 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
> 
> Since mlx4 is so far the only user of this pseudo skb, the build
> function is defined locally.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
[...]
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 86bcfe5..03fe005 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
[...]
> @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  	if (budget <= 0)
>  		return polled;
>  
> +	prog = READ_ONCE(priv->prog);
> +
>  	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>  	 * descriptor offset can be deduced from the CQE index instead of
>  	 * reading 'cqe->index' */
> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +

I think you need to DMA sync RX-page before you can safely access
packet data in page (on all arch's).

> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {

AFAIK length here covers all the frags[n].page, thus potentially
causing the BPF program to access memory out of bound (crash).

Having several page fragments is AFAIK an optimization for jumbo-frames
on PowerPC (which is a bit annoying for you use-case ;-)).


> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-02  1:21 ` [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter Brenden Blanco
@ 2016-04-02 16:39   ` Tom Herbert
  2016-04-03  7:02     ` Brenden Blanco
  2016-04-04  8:49   ` Daniel Borkmann
  1 sibling, 1 reply; 62+ messages in thread
From: Tom Herbert @ 2016-04-02 16:39 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer

On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a new
> context type, struct xdp_metadata, is exposed to userspace. So far only
> expose the readable packet length, and only in read mode.
>
This would eventually be a generic abstraction of receive descriptors?

> The PHYS_DEV name is chosen to represent that the program is meant only
> for physical adapters, rather than all netdevs.
>
Is there a hard restriction that this could only work with physical devices?

> While the user visible struct is new, the underlying context must be
> implemented as a minimal skb in order for the packet load_* instructions
> to work. The skb filled in by the driver must have skb->len, skb->head,
> and skb->data set, and skb->data_len == 0.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/uapi/linux/bpf.h |  5 ++++
>  kernel/bpf/verifier.c    |  1 +
>  net/core/filter.c        | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 74 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 924f537..b8a4ef2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -92,6 +92,7 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_KPROBE,
>         BPF_PROG_TYPE_SCHED_CLS,
>         BPF_PROG_TYPE_SCHED_ACT,
> +       BPF_PROG_TYPE_PHYS_DEV,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -367,6 +368,10 @@ struct __sk_buff {
>         __u32 tc_classid;
>  };
>
> +struct xdp_metadata {
> +       __u32 len;
> +};
> +
>  struct bpf_tunnel_key {
>         __u32 tunnel_id;
>         union {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2e08f8e..804ca70 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>         case BPF_PROG_TYPE_SOCKET_FILTER:
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
> +       case BPF_PROG_TYPE_PHYS_DEV:
>                 return true;
>         default:
>                 return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b7177d0..c417db6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +phys_dev_func_proto(enum bpf_func_id func_id)
> +{
> +       return sk_filter_func_proto(func_id);
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         /* check bounds */
> @@ -2073,6 +2079,36 @@ static bool tc_cls_act_is_valid_access(int off, int size,
>         return __is_valid_access(off, size, type);
>  }
>
> +static bool __is_valid_xdp_access(int off, int size,
> +                                 enum bpf_access_type type)
> +{
> +       if (off < 0 || off >= sizeof(struct xdp_metadata))
> +               return false;
> +
> +       if (off % size != 0)
> +               return false;
> +
> +       if (size != 4)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool phys_dev_is_valid_access(int off, int size,
> +                                    enum bpf_access_type type)
> +{
> +       if (type == BPF_WRITE)
> +               return false;
> +
> +       switch (off) {
> +       case offsetof(struct xdp_metadata, len):
> +               break;
> +       default:
> +               return false;
> +       }
> +       return __is_valid_xdp_access(off, size, type);
> +}
> +
>  static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>                                       int src_reg, int ctx_off,
>                                       struct bpf_insn *insn_buf,
> @@ -2210,6 +2246,26 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
>         return insn - insn_buf;
>  }
>
> +static u32 bpf_phys_dev_convert_ctx_access(enum bpf_access_type type,
> +                                          int dst_reg, int src_reg,
> +                                          int ctx_off,
> +                                          struct bpf_insn *insn_buf,
> +                                          struct bpf_prog *prog)
> +{
> +       struct bpf_insn *insn = insn_buf;
> +
> +       switch (ctx_off) {
> +       case offsetof(struct xdp_metadata, len):
> +               BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, len) != 4);
> +
> +               *insn++ = BPF_LDX_MEM(BPF_W, dst_reg, src_reg,
> +                                     offsetof(struct sk_buff, len));
> +               break;
> +       }
> +
> +       return insn - insn_buf;
> +}
> +
>  static const struct bpf_verifier_ops sk_filter_ops = {
>         .get_func_proto = sk_filter_func_proto,
>         .is_valid_access = sk_filter_is_valid_access,
> @@ -2222,6 +2278,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
>         .convert_ctx_access = bpf_net_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops phys_dev_ops = {
> +       .get_func_proto = phys_dev_func_proto,
> +       .is_valid_access = phys_dev_is_valid_access,
> +       .convert_ctx_access = bpf_phys_dev_convert_ctx_access,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops = &sk_filter_ops,
>         .type = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2237,11 +2299,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
>         .type = BPF_PROG_TYPE_SCHED_ACT,
>  };
>
> +static struct bpf_prog_type_list phys_dev_type __read_mostly = {
> +       .ops = &phys_dev_ops,
> +       .type = BPF_PROG_TYPE_PHYS_DEV,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
> +       bpf_register_prog_type(&phys_dev_type);
>
>         return 0;
>  }
> --
> 2.8.0
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
                   ` (4 preceding siblings ...)
  2016-04-02  1:21 ` [RFC PATCH 5/5] Add sample for adding simple drop program to link Brenden Blanco
@ 2016-04-02 16:47 ` Tom Herbert
  2016-04-03  5:41   ` Brenden Blanco
  2016-04-02 18:41 ` Johannes Berg
  6 siblings, 1 reply; 62+ messages in thread
From: Tom Herbert @ 2016-04-02 16:47 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer

On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before even
> an skb has been allocated.
>
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.
>
> Patch 1 introduces the new prog type and helpers for validating the bpf
> program. A new userspace struct is defined containing only len as a field,
> with others to follow in the future.
> In patch 2, create a new ndo to pass the fd to support drivers.
> In patch 3, expose a new rtnl option to userspace.
> In patch 4, enable support in mlx4 driver. No skb allocation is required,
> instead a static percpu skb is kept in the driver and minimally initialized
> for each driver frag.
> In patch 5, create a sample drop and count program. With single core,
> achieved ~14.5 Mpps drop rate on a 40G mlx4. This includes packet data
> access, bpf array lookup, and increment.
>
Very nice! Do you think this hook will be sufficient to implement a
fast forward patch also?

Tom

> Interestingly, accessing packet data from the program did not have a
> noticeable impact on performance. Even so, future enhancements to
> prefetching / batching / page-allocs should hopefully improve the
> performance in this path.
>
> [1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf
>
> Brenden Blanco (5):
>   bpf: add PHYS_DEV prog type for early driver filter
>   net: add ndo to set bpf prog in adapter rx
>   rtnl: add option for setting link bpf prog
>   mlx4: add support for fast rx drop bpf program
>   Add sample for adding simple drop program to link
>
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  61 ++++++++++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     |  18 +++
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   2 +
>  include/linux/netdevice.h                      |   8 ++
>  include/uapi/linux/bpf.h                       |   5 +
>  include/uapi/linux/if_link.h                   |   1 +
>  kernel/bpf/verifier.c                          |   1 +
>  net/core/dev.c                                 |  12 ++
>  net/core/filter.c                              |  68 +++++++++++
>  net/core/rtnetlink.c                           |  10 ++
>  samples/bpf/Makefile                           |   4 +
>  samples/bpf/bpf_load.c                         |   8 ++
>  samples/bpf/netdrvx1_kern.c                    |  26 +++++
>  samples/bpf/netdrvx1_user.c                    | 155 +++++++++++++++++++++++++
>  14 files changed, 379 insertions(+)
>  create mode 100644 samples/bpf/netdrvx1_kern.c
>  create mode 100644 samples/bpf/netdrvx1_user.c
>
> --
> 2.8.0
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
  2016-04-02  2:08   ` Eric Dumazet
  2016-04-02  8:23   ` Jesper Dangaard Brouer
@ 2016-04-02 18:40   ` Johannes Berg
  2016-04-03  6:38     ` Brenden Blanco
  2016-04-04  8:33   ` Jesper Dangaard Brouer
  2016-04-04  9:22   ` Daniel Borkmann
  4 siblings, 1 reply; 62+ messages in thread
From: Johannes Berg @ 2016-04-02 18:40 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, gerlitz, daniel, john.fastabend, brouer

On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:

> +static int mlx4_bpf_set(struct net_device *dev, int fd)
> +{
[...]
> +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> +			bpf_prog_put(prog);
> +			return -EINVAL;
> +		}
> +	}

Why wouldn't this check be done in the generic code that calls
ndo_bpf_set()?

johannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
                   ` (5 preceding siblings ...)
  2016-04-02 16:47 ` [RFC PATCH 0/5] Add driver bpf hook for early packet drop Tom Herbert
@ 2016-04-02 18:41 ` Johannes Berg
  2016-04-02 22:57   ` Tom Herbert
  6 siblings, 1 reply; 62+ messages in thread
From: Johannes Berg @ 2016-04-02 18:41 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, gerlitz, daniel, john.fastabend,
	brouer, Lorenzo Colitti

On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before
> even
> an skb has been allocated.
> 
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.

Since this is handed to the driver in some way, I assume the API would
also allow offloading the program to the NIC itself, and as such be
useful for what Android wants to do to save power in wireless?

johannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-02 18:41 ` Johannes Berg
@ 2016-04-02 22:57   ` Tom Herbert
  2016-04-03  2:28     ` Lorenzo Colitti
  0 siblings, 1 reply; 62+ messages in thread
From: Tom Herbert @ 2016-04-02 22:57 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer, Lorenzo Colitti

On Sat, Apr 2, 2016 at 2:41 PM, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
>> This patch set introduces new infrastructure for programmatically
>> processing packets in the earliest stages of rx, as part of an effort
>> others are calling Express Data Path (XDP) [1]. Start this effort by
>> introducing a new bpf program type for early packet filtering, before
>> even
>> an skb has been allocated.
>>
>> With this, hope to enable line rate filtering, with this initial
>> implementation providing drop/allow action only.
>
> Since this is handed to the driver in some way, I assume the API would
> also allow offloading the program to the NIC itself, and as such be
> useful for what Android wants to do to save power in wireless?
>
Conceptually, yes. There is some ongoing work to offload BPF and one
goal is that BPF programs (like for XDP) could be portable between
userspace, kernel (maybe even other OSes), and devices.

I am curious though, how do you think this would specifically help
Android with power? Seems like the receiver still needs to be powered
to receive packets to filter them anyway...

Thanks,
Tom

> johannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-02 22:57   ` Tom Herbert
@ 2016-04-03  2:28     ` Lorenzo Colitti
  2016-04-04  7:37       ` Johannes Berg
  0 siblings, 1 reply; 62+ messages in thread
From: Lorenzo Colitti @ 2016-04-03  2:28 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Johannes Berg, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, gerlitz,
	Daniel Borkmann, john fastabend, Jesper Dangaard Brouer

On Sun, Apr 3, 2016 at 7:57 AM, Tom Herbert <tom@herbertland.com> wrote:
> I am curious though, how do you think this would specifically help
> Android with power? Seems like the receiver still needs to be powered
> to receive packets to filter them anyway...

The receiver is powered up, but its wake/sleep cycles are much shorter
than the main CPU's. On a phone, leaving the CPU asleep with wifi on
might consume ~5mA average, but getting the CPU out of suspend might
average ~200mA for ~300ms as the system comes out of sleep,
initializes other hardware, wakes up userspace processes whose
timeouts have fired, freezes, and suspends again. Receiving one such
superfluous packet every 3 seconds (e.g., on networks that send
identical IPv6 RAs once every 3 seconds) works out to ~25mA, which is
5x the cost of idle. Pushing down filters to the hardware so it can
drop the packet without waking up the CPU thus saves a lot of idle
power.

That said, getting BPF to the driver is part of the picture. On the
chipsets we're targeting for APF, we're only seeing 2k-4k of memory
(that's 256-512 BPF instructions) available for filtering code, which
means that BPF might be too large.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-02 16:47 ` [RFC PATCH 0/5] Add driver bpf hook for early packet drop Tom Herbert
@ 2016-04-03  5:41   ` Brenden Blanco
  2016-04-04  7:48     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-03  5:41 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer

On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
> Very nice! Do you think this hook will be sufficient to implement a
> fast forward patch also?
That is the goal, but more work needs to be done of course. It won't be
possible with just a single pseudo skb, the driver will need a fast way to get
batches of pseudo skbs (per core?) through from rx to tx. In mlx4 for
instance, either the skb needs to be much more complete to be handled from the
start of mlx4_en_xmit(), or that function would need to be split so that the
fast tx could start midway through.

Or, skb allocation just gets much faster. Then it should be pretty
straightforward.
> 
> Tom

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  8:23   ` Jesper Dangaard Brouer
@ 2016-04-03  6:11     ` Brenden Blanco
  2016-04-04 18:27       ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-03  6:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel, john.fastabend

On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
[...]
> 
> I think you need to DMA sync RX-page before you can safely access
> packet data in page (on all arch's).
> 
Thanks, I will give that a try in the next spin.
> > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > +						 frags[0].page_offset);
> > +			if (mlx4_call_bpf(prog, ethh, length)) {
> 
> AFAIK length here covers all the frags[n].page, thus potentially
> causing the BPF program to access memory out of bound (crash).
> 
> Having several page fragments is AFAIK an optimization for jumbo-frames
> on PowerPC (which is a bit annoying for you use-case ;-)).
> 
Yeah, this needs some more work. I can think of some options:
1. limit pseudo skb.len to first frag's length only, and signal to
program that the packet is incomplete
2. for nfrags>1 skip bpf processing, but this could be functionally
incorrect for some use cases
3. run the program for each frag
4. reject ndo_bpf_set when frags are possible (large mtu?)

My preference is to go with 1, thoughts?
> 
[...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  2:08   ` Eric Dumazet
  2016-04-02  2:47     ` Alexei Starovoitov
@ 2016-04-03  6:15     ` Brenden Blanco
  2016-04-05  2:20       ` Brenden Blanco
  1 sibling, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-03  6:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer

On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
[...]
> 
> 
> 1) mlx4 can use multiple fragments (priv->num_frags) to hold an Ethernet
> frame. 
> 
> Still you pass a single fragment but total 'length' here : BPF program
> can read past the end of this first fragment and panic the box.
> 
> Please take a look at mlx4_en_complete_rx_desc() and you'll see what I
> mean.
Sure, I will do some reading. Jesper also raised the issue after you
did. Please let me know what you think of the proposals.
> 
> 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> sharing.
> 
>    This is probably the right time to add a rx_dropped field in struct
> mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> higher speed links.
> 
This sounds reasonable! Will look into it for the next spin.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02 18:40   ` Johannes Berg
@ 2016-04-03  6:38     ` Brenden Blanco
  2016-04-04  7:35       ` Johannes Berg
  0 siblings, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-03  6:38 UTC (permalink / raw)
  To: Johannes Berg
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer

On Sat, Apr 02, 2016 at 08:40:55PM +0200, Johannes Berg wrote:
> On Fri, 2016-04-01 at 18:21 -0700, Brenden Blanco wrote:
> 
> > +static int mlx4_bpf_set(struct net_device *dev, int fd)
> > +{
> [...]
> > +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> > +			bpf_prog_put(prog);
> > +			return -EINVAL;
> > +		}
> > +	}
> 
> Why wouldn't this check be done in the generic code that calls
> ndo_bpf_set()?
Having a common check makes sense. The tricky thing is that the type can
only be checked after taking the reference, and I wanted to keep the
scope of the prog brief in the case of errors. I would have to move the
bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
ndo instead. Would that API look fine to you?

A possible extension of this is just to keep the bpf_prog * in the
netdev itself and expose a feature flag from the driver rather than an
ndo. But that would mean another 8 bytes in the netdev.
> 
> johannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-02 16:39   ` Tom Herbert
@ 2016-04-03  7:02     ` Brenden Blanco
  2016-04-04 22:07       ` Thomas Graf
  0 siblings, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-03  7:02 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, ogerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer

On Sat, Apr 02, 2016 at 12:39:45PM -0400, Tom Herbert wrote:
> On Fri, Apr 1, 2016 at 9:21 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> > Add a new bpf prog type that is intended to run in early stages of the
> > packet rx path. Only minimal packet metadata will be available, hence a new
> > context type, struct xdp_metadata, is exposed to userspace. So far only
> > expose the readable packet length, and only in read mode.
> >
> This would eventually be a generic abstraction of receive descriptors?
Exactly. For instance, maybe let the hw's hash be available.
> 
> > The PHYS_DEV name is chosen to represent that the program is meant only
> > for physical adapters, rather than all netdevs.
> >
> Is there a hard restriction that this could only work with physical devices?
I suppose not, but there wouldn't be much use case as compared to tc
ingress, no? Since I was imagining that this new hook would be more
restricted in functionality due to operating on descriptors rather than
with a full skb, I tried to think of an appropriate name.

If you think that this hook would spread, then a better name is needed.
> 
[...]

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-03  6:38     ` Brenden Blanco
@ 2016-04-04  7:35       ` Johannes Berg
  2016-04-04  9:57         ` Daniel Borkmann
  0 siblings, 1 reply; 62+ messages in thread
From: Johannes Berg @ 2016-04-04  7:35 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer

On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
> 
> Having a common check makes sense. The tricky thing is that the type can
> only be checked after taking the reference, and I wanted to keep the
> scope of the prog brief in the case of errors. I would have to move the
> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> ndo instead. Would that API look fine to you?

I can't really comment, I wasn't planning on using the API right now :)

However, what else is there that the driver could possibly do with the
FD, other than getting the bpf_prog?

> A possible extension of this is just to keep the bpf_prog * in the
> netdev itself and expose a feature flag from the driver rather than
> an ndo. But that would mean another 8 bytes in the netdev.

That also misses the signal to the driver when the program is
set/removed, so I don't think that works. I'd argue it's not really
desirable anyway though since I wouldn't expect a majority of drivers
to start supporting this.

johannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-03  2:28     ` Lorenzo Colitti
@ 2016-04-04  7:37       ` Johannes Berg
  0 siblings, 0 replies; 62+ messages in thread
From: Johannes Berg @ 2016-04-04  7:37 UTC (permalink / raw)
  To: Lorenzo Colitti, Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer

On Sun, 2016-04-03 at 11:28 +0900, Lorenzo Colitti wrote:

> That said, getting BPF to the driver is part of the picture. On the
> chipsets we're targeting for APF, we're only seeing 2k-4k of memory
> (that's 256-512 BPF instructions) available for filtering code, which
> means that BPF might be too large.

That's true, but I think that as far as the userspace API is concerned
that shouldn't really be an issue. I think we can compile the BPF into
APF, similar to how BPF can be compiled into machine code today.
Additionally, I'm not sure we can realistically expect all devices to
really implement APF "natively", I think there's a good chance but
there's also a possibility of compiling to the native firmware
environment, for example.

johannes

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-03  5:41   ` Brenden Blanco
@ 2016-04-04  7:48     ` Jesper Dangaard Brouer
  2016-04-04 18:10       ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-04  7:48 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, gerlitz, Daniel Borkmann, john fastabend,
	brouer, Alexander Duyck

On Sat, 2 Apr 2016 22:41:04 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
>
> > Very nice! Do you think this hook will be sufficient to implement a
> > fast forward patch also?

(DMA experts please verify and correct me!)

One of the gotchas is how DMA sync/unmap works.  For forwarding you
need to modify the headers.  The DMA sync API (DMA_FROM_DEVICE) specify
that the data is to be _considered_ read-only.  AFAIK you can write into
the data, BUT on DMA_unmap the API/DMA-engine is allowed to overwrite
data... note on most archs the DMA_unmap does not overwrite.

This DMA issue should not block the work on a hook for early packet drop.
Maybe we should add a flag option, that can specify to the hook if the
packet read-only? (e.g. if driver use page-fragments and DMA_sync)


We should have another track/thread on how to solve the DMA issue:
I see two solutions.

Solution 1: Simply use a "full" page per packet and do the DMA_unmap.
This result in a slowdown on arch's with expensive DMA-map/unmap.  And
we stress the page allocator more (can be solved with a page-pool-cache).
Eric will not like this due to memory usage, but we can just add a
"copy-break" step for normal stack hand-off.

Solution 2: (Due credit to Alex Duyck, this idea came up while
discussing issue with him).  Remember DMA_sync'ed data is only
considered read-only, because the DMA_unmap can be destructive.  In many
cases DMA_unmap is not.  Thus, we could take advantage of this, and
allow modifying DMA sync'ed data on those DMA setups.


> That is the goal, but more work needs to be done of course. It won't be
> possible with just a single pseudo skb, the driver will need a fast
> way to get batches of pseudo skbs (per core?) through from rx to tx.
> In mlx4 for instance, either the skb needs to be much more complete
> to be handled from the start of mlx4_en_xmit(), or that function
> would need to be split so that the fast tx could start midway through.
> 
> Or, skb allocation just gets much faster. Then it should be pretty
> straightforward.

With the bulking SLUB API, we can reduce the bare kmem_cache_alloc+free
cost per SKB from 90 cycles to 27 cycles.  It is good, but for really
fast forwarding it would be good to avoid allocating any extra data
structures.  We just want to move a RX packet-page to a TX ring queue.

Maybe the 27 cycles kmem_cache/slab cost is considered "fast-enough",
for what we gain in ease of implementation.  The real expensive part of
the SKB process is memset/clearing the SKB.  Which the fast forward
use-case could avoid.  Splitting the SKB alloc and clearing part would
be a needed first step.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
                     ` (2 preceding siblings ...)
  2016-04-02 18:40   ` Johannes Berg
@ 2016-04-04  8:33   ` Jesper Dangaard Brouer
  2016-04-04  9:22   ` Daniel Borkmann
  4 siblings, 0 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-04  8:33 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, Or Gerlitz, daniel,
	john.fastabend, brouer

On Fri,  1 Apr 2016 18:21:57 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>  			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>  
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {
> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +

For future API, I can imagine more return codes being needed.

For forwarding I could imagine returning "STOLEN", which should not
increment rx_dropped.

One could also imagine supporting tcpdump/af_packet like facilities at
this packet-page level (e.g. af_packet queue packets into a RX ring
buffer, later processed/read async). It could return "SHARED", bumping
refcnt on page, and indicate page is now read-only. Thus, affecting
drivers processing flow.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-02  1:21 ` [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter Brenden Blanco
  2016-04-02 16:39   ` Tom Herbert
@ 2016-04-04  8:49   ` Daniel Borkmann
  2016-04-04 13:07     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2016-04-04  8:49 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, gerlitz, john.fastabend, brouer

On 04/02/2016 03:21 AM, Brenden Blanco wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a new
> context type, struct xdp_metadata, is exposed to userspace. So far only
> expose the readable packet length, and only in read mode.
>
> The PHYS_DEV name is chosen to represent that the program is meant only
> for physical adapters, rather than all netdevs.
>
> While the user visible struct is new, the underlying context must be
> implemented as a minimal skb in order for the packet load_* instructions
> to work. The skb filled in by the driver must have skb->len, skb->head,
> and skb->data set, and skb->data_len == 0.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>   include/uapi/linux/bpf.h |  5 ++++
>   kernel/bpf/verifier.c    |  1 +
>   net/core/filter.c        | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 74 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 924f537..b8a4ef2 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -92,6 +92,7 @@ enum bpf_prog_type {
>   	BPF_PROG_TYPE_KPROBE,
>   	BPF_PROG_TYPE_SCHED_CLS,
>   	BPF_PROG_TYPE_SCHED_ACT,
> +	BPF_PROG_TYPE_PHYS_DEV,
>   };
>
>   #define BPF_PSEUDO_MAP_FD	1
> @@ -367,6 +368,10 @@ struct __sk_buff {
>   	__u32 tc_classid;
>   };
>
> +struct xdp_metadata {
> +	__u32 len;
> +};

Should this consistently be called 'xdp' or rather 'phys dev',
because currently it's a mixture of both everywhere?

>   struct bpf_tunnel_key {
>   	__u32 tunnel_id;
>   	union {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2e08f8e..804ca70 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1340,6 +1340,7 @@ static bool may_access_skb(enum bpf_prog_type type)
>   	case BPF_PROG_TYPE_SOCKET_FILTER:
>   	case BPF_PROG_TYPE_SCHED_CLS:
>   	case BPF_PROG_TYPE_SCHED_ACT:
> +	case BPF_PROG_TYPE_PHYS_DEV:
>   		return true;
>   	default:
>   		return false;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index b7177d0..c417db6 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2018,6 +2018,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
>   	}
>   }
>
> +static const struct bpf_func_proto *
> +phys_dev_func_proto(enum bpf_func_id func_id)
> +{
> +	return sk_filter_func_proto(func_id);

Do you plan to support bpf_skb_load_bytes() as well? I like using
this API especially when dealing with larger chunks (>4 bytes) to
load into stack memory, plus content is kept in network byte order.

What about other helpers such as bpf_skb_store_bytes() et al that
work on skbs. Do you intent to reuse them as is and thus populate
the per cpu skb with needed fields (faking linear data), or do you
see larger obstacles that prevent for this?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
                     ` (3 preceding siblings ...)
  2016-04-04  8:33   ` Jesper Dangaard Brouer
@ 2016-04-04  9:22   ` Daniel Borkmann
  4 siblings, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2016-04-04  9:22 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, gerlitz, john.fastabend, brouer

On 04/02/2016 03:21 AM, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx4 driver.  Since
> bpf programs require a skb context to navigate the packet, build a
> percpu fake skb with the minimal fields. This avoids the costly
> allocation for packets that end up being dropped.
>
> Since mlx4 is so far the only user of this pseudo skb, the build
> function is defined locally.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 61 ++++++++++++++++++++++++++
>   drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 18 ++++++++
>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  2 +
>   3 files changed, 81 insertions(+)
>
[...]
>
> +static DEFINE_PER_CPU(struct sk_buff, percpu_pseudo_skb);
> +
> +static void build_pseudo_skb_for_bpf(struct sk_buff *skb, void *data,
> +				     unsigned int length)
> +{
> +	/* data_len is intentionally not set here so that skb_is_nonlinear()
> +	 * returns false
> +	 */
> +
> +	skb->len = length;
> +	skb->head = data;
> +	skb->data = data;
> +}
> +
> +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length)
> +{
> +	struct sk_buff *skb = this_cpu_ptr(&percpu_pseudo_skb);
> +	int ret;
> +
> +	build_pseudo_skb_for_bpf(skb, data, length);
> +
> +	rcu_read_lock();
> +	ret = BPF_PROG_RUN(prog, (void *)skb);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

Couldn't this diff rather live in filter.c? Doesn't seem mlx4 specific. When
placed there, the api would also make the requirements clear for every driver
wanting to implement xdp wrt meta data that needs to be passed, and allows to
easier review code (as driver just call a few core helpers rather than needing
to re-implement the pseudo skb et al).

> +static int mlx4_bpf_set(struct net_device *dev, int fd)
> +{
> +	struct mlx4_en_priv *priv = netdev_priv(dev);
> +	struct bpf_prog *prog = NULL, *old_prog;
> +
> +	if (fd >= 0) {
> +		prog = bpf_prog_get(fd);
> +		if (IS_ERR(prog))
> +			return PTR_ERR(prog);
> +
> +		if (prog->type != BPF_PROG_TYPE_PHYS_DEV) {
> +			bpf_prog_put(prog);
> +			return -EINVAL;
> +		}

This block could just be a generic helper that mlx4_bpf_set() calls from here.

> +	}
> +
> +	old_prog = xchg(&priv->prog, prog);
> +	if (old_prog) {
> +		synchronize_net();
> +		bpf_prog_put(old_prog);
> +	}
> +
> +	priv->dev->bpf_valid = !!prog;

Could the 'bpf_valid' addition to the net_device be avoided altogether?

The API could probably just be named .ndo_bpf() and depending how you invoke
it, either set/deletes the program or tell (return code) whether a program is
currently attached.

> +	return 0;
> +}
> +
>   static const struct net_device_ops mlx4_netdev_ops = {
>   	.ndo_open		= mlx4_en_open,
>   	.ndo_stop		= mlx4_en_close,
> @@ -2486,6 +2545,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   #endif
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
> +	.ndo_bpf_set		= mlx4_bpf_set,
>   };
>
>   static const struct net_device_ops mlx4_netdev_ops_master = {
> @@ -2524,6 +2584,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
>   	.ndo_features_check	= mlx4_en_features_check,
>   #endif
>   	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
> +	.ndo_bpf_set		= mlx4_bpf_set,
>   };
>
>   struct mlx4_en_bond {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 86bcfe5..03fe005 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -748,6 +748,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
>   	struct mlx4_en_rx_alloc *frags;
>   	struct mlx4_en_rx_desc *rx_desc;
> +	struct bpf_prog *prog;
>   	struct sk_buff *skb;
>   	int index;
>   	int nr;
> @@ -764,6 +765,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	if (budget <= 0)
>   		return polled;
>
> +	prog = READ_ONCE(priv->prog);
> +
>   	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>   	 * descriptor offset can be deduced from the CQE index instead of
>   	 * reading 'cqe->index' */
> @@ -840,6 +843,21 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>   			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
> +		/* A bpf program gets first chance to drop the packet. It may
> +		 * read bytes but not past the end of the frag. A non-zero
> +		 * return indicates packet should be dropped.
> +		 */
> +		if (prog) {
> +			struct ethhdr *ethh;
> +
> +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> +						 frags[0].page_offset);
> +			if (mlx4_call_bpf(prog, ethh, length)) {

Since such program will be ABI, the return code might get some more additions in
future (e.g. forwarding, etc), so it needs to be thought through that we don't
burn ourselves later.

Maybe reuse tc opcodes, or define own ones?

We currently would have:

  0    - Drop.
  1    - Pass to stack.
  rest - Reserved for future use.

> +				priv->stats.rx_dropped++;
> +				goto next;
> +			}
> +		}
> +
>   		if (likely(dev->features & NETIF_F_RXCSUM)) {
>   			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>   						      MLX4_CQE_STATUS_UDP)) {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index d12ab6a..3d0fc89 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -568,6 +568,7 @@ struct mlx4_en_priv {
>   	struct hlist_head mac_hash[MLX4_EN_MAC_HASH_SIZE];
>   	struct hwtstamp_config hwtstamp_config;
>   	u32 counter_index;
> +	struct bpf_prog *prog;
>
>   #ifdef CONFIG_MLX4_EN_DCB
>   	struct ieee_ets ets;
> @@ -682,6 +683,7 @@ int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv);
>   void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv);
>   int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring);
>   void mlx4_en_rx_irq(struct mlx4_cq *mcq);
> +int mlx4_call_bpf(struct bpf_prog *prog, void *data, unsigned int length);
>
>   int mlx4_SET_MCAST_FLTR(struct mlx4_dev *dev, u8 port, u64 mac, u64 clear, u8 mode);
>   int mlx4_SET_VLAN_FLTR(struct mlx4_dev *dev, struct mlx4_en_priv *priv);
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04  7:35       ` Johannes Berg
@ 2016-04-04  9:57         ` Daniel Borkmann
  2016-04-04 18:46           ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2016-04-04  9:57 UTC (permalink / raw)
  To: Johannes Berg, Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, john.fastabend, brouer

On 04/04/2016 09:35 AM, Johannes Berg wrote:
> On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
>>
>> Having a common check makes sense. The tricky thing is that the type can
>> only be checked after taking the reference, and I wanted to keep the
>> scope of the prog brief in the case of errors. I would have to move the
>> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
>> ndo instead. Would that API look fine to you?
>
> I can't really comment, I wasn't planning on using the API right now :)
>
> However, what else is there that the driver could possibly do with the
> FD, other than getting the bpf_prog?
>
>> A possible extension of this is just to keep the bpf_prog * in the
>> netdev itself and expose a feature flag from the driver rather than
>> an ndo. But that would mean another 8 bytes in the netdev.
>
> That also misses the signal to the driver when the program is
> set/removed, so I don't think that works. I'd argue it's not really
> desirable anyway though since I wouldn't expect a majority of drivers
> to start supporting this.

I think ndo is probably fine for this purpose, see also my other mail. I
think currently, the only really driver specific code would be to store
the prog pointer somewhere and to pass needed meta data to populate the
fake skb.

Maybe mid-term drivers might want to reuse this hook/signal for offloading
as well, not yet sure ... how would that relate to offloading of cls_bpf?
Should these be considered two different things (although from an offloading
perspective they are not really). _Conceptually_, XDP could also be seen
as a software offload for the facilities we support with cls_bpf et al.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04  8:49   ` Daniel Borkmann
@ 2016-04-04 13:07     ` Jesper Dangaard Brouer
  2016-04-04 13:36       ` Daniel Borkmann
  2016-04-04 14:33       ` Eric Dumazet
  0 siblings, 2 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-04 13:07 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Brenden Blanco, davem, netdev, tom, alexei.starovoitov, gerlitz,
	john.fastabend, brouer


On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 04/02/2016 03:21 AM, Brenden Blanco wrote:
> > Add a new bpf prog type that is intended to run in early stages of the
> > packet rx path. Only minimal packet metadata will be available, hence a new
> > context type, struct xdp_metadata, is exposed to userspace. So far only
> > expose the readable packet length, and only in read mode.
> >
> > The PHYS_DEV name is chosen to represent that the program is meant only
> > for physical adapters, rather than all netdevs.
> >
> > While the user visible struct is new, the underlying context must be
> > implemented as a minimal skb in order for the packet load_* instructions
> > to work. The skb filled in by the driver must have skb->len, skb->head,
> > and skb->data set, and skb->data_len == 0.
> >
[...]
> 
> Do you plan to support bpf_skb_load_bytes() as well? I like using
> this API especially when dealing with larger chunks (>4 bytes) to
> load into stack memory, plus content is kept in network byte order.
> 
> What about other helpers such as bpf_skb_store_bytes() et al that
> work on skbs. Do you intent to reuse them as is and thus populate
> the per cpu skb with needed fields (faking linear data), or do you
> see larger obstacles that prevent for this?

Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
to users of this API.

The hole idea is that an SKB is NOT allocated yet, and not needed at
this level.  If we start supporting calling underlying SKB functions,
then we will end-up in the same place (performance wise).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 13:07     ` Jesper Dangaard Brouer
@ 2016-04-04 13:36       ` Daniel Borkmann
  2016-04-04 14:09         ` Tom Herbert
  2016-04-04 14:33       ` Eric Dumazet
  1 sibling, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2016-04-04 13:36 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, tom, alexei.starovoitov, gerlitz,
	john.fastabend

On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:
> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:
>>> Add a new bpf prog type that is intended to run in early stages of the
>>> packet rx path. Only minimal packet metadata will be available, hence a new
>>> context type, struct xdp_metadata, is exposed to userspace. So far only
>>> expose the readable packet length, and only in read mode.
>>>
>>> The PHYS_DEV name is chosen to represent that the program is meant only
>>> for physical adapters, rather than all netdevs.
>>>
>>> While the user visible struct is new, the underlying context must be
>>> implemented as a minimal skb in order for the packet load_* instructions
>>> to work. The skb filled in by the driver must have skb->len, skb->head,
>>> and skb->data set, and skb->data_len == 0.
>>>
> [...]
>>
>> Do you plan to support bpf_skb_load_bytes() as well? I like using
>> this API especially when dealing with larger chunks (>4 bytes) to
>> load into stack memory, plus content is kept in network byte order.
>>
>> What about other helpers such as bpf_skb_store_bytes() et al that
>> work on skbs. Do you intent to reuse them as is and thus populate
>> the per cpu skb with needed fields (faking linear data), or do you
>> see larger obstacles that prevent for this?
>
> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> to users of this API.
>
> The hole idea is that an SKB is NOT allocated yet, and not needed at
> this level.  If we start supporting calling underlying SKB functions,
> then we will end-up in the same place (performance wise).

I'm talking about the current skb-related BPF helper functions we have,
so the question is how much from that code we have we can reuse under
these constraints (obviously things like the tunnel helpers are a different
story) and if that trade-off is acceptable for us. I'm also thinking
that, for example, if you need to parse the packet data anyway for a drop
verdict, you might as well pass some meta data (that is set in the real
skb later on) for those packets that go up the stack.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 13:36       ` Daniel Borkmann
@ 2016-04-04 14:09         ` Tom Herbert
  2016-04-04 15:12           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 62+ messages in thread
From: Tom Herbert @ 2016-04-04 14:09 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, gerlitz,
	john fastabend

On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:
>>
>> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
>> wrote:
>>>
>>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:
>>>>
>>>> Add a new bpf prog type that is intended to run in early stages of the
>>>> packet rx path. Only minimal packet metadata will be available, hence a
>>>> new
>>>> context type, struct xdp_metadata, is exposed to userspace. So far only
>>>> expose the readable packet length, and only in read mode.
>>>>
>>>> The PHYS_DEV name is chosen to represent that the program is meant only
>>>> for physical adapters, rather than all netdevs.
>>>>
>>>> While the user visible struct is new, the underlying context must be
>>>> implemented as a minimal skb in order for the packet load_* instructions
>>>> to work. The skb filled in by the driver must have skb->len, skb->head,
>>>> and skb->data set, and skb->data_len == 0.
>>>>
>> [...]
>>>
>>>
>>> Do you plan to support bpf_skb_load_bytes() as well? I like using
>>> this API especially when dealing with larger chunks (>4 bytes) to
>>> load into stack memory, plus content is kept in network byte order.
>>>
>>> What about other helpers such as bpf_skb_store_bytes() et al that
>>> work on skbs. Do you intent to reuse them as is and thus populate
>>> the per cpu skb with needed fields (faking linear data), or do you
>>> see larger obstacles that prevent for this?
>>
>>
>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
>> to users of this API.
>>
>> The hole idea is that an SKB is NOT allocated yet, and not needed at
>> this level.  If we start supporting calling underlying SKB functions,
>> then we will end-up in the same place (performance wise).
>
>
> I'm talking about the current skb-related BPF helper functions we have,
> so the question is how much from that code we have we can reuse under
> these constraints (obviously things like the tunnel helpers are a different
> story) and if that trade-off is acceptable for us. I'm also thinking
> that, for example, if you need to parse the packet data anyway for a drop
> verdict, you might as well pass some meta data (that is set in the real
> skb later on) for those packets that go up the stack.

Right, the meta data in this case is an abstracted receive descriptor.
This would include items that we get in a device receive descriptor
(computed checksum, hash, VLAN tag). This is purposely a small
restricted data structure. I'm hoping we can minimize the size of this
to not much more than 32 bytes (including pointers to data and
linkage).

How this translates to skb to maintain compatibility is with BPF
interesting question. One other consideration is that skb's are kernel
specific, we should be able to use the same BPF filter program in
userspace over DPDK for instance-- so an skb interface as the packet
abstraction might not be the right model...

Tom

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 13:07     ` Jesper Dangaard Brouer
  2016-04-04 13:36       ` Daniel Borkmann
@ 2016-04-04 14:33       ` Eric Dumazet
  2016-04-04 15:18         ` Edward Cree
  1 sibling, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2016-04-04 14:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Brenden Blanco, davem, netdev, tom,
	alexei.starovoitov, gerlitz, john.fastabend

On Mon, 2016-04-04 at 15:07 +0200, Jesper Dangaard Brouer wrote:

> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> to users of this API.
> 
> The hole idea is that an SKB is NOT allocated yet, and not needed at
> this level.  If we start supporting calling underlying SKB functions,
> then we will end-up in the same place (performance wise).

A BPF program can access many skb fields.

If you plan to support BPF, your fake skb needs to be populated like a
real one. Looks like some code will be replicated in all drivers that
want this facility...

Or accept (document ?) that some BPF instructions are just not there.
(hash, queue_mapping ...)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-02  2:47     ` Alexei Starovoitov
@ 2016-04-04 14:57       ` Jesper Dangaard Brouer
  2016-04-04 15:22         ` Eric Dumazet
  0 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-04 14:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Brenden Blanco, davem, netdev, tom, Or Gerlitz,
	daniel, john.fastabend, brouer


On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> My guess we're hitting 14.5Mpps limit for empty bpf program
> and for program that actually looks into the packet because we're
> hitting 10G phy limit of 40G nic. Since physically 40G nic
> consists of four 10G phys. There will be the same problem
> with 100G and 50G nics. Both will be hitting 25G phy limit.
> We need to vary packets somehow. Hopefully Or can explain that
> bit of hw design.
> Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> when sender blasting the same packet over and over again.

That is an interesting observation Alexei, and could explain the pps limit
I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
100G is 4x 25G PHYs.

I have a pktgen script that tried to avoid this pitfall.  By creating a
new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]

[1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 14:09         ` Tom Herbert
@ 2016-04-04 15:12           ` Jesper Dangaard Brouer
  2016-04-04 15:29             ` Brenden Blanco
  0 siblings, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-04 15:12 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Daniel Borkmann, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, gerlitz,
	john fastabend, brouer

On Mon, 4 Apr 2016 11:09:57 -0300
Tom Herbert <tom@herbertland.com> wrote:

> On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> > On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
> >>
> >> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
> >> wrote:  
> >>>
> >>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
> >>>>
> >>>> Add a new bpf prog type that is intended to run in early stages of the
> >>>> packet rx path. Only minimal packet metadata will be available, hence a
> >>>> new
> >>>> context type, struct xdp_metadata, is exposed to userspace. So far only
> >>>> expose the readable packet length, and only in read mode.
> >>>>
> >>>> The PHYS_DEV name is chosen to represent that the program is meant only
> >>>> for physical adapters, rather than all netdevs.
> >>>>
> >>>> While the user visible struct is new, the underlying context must be
> >>>> implemented as a minimal skb in order for the packet load_* instructions
> >>>> to work. The skb filled in by the driver must have skb->len, skb->head,
> >>>> and skb->data set, and skb->data_len == 0.
> >>>>  
> >> [...]  
> >>>
> >>>
> >>> Do you plan to support bpf_skb_load_bytes() as well? I like using
> >>> this API especially when dealing with larger chunks (>4 bytes) to
> >>> load into stack memory, plus content is kept in network byte order.
> >>>
> >>> What about other helpers such as bpf_skb_store_bytes() et al that
> >>> work on skbs. Do you intent to reuse them as is and thus populate
> >>> the per cpu skb with needed fields (faking linear data), or do you
> >>> see larger obstacles that prevent for this?  
> >>
> >>
> >> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> >> to users of this API.
> >>
> >> The hole idea is that an SKB is NOT allocated yet, and not needed at
> >> this level.  If we start supporting calling underlying SKB functions,
> >> then we will end-up in the same place (performance wise).  
> >
> >
> > I'm talking about the current skb-related BPF helper functions we have,
> > so the question is how much from that code we have we can reuse under
> > these constraints (obviously things like the tunnel helpers are a different
> > story) and if that trade-off is acceptable for us. I'm also thinking
> > that, for example, if you need to parse the packet data anyway for a drop
> > verdict, you might as well pass some meta data (that is set in the real
> > skb later on) for those packets that go up the stack.  
> 
> Right, the meta data in this case is an abstracted receive descriptor.
> This would include items that we get in a device receive descriptor
> (computed checksum, hash, VLAN tag). This is purposely a small
> restricted data structure. I'm hoping we can minimize the size of this
> to not much more than 32 bytes (including pointers to data and
> linkage).

I agree.
 
> How this translates to skb to maintain compatibility is with BPF
> interesting question. One other consideration is that skb's are kernel
> specific, we should be able to use the same BPF filter program in
> userspace over DPDK for instance-- so an skb interface as the packet
> abstraction might not be the right model...

I agree.  I don't think reusing the SKB data structure is the right
model.  We should drop the SKB pointer from the API.

As Tom also points out, making the BPF interface independent of the SKB
meta-data structure, would also make the eBPF program more generally
applicable.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 14:33       ` Eric Dumazet
@ 2016-04-04 15:18         ` Edward Cree
  0 siblings, 0 replies; 62+ messages in thread
From: Edward Cree @ 2016-04-04 15:18 UTC (permalink / raw)
  To: Eric Dumazet, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Brenden Blanco, davem, netdev, tom,
	alexei.starovoitov, gerlitz, john.fastabend

On 04/04/16 15:33, Eric Dumazet wrote:
> On Mon, 2016-04-04 at 15:07 +0200, Jesper Dangaard Brouer wrote:
> 
>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
>> to users of this API.
>>
>> The hole idea is that an SKB is NOT allocated yet, and not needed at
>> this level.  If we start supporting calling underlying SKB functions,
>> then we will end-up in the same place (performance wise).
> 
> A BPF program can access many skb fields.
> 
> If you plan to support BPF, your fake skb needs to be populated like a
> real one. Looks like some code will be replicated in all drivers that
> want this facility...
> 
> Or accept (document ?) that some BPF instructions are just not there.
> (hash, queue_mapping ...)

If these progs are eventually going to get pushed down into supporting
hardware, many skb things won't make sense at all at that level.  I would
suggest that anything hardware wouldn't reasonably have available should
be "just not there"; I suspect that'll lead you to the right API for early
driver filter as well.  And it probably won't look much like an skb.

-Ed

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04 14:57       ` Jesper Dangaard Brouer
@ 2016-04-04 15:22         ` Eric Dumazet
  2016-04-04 18:50           ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Eric Dumazet @ 2016-04-04 15:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Brenden Blanco, davem, netdev, tom,
	Or Gerlitz, daniel, john.fastabend

On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > My guess we're hitting 14.5Mpps limit for empty bpf program
> > and for program that actually looks into the packet because we're
> > hitting 10G phy limit of 40G nic. Since physically 40G nic
> > consists of four 10G phys. There will be the same problem
> > with 100G and 50G nics. Both will be hitting 25G phy limit.
> > We need to vary packets somehow. Hopefully Or can explain that
> > bit of hw design.
> > Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> > when sender blasting the same packet over and over again.
> 
> That is an interesting observation Alexei, and could explain the pps limit
> I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
> 100G is 4x 25G PHYs.
> 
> I have a pktgen script that tried to avoid this pitfall.  By creating a
> new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]
> 
> [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
> 

A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
a single 10GB trunk used for a given flow.

This 14Mpps thing seems to be a queue limitation on mlx4.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 15:12           ` Jesper Dangaard Brouer
@ 2016-04-04 15:29             ` Brenden Blanco
  2016-04-04 16:07               ` John Fastabend
  0 siblings, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-04 15:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Tom Herbert, Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, ogerlitz,
	john fastabend

On Mon, Apr 04, 2016 at 05:12:27PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 4 Apr 2016 11:09:57 -0300
> Tom Herbert <tom@herbertland.com> wrote:
> 
> > On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> > > On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
> > >>
> > >> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
> > >> wrote:  
> > >>>
> > >>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
> > >>>>
> > >>>> Add a new bpf prog type that is intended to run in early stages of the
> > >>>> packet rx path. Only minimal packet metadata will be available, hence a
> > >>>> new
> > >>>> context type, struct xdp_metadata, is exposed to userspace. So far only
> > >>>> expose the readable packet length, and only in read mode.
> > >>>>
> > >>>> The PHYS_DEV name is chosen to represent that the program is meant only
> > >>>> for physical adapters, rather than all netdevs.
> > >>>>
> > >>>> While the user visible struct is new, the underlying context must be
> > >>>> implemented as a minimal skb in order for the packet load_* instructions
> > >>>> to work. The skb filled in by the driver must have skb->len, skb->head,
> > >>>> and skb->data set, and skb->data_len == 0.
> > >>>>  
> > >> [...]  
> > >>>
> > >>>
> > >>> Do you plan to support bpf_skb_load_bytes() as well? I like using
> > >>> this API especially when dealing with larger chunks (>4 bytes) to
> > >>> load into stack memory, plus content is kept in network byte order.
> > >>>
> > >>> What about other helpers such as bpf_skb_store_bytes() et al that
> > >>> work on skbs. Do you intent to reuse them as is and thus populate
> > >>> the per cpu skb with needed fields (faking linear data), or do you
> > >>> see larger obstacles that prevent for this?  
> > >>
> > >>
> > >> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> > >> to users of this API.
> > >>
> > >> The hole idea is that an SKB is NOT allocated yet, and not needed at
> > >> this level.  If we start supporting calling underlying SKB functions,
> > >> then we will end-up in the same place (performance wise).  
> > >
> > >
> > > I'm talking about the current skb-related BPF helper functions we have,
> > > so the question is how much from that code we have we can reuse under
> > > these constraints (obviously things like the tunnel helpers are a different
> > > story) and if that trade-off is acceptable for us. I'm also thinking
> > > that, for example, if you need to parse the packet data anyway for a drop
> > > verdict, you might as well pass some meta data (that is set in the real
> > > skb later on) for those packets that go up the stack.  
> > 
> > Right, the meta data in this case is an abstracted receive descriptor.
> > This would include items that we get in a device receive descriptor
> > (computed checksum, hash, VLAN tag). This is purposely a small
> > restricted data structure. I'm hoping we can minimize the size of this
> > to not much more than 32 bytes (including pointers to data and
> > linkage).
> 
> I agree.
>  
> > How this translates to skb to maintain compatibility is with BPF
> > interesting question. One other consideration is that skb's are kernel
> > specific, we should be able to use the same BPF filter program in
> > userspace over DPDK for instance-- so an skb interface as the packet
> > abstraction might not be the right model...
> 
> I agree.  I don't think reusing the SKB data structure is the right
> model.  We should drop the SKB pointer from the API.
> 
> As Tom also points out, making the BPF interface independent of the SKB
> meta-data structure, would also make the eBPF program more generally
> applicable.
The initial approach that I tried went down this path. Alexei advised
that I use the pseudo skb, and in the future the API between drivers and
bpf can change to adopt non-skb context. The only user facing ABIs in
this patchset are the IFLA, the xdp_metadata struct, and the name of the
new enum.

The reason to use a pseudo skb for now is that there will be a fair
amount of churn to get bpf jit and interpreter to understand non-skb
context in the bpf_load_pointer() code. I don't see the need for
requiring that for this patchset, as it will be internal-only change
if/when we use something else.
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 15:29             ` Brenden Blanco
@ 2016-04-04 16:07               ` John Fastabend
  2016-04-04 16:17                 ` Brenden Blanco
  0 siblings, 1 reply; 62+ messages in thread
From: John Fastabend @ 2016-04-04 16:07 UTC (permalink / raw)
  To: Brenden Blanco, Jesper Dangaard Brouer
  Cc: Tom Herbert, Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, ogerlitz

On 16-04-04 08:29 AM, Brenden Blanco wrote:
> On Mon, Apr 04, 2016 at 05:12:27PM +0200, Jesper Dangaard Brouer wrote:
>> On Mon, 4 Apr 2016 11:09:57 -0300
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>>> On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
>>>>>
>>>>> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
>>>>> wrote:  
>>>>>>
>>>>>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
>>>>>>>
>>>>>>> Add a new bpf prog type that is intended to run in early stages of the
>>>>>>> packet rx path. Only minimal packet metadata will be available, hence a
>>>>>>> new
>>>>>>> context type, struct xdp_metadata, is exposed to userspace. So far only
>>>>>>> expose the readable packet length, and only in read mode.
>>>>>>>
>>>>>>> The PHYS_DEV name is chosen to represent that the program is meant only
>>>>>>> for physical adapters, rather than all netdevs.
>>>>>>>
>>>>>>> While the user visible struct is new, the underlying context must be
>>>>>>> implemented as a minimal skb in order for the packet load_* instructions
>>>>>>> to work. The skb filled in by the driver must have skb->len, skb->head,
>>>>>>> and skb->data set, and skb->data_len == 0.
>>>>>>>  
>>>>> [...]  
>>>>>>
>>>>>>
>>>>>> Do you plan to support bpf_skb_load_bytes() as well? I like using
>>>>>> this API especially when dealing with larger chunks (>4 bytes) to
>>>>>> load into stack memory, plus content is kept in network byte order.
>>>>>>
>>>>>> What about other helpers such as bpf_skb_store_bytes() et al that
>>>>>> work on skbs. Do you intent to reuse them as is and thus populate
>>>>>> the per cpu skb with needed fields (faking linear data), or do you
>>>>>> see larger obstacles that prevent for this?  
>>>>>
>>>>>
>>>>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
>>>>> to users of this API.
>>>>>
>>>>> The hole idea is that an SKB is NOT allocated yet, and not needed at
>>>>> this level.  If we start supporting calling underlying SKB functions,
>>>>> then we will end-up in the same place (performance wise).  
>>>>
>>>>
>>>> I'm talking about the current skb-related BPF helper functions we have,
>>>> so the question is how much from that code we have we can reuse under
>>>> these constraints (obviously things like the tunnel helpers are a different
>>>> story) and if that trade-off is acceptable for us. I'm also thinking
>>>> that, for example, if you need to parse the packet data anyway for a drop
>>>> verdict, you might as well pass some meta data (that is set in the real
>>>> skb later on) for those packets that go up the stack.  
>>>
>>> Right, the meta data in this case is an abstracted receive descriptor.
>>> This would include items that we get in a device receive descriptor
>>> (computed checksum, hash, VLAN tag). This is purposely a small
>>> restricted data structure. I'm hoping we can minimize the size of this
>>> to not much more than 32 bytes (including pointers to data and
>>> linkage).
>>
>> I agree.
>>  
>>> How this translates to skb to maintain compatibility is with BPF
>>> interesting question. One other consideration is that skb's are kernel
>>> specific, we should be able to use the same BPF filter program in
>>> userspace over DPDK for instance-- so an skb interface as the packet
>>> abstraction might not be the right model...
>>
>> I agree.  I don't think reusing the SKB data structure is the right
>> model.  We should drop the SKB pointer from the API.
>>
>> As Tom also points out, making the BPF interface independent of the SKB
>> meta-data structure, would also make the eBPF program more generally
>> applicable.
> The initial approach that I tried went down this path. Alexei advised
> that I use the pseudo skb, and in the future the API between drivers and
> bpf can change to adopt non-skb context. The only user facing ABIs in
> this patchset are the IFLA, the xdp_metadata struct, and the name of the
> new enum.
> 
> The reason to use a pseudo skb for now is that there will be a fair
> amount of churn to get bpf jit and interpreter to understand non-skb
> context in the bpf_load_pointer() code. I don't see the need for
> requiring that for this patchset, as it will be internal-only change
> if/when we use something else.

Another option would be to have per driver JIT code to patch up the
skb read/loads with descriptor reads and metadata. From a strictly
performance stand point it should be better than pseudo skbs.

.John

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 16:07               ` John Fastabend
@ 2016-04-04 16:17                 ` Brenden Blanco
  2016-04-04 20:00                   ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Brenden Blanco @ 2016-04-04 16:17 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, Tom Herbert, Daniel Borkmann,
	David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, ogerlitz

On Mon, Apr 04, 2016 at 09:07:03AM -0700, John Fastabend wrote:
> On 16-04-04 08:29 AM, Brenden Blanco wrote:
> > On Mon, Apr 04, 2016 at 05:12:27PM +0200, Jesper Dangaard Brouer wrote:
> >> On Mon, 4 Apr 2016 11:09:57 -0300
> >> Tom Herbert <tom@herbertland.com> wrote:
> >>
> >>> On Mon, Apr 4, 2016 at 10:36 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>> On 04/04/2016 03:07 PM, Jesper Dangaard Brouer wrote:  
> >>>>>
> >>>>> On Mon, 04 Apr 2016 10:49:09 +0200 Daniel Borkmann <daniel@iogearbox.net>
> >>>>> wrote:  
> >>>>>>
> >>>>>> On 04/02/2016 03:21 AM, Brenden Blanco wrote:  
> >>>>>>>
> >>>>>>> Add a new bpf prog type that is intended to run in early stages of the
> >>>>>>> packet rx path. Only minimal packet metadata will be available, hence a
> >>>>>>> new
> >>>>>>> context type, struct xdp_metadata, is exposed to userspace. So far only
> >>>>>>> expose the readable packet length, and only in read mode.
> >>>>>>>
> >>>>>>> The PHYS_DEV name is chosen to represent that the program is meant only
> >>>>>>> for physical adapters, rather than all netdevs.
> >>>>>>>
> >>>>>>> While the user visible struct is new, the underlying context must be
> >>>>>>> implemented as a minimal skb in order for the packet load_* instructions
> >>>>>>> to work. The skb filled in by the driver must have skb->len, skb->head,
> >>>>>>> and skb->data set, and skb->data_len == 0.
> >>>>>>>  
> >>>>> [...]  
> >>>>>>
> >>>>>>
> >>>>>> Do you plan to support bpf_skb_load_bytes() as well? I like using
> >>>>>> this API especially when dealing with larger chunks (>4 bytes) to
> >>>>>> load into stack memory, plus content is kept in network byte order.
> >>>>>>
> >>>>>> What about other helpers such as bpf_skb_store_bytes() et al that
> >>>>>> work on skbs. Do you intent to reuse them as is and thus populate
> >>>>>> the per cpu skb with needed fields (faking linear data), or do you
> >>>>>> see larger obstacles that prevent for this?  
> >>>>>
> >>>>>
> >>>>> Argh... maybe the minimal pseudo/fake SKB is the wrong "signal" to send
> >>>>> to users of this API.
> >>>>>
> >>>>> The hole idea is that an SKB is NOT allocated yet, and not needed at
> >>>>> this level.  If we start supporting calling underlying SKB functions,
> >>>>> then we will end-up in the same place (performance wise).  
> >>>>
> >>>>
> >>>> I'm talking about the current skb-related BPF helper functions we have,
> >>>> so the question is how much from that code we have we can reuse under
> >>>> these constraints (obviously things like the tunnel helpers are a different
> >>>> story) and if that trade-off is acceptable for us. I'm also thinking
> >>>> that, for example, if you need to parse the packet data anyway for a drop
> >>>> verdict, you might as well pass some meta data (that is set in the real
> >>>> skb later on) for those packets that go up the stack.  
> >>>
> >>> Right, the meta data in this case is an abstracted receive descriptor.
> >>> This would include items that we get in a device receive descriptor
> >>> (computed checksum, hash, VLAN tag). This is purposely a small
> >>> restricted data structure. I'm hoping we can minimize the size of this
> >>> to not much more than 32 bytes (including pointers to data and
> >>> linkage).
> >>
> >> I agree.
> >>  
> >>> How this translates to skb to maintain compatibility is with BPF
> >>> interesting question. One other consideration is that skb's are kernel
> >>> specific, we should be able to use the same BPF filter program in
> >>> userspace over DPDK for instance-- so an skb interface as the packet
> >>> abstraction might not be the right model...
> >>
> >> I agree.  I don't think reusing the SKB data structure is the right
> >> model.  We should drop the SKB pointer from the API.
> >>
> >> As Tom also points out, making the BPF interface independent of the SKB
> >> meta-data structure, would also make the eBPF program more generally
> >> applicable.
> > The initial approach that I tried went down this path. Alexei advised
> > that I use the pseudo skb, and in the future the API between drivers and
> > bpf can change to adopt non-skb context. The only user facing ABIs in
> > this patchset are the IFLA, the xdp_metadata struct, and the name of the
> > new enum.
> > 
> > The reason to use a pseudo skb for now is that there will be a fair
> > amount of churn to get bpf jit and interpreter to understand non-skb
> > context in the bpf_load_pointer() code. I don't see the need for
> > requiring that for this patchset, as it will be internal-only change
> > if/when we use something else.
> 
> Another option would be to have per driver JIT code to patch up the
> skb read/loads with descriptor reads and metadata. From a strictly
> performance stand point it should be better than pseudo skbs.

I considered (and implemented) this as well, but there my problem was
that I needed to inform the bpf() syscall at BPF_PROG_LOAD time which
ifindex to look at for fixups, so I had to add a new ifindex field to
bpf_attr. Then during verification I had to use a new ndo to get the
driver-specific offsets for its particular descriptor format. It seemed
kludgy.
> 
> .John

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 0/5] Add driver bpf hook for early packet drop
  2016-04-04  7:48     ` Jesper Dangaard Brouer
@ 2016-04-04 18:10       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-04 18:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, Tom Herbert, David S. Miller,
	Linux Kernel Network Developers, ogerlitz, Daniel Borkmann,
	john fastabend, Alexander Duyck

On Mon, Apr 04, 2016 at 09:48:46AM +0200, Jesper Dangaard Brouer wrote:
> On Sat, 2 Apr 2016 22:41:04 -0700
> Brenden Blanco <bblanco@plumgrid.com> wrote:
> 
> > On Sat, Apr 02, 2016 at 12:47:16PM -0400, Tom Herbert wrote:
> >
> > > Very nice! Do you think this hook will be sufficient to implement a
> > > fast forward patch also?
> 
> (DMA experts please verify and correct me!)
> 
> One of the gotchas is how DMA sync/unmap works.  For forwarding you
> need to modify the headers.  The DMA sync API (DMA_FROM_DEVICE) specify
> that the data is to be _considered_ read-only.  AFAIK you can write into
> the data, BUT on DMA_unmap the API/DMA-engine is allowed to overwrite
> data... note on most archs the DMA_unmap does not overwrite.
> 
> This DMA issue should not block the work on a hook for early packet drop.
> Maybe we should add a flag option, that can specify to the hook if the
> packet read-only? (e.g. if driver use page-fragments and DMA_sync)
> 
> 
> We should have another track/thread on how to solve the DMA issue:
> I see two solutions.
> 
> Solution 1: Simply use a "full" page per packet and do the DMA_unmap.
> This result in a slowdown on arch's with expensive DMA-map/unmap.  And
> we stress the page allocator more (can be solved with a page-pool-cache).
> Eric will not like this due to memory usage, but we can just add a
> "copy-break" step for normal stack hand-off.
> 
> Solution 2: (Due credit to Alex Duyck, this idea came up while
> discussing issue with him).  Remember DMA_sync'ed data is only
> considered read-only, because the DMA_unmap can be destructive.  In many
> cases DMA_unmap is not.  Thus, we could take advantage of this, and
> allow modifying DMA sync'ed data on those DMA setups.

I bet on those device dma_sync is a noop as well.
In ndo_bpf_set we can check
if (sync_single_for_cpu != swiotlb_sync_single_for_cpu)
 return -ENOTSUPP;

to avoid all these problems altogether. We're doing this to have
as high as possible performance, so we have to sacrifice generality.

This BPF_PROG_TYPE_PHYS_DEV program type is only applicable to physical
ethernet networking device and the name clearly indicates that.
Devices like taps or veth will not have such ndo.
These are early architectural decisions that we have to make to
actually hit our performance targets.
This is not 'yet another hook in the stack'. We already have tc+cls_bpf
that is pretty fast, but it's generic and works with veth, taps, phys dev
and by design operates on skb.
The BPF_PROG_TYPE_PHYS_DEV is operating on dma buffer. Virtual devices
don't have dma buffers, so no ndo.
Probably the confusion is due to 'pseudo skb' name in the patches.
I guess we have to pick some other name.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-03  6:11     ` Brenden Blanco
@ 2016-04-04 18:27       ` Alexei Starovoitov
  2016-04-05  6:04         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-04 18:27 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Jesper Dangaard Brouer, davem, netdev, tom, ogerlitz, daniel,
	john.fastabend

On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote:
> On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> > 
> > I think you need to DMA sync RX-page before you can safely access
> > packet data in page (on all arch's).
> > 
> Thanks, I will give that a try in the next spin.
> > > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > > +						 frags[0].page_offset);
> > > +			if (mlx4_call_bpf(prog, ethh, length)) {
> > 
> > AFAIK length here covers all the frags[n].page, thus potentially
> > causing the BPF program to access memory out of bound (crash).
> > 
> > Having several page fragments is AFAIK an optimization for jumbo-frames
> > on PowerPC (which is a bit annoying for you use-case ;-)).
> > 
> Yeah, this needs some more work. I can think of some options:
> 1. limit pseudo skb.len to first frag's length only, and signal to
> program that the packet is incomplete
> 2. for nfrags>1 skip bpf processing, but this could be functionally
> incorrect for some use cases
> 3. run the program for each frag
> 4. reject ndo_bpf_set when frags are possible (large mtu?)
> 
> My preference is to go with 1, thoughts?

hmm and what program will do with 'incomplete' packet?
imo option 4 is only way here. If phys_dev bpf program already
attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus.
My understanding of mlx4_en_calc_rx_buf is that mtu < 1514
will have num_frags==1. That's the common case and one we
want to optimize for.
If later we can find a way to change mlx4 driver to support
phys_dev bpf programs with jumbo mtus, great.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04  9:57         ` Daniel Borkmann
@ 2016-04-04 18:46           ` Alexei Starovoitov
  2016-04-04 21:01             ` Daniel Borkmann
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-04 18:46 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Johannes Berg, Brenden Blanco, davem, netdev, tom, ogerlitz,
	john.fastabend, brouer

On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote:
> On 04/04/2016 09:35 AM, Johannes Berg wrote:
> >On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
> >>
> >>Having a common check makes sense. The tricky thing is that the type can
> >>only be checked after taking the reference, and I wanted to keep the
> >>scope of the prog brief in the case of errors. I would have to move the
> >>bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> >>ndo instead. Would that API look fine to you?
> >
> >I can't really comment, I wasn't planning on using the API right now :)
> >
> >However, what else is there that the driver could possibly do with the
> >FD, other than getting the bpf_prog?
> >
> >>A possible extension of this is just to keep the bpf_prog * in the
> >>netdev itself and expose a feature flag from the driver rather than
> >>an ndo. But that would mean another 8 bytes in the netdev.
> >
> >That also misses the signal to the driver when the program is
> >set/removed, so I don't think that works. I'd argue it's not really
> >desirable anyway though since I wouldn't expect a majority of drivers
> >to start supporting this.
> 
> I think ndo is probably fine for this purpose, see also my other mail. I
> think currently, the only really driver specific code would be to store
> the prog pointer somewhere and to pass needed meta data to populate the
> fake skb.

yes. I think ndo is better and having bpf_prog in the driver priv
part is likely better as well, since driver may decide to put it into
their ring struct for faster fetch or layout prog pointer next to other
priv fields for better cache.
Having prog in 'struct net_device' may look very sensible right now,
since there is not much code around it, but later it may be causing
some performance headachces. I think it's better to have complete
freedom in the drivers and later move code to generic part.
Same applies to your other comment about moving mlx4_bpf_set() and
mlx4_call_bpf() into generic. It's better for them to be driver
specific in the moment. Right now we have only mlx4 anyway.

> Maybe mid-term drivers might want to reuse this hook/signal for offloading
> as well, not yet sure ... how would that relate to offloading of cls_bpf?
> Should these be considered two different things (although from an offloading
> perspective they are not really). _Conceptually_, XDP could also be seen
> as a software offload for the facilities we support with cls_bpf et al.

agree. 'conceptually' phys_dev bpf program is similar to sched_cls bpf program.
If we can reuse some of the helpers that would be great...
but only if we can maintain highest performance.
hw offloading may be more convenient from cls_bpf side for some drivers,
but nothing obviously stops them from hw offloading of bpf_type_phys_dev

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04 15:22         ` Eric Dumazet
@ 2016-04-04 18:50           ` Alexei Starovoitov
  2016-04-05 14:15             ` Or Gerlitz
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-04 18:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Brenden Blanco, davem, netdev, tom,
	Or Gerlitz, daniel, john.fastabend

On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote:
> On Mon, 2016-04-04 at 16:57 +0200, Jesper Dangaard Brouer wrote:
> > On Fri, 1 Apr 2016 19:47:12 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > > My guess we're hitting 14.5Mpps limit for empty bpf program
> > > and for program that actually looks into the packet because we're
> > > hitting 10G phy limit of 40G nic. Since physically 40G nic
> > > consists of four 10G phys. There will be the same problem
> > > with 100G and 50G nics. Both will be hitting 25G phy limit.
> > > We need to vary packets somehow. Hopefully Or can explain that
> > > bit of hw design.
> > > Jesper's experiments with mlx4 showed the same 14.5Mpps limit
> > > when sender blasting the same packet over and over again.
> > 
> > That is an interesting observation Alexei, and could explain the pps limit
> > I hit on 40G, with single flow testing.  AFAIK 40G is 4x 10G PHYs, and
> > 100G is 4x 25G PHYs.
> > 
> > I have a pktgen script that tried to avoid this pitfall.  By creating a
> > new flow per pktgen kthread. I call it "pktgen_sample05_flow_per_thread.sh"[1]
> > 
> > [1] https://github.com/netoptimizer/network-testing/blob/master/pktgen/pktgen_sample05_flow_per_thread.sh
> > 
> 
> A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
> a single 10GB trunk used for a given flow.
> 
> This 14Mpps thing seems to be a queue limitation on mlx4.

yeah, could be queueing related.
Multiple cpus can send ~30Mpps of the same 64 byte packet,
but mlx4 can only receive 14.5Mpps. Odd.

Or (and other mellanox guys),
what is really going on inside 40G nic ?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 16:17                 ` Brenden Blanco
@ 2016-04-04 20:00                   ` Alexei Starovoitov
  2016-04-04 22:04                     ` Thomas Graf
  2016-04-05  9:29                     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-04 20:00 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: John Fastabend, Jesper Dangaard Brouer, Tom Herbert,
	Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, ogerlitz

On Mon, Apr 04, 2016 at 09:17:22AM -0700, Brenden Blanco wrote:
> > >>
> > >> As Tom also points out, making the BPF interface independent of the SKB
> > >> meta-data structure, would also make the eBPF program more generally
> > >> applicable.
> > > The initial approach that I tried went down this path. Alexei advised
> > > that I use the pseudo skb, and in the future the API between drivers and
> > > bpf can change to adopt non-skb context. The only user facing ABIs in
> > > this patchset are the IFLA, the xdp_metadata struct, and the name of the
> > > new enum.

Exactly. That the most important part of this rfc.
Right now redirect to different queue, batching, prefetch and tons of
other code are mising. We have to plan the whole project, so we can
incrementally add features without breaking abi.
So new IFLA, xdp_metadata struct and enum for bpf return codes are
the main things to agree on.

> > > The reason to use a pseudo skb for now is that there will be a fair
> > > amount of churn to get bpf jit and interpreter to understand non-skb
> > > context in the bpf_load_pointer() code. I don't see the need for
> > > requiring that for this patchset, as it will be internal-only change
> > > if/when we use something else.
> > 
> > Another option would be to have per driver JIT code to patch up the
> > skb read/loads with descriptor reads and metadata. From a strictly
> > performance stand point it should be better than pseudo skbs.

Per-driver pre/post JIT-like phase is actually on the table. It may
still be needed. If we can avoid it while keeping high performance, great.
 
> I considered (and implemented) this as well, but there my problem was
> that I needed to inform the bpf() syscall at BPF_PROG_LOAD time which
> ifindex to look at for fixups, so I had to add a new ifindex field to
> bpf_attr. Then during verification I had to use a new ndo to get the
> driver-specific offsets for its particular descriptor format. It seemed
> kludgy.

Another reason for going with 'pseudo skb' structure was to reuse
load_byte/half/word instructions in bpf interpreter as-is.
Right now these instructions have to see in-kernel
'struct sk_buff' as context (note we have mirror __sk_buff
for user space), so to use load_byte for bpf_prog_type_phys_dev
we have to give real 'struct sk_buff' to interpter with
data, head, len, data_len fields initialized, so that
interpreter 'just works'.
The potential fix would be to add phys_dev specific load_byte/word
instructions. Then we can drop all the legacy negative offset
stuff that <1% uses, but it slows down everyone.
We can also drop byteswap that load_word does (which turned out
to be confusing and often programs do 2nd byteswap to go
back to cpu endiannes).
And if we do it smart, we can drop length check as well,
then new_load_byte will actually be single load byte cpu instruction.
We can drop length check when offset is constant in the verfier
and that constant is less than 64, since all packets are larger.
As seen in 'perf report' from patch 5:
  3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
this is 14Mpps and 4 assembler instructions in the above function
are consuming 3% of the cpu. Making new_load_byte to be single
x86 insn would be really cool.

Of course, there are other pieces to accelerate:
 12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
  6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
  4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
  4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
and I think Jesper's work on batch allocation is going help that a lot.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04 18:46           ` Alexei Starovoitov
@ 2016-04-04 21:01             ` Daniel Borkmann
  2016-04-05  1:17               ` Alexei Starovoitov
  0 siblings, 1 reply; 62+ messages in thread
From: Daniel Borkmann @ 2016-04-04 21:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Johannes Berg, Brenden Blanco, davem, netdev, tom, ogerlitz,
	john.fastabend, brouer

On 04/04/2016 08:46 PM, Alexei Starovoitov wrote:
> On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote:
>> On 04/04/2016 09:35 AM, Johannes Berg wrote:
>>> On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
>>>>
>>>> Having a common check makes sense. The tricky thing is that the type can
>>>> only be checked after taking the reference, and I wanted to keep the
>>>> scope of the prog brief in the case of errors. I would have to move the
>>>> bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
>>>> ndo instead. Would that API look fine to you?
>>>
>>> I can't really comment, I wasn't planning on using the API right now :)
>>>
>>> However, what else is there that the driver could possibly do with the
>>> FD, other than getting the bpf_prog?
>>>
>>>> A possible extension of this is just to keep the bpf_prog * in the
>>>> netdev itself and expose a feature flag from the driver rather than
>>>> an ndo. But that would mean another 8 bytes in the netdev.
>>>
>>> That also misses the signal to the driver when the program is
>>> set/removed, so I don't think that works. I'd argue it's not really
>>> desirable anyway though since I wouldn't expect a majority of drivers
>>> to start supporting this.
>>
>> I think ndo is probably fine for this purpose, see also my other mail. I
>> think currently, the only really driver specific code would be to store
>> the prog pointer somewhere and to pass needed meta data to populate the
>> fake skb.
>
> yes. I think ndo is better and having bpf_prog in the driver priv
> part is likely better as well, since driver may decide to put it into
> their ring struct for faster fetch or layout prog pointer next to other
> priv fields for better cache.
> Having prog in 'struct net_device' may look very sensible right now,
> since there is not much code around it, but later it may be causing
> some performance headachces. I think it's better to have complete
> freedom in the drivers and later move code to generic part.
> Same applies to your other comment about moving mlx4_bpf_set() and
> mlx4_call_bpf() into generic. It's better for them to be driver
> specific in the moment. Right now we have only mlx4 anyway.

Sure, right now it's only mlx4, but we need to make sure that once this gets
adapted/extended by others, that we won't end up with programs that can only
be run by specific drivers e.g., due to meta data only available for this kind
of driver but not others supporting XDP. So, some form of generic part will
be needed in any case, also makes it easier for testing changes.

>> Maybe mid-term drivers might want to reuse this hook/signal for offloading
>> as well, not yet sure ... how would that relate to offloading of cls_bpf?
>> Should these be considered two different things (although from an offloading
>> perspective they are not really). _Conceptually_, XDP could also be seen
>> as a software offload for the facilities we support with cls_bpf et al.
>
> agree. 'conceptually' phys_dev bpf program is similar to sched_cls bpf program.
> If we can reuse some of the helpers that would be great...
> but only if we can maintain highest performance.
> hw offloading may be more convenient from cls_bpf side for some drivers,
> but nothing obviously stops them from hw offloading of bpf_type_phys_dev

Right, I assume such a f.e. JIT backend on driver side might be generic enough
to be able to support both things as it will mostly boil down to different
helpers and some subset of helpers are used by both anyway.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 20:00                   ` Alexei Starovoitov
@ 2016-04-04 22:04                     ` Thomas Graf
  2016-04-05  2:25                       ` Alexei Starovoitov
  2016-04-05  9:29                     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 62+ messages in thread
From: Thomas Graf @ 2016-04-04 22:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, John Fastabend, Jesper Dangaard Brouer,
	Tom Herbert, Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, ogerlitz

On 04/04/16 at 01:00pm, Alexei Starovoitov wrote:
> Exactly. That the most important part of this rfc.
> Right now redirect to different queue, batching, prefetch and tons of
> other code are mising. We have to plan the whole project, so we can
> incrementally add features without breaking abi.
> So new IFLA, xdp_metadata struct and enum for bpf return codes are
> the main things to agree on.

+1
This is the most important statement in this thread so far. A plan
that gets us from this RFC series to a functional forwarding engine
with redirect and load/write is essential. [...]

> Another reason for going with 'pseudo skb' structure was to reuse
> load_byte/half/word instructions in bpf interpreter as-is.
> Right now these instructions have to see in-kernel
> 'struct sk_buff' as context (note we have mirror __sk_buff
> for user space), so to use load_byte for bpf_prog_type_phys_dev
> we have to give real 'struct sk_buff' to interpter with
> data, head, len, data_len fields initialized, so that
> interpreter 'just works'.
> The potential fix would be to add phys_dev specific load_byte/word
> instructions. Then we can drop all the legacy negative offset
> stuff that <1% uses, but it slows down everyone.
> We can also drop byteswap that load_word does (which turned out
> to be confusing and often programs do 2nd byteswap to go
> back to cpu endiannes).

[...] I would really like to see a common set of helpers which
applies to both cls_bpf and phys_dev. Given the existing skb based
helpers cannot be overloaded, at least the phys_dev helpers should
be made to work in cls_bpf context as well to allow for some
portability. Otherwise we'll end up with half a dozen flavours of
BPF which are all incompatible.

> And if we do it smart, we can drop length check as well,
> then new_load_byte will actually be single load byte cpu instruction.
> We can drop length check when offset is constant in the verfier
> and that constant is less than 64, since all packets are larger.
> As seen in 'perf report' from patch 5:
>   3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
> this is 14Mpps and 4 assembler instructions in the above function
> are consuming 3% of the cpu. Making new_load_byte to be single
> x86 insn would be really cool.

Neat.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-03  7:02     ` Brenden Blanco
@ 2016-04-04 22:07       ` Thomas Graf
  2016-04-05  8:19         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 62+ messages in thread
From: Thomas Graf @ 2016-04-04 22:07 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Tom Herbert, David S. Miller, Linux Kernel Network Developers,
	Alexei Starovoitov, ogerlitz, Daniel Borkmann, john fastabend,
	Jesper Dangaard Brouer

On 04/03/16 at 12:02am, Brenden Blanco wrote:
> On Sat, Apr 02, 2016 at 12:39:45PM -0400, Tom Herbert wrote:
> > Is there a hard restriction that this could only work with physical devices?
> I suppose not, but there wouldn't be much use case as compared to tc
> ingress, no? Since I was imagining that this new hook would be more
> restricted in functionality due to operating on descriptors rather than
> with a full skb, I tried to think of an appropriate name.
> 
> If you think that this hook would spread, then a better name is needed.

The thing that comes to mind is that this prog type makes it easier to
implement batched processing of packets in BPF which would require major
surgery across the tc layer to make it available in cls_bpf. Batched
processing will be beneficial for software devices as well.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04 21:01             ` Daniel Borkmann
@ 2016-04-05  1:17               ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-05  1:17 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Johannes Berg, Brenden Blanco, davem, netdev, tom, ogerlitz,
	john.fastabend, brouer

On Mon, Apr 04, 2016 at 11:01:57PM +0200, Daniel Borkmann wrote:
> On 04/04/2016 08:46 PM, Alexei Starovoitov wrote:
> >On Mon, Apr 04, 2016 at 11:57:52AM +0200, Daniel Borkmann wrote:
> >>On 04/04/2016 09:35 AM, Johannes Berg wrote:
> >>>On Sat, 2016-04-02 at 23:38 -0700, Brenden Blanco wrote:
> >>>>
> >>>>Having a common check makes sense. The tricky thing is that the type can
> >>>>only be checked after taking the reference, and I wanted to keep the
> >>>>scope of the prog brief in the case of errors. I would have to move the
> >>>>bpf_prog_get logic into dev_change_bpf_fd and pass a bpf_prog * into the
> >>>>ndo instead. Would that API look fine to you?
> >>>
> >>>I can't really comment, I wasn't planning on using the API right now :)
> >>>
> >>>However, what else is there that the driver could possibly do with the
> >>>FD, other than getting the bpf_prog?
> >>>
> >>>>A possible extension of this is just to keep the bpf_prog * in the
> >>>>netdev itself and expose a feature flag from the driver rather than
> >>>>an ndo. But that would mean another 8 bytes in the netdev.
> >>>
> >>>That also misses the signal to the driver when the program is
> >>>set/removed, so I don't think that works. I'd argue it's not really
> >>>desirable anyway though since I wouldn't expect a majority of drivers
> >>>to start supporting this.
> >>
> >>I think ndo is probably fine for this purpose, see also my other mail. I
> >>think currently, the only really driver specific code would be to store
> >>the prog pointer somewhere and to pass needed meta data to populate the
> >>fake skb.
> >
> >yes. I think ndo is better and having bpf_prog in the driver priv
> >part is likely better as well, since driver may decide to put it into
> >their ring struct for faster fetch or layout prog pointer next to other
> >priv fields for better cache.
> >Having prog in 'struct net_device' may look very sensible right now,
> >since there is not much code around it, but later it may be causing
> >some performance headachces. I think it's better to have complete
> >freedom in the drivers and later move code to generic part.
> >Same applies to your other comment about moving mlx4_bpf_set() and
> >mlx4_call_bpf() into generic. It's better for them to be driver
> >specific in the moment. Right now we have only mlx4 anyway.
> 
> Sure, right now it's only mlx4, but we need to make sure that once this gets
> adapted/extended by others, that we won't end up with programs that can only
> be run by specific drivers e.g., due to meta data only available for this kind
> of driver but not others supporting XDP. So, some form of generic part will
> be needed in any case, also makes it easier for testing changes.

yes. if packet metadata becomes different for different drivers it will
be a major pain to write portable programs and we should strive to avoid that.
Right now it's only 'len' which obviously available everywhere and any new
field need to be argued for.
Same will apply to helper functions. In this rfc it's just sk_filter_func_proto,
which is a good set for filtering packets, but load/store bytes + csum helpers
need to be added along with packet redirect to be useful for eth2eth traffic.
Since there is no skb and there is no legacy of LD_ABS with negative offsets,
we can have direct load/store bytes, so csum operations may become instructions
as well and will be faster too. Yes. It would mean that tc+cls_bpf have to look
different in bpf assembler, but since they're written in C we can abstract
them at user space C level and can have the same program compiled as cls_bpf
using cls_bpf helpers and as bpf_phys_dev using direct load/store instructions.
Like bpf_skb_load_bytes() may become inlined memcpy with extra len check
for bpf_phys_dev prog type. All options are open.
The only thing we cannot compromise on is max performance.
If performance suffers due to generality/legacy_code/compatiblity_with_X_or_Y,
we should pick performance.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-03  6:15     ` Brenden Blanco
@ 2016-04-05  2:20       ` Brenden Blanco
  2016-04-05  2:44         ` Eric Dumazet
  2016-04-05 18:59         ` Eran Ben Elisha
  0 siblings, 2 replies; 62+ messages in thread
From: Brenden Blanco @ 2016-04-05  2:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer

On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote:
> On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
[...]
> > 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> > sharing.
> > 
> >    This is probably the right time to add a rx_dropped field in struct
> > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> > higher speed links.
> > 
> This sounds reasonable! Will look into it for the next spin.
I looked into this, and it seems to me that both the rx and tx dropped
stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616,
specifically with the line
  stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP);
that occurs during the periodic ethtool task, whatever ++ was happening
in the rx/tx code is overwritten with the HW value. Since the SW stats
are incremented mostly in edge (oom) cases, nobody probably noticed. To
me it doesn't seem right to mix hard and soft counters, especially at
the risk of making a bad situation worse, so I'm planning to omit the
new bpf dropped++ stat and we can discuss ways to fix this other bug
separately.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 22:04                     ` Thomas Graf
@ 2016-04-05  2:25                       ` Alexei Starovoitov
  2016-04-05  8:11                         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-05  2:25 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Brenden Blanco, John Fastabend, Jesper Dangaard Brouer,
	Tom Herbert, Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, ogerlitz

On Tue, Apr 05, 2016 at 12:04:39AM +0200, Thomas Graf wrote:
> On 04/04/16 at 01:00pm, Alexei Starovoitov wrote:
> > Exactly. That the most important part of this rfc.
> > Right now redirect to different queue, batching, prefetch and tons of
> > other code are mising. We have to plan the whole project, so we can
> > incrementally add features without breaking abi.
> > So new IFLA, xdp_metadata struct and enum for bpf return codes are
> > the main things to agree on.
> 
> +1
> This is the most important statement in this thread so far. A plan
> that gets us from this RFC series to a functional forwarding engine
> with redirect and load/write is essential. [...]

exactly. I think the next step 2 is to figure out the redirect return code
and 'rewiring' of the rx dma buffer into tx ring and auto-batching.
As this rfc showed even when using standard page alloc/free the peformance
is hitting 10Gbps hw limit and not being cpu bounded, so recycling of
the pages and avoiding map/unmap will come at step 3.
Batching is necessary even for basic redirect, since ringing doorbell
for every tx buffer is not an option.

> [...] I would really like to see a common set of helpers which
> applies to both cls_bpf and phys_dev. Given the existing skb based
> helpers cannot be overloaded, at least the phys_dev helpers should
> be made to work in cls_bpf context as well to allow for some
> portability. Otherwise we'll end up with half a dozen flavours of
> BPF which are all incompatible.

The helpers can be 'overloaded'. In my upcoming patches for
bpf+tracepoints the bpf_perf_event_output() helper is different
depending on program type (kprobe vs tracepoint), but logically
it looks exactly the same from program point of view and
BPF_FUNC_id is reused.
So for cls_bpf vs bpf_phys_dev we can have the same bpf_csum_diff()
helper which will have different internal implementation depending
on program type.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-05  2:20       ` Brenden Blanco
@ 2016-04-05  2:44         ` Eric Dumazet
  2016-04-05 18:59         ` Eran Ben Elisha
  1 sibling, 0 replies; 62+ messages in thread
From: Eric Dumazet @ 2016-04-05  2:44 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel,
	john.fastabend, brouer

On Mon, 2016-04-04 at 19:20 -0700, Brenden Blanco wrote:
> On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote:
> > On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> [...]
> > > 2) priv->stats.rx_dropped is shared by all the RX queues -> false
> > > sharing.
> > > 
> > >    This is probably the right time to add a rx_dropped field in struct
> > > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
> > > higher speed links.
> > > 
> > This sounds reasonable! Will look into it for the next spin.
> I looked into this, and it seems to me that both the rx and tx dropped
> stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616,
> specifically with the line
>   stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP);
> that occurs during the periodic ethtool task, whatever ++ was happening
> in the rx/tx code is overwritten with the HW value. Since the SW stats
> are incremented mostly in edge (oom) cases, nobody probably noticed. To
> me it doesn't seem right to mix hard and soft counters, especially at
> the risk of making a bad situation worse, so I'm planning to omit the
> new bpf dropped++ stat and we can discuss ways to fix this other bug
> separately.

Yes, soft stats should not be overwritten.
 
Also adding 32bit and 64bit fields is wrong, as SNMP software are most
of the time not able to properly overflows.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04 18:27       ` Alexei Starovoitov
@ 2016-04-05  6:04         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-05  6:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, davem, netdev, tom, ogerlitz, daniel,
	john.fastabend, brouer

On Mon, 4 Apr 2016 11:27:27 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Sat, Apr 02, 2016 at 11:11:52PM -0700, Brenden Blanco wrote:
> > On Sat, Apr 02, 2016 at 10:23:31AM +0200, Jesper Dangaard Brouer wrote:
> > [...]  
> > > 
> > > I think you need to DMA sync RX-page before you can safely access
> > > packet data in page (on all arch's).
> > >   
> > Thanks, I will give that a try in the next spin.  
> > > > +			ethh = (struct ethhdr *)(page_address(frags[0].page) +
> > > > +						 frags[0].page_offset);
> > > > +			if (mlx4_call_bpf(prog, ethh, length)) {  
> > > 
> > > AFAIK length here covers all the frags[n].page, thus potentially
> > > causing the BPF program to access memory out of bound (crash).
> > > 
> > > Having several page fragments is AFAIK an optimization for jumbo-frames
> > > on PowerPC (which is a bit annoying for you use-case ;-)).
> > >   
> > Yeah, this needs some more work. I can think of some options:
> > 1. limit pseudo skb.len to first frag's length only, and signal to
> > program that the packet is incomplete
> > 2. for nfrags>1 skip bpf processing, but this could be functionally
> > incorrect for some use cases
> > 3. run the program for each frag
> > 4. reject ndo_bpf_set when frags are possible (large mtu?)
> > 
> > My preference is to go with 1, thoughts?  
> 
> hmm and what program will do with 'incomplete' packet?
> imo option 4 is only way here. If phys_dev bpf program already
> attached to netdev then mlx4_en_change_mtu() can reject jumbo mtus.
> My understanding of mlx4_en_calc_rx_buf is that mtu < 1514
> will have num_frags==1. That's the common case and one we
> want to optimize for.

I agree, we should only optimize for the common case, where
num_frags==1.


> If later we can find a way to change mlx4 driver to support
> phys_dev bpf programs with jumbo mtus, great.

For getting the DMA-buffer/packet-page writable, some change are needed
in this code path anyhow.  Lets look at that later, when touching that
code path.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-05  2:25                       ` Alexei Starovoitov
@ 2016-04-05  8:11                         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-05  8:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Thomas Graf, Brenden Blanco, John Fastabend, Tom Herbert,
	Daniel Borkmann, David S. Miller,
	Linux Kernel Network Developers, Or Gerlitz, brouer


On Mon, 4 Apr 2016 19:25:08 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Apr 05, 2016 at 12:04:39AM +0200, Thomas Graf wrote:
> > On 04/04/16 at 01:00pm, Alexei Starovoitov wrote:  
> > > Exactly. That the most important part of this rfc.
> > > Right now redirect to different queue, batching, prefetch and tons of
> > > other code are mising. We have to plan the whole project, so we can
> > > incrementally add features without breaking abi.
> > > So new IFLA, xdp_metadata struct and enum for bpf return codes are
> > > the main things to agree on.  
> > 
> > +1
> > This is the most important statement in this thread so far. A plan
> > that gets us from this RFC series to a functional forwarding engine
> > with redirect and load/write is essential. [...]  

+1 agreed, I love to see the energy in this thread! :-)
 
> exactly. I think the next step 2 is to figure out the redirect return code
> and 'rewiring' of the rx dma buffer into tx ring and auto-batching.
>
> As this rfc showed even when using standard page alloc/free the peformance
> is hitting 10Gbps hw limit and not being cpu bounded, so recycling of
> the pages and avoiding map/unmap will come at step 3.

Yes, I've also noticed the standard page alloc/free performance is
slowing us down.  I will be working on identifying (and measuring) page
allocator bottlenecks.

For the early drop case, we should be able to hack the driver to,
recycle the page directly (and even avoid the DMA unmap).  But for the
TX (forward) case, we would need some kind of page-pool cache API to
recycle the page through (also useful for normal netstack usage).
  I'm interested in implementing a generic page-pool cache mechanism,
and I plan to bring up this topic at the MM-summit in 2 weeks.  Input
are welcome... but as Alexei says this is likely a step 3 project.


> Batching is necessary even for basic redirect, since ringing doorbell
> for every tx buffer is not an option.

Yes, we know TX batching is essential for performance.  If we create a
RX bundle/batch step (in the driver) then eBFP forward step can work on
a RX-bundle and build up TX-bundle(s) (per TX device), that can TX bulk
send these.  Notice that I propose building TX-bundles, not sending
each individual packet and flushing tail/doorbell, because I want to
maximize icache efficiency of the eBFP program.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 22:07       ` Thomas Graf
@ 2016-04-05  8:19         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-05  8:19 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Brenden Blanco, Tom Herbert, David S. Miller,
	Linux Kernel Network Developers, Alexei Starovoitov, Or Gerlitz,
	Daniel Borkmann, john fastabend, brouer


On Tue, 5 Apr 2016 00:07:14 +0200 Thomas Graf <tgraf@suug.ch> wrote:

> On 04/03/16 at 12:02am, Brenden Blanco wrote:
> > On Sat, Apr 02, 2016 at 12:39:45PM -0400, Tom Herbert wrote:  
> > > Is there a hard restriction that this could only work with physical devices?  
> > I suppose not, but there wouldn't be much use case as compared to tc
> > ingress, no? Since I was imagining that this new hook would be more
> > restricted in functionality due to operating on descriptors rather than
> > with a full skb, I tried to think of an appropriate name.
> > 
> > If you think that this hook would spread, then a better name is needed.  
> 
> The thing that comes to mind is that this prog type makes it easier to
> implement batched processing of packets in BPF which would require major
> surgery across the tc layer to make it available in cls_bpf. Batched
> processing will be beneficial for software devices as well.

+1 agreed, I would really like to see batched processing of packets in
BPF, for this packet type, designed in from the start.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-04 20:00                   ` Alexei Starovoitov
  2016-04-04 22:04                     ` Thomas Graf
@ 2016-04-05  9:29                     ` Jesper Dangaard Brouer
  2016-04-05 22:06                       ` Alexei Starovoitov
  1 sibling, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-05  9:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, John Fastabend, Tom Herbert, Daniel Borkmann,
	David S. Miller, Linux Kernel Network Developers, ogerlitz,
	brouer


On Mon, 4 Apr 2016 13:00:34 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> As seen in 'perf report' from patch 5:
>   3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
> this is 14Mpps and 4 assembler instructions in the above function
> are consuming 3% of the cpu.

At this level we also need to take into account the cost/overhead of a
function call.  Which I've measured to between 5-7 cycles, part of my
time_bench_sample[1] test.

> Making new_load_byte to be single  x86 insn would be really cool.
> 
> Of course, there are other pieces to accelerate:
>  12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
>   6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
>   4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
>   4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
> and I think Jesper's work on batch allocation is going help that a lot.

Actually, it looks like all of this "overhead" comes from the page
alloc/free (+ dma unmap/map). We would need a page-pool recycle
mechanism to solve/remove this overhead.  For the early drop case we
might be able to hack recycle the page directly in the driver (and also
avoid dma_unmap/map cycle).


[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-04 18:50           ` Alexei Starovoitov
@ 2016-04-05 14:15             ` Or Gerlitz
  2016-04-06  4:05               ` Brenden Blanco
  0 siblings, 1 reply; 62+ messages in thread
From: Or Gerlitz @ 2016-04-05 14:15 UTC (permalink / raw)
  To: Alexei Starovoitov, Eric Dumazet
  Cc: Jesper Dangaard Brouer, Brenden Blanco, davem, netdev, tom,
	daniel, john.fastabend, Eran Ben Elisha, Rana Shahout,
	Matan Barak

On 4/4/2016 9:50 PM, Alexei Starovoitov wrote:
> On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote:
>> A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
>> a single 10GB trunk used for a given flow.
>>
>> This 14Mpps thing seems to be a queue limitation on mlx4.
> yeah, could be queueing related. Multiple cpus can send ~30Mpps of the same 64 byte packet,
> but mlx4 can only receive 14.5Mpps. Odd.
>
> Or (and other mellanox guys), what is really going on inside 40G nic?

Hi Alexei,

Not that I know everything that goes inside there, and not that if I 
knew it all I could have posted that here (I heard HWs sometimes have 
IP)... but, anyway, as for your questions:

ConnectX3 40Gbs NIC can receive > 10Gbs packet-worthy (14.5M) in single 
ring and Mellanox
100Gbs NICs can receive > 25Gbs packet-worthy (37.5M) in single ring, 
people that use DPDK (...) even see this numbers and AFAIU we now 
attempt to see that in the kernel with XDP :)

I realize that we might have some issues in the mlx4 driver reporting on 
HW drops. Eran (cc-ed) and Co are looking on that.

In parallel to doing so, I would suggest you to do some experiments that 
might shed some more light, if on the TX side you do

$ ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4

On the RX side,  skip RSS and force the packets that match that traffic 
pattern to go to (say) ring (==action) 0

$ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action 0 loc 0

to go back to RSS remove the rule

$ ethtool -U $DEV delete action 0

FWIW (not that I see how it helps you now), you can do HW drop on the RX 
side with ring -1

$ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action -1 loc 0

Or.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-05  2:20       ` Brenden Blanco
  2016-04-05  2:44         ` Eric Dumazet
@ 2016-04-05 18:59         ` Eran Ben Elisha
  1 sibling, 0 replies; 62+ messages in thread
From: Eran Ben Elisha @ 2016-04-05 18:59 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Eric Dumazet, David S. Miller, Linux Netdev List, tom,
	alexei.starovoitov, Or Gerlitz, daniel, john.fastabend, brouer

On Tue, Apr 5, 2016 at 5:20 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Sat, Apr 02, 2016 at 11:15:38PM -0700, Brenden Blanco wrote:
>> On Fri, Apr 01, 2016 at 07:08:31PM -0700, Eric Dumazet wrote:
> [...]
>> > 2) priv->stats.rx_dropped is shared by all the RX queues -> false
>> > sharing.
>> >
>> >    This is probably the right time to add a rx_dropped field in struct
>> > mlx4_en_rx_ring since you guys want to drop 14 Mpps, and 50 Mpps on
>> > higher speed links.
>> >
>> This sounds reasonable! Will look into it for the next spin.
> I looked into this, and it seems to me that both the rx and tx dropped
> stats are buggy. With commit a3333b35da1634f49aca541f2574a084221e2616,
> specifically with the line
>   stats->rx_dropped = be32_to_cpu(mlx4_en_stats->RDROP);
> that occurs during the periodic ethtool task, whatever ++ was happening
> in the rx/tx code is overwritten with the HW value. Since the SW stats
> are incremented mostly in edge (oom) cases, nobody probably noticed. To
> me it doesn't seem right to mix hard and soft counters, especially at
> the risk of making a bad situation worse, so I'm planning to omit the
> new bpf dropped++ stat and we can discuss ways to fix this other bug
> separately.

Thanks Eric and Brenden,
I will make a patch for mlx4_en RX dropped counters to fix the issues
you raised here.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter
  2016-04-05  9:29                     ` Jesper Dangaard Brouer
@ 2016-04-05 22:06                       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-05 22:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, John Fastabend, Tom Herbert, Daniel Borkmann,
	David S. Miller, Linux Kernel Network Developers, ogerlitz

On Tue, Apr 05, 2016 at 11:29:05AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > Of course, there are other pieces to accelerate:
> >  12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
> >   6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
> >   4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
> >   4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
> > and I think Jesper's work on batch allocation is going help that a lot.
> 
> Actually, it looks like all of this "overhead" comes from the page
> alloc/free (+ dma unmap/map). We would need a page-pool recycle
> mechanism to solve/remove this overhead.  For the early drop case we
> might be able to hack recycle the page directly in the driver (and also
> avoid dma_unmap/map cycle).

Exactly. A cache of allocated and mapped pages will help a lot both drop
and redirect use cases. After tx completion we can recycle still mmaped
page into the cache (need to make sure to map them PCI_DMA_BIDIRECTIONAL)
and rx can refill the ring with it. For load balancer steady state
we won't have any calls to page allocator and dma.
Being able to do cheap percpu pool like this is a huge advantage
that any kernel bypass cannot have. I'm pretty sure it will be
possible to avoid local_cmpxchg as well.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program
  2016-04-05 14:15             ` Or Gerlitz
@ 2016-04-06  4:05               ` Brenden Blanco
  0 siblings, 0 replies; 62+ messages in thread
From: Brenden Blanco @ 2016-04-06  4:05 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexei Starovoitov, Eric Dumazet, Jesper Dangaard Brouer, davem,
	netdev, tom, daniel, john.fastabend, Eran Ben Elisha,
	Rana Shahout, Matan Barak

On Tue, Apr 05, 2016 at 05:15:20PM +0300, Or Gerlitz wrote:
> On 4/4/2016 9:50 PM, Alexei Starovoitov wrote:
> >On Mon, Apr 04, 2016 at 08:22:03AM -0700, Eric Dumazet wrote:
> >>A single flow is able to use 40Gbit on those 40Gbit NIC, so there is not
> >>a single 10GB trunk used for a given flow.
> >>
> >>This 14Mpps thing seems to be a queue limitation on mlx4.
> >yeah, could be queueing related. Multiple cpus can send ~30Mpps of the same 64 byte packet,
> >but mlx4 can only receive 14.5Mpps. Odd.
> >
> >Or (and other mellanox guys), what is really going on inside 40G nic?
> 
> Hi Alexei,
> 
> Not that I know everything that goes inside there, and not that if I
> knew it all I could have posted that here (I heard HWs sometimes
> have IP)... but, anyway, as for your questions:
> 
> ConnectX3 40Gbs NIC can receive > 10Gbs packet-worthy (14.5M) in
> single ring and Mellanox
> 100Gbs NICs can receive > 25Gbs packet-worthy (37.5M) in single
> ring, people that use DPDK (...) even see this numbers and AFAIU we
> now attempt to see that in the kernel with XDP :)
> 
> I realize that we might have some issues in the mlx4 driver
> reporting on HW drops. Eran (cc-ed) and Co are looking on that.
Thanks!
> 
> In parallel to doing so, I would suggest you to do some experiments
> that might shed some more light, if on the TX side you do
> 
> $ ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> 
> On the RX side,  skip RSS and force the packets that match that
> traffic pattern to go to (say) ring (==action) 0
> 
> $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action 0 loc 0

I added the module parameter:
  options mlx4_core log_num_mgm_entry_size=-1
And with this I was able to reach to >20 Mpps. This is actually
regardless of the ethtool settings mentioned above.

 25.31%  ksoftirqd/0   [mlx4_en]         [k] mlx4_en_process_rx_cq
 20.18%  ksoftirqd/0   [mlx4_en]         [k] mlx4_en_alloc_frags
  8.42%  ksoftirqd/0   [mlx4_en]         [k] mlx4_en_free_frag
  5.59%  swapper       [kernel.vmlinux]  [k] poll_idle
  5.38%  ksoftirqd/0   [kernel.vmlinux]  [k] get_page_from_freelist
  3.06%  ksoftirqd/0   [mlx4_en]         [k] mlx4_call_bpf
  2.73%  ksoftirqd/0   [mlx4_en]         [k] 0x000000000001cf94
  2.72%  ksoftirqd/0   [kernel.vmlinux]  [k] free_pages_prepare
  2.19%  ksoftirqd/0   [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  2.08%  ksoftirqd/0   [kernel.vmlinux]  [k] sk_load_byte_positive_offset
  1.72%  ksoftirqd/0   [kernel.vmlinux]  [k] free_one_page
  1.59%  ksoftirqd/0   [kernel.vmlinux]  [k] bpf_map_lookup_elem
  1.30%  ksoftirqd/0   [mlx4_en]         [k] 0x000000000001cfc1
  1.07%  ksoftirqd/0   [kernel.vmlinux]  [k] __alloc_pages_nodemask
  1.00%  ksoftirqd/0   [mlx4_en]         [k] mlx4_alloc_pages.isra.23

> 
> to go back to RSS remove the rule
> 
> $ ethtool -U $DEV delete action 0
> 
> FWIW (not that I see how it helps you now), you can do HW drop on
> the RX side with ring -1
> 
> $ ethtool -U $DEV flow-type ip4 dst-mac $MAC dst-ip $IP action -1 loc 0
> 
> Or.
> 

Here also is the output from the two machines using a tool to get
ethtool delta stats at 1 second intervals:

----------- sender -----------
           tx_packets: 20,246,059
             tx_bytes: 1,214,763,540 bps    = 9,267.91 Mbps
            xmit_more: 19,463,226
        queue_stopped: 36,982
           wake_queue: 36,982
             rx_pause: 6,351
    tx_pause_duration: 124,974
  tx_pause_transition: 3,176
    tx_novlan_packets: 20,244,344
      tx_novlan_bytes: 1,295,629,440 bps    = 9,884.86 Mbps
          tx0_packets: 5,151,029
            tx0_bytes: 309,061,680 bps      = 2,357.95 Mbps
          tx1_packets: 5,094,532
            tx1_bytes: 305,671,920 bps      = 2,332.9 Mbps
          tx2_packets: 5,130,996
            tx2_bytes: 307,859,760 bps      = 2,348.78 Mbps
          tx3_packets: 5,135,513
            tx3_bytes: 308,130,780 bps      = 2,350.85 Mbps
                 UP 0: 9,389.68             Mbps = 100.00%
                 UP 0: 20,512,070           Tran/sec = 100.00%

----------- receiver -----------
           rx_packets: 20,207,929
             rx_bytes: 1,212,475,740 bps    = 9,250.45 Mbps
           rx_dropped: 236,604
    rx_pause_duration: 128,436
  rx_pause_transition: 3,258
             tx_pause: 6,516
    rx_novlan_packets: 20,208,906
      rx_novlan_bytes: 1,293,369,984 bps    = 9,867.62 Mbps
          rx0_packets: 20,444,526
            rx0_bytes: 1,226,671,560 bps    = 9,358.76 Mbps

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] Add sample for adding simple drop program to link
  2016-04-02  1:21 ` [RFC PATCH 5/5] Add sample for adding simple drop program to link Brenden Blanco
@ 2016-04-06 19:48   ` Jesper Dangaard Brouer
  2016-04-06 20:01     ` Jesper Dangaard Brouer
  2016-04-06 20:03     ` Daniel Borkmann
  0 siblings, 2 replies; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-06 19:48 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, Or Gerlitz, daniel,
	john.fastabend, brouer


I'm testing with this program and these patches, after getting past the
challenge of compiling the samples/bpf files ;-)


On Fri,  1 Apr 2016 18:21:58 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:

> Add a sample program that only drops packets at the
> BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
> observed single core rate is ~14.6Mpps.

On my i7-4790K CPU @ 4.00GHz I'm seeing 9.7Mpps (single flow/cpu).
(generator: pktgen_sample03_burst_single_flow.sh)

 # ./netdrvx1 $(</sys/class/net/mlx4p1/ifindex)
 sh: /sys/kernel/debug/tracing/kprobe_events: No such file or directory
 Success: Loaded file ./netdrvx1_kern.o
 proto 17:    9776320 drops/s

These numbers are quite impressive. Compared to: sending it to local
socket that drop packets 1.7Mpps. Compared to: dropping with iptables
in "raw" table 3.7Mpps.

If I do multiple flows, via ./pktgen_sample05_flow_per_thread.sh
then I hit this strange 14.5Mpps limit (proto 17:   14505558 drops/s).
And the RX 4x CPUs are starting to NOT use 100% in softirq, they have
some cycles attributed to %idle. (I verified generator is sending at
24Mpps).


> Other tests were run, for instance without the dropcnt increment or
> without reading from the packet header, the packet rate was mostly
> unchanged.

If I change the program to not touch packet data (don't call
load_byte()) then the performance increase to 14.6Mpps (single
flow/cpu).  And the RX CPU is mostly idle... mlx4_en_process_rx_cq()
and page alloc/free functions taking the time.

> $ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex)
> proto 17:   14597724 drops/s
> 
> ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> Running... ctrl^C to stop
> Device: eth4@0
> Result: OK: 6486875(c6485849+d1026) usec, 23689465 (60byte,0frags)
>   3651906pps 1752Mb/sec (1752914880bps) errors: 0
> Device: eth4@1
> Result: OK: 6486874(c6485656+d1217) usec, 23689489 (60byte,0frags)
>   3651911pps 1752Mb/sec (1752917280bps) errors: 0
> Device: eth4@2
> Result: OK: 6486851(c6485730+d1120) usec, 23687853 (60byte,0frags)
>   3651672pps 1752Mb/sec (1752802560bps) errors: 0
> Device: eth4@3
> Result: OK: 6486879(c6485807+d1071) usec, 23688954 (60byte,0frags)
>   3651825pps 1752Mb/sec (1752876000bps) errors: 0
> 
> perf report --no-children:
>   18.36%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_process_rx_cq
>   15.98%  swapper        [kernel.vmlinux]  [k] poll_idle
>   12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
>    6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
>    4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
>    4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
>    3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
>    2.39%  ksoftirqd/1    [mdio]            [k] 0x00000000000074cd
>    2.23%  swapper        [mlx4_en]         [k] mlx4_en_alloc_frags
>    2.20%  ksoftirqd/1    [kernel.vmlinux]  [k] free_pages_prepare
>    2.08%  ksoftirqd/1    [mlx4_en]         [k] mlx4_call_bpf
>    1.57%  ksoftirqd/1    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
>    1.35%  ksoftirqd/1    [mdio]            [k] 0x00000000000074fa
>    1.09%  ksoftirqd/1    [kernel.vmlinux]  [k] free_one_page
>    1.02%  ksoftirqd/1    [kernel.vmlinux]  [k] bpf_map_lookup_elem
>    0.90%  ksoftirqd/1    [kernel.vmlinux]  [k] __alloc_pages_nodemask
>    0.88%  swapper        [kernel.vmlinux]  [k] intel_idle
>    0.82%  ksoftirqd/1    [mdio]            [k] 0x00000000000074be
>    0.80%  swapper        [mlx4_en]         [k] mlx4_en_free_frag

My picture (single flow/cpu) looks a little bit different:

 +   64.33%  ksoftirqd/7    [kernel.vmlinux]  [k] __bpf_prog_run
 +    9.60%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_alloc_frags
 +    7.71%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_process_rx_cq
 +    5.47%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_free_frag
 +    1.68%  ksoftirqd/7    [kernel.vmlinux]  [k] get_page_from_freelist
 +    1.52%  ksoftirqd/7    [mlx4_en]         [k] mlx4_call_bpf
 +    1.02%  ksoftirqd/7    [kernel.vmlinux]  [k] free_pages_prepare
 +    0.72%  ksoftirqd/7    [mlx4_en]         [k] mlx4_alloc_pages.isra.20
 +    0.70%  ksoftirqd/7    [kernel.vmlinux]  [k] __rcu_read_unlock
 +    0.65%  ksoftirqd/7    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem

On my i7-4790K CPU, I don't have DDIO, thus I assume this high cost in
__bpf_prog_run is due to a cache-miss on the packet data.

> machine specs:
>  receiver - Intel E5-1630 v3 @ 3.70GHz
>  sender - Intel E5645 @ 2.40GHz
>  Mellanox ConnectX-3 @40G

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] Add sample for adding simple drop program to link
  2016-04-06 19:48   ` Jesper Dangaard Brouer
@ 2016-04-06 20:01     ` Jesper Dangaard Brouer
  2016-04-06 23:11       ` Alexei Starovoitov
  2016-04-06 20:03     ` Daniel Borkmann
  1 sibling, 1 reply; 62+ messages in thread
From: Jesper Dangaard Brouer @ 2016-04-06 20:01 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, Or Gerlitz, daniel,
	john.fastabend, brouer

On Wed, 6 Apr 2016 21:48:48 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> I'm testing with this program and these patches, after getting past the
> challenge of compiling the samples/bpf files ;-)
> 
> 
> On Fri,  1 Apr 2016 18:21:58 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
> 
> > Add a sample program that only drops packets at the
> > BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
> > observed single core rate is ~14.6Mpps.  
> 
> On my i7-4790K CPU @ 4.00GHz I'm seeing 9.7Mpps (single flow/cpu).
> (generator: pktgen_sample03_burst_single_flow.sh)
> 
>  # ./netdrvx1 $(</sys/class/net/mlx4p1/ifindex)
>  sh: /sys/kernel/debug/tracing/kprobe_events: No such file or directory
>  Success: Loaded file ./netdrvx1_kern.o
>  proto 17:    9776320 drops/s
> 
> These numbers are quite impressive. Compared to: sending it to local
> socket that drop packets 1.7Mpps. Compared to: dropping with iptables
> in "raw" table 3.7Mpps.
> 
> If I do multiple flows, via ./pktgen_sample05_flow_per_thread.sh
> then I hit this strange 14.5Mpps limit (proto 17:   14505558 drops/s).
> And the RX 4x CPUs are starting to NOT use 100% in softirq, they have
> some cycles attributed to %idle. (I verified generator is sending at
> 24Mpps).
> 
> 
> > Other tests were run, for instance without the dropcnt increment or
> > without reading from the packet header, the packet rate was mostly
> > unchanged.  
> 
> If I change the program to not touch packet data (don't call
> load_byte()) then the performance increase to 14.6Mpps (single
> flow/cpu).  And the RX CPU is mostly idle... mlx4_en_process_rx_cq()
> and page alloc/free functions taking the time.
> 
> > $ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex)
> > proto 17:   14597724 drops/s
> > 
> > ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> > Running... ctrl^C to stop
> > Device: eth4@0
> > Result: OK: 6486875(c6485849+d1026) usec, 23689465 (60byte,0frags)
> >   3651906pps 1752Mb/sec (1752914880bps) errors: 0
> > Device: eth4@1
> > Result: OK: 6486874(c6485656+d1217) usec, 23689489 (60byte,0frags)
> >   3651911pps 1752Mb/sec (1752917280bps) errors: 0
> > Device: eth4@2
> > Result: OK: 6486851(c6485730+d1120) usec, 23687853 (60byte,0frags)
> >   3651672pps 1752Mb/sec (1752802560bps) errors: 0
> > Device: eth4@3
> > Result: OK: 6486879(c6485807+d1071) usec, 23688954 (60byte,0frags)
> >   3651825pps 1752Mb/sec (1752876000bps) errors: 0
> > 
> > perf report --no-children:
> >   18.36%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_process_rx_cq
> >   15.98%  swapper        [kernel.vmlinux]  [k] poll_idle
> >   12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
> >    6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
> >    4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
> >    4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
> >    3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
> >    2.39%  ksoftirqd/1    [mdio]            [k] 0x00000000000074cd
> >    2.23%  swapper        [mlx4_en]         [k] mlx4_en_alloc_frags
> >    2.20%  ksoftirqd/1    [kernel.vmlinux]  [k] free_pages_prepare
> >    2.08%  ksoftirqd/1    [mlx4_en]         [k] mlx4_call_bpf
> >    1.57%  ksoftirqd/1    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
> >    1.35%  ksoftirqd/1    [mdio]            [k] 0x00000000000074fa
> >    1.09%  ksoftirqd/1    [kernel.vmlinux]  [k] free_one_page
> >    1.02%  ksoftirqd/1    [kernel.vmlinux]  [k] bpf_map_lookup_elem
> >    0.90%  ksoftirqd/1    [kernel.vmlinux]  [k] __alloc_pages_nodemask
> >    0.88%  swapper        [kernel.vmlinux]  [k] intel_idle
> >    0.82%  ksoftirqd/1    [mdio]            [k] 0x00000000000074be
> >    0.80%  swapper        [mlx4_en]         [k] mlx4_en_free_frag  
> 
> My picture (single flow/cpu) looks a little bit different:
> 
>  +   64.33%  ksoftirqd/7    [kernel.vmlinux]  [k] __bpf_prog_run
>  +    9.60%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_alloc_frags
>  +    7.71%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_process_rx_cq
>  +    5.47%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_free_frag
>  +    1.68%  ksoftirqd/7    [kernel.vmlinux]  [k] get_page_from_freelist
>  +    1.52%  ksoftirqd/7    [mlx4_en]         [k] mlx4_call_bpf
>  +    1.02%  ksoftirqd/7    [kernel.vmlinux]  [k] free_pages_prepare
>  +    0.72%  ksoftirqd/7    [mlx4_en]         [k] mlx4_alloc_pages.isra.20
>  +    0.70%  ksoftirqd/7    [kernel.vmlinux]  [k] __rcu_read_unlock
>  +    0.65%  ksoftirqd/7    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
> 
> On my i7-4790K CPU, I don't have DDIO, thus I assume this high cost in
> __bpf_prog_run is due to a cache-miss on the packet data.

Before someone else point out the obvious... I forgot to enable JIT.
Enable it::

 # echo 1 > /proc/sys/net/core/bpf_jit_enable

Performance increased to: 10.8Mpps (proto 17:   10819446 drops/s)

 Samples: 51K of event 'cycles', Event count (approx.): 56775706510
   Overhead  Command      Shared Object     Symbol
 +   55.90%  ksoftirqd/7  [kernel.vmlinux]  [k] sk_load_byte_positive_offset
 +   10.71%  ksoftirqd/7  [mlx4_en]         [k] mlx4_en_alloc_frags
 +    8.26%  ksoftirqd/7  [mlx4_en]         [k] mlx4_en_process_rx_cq
 +    5.94%  ksoftirqd/7  [mlx4_en]         [k] mlx4_en_free_frag
 +    2.04%  ksoftirqd/7  [kernel.vmlinux]  [k] get_page_from_freelist
 +    2.03%  ksoftirqd/7  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
 +    1.42%  ksoftirqd/7  [mlx4_en]         [k] mlx4_call_bpf
 +    1.04%  ksoftirqd/7  [kernel.vmlinux]  [k] free_pages_prepare
 +    1.03%  ksoftirqd/7  [kernel.vmlinux]  [k] __rcu_read_unlock
 +    0.97%  ksoftirqd/7  [mlx4_en]         [k] mlx4_alloc_pages.isra.20
 +    0.95%  ksoftirqd/7  [devlink]         [k] 0x0000000000005f87
 +    0.58%  ksoftirqd/7  [devlink]         [k] 0x0000000000005f8f
 +    0.49%  ksoftirqd/7  [kernel.vmlinux]  [k] __free_pages_ok
 +    0.47%  ksoftirqd/7  [kernel.vmlinux]  [k] __rcu_read_lock
 +    0.46%  ksoftirqd/7  [kernel.vmlinux]  [k] free_one_page
 +    0.38%  ksoftirqd/7  [kernel.vmlinux]  [k] net_rx_action
 +    0.36%  ksoftirqd/7  [kernel.vmlinux]  [k] bpf_map_lookup_elem
 +    0.36%  ksoftirqd/7  [kernel.vmlinux]  [k] __mod_zone_page_state
 +    0.34%  ksoftirqd/7  [kernel.vmlinux]  [k] __alloc_pages_nodemask
 +    0.32%  ksoftirqd/7  [kernel.vmlinux]  [k] _raw_spin_lock
 +    0.31%  ksoftirqd/7  [devlink]         [k] 0x0000000000005f0a
 +    0.29%  ksoftirqd/7  [kernel.vmlinux]  [k] next_zones_zonelist

It is a very likely cache-miss in sk_load_byte_positive_offset().

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] Add sample for adding simple drop program to link
  2016-04-06 19:48   ` Jesper Dangaard Brouer
  2016-04-06 20:01     ` Jesper Dangaard Brouer
@ 2016-04-06 20:03     ` Daniel Borkmann
  1 sibling, 0 replies; 62+ messages in thread
From: Daniel Borkmann @ 2016-04-06 20:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, Or Gerlitz, john.fastabend

On 04/06/2016 09:48 PM, Jesper Dangaard Brouer wrote:
>
> I'm testing with this program and these patches, after getting past the
> challenge of compiling the samples/bpf files ;-)
>
> On Fri,  1 Apr 2016 18:21:58 -0700 Brenden Blanco <bblanco@plumgrid.com> wrote:
>
>> Add a sample program that only drops packets at the
>> BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
>> observed single core rate is ~14.6Mpps.
>
> On my i7-4790K CPU @ 4.00GHz I'm seeing 9.7Mpps (single flow/cpu).
> (generator: pktgen_sample03_burst_single_flow.sh)
>
>   # ./netdrvx1 $(</sys/class/net/mlx4p1/ifindex)
>   sh: /sys/kernel/debug/tracing/kprobe_events: No such file or directory
>   Success: Loaded file ./netdrvx1_kern.o
>   proto 17:    9776320 drops/s
>
> These numbers are quite impressive. Compared to: sending it to local
> socket that drop packets 1.7Mpps. Compared to: dropping with iptables
> in "raw" table 3.7Mpps.
>
> If I do multiple flows, via ./pktgen_sample05_flow_per_thread.sh
> then I hit this strange 14.5Mpps limit (proto 17:   14505558 drops/s).
> And the RX 4x CPUs are starting to NOT use 100% in softirq, they have
> some cycles attributed to %idle. (I verified generator is sending at
> 24Mpps).
>
>> Other tests were run, for instance without the dropcnt increment or
>> without reading from the packet header, the packet rate was mostly
>> unchanged.
>
> If I change the program to not touch packet data (don't call
> load_byte()) then the performance increase to 14.6Mpps (single
> flow/cpu).  And the RX CPU is mostly idle... mlx4_en_process_rx_cq()
> and page alloc/free functions taking the time.
>
>> $ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex)
>> proto 17:   14597724 drops/s
>>
>> ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
>> Running... ctrl^C to stop
>> Device: eth4@0
>> Result: OK: 6486875(c6485849+d1026) usec, 23689465 (60byte,0frags)
>>    3651906pps 1752Mb/sec (1752914880bps) errors: 0
>> Device: eth4@1
>> Result: OK: 6486874(c6485656+d1217) usec, 23689489 (60byte,0frags)
>>    3651911pps 1752Mb/sec (1752917280bps) errors: 0
>> Device: eth4@2
>> Result: OK: 6486851(c6485730+d1120) usec, 23687853 (60byte,0frags)
>>    3651672pps 1752Mb/sec (1752802560bps) errors: 0
>> Device: eth4@3
>> Result: OK: 6486879(c6485807+d1071) usec, 23688954 (60byte,0frags)
>>    3651825pps 1752Mb/sec (1752876000bps) errors: 0
>>
>> perf report --no-children:
>>    18.36%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_process_rx_cq
>>    15.98%  swapper        [kernel.vmlinux]  [k] poll_idle
>>    12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
>>     6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
>>     4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
>>     4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
>>     3.32%  ksoftirqd/1    [kernel.vmlinux]  [k] sk_load_byte_positive_offset
>>     2.39%  ksoftirqd/1    [mdio]            [k] 0x00000000000074cd
>>     2.23%  swapper        [mlx4_en]         [k] mlx4_en_alloc_frags
>>     2.20%  ksoftirqd/1    [kernel.vmlinux]  [k] free_pages_prepare
>>     2.08%  ksoftirqd/1    [mlx4_en]         [k] mlx4_call_bpf
>>     1.57%  ksoftirqd/1    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
>>     1.35%  ksoftirqd/1    [mdio]            [k] 0x00000000000074fa
>>     1.09%  ksoftirqd/1    [kernel.vmlinux]  [k] free_one_page
>>     1.02%  ksoftirqd/1    [kernel.vmlinux]  [k] bpf_map_lookup_elem
>>     0.90%  ksoftirqd/1    [kernel.vmlinux]  [k] __alloc_pages_nodemask
>>     0.88%  swapper        [kernel.vmlinux]  [k] intel_idle
>>     0.82%  ksoftirqd/1    [mdio]            [k] 0x00000000000074be
>>     0.80%  swapper        [mlx4_en]         [k] mlx4_en_free_frag
>
> My picture (single flow/cpu) looks a little bit different:
>
>   +   64.33%  ksoftirqd/7    [kernel.vmlinux]  [k] __bpf_prog_run

Looks like 'echo 1 > /proc/sys/net/core/bpf_jit_enable' is missing?

>   +    9.60%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_alloc_frags
>   +    7.71%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_process_rx_cq
>   +    5.47%  ksoftirqd/7    [mlx4_en]         [k] mlx4_en_free_frag
>   +    1.68%  ksoftirqd/7    [kernel.vmlinux]  [k] get_page_from_freelist
>   +    1.52%  ksoftirqd/7    [mlx4_en]         [k] mlx4_call_bpf
>   +    1.02%  ksoftirqd/7    [kernel.vmlinux]  [k] free_pages_prepare
>   +    0.72%  ksoftirqd/7    [mlx4_en]         [k] mlx4_alloc_pages.isra.20
>   +    0.70%  ksoftirqd/7    [kernel.vmlinux]  [k] __rcu_read_unlock
>   +    0.65%  ksoftirqd/7    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
>
> On my i7-4790K CPU, I don't have DDIO, thus I assume this high cost in
> __bpf_prog_run is due to a cache-miss on the packet data.
>
>> machine specs:
>>   receiver - Intel E5-1630 v3 @ 3.70GHz
>>   sender - Intel E5645 @ 2.40GHz
>>   Mellanox ConnectX-3 @40G
>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [RFC PATCH 5/5] Add sample for adding simple drop program to link
  2016-04-06 20:01     ` Jesper Dangaard Brouer
@ 2016-04-06 23:11       ` Alexei Starovoitov
  0 siblings, 0 replies; 62+ messages in thread
From: Alexei Starovoitov @ 2016-04-06 23:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Brenden Blanco, davem, netdev, tom, Or Gerlitz, daniel, john.fastabend

On Wed, Apr 06, 2016 at 10:01:00PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 6 Apr 2016 21:48:48 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> > If I do multiple flows, via ./pktgen_sample05_flow_per_thread.sh
> > then I hit this strange 14.5Mpps limit (proto 17:   14505558 drops/s).
> > And the RX 4x CPUs are starting to NOT use 100% in softirq, they have
> > some cycles attributed to %idle. (I verified generator is sending at
> > 24Mpps).
...
> > If I change the program to not touch packet data (don't call
> > load_byte()) then the performance increase to 14.6Mpps (single
> > flow/cpu).  And the RX CPU is mostly idle... mlx4_en_process_rx_cq()
> > and page alloc/free functions taking the time.

Please try it with module param log_num_mgm_entry_size=-1
It should get to 20Mpps when bpf doesn't touch the packet.

> Before someone else point out the obvious... I forgot to enable JIT.
> Enable it::
> 
>  # echo 1 > /proc/sys/net/core/bpf_jit_enable
> 
> Performance increased to: 10.8Mpps (proto 17:   10819446 drops/s)
> 
>  Samples: 51K of event 'cycles', Event count (approx.): 56775706510
>    Overhead  Command      Shared Object     Symbol
>  +   55.90%  ksoftirqd/7  [kernel.vmlinux]  [k] sk_load_byte_positive_offset
>  +   10.71%  ksoftirqd/7  [mlx4_en]         [k] mlx4_en_alloc_frags
... 
> It is a very likely cache-miss in sk_load_byte_positive_offset().

yes, likely due to missing ddio as you said.

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2016-04-06 23:11 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-02  1:21 [RFC PATCH 0/5] Add driver bpf hook for early packet drop Brenden Blanco
2016-04-02  1:21 ` [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver filter Brenden Blanco
2016-04-02 16:39   ` Tom Herbert
2016-04-03  7:02     ` Brenden Blanco
2016-04-04 22:07       ` Thomas Graf
2016-04-05  8:19         ` Jesper Dangaard Brouer
2016-04-04  8:49   ` Daniel Borkmann
2016-04-04 13:07     ` Jesper Dangaard Brouer
2016-04-04 13:36       ` Daniel Borkmann
2016-04-04 14:09         ` Tom Herbert
2016-04-04 15:12           ` Jesper Dangaard Brouer
2016-04-04 15:29             ` Brenden Blanco
2016-04-04 16:07               ` John Fastabend
2016-04-04 16:17                 ` Brenden Blanco
2016-04-04 20:00                   ` Alexei Starovoitov
2016-04-04 22:04                     ` Thomas Graf
2016-04-05  2:25                       ` Alexei Starovoitov
2016-04-05  8:11                         ` Jesper Dangaard Brouer
2016-04-05  9:29                     ` Jesper Dangaard Brouer
2016-04-05 22:06                       ` Alexei Starovoitov
2016-04-04 14:33       ` Eric Dumazet
2016-04-04 15:18         ` Edward Cree
2016-04-02  1:21 ` [RFC PATCH 2/5] net: add ndo to set bpf prog in adapter rx Brenden Blanco
2016-04-02  1:21 ` [RFC PATCH 3/5] rtnl: add option for setting link bpf prog Brenden Blanco
2016-04-02  1:21 ` [RFC PATCH 4/5] mlx4: add support for fast rx drop bpf program Brenden Blanco
2016-04-02  2:08   ` Eric Dumazet
2016-04-02  2:47     ` Alexei Starovoitov
2016-04-04 14:57       ` Jesper Dangaard Brouer
2016-04-04 15:22         ` Eric Dumazet
2016-04-04 18:50           ` Alexei Starovoitov
2016-04-05 14:15             ` Or Gerlitz
2016-04-06  4:05               ` Brenden Blanco
2016-04-03  6:15     ` Brenden Blanco
2016-04-05  2:20       ` Brenden Blanco
2016-04-05  2:44         ` Eric Dumazet
2016-04-05 18:59         ` Eran Ben Elisha
2016-04-02  8:23   ` Jesper Dangaard Brouer
2016-04-03  6:11     ` Brenden Blanco
2016-04-04 18:27       ` Alexei Starovoitov
2016-04-05  6:04         ` Jesper Dangaard Brouer
2016-04-02 18:40   ` Johannes Berg
2016-04-03  6:38     ` Brenden Blanco
2016-04-04  7:35       ` Johannes Berg
2016-04-04  9:57         ` Daniel Borkmann
2016-04-04 18:46           ` Alexei Starovoitov
2016-04-04 21:01             ` Daniel Borkmann
2016-04-05  1:17               ` Alexei Starovoitov
2016-04-04  8:33   ` Jesper Dangaard Brouer
2016-04-04  9:22   ` Daniel Borkmann
2016-04-02  1:21 ` [RFC PATCH 5/5] Add sample for adding simple drop program to link Brenden Blanco
2016-04-06 19:48   ` Jesper Dangaard Brouer
2016-04-06 20:01     ` Jesper Dangaard Brouer
2016-04-06 23:11       ` Alexei Starovoitov
2016-04-06 20:03     ` Daniel Borkmann
2016-04-02 16:47 ` [RFC PATCH 0/5] Add driver bpf hook for early packet drop Tom Herbert
2016-04-03  5:41   ` Brenden Blanco
2016-04-04  7:48     ` Jesper Dangaard Brouer
2016-04-04 18:10       ` Alexei Starovoitov
2016-04-02 18:41 ` Johannes Berg
2016-04-02 22:57   ` Tom Herbert
2016-04-03  2:28     ` Lorenzo Colitti
2016-04-04  7:37       ` Johannes Berg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.