All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding
@ 2016-07-19 19:16 Brenden Blanco
  2016-07-19 19:16 ` [PATCH v10 01/12] bpf: add bpf_prog_add api for bulk prog refcnt Brenden Blanco
                   ` (12 more replies)
  0 siblings, 13 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

This patch set introduces new infrastructure for programmatically
processing packets in the earliest stages of rx, as part of an effort
others are calling eXpress Data Path (XDP) [1]. Start this effort by
introducing a new bpf program type for early packet filtering, before
even an skb has been allocated.

Extend on this with the ability to modify packet data and send back out
on the same port.

Patch 1 adds an API for bulk bpf prog refcnt incrememnt.
Patch 2 introduces the new prog type and helpers for validating the bpf
  program. A new userspace struct is defined containing only data and
  data_end as fields, with others to follow in the future.
In patch 3, create a new ndo to pass the fd to supported drivers.
In patch 4, expose a new rtnl option to userspace.
In patch 5, enable support in mlx4 driver.
In patch 6, create a sample drop and count program. With single core,
  achieved ~20 Mpps drop rate on a 40G ConnectX3-Pro. This includes
  packet data access, bpf array lookup, and increment.
In patch 7, add a page recycle facility to mlx4 rx, enabled when xdp is
  active.
In patch 8, add the XDP_TX type to bpf.h
In patch 9, add helper in tx patch for writing tx_desc
In patch 10, add support in mlx4 for packet data write and forwarding
In patch 11, turn on packet write support in the bpf verifier
In patch 12, add a sample program for packet write and forwarding. With
  single core, achieved ~10 Mpps rewrite and forwarding.

[1] https://github.com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf

v10:
 1/12: Add bulk refcnt api.
 5/12: Move prog from priv to ring. This attribute is still only set
   globally, but the path to finer granularity should be clear. No lock
   is taken, so some rings may operate on older programs for a time (one
   napi loop). Looked into options such as napi_synchronize, but they
   were deemed too slow (calls to msleep).
   Rename prog to xdp_prog. Add xdp_ring_num to help with accounting,
   used more heavily in later patches.
 7/12: Adjust to use per-ring xdp prog. Use priv->xdp_ring_num where
   before priv->prog was used to determine buffer allocations.
 9/12: Add cpu_to_be16 to vlan_tag in mxl4_en_xmit(). Remove unused variable
   from mlx4_en_xmit and unused params from build_inline_wqe.

v9:
 4/11: Add missing newline in en_err message.
 6/11: Move page_cache cleanup from mlx4_en_destroy_rx_ring to
   mlx4_en_deactivate_rx_ring. Move mlx4_en_moderation_update back to
   static. Remove calls to mlx4_en_alloc/free_resources in mlx4_xdp_set.
   Adopt instead the approach of mlx4_en_change_mtu to use a watchdog.
 9/11: Use a per-ring function pointer in tx to separate out the code
   for regular and recycle paths of tx completion handling. Add a helper
   function to init the recycle ring and callback, called just after
   activating tx. Remove extra tx ring resource requirement, and instead
   steal from the upper rings. This helps to avoid needing
   mlx4_en_alloc_resources. Add some hopefully meaningful error
   messages for the various error cases. Reverted some of the
   hard-to-follow logic that was accounting for the extra tx rings.

v8:
 1/11: Reduce WARN_ONCE to single line. Also, change act param of that
   function to u32 to match return type of bpf_prog_run_xdp.
 2/11: Clarify locking semantics in ndo comment.
 4/11: Add en_err warning in mlx4_xdp_set on num_frags/mtu violation.

v7:
 Addressing two of the major discussion points: return codes and ndo.
 The rest will be taken as todo items for separate patches.

 Add an XDP_ABORTED type, which explicitly falls through to DROP. The
 same result must be taken for the default case as well, as it is now
 well-defined API behavior.

 Merge ndo_xdp_* into a single ndo. The style is similar to
 ndo_setup_tc, but with less unidirectional naming convention. The IFLA
 parameter names are unchanged.

 TODOs:
 Add ethtool per-ring stats for aborted, default cases, maybe even drop
 and tx as well.
 Avoid duplicate dma sync operation in XDP_PASS case as mentioned by
 Saeed.

  1/12: Add XDP_ABORTED enum, reword API comment, and update commit
   message.
  2/12: Rewrite ndo_xdp_*() into single ndo_xdp() with type/union style
    calling convention.
  3/12: Switch to ndo_xdp callback.
  4/12: Add XDP_ABORTED case as a fall-through to XDP_DROP. Implement
    ndo_xdp.
 12/12: Dropped, this will need some more work.

v6:
  2/12: drop unnecessary netif_device_present check
  4/12, 6/12, 9/12: Reorder default case statement above drop case to
    remove some copy/paste.

v5:
  0/12: Rebase and remove previous 1/13 patch
  1/12: Fix nits from Daniel. Left the (void *) cast as-is, to be fixed
    in future. Add bpf_warn_invalid_xdp_action() helper, to be used when
    out of bounds action is returned by the program. Add a comment to
    bpf.h denoting the undefined nature of out of bounds returns.
  2/12: Switch to using bpf_prog_get_type(). Rename ndo_xdp_get() to
    ndo_xdp_attached().
  3/12: Add IFLA_XDP as a nested type, and add the associated nla_policy
    for the new subtypes IFLA_XDP_FD and IFLA_XDP_ATTACHED.
  4/12: Fixup the use of READ_ONCE in the ndos. Add a user of
    bpf_warn_invalid_xdp_action helper.
  5/12: Adjust to using the nested netlink options.
  6/12: kbuild was complaining about overflow of u16 on tile
    architecture...bump frag_stride to u32. The page_offset member that
    is computed from this was already u32.

v4:
  2/12: Add inline helper for calling xdp bpf prog under rcu
  3/12: Add detail to ndo comments
  5/12: Remove mlx4_call_xdp and use inline helper instead.
  6/12: Fix checkpatch complaints
  9/12: Introduce new patch 9/12 with common helper for tx_desc write
    Refactor to use common tx_desc write helper
 11/12: Fix checkpatch complaints

v3:
  Rewrite from v2 trying to incorporate feedback from multiple sources.
  Specifically, add ability to forward packets out the same port and
    allow packet modification.
  For packet forwarding, the driver reserves a dedicated set of tx rings
    for exclusive use by xdp. Upon completion, the pages on this ring are
    recycled directly back to a small per-rx-ring page cache without
    being dma unmapped.
  Use of the percpu skb is dropped in favor of a lightweight struct
    xdp_buff. The direct packet access feature is leveraged to remove
    dependence on the skb.
  The mlx4 driver implementation allocates a page-per-packet and maps it
    in PCI_DMA_BIDIRECTIONAL mode when the bpf program is activated.
  Naming is converted to use "xdp" instead of "phys_dev".

v2:
  1/5: Drop xdp from types, instead consistently use bpf_phys_dev_
    Introduce enum for return values from phys_dev hook
  2/5: Move prog->type check to just before invoking ndo
    Change ndo to take a bpf_prog * instead of fd
    Add ndo_bpf_get rather than keeping a bool in the netdev struct
  3/5: Use ndo_bpf_get to fetch bool
  4/5: Enforce that only 1 frag is ever given to bpf prog by disallowing
    mtu to increase beyond FRAG_SZ0 when bpf prog is running, or conversely
    to set a bpf prog when priv->num_frags > 1
    Rename pseudo_skb to bpf_phys_dev_md
    Implement ndo_bpf_get
    Add dma sync just before invoking prog
    Check for explicit bpf return code rather than nonzero
    Remove increment of rx_dropped
  5/5: Use explicit bpf return code in example
    Update commit log with higher pps numbers

Brenden Blanco (12):
  bpf: add bpf_prog_add api for bulk prog refcnt
  bpf: add XDP prog type for early driver filter
  net: add ndo to setup/query xdp prog in adapter rx
  rtnl: add option for setting link xdp prog
  net/mlx4_en: add support for fast rx drop bpf program
  Add sample for adding simple drop program to link
  net/mlx4_en: add page recycle to prepare rx ring for tx support
  bpf: add XDP_TX xdp_action for direct forwarding
  net/mlx4_en: break out tx_desc write into separate function
  net/mlx4_en: add xdp forwarding and data write support
  bpf: enable direct packet data write for xdp progs
  bpf: add sample for xdp forwarding and rewrite

 drivers/infiniband/hw/mlx4/qp.c                 |  11 +-
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |   9 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  | 125 +++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      | 124 +++++++++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 274 ++++++++++++++++++------
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  44 +++-
 include/linux/bpf.h                             |   1 +
 include/linux/filter.h                          |  18 ++
 include/linux/mlx4/qp.h                         |  18 +-
 include/linux/netdevice.h                       |  34 +++
 include/uapi/linux/bpf.h                        |  21 ++
 include/uapi/linux/if_link.h                    |  12 ++
 kernel/bpf/syscall.c                            |  12 +-
 kernel/bpf/verifier.c                           |  18 +-
 net/core/dev.c                                  |  33 +++
 net/core/filter.c                               |  79 +++++++
 net/core/rtnetlink.c                            |  64 ++++++
 samples/bpf/Makefile                            |   9 +
 samples/bpf/bpf_load.c                          |   8 +
 samples/bpf/xdp1_kern.c                         |  93 ++++++++
 samples/bpf/xdp1_user.c                         | 181 ++++++++++++++++
 samples/bpf/xdp2_kern.c                         | 114 ++++++++++
 22 files changed, 1206 insertions(+), 96 deletions(-)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c
 create mode 100644 samples/bpf/xdp2_kern.c

-- 
2.8.2

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v10 01/12] bpf: add bpf_prog_add api for bulk prog refcnt
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 21:46   ` Alexei Starovoitov
  2016-07-19 19:16 ` [PATCH v10 02/12] bpf: add XDP prog type for early driver filter Brenden Blanco
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

A subsystem may need to store many copies of a bpf program, each
deserving its own reference. Rather than requiring the caller to loop
one by one (with possible mid-loop failure), add a bulk bpf_prog_add
api.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/bpf.h  |  1 +
 kernel/bpf/syscall.c | 12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index c13e92b..75a5ae6 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -224,6 +224,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl);
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
+struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
 struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
 void bpf_prog_put(struct bpf_prog *prog);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 96d938a..228f962 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -670,14 +670,20 @@ static struct bpf_prog *____bpf_prog_get(struct fd f)
 	return f.file->private_data;
 }
 
-struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
 {
-	if (atomic_inc_return(&prog->aux->refcnt) > BPF_MAX_REFCNT) {
-		atomic_dec(&prog->aux->refcnt);
+	if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) {
+		atomic_sub(i, &prog->aux->refcnt);
 		return ERR_PTR(-EBUSY);
 	}
 	return prog;
 }
+EXPORT_SYMBOL_GPL(bpf_prog_add);
+
+struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+{
+	return bpf_prog_add(prog, 1);
+}
 
 static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *type)
 {
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 02/12] bpf: add XDP prog type for early driver filter
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
  2016-07-19 19:16 ` [PATCH v10 01/12] bpf: add bpf_prog_add api for bulk prog refcnt Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 21:33   ` Alexei Starovoitov
  2016-07-19 19:16 ` [PATCH v10 03/12] net: add ndo to setup/query xdp prog in adapter rx Brenden Blanco
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

Add a new bpf prog type that is intended to run in early stages of the
packet rx path. Only minimal packet metadata will be available, hence a
new context type, struct xdp_md, is exposed to userspace. So far only
expose the packet start and end pointers, and only in read mode.

An XDP program must return one of the well known enum values, all other
return codes are reserved for future use. Unfortunately, this
restriction is hard to enforce at verification time, so take the
approach of warning at runtime when such programs are encountered. Out
of bounds return codes should alias to XDP_ABORTED.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/filter.h   | 18 +++++++++++
 include/uapi/linux/bpf.h | 20 ++++++++++++
 kernel/bpf/verifier.c    |  1 +
 net/core/filter.c        | 79 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 118 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6fc31ef..15d816a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -368,6 +368,11 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
+struct xdp_buff {
+	void *data;
+	void *data_end;
+};
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf and act_bpf programs
  */
@@ -429,6 +434,18 @@ static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 	return BPF_PROG_RUN(prog, skb);
 }
 
+static inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+				   struct xdp_buff *xdp)
+{
+	u32 ret;
+
+	rcu_read_lock();
+	ret = BPF_PROG_RUN(prog, (void *)xdp);
+	rcu_read_unlock();
+
+	return ret;
+}
+
 static inline unsigned int bpf_prog_size(unsigned int proglen)
 {
 	return max(sizeof(struct bpf_prog),
@@ -509,6 +526,7 @@ bool bpf_helper_changes_skb_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
+void bpf_warn_invalid_xdp_action(u32 act);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c4d9224..a517865 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_SCHED_CLS,
 	BPF_PROG_TYPE_SCHED_ACT,
 	BPF_PROG_TYPE_TRACEPOINT,
+	BPF_PROG_TYPE_XDP,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
@@ -439,4 +440,23 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* User return codes for XDP prog type.
+ * A valid XDP program must return one of these defined values. All other
+ * return codes are reserved for future use. Unknown return codes will result
+ * in packet drop.
+ */
+enum xdp_action {
+	XDP_ABORTED = 0,
+	XDP_DROP,
+	XDP_PASS,
+};
+
+/* user accessible metadata for XDP packet hook
+ * new fields must be added to the end of this structure
+ */
+struct xdp_md {
+	__u32 data;
+	__u32 data_end;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e206c21..a8d67d0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -713,6 +713,7 @@ static int check_ptr_alignment(struct verifier_env *env, struct reg_state *reg,
 	switch (env->prog->type) {
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
+	case BPF_PROG_TYPE_XDP:
 		break;
 	default:
 		verbose("verifier is misconfigured\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 22e3992..6c627bc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2410,6 +2410,12 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+xdp_func_proto(enum bpf_func_id func_id)
+{
+	return sk_filter_func_proto(func_id);
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2477,6 +2483,44 @@ static bool tc_cls_act_is_valid_access(int off, int size,
 	return __is_valid_access(off, size, type);
 }
 
+static bool __is_valid_xdp_access(int off, int size,
+				  enum bpf_access_type type)
+{
+	if (off < 0 || off >= sizeof(struct xdp_md))
+		return false;
+	if (off % size != 0)
+		return false;
+	if (size != 4)
+		return false;
+
+	return true;
+}
+
+static bool xdp_is_valid_access(int off, int size,
+				enum bpf_access_type type,
+				enum bpf_reg_type *reg_type)
+{
+	if (type == BPF_WRITE)
+		return false;
+
+	switch (off) {
+	case offsetof(struct xdp_md, data):
+		*reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*reg_type = PTR_TO_PACKET_END;
+		break;
+	}
+
+	return __is_valid_xdp_access(off, size, type);
+}
+
+void bpf_warn_invalid_xdp_action(u32 act)
+{
+	WARN_ONCE(1, "Illegal XDP return value %u, expect packet loss\n", act);
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
+
 static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 				      int src_reg, int ctx_off,
 				      struct bpf_insn *insn_buf,
@@ -2628,6 +2672,29 @@ static u32 bpf_net_convert_ctx_access(enum bpf_access_type type, int dst_reg,
 	return insn - insn_buf;
 }
 
+static u32 xdp_convert_ctx_access(enum bpf_access_type type, int dst_reg,
+				  int src_reg, int ctx_off,
+				  struct bpf_insn *insn_buf,
+				  struct bpf_prog *prog)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (ctx_off) {
+	case offsetof(struct xdp_md, data):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data));
+		break;
+	case offsetof(struct xdp_md, data_end):
+		*insn++ = BPF_LDX_MEM(bytes_to_bpf_size(FIELD_SIZEOF(struct xdp_buff, data_end)),
+				      dst_reg, src_reg,
+				      offsetof(struct xdp_buff, data_end));
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
 static const struct bpf_verifier_ops sk_filter_ops = {
 	.get_func_proto		= sk_filter_func_proto,
 	.is_valid_access	= sk_filter_is_valid_access,
@@ -2640,6 +2707,12 @@ static const struct bpf_verifier_ops tc_cls_act_ops = {
 	.convert_ctx_access	= bpf_net_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops xdp_ops = {
+	.get_func_proto		= xdp_func_proto,
+	.is_valid_access	= xdp_is_valid_access,
+	.convert_ctx_access	= xdp_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2655,11 +2728,17 @@ static struct bpf_prog_type_list sched_act_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_SCHED_ACT,
 };
 
+static struct bpf_prog_type_list xdp_type __read_mostly = {
+	.ops	= &xdp_ops,
+	.type	= BPF_PROG_TYPE_XDP,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
+	bpf_register_prog_type(&xdp_type);
 
 	return 0;
 }
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 03/12] net: add ndo to setup/query xdp prog in adapter rx
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
  2016-07-19 19:16 ` [PATCH v10 01/12] bpf: add bpf_prog_add api for bulk prog refcnt Brenden Blanco
  2016-07-19 19:16 ` [PATCH v10 02/12] bpf: add XDP prog type for early driver filter Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 19:16 ` [PATCH v10 04/12] rtnl: add option for setting link xdp prog Brenden Blanco
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

Add one new netdev op for drivers implementing the BPF_PROG_TYPE_XDP
filter. The single op is used for both setup/query of the xdp program,
modelled after ndo_setup_tc.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/linux/netdevice.h | 34 ++++++++++++++++++++++++++++++++++
 net/core/dev.c            | 33 +++++++++++++++++++++++++++++++++
 2 files changed, 67 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 49736a3..fab9a1c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -63,6 +63,7 @@ struct wpan_dev;
 struct mpls_dev;
 /* UDP Tunnel offloads */
 struct udp_tunnel_info;
+struct bpf_prog;
 
 void netdev_set_default_ethtool_ops(struct net_device *dev,
 				    const struct ethtool_ops *ops);
@@ -799,6 +800,33 @@ struct tc_to_netdev {
 	};
 };
 
+/* These structures hold the attributes of xdp state that are being passed
+ * to the netdevice through the xdp op.
+ */
+enum xdp_netdev_command {
+	/* Set or clear a bpf program used in the earliest stages of packet
+	 * rx. The prog will have been loaded as BPF_PROG_TYPE_XDP. The callee
+	 * is responsible for calling bpf_prog_put on any old progs that are
+	 * stored. In case of error, the callee need not release the new prog
+	 * reference, but on success it takes ownership and must bpf_prog_put
+	 * when it is no longer used.
+	 */
+	XDP_SETUP_PROG,
+	/* Check if a bpf program is set on the device.  The callee should
+	 * return true if a program is currently attached and running.
+	 */
+	XDP_QUERY_PROG,
+};
+
+struct netdev_xdp {
+	enum xdp_netdev_command command;
+	union {
+		/* XDP_SETUP_PROG */
+		struct bpf_prog *prog;
+		/* XDP_QUERY_PROG */
+		bool prog_attached;
+	};
+};
 
 /*
  * This structure defines the management hooks for network devices.
@@ -1087,6 +1115,9 @@ struct tc_to_netdev {
  *	appropriate rx headroom value allows avoiding skb head copy on
  *	forward. Setting a negative value resets the rx headroom to the
  *	default value.
+ * int (*ndo_xdp)(struct net_device *dev, struct netdev_xdp *xdp);
+ *	This function is used to set or query state related to XDP on the
+ *	netdevice. See definition of enum xdp_netdev_command for details.
  *
  */
 struct net_device_ops {
@@ -1271,6 +1302,8 @@ struct net_device_ops {
 						       struct sk_buff *skb);
 	void			(*ndo_set_rx_headroom)(struct net_device *dev,
 						       int needed_headroom);
+	int			(*ndo_xdp)(struct net_device *dev,
+					   struct netdev_xdp *xdp);
 };
 
 /**
@@ -3257,6 +3290,7 @@ int dev_get_phys_port_id(struct net_device *dev,
 int dev_get_phys_port_name(struct net_device *dev,
 			   char *name, size_t len);
 int dev_change_proto_down(struct net_device *dev, bool proto_down);
+int dev_change_xdp_fd(struct net_device *dev, int fd);
 struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *dev);
 struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
diff --git a/net/core/dev.c b/net/core/dev.c
index 7894e40..2a9c39f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -94,6 +94,7 @@
 #include <linux/ethtool.h>
 #include <linux/notifier.h>
 #include <linux/skbuff.h>
+#include <linux/bpf.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/busy_poll.h>
@@ -6615,6 +6616,38 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
 EXPORT_SYMBOL(dev_change_proto_down);
 
 /**
+ *	dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ *	@dev: device
+ *	@fd: new program fd or negative value to clear
+ *
+ *	Set or clear a bpf program for a device
+ */
+int dev_change_xdp_fd(struct net_device *dev, int fd)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	struct bpf_prog *prog = NULL;
+	struct netdev_xdp xdp = {};
+	int err;
+
+	if (!ops->ndo_xdp)
+		return -EOPNOTSUPP;
+	if (fd >= 0) {
+		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	xdp.command = XDP_SETUP_PROG;
+	xdp.prog = prog;
+	err = ops->ndo_xdp(dev, &xdp);
+	if (err < 0 && prog)
+		bpf_prog_put(prog);
+
+	return err;
+}
+EXPORT_SYMBOL(dev_change_xdp_fd);
+
+/**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
  *
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 04/12] rtnl: add option for setting link xdp prog
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (2 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 03/12] net: add ndo to setup/query xdp prog in adapter rx Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-20  8:38   ` Daniel Borkmann
  2016-07-19 19:16 ` [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

Sets the bpf program represented by fd as an early filter in the rx path
of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
Providing a negative value as fd clears the program. Getting the fd back
via rtnl is not possible, therefore reading of this value merely
provides a bool whether the program is valid on the link or not.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/if_link.h | 12 +++++++++
 net/core/rtnetlink.c         | 64 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4285ac3..a1b5202 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -156,6 +156,7 @@ enum {
 	IFLA_GSO_MAX_SEGS,
 	IFLA_GSO_MAX_SIZE,
 	IFLA_PAD,
+	IFLA_XDP,
 	__IFLA_MAX
 };
 
@@ -843,4 +844,15 @@ enum {
 };
 #define LINK_XSTATS_TYPE_MAX (__LINK_XSTATS_TYPE_MAX - 1)
 
+/* XDP section */
+
+enum {
+	IFLA_XDP_UNSPEC,
+	IFLA_XDP_FD,
+	IFLA_XDP_ATTACHED,
+	__IFLA_XDP_MAX,
+};
+
+#define IFLA_XDP_MAX (__IFLA_XDP_MAX - 1)
+
 #endif /* _UAPI_LINUX_IF_LINK_H */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a9e3805..eba2b82 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -891,6 +891,16 @@ static size_t rtnl_port_size(const struct net_device *dev,
 		return port_self_size;
 }
 
+static size_t rtnl_xdp_size(const struct net_device *dev)
+{
+	size_t xdp_size = nla_total_size(1);	/* XDP_ATTACHED */
+
+	if (!dev->netdev_ops->ndo_xdp)
+		return 0;
+	else
+		return xdp_size;
+}
+
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
 				     u32 ext_filter_mask)
 {
@@ -927,6 +937,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
 	       + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
+	       + rtnl_xdp_size(dev) /* IFLA_XDP */
 	       + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1211,6 +1222,33 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	struct netdev_xdp xdp_op = {};
+	struct nlattr *xdp;
+	int err;
+
+	if (!dev->netdev_ops->ndo_xdp)
+		return 0;
+	xdp = nla_nest_start(skb, IFLA_XDP);
+	if (!xdp)
+		return -EMSGSIZE;
+	xdp_op.command = XDP_QUERY_PROG;
+	err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
+	if (err)
+		goto err_cancel;
+	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, xdp_op.prog_attached);
+	if (err)
+		goto err_cancel;
+
+	nla_nest_end(skb, xdp);
+	return 0;
+
+err_cancel:
+	nla_nest_cancel(skb, xdp);
+	return err;
+}
+
 static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			    int type, u32 pid, u32 seq, u32 change,
 			    unsigned int flags, u32 ext_filter_mask)
@@ -1307,6 +1345,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	if (rtnl_port_fill(skb, dev, ext_filter_mask))
 		goto nla_put_failure;
 
+	if (rtnl_xdp_fill(skb, dev))
+		goto nla_put_failure;
+
 	if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) {
 		if (rtnl_link_fill(skb, dev) < 0)
 			goto nla_put_failure;
@@ -1392,6 +1433,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
 	[IFLA_LINK_NETNSID]	= { .type = NLA_S32 },
 	[IFLA_PROTO_DOWN]	= { .type = NLA_U8 },
+	[IFLA_XDP]		= { .type = NLA_NESTED },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -1429,6 +1471,11 @@ static const struct nla_policy ifla_port_policy[IFLA_PORT_MAX+1] = {
 	[IFLA_PORT_RESPONSE]	= { .type = NLA_U16, },
 };
 
+static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
+	[IFLA_XDP_FD]		= { .type = NLA_S32 },
+	[IFLA_XDP_ATTACHED]	= { .type = NLA_U8 },
+};
+
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
 {
 	const struct rtnl_link_ops *ops = NULL;
@@ -2054,6 +2101,23 @@ static int do_setlink(const struct sk_buff *skb,
 		status |= DO_SETLINK_NOTIFY;
 	}
 
+	if (tb[IFLA_XDP]) {
+		struct nlattr *xdp[IFLA_XDP_MAX + 1];
+
+		err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
+				       ifla_xdp_policy);
+		if (err < 0)
+			goto errout;
+
+		if (xdp[IFLA_XDP_FD]) {
+			err = dev_change_xdp_fd(dev,
+						nla_get_s32(xdp[IFLA_XDP_FD]));
+			if (err)
+				goto errout;
+			status |= DO_SETLINK_NOTIFY;
+		}
+	}
+
 errout:
 	if (status & DO_SETLINK_MODIFIED) {
 		if (status & DO_SETLINK_NOTIFY)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (3 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 04/12] rtnl: add option for setting link xdp prog Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 21:41   ` Alexei Starovoitov
                     ` (3 more replies)
  2016-07-19 19:16 ` [PATCH v10 06/12] Add sample for adding simple drop program to link Brenden Blanco
                   ` (7 subsequent siblings)
  12 siblings, 4 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.

In tc/socket bpf programs, helpers linearize skb fragments as needed
when the program touches the packet data. However, in the pursuit of
speed, XDP programs will not be allowed to use these slower functions,
especially if it involves allocating an skb.

Therefore, disallow MTU settings that would produce a multi-fragment
packet that XDP programs would fail to access. Future enhancements could
be done to increase the allowable MTU.

The xdp program is present as a per-ring data structure, but as of yet
it is not possible to set at that granularity through any ndo.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 60 ++++++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 40 +++++++++++++++--
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  6 +++
 3 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 6083775..c34a33d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -31,6 +31,7 @@
  *
  */
 
+#include <linux/bpf.h>
 #include <linux/etherdevice.h>
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
@@ -2112,6 +2113,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 		en_err(priv, "Bad MTU size:%d.\n", new_mtu);
 		return -EPERM;
 	}
+	if (priv->xdp_ring_num && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
+		en_err(priv, "MTU size:%d requires frags but XDP running\n",
+		       new_mtu);
+		return -EOPNOTSUPP;
+	}
 	dev->mtu = new_mtu;
 
 	if (netif_running(dev)) {
@@ -2520,6 +2526,58 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 	return err;
 }
 
+static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+	int xdp_ring_num;
+	int i;
+
+	xdp_ring_num = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;
+
+	if (priv->num_frags > 1) {
+		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (prog) {
+		prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	priv->xdp_ring_num = xdp_ring_num;
+
+	/* This xchg is paired with READ_ONCE in the fast path */
+	for (i = 0; i < priv->rx_ring_num; i++) {
+		old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
+		if (old_prog)
+			bpf_prog_put(old_prog);
+	}
+
+	return 0;
+}
+
+static bool mlx4_xdp_attached(struct net_device *dev)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+
+	return !!priv->xdp_ring_num;
+}
+
+static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return mlx4_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = mlx4_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_open		= mlx4_en_open,
 	.ndo_stop		= mlx4_en_close,
@@ -2548,6 +2606,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp		= mlx4_xdp,
 };
 
 static const struct net_device_ops mlx4_netdev_ops_master = {
@@ -2584,6 +2643,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_udp_tunnel_del	= mlx4_en_del_vxlan_port,
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
+	.ndo_xdp		= mlx4_xdp,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index c1b3a9c..6729545 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -32,6 +32,7 @@
  */
 
 #include <net/busy_poll.h>
+#include <linux/bpf.h>
 #include <linux/mlx4/cq.h>
 #include <linux/slab.h>
 #include <linux/mlx4/qp.h>
@@ -509,6 +510,8 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 	struct mlx4_en_dev *mdev = priv->mdev;
 	struct mlx4_en_rx_ring *ring = *pring;
 
+	if (ring->xdp_prog)
+		bpf_prog_put(ring->xdp_prog);
 	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
 	vfree(ring->rx_info);
 	ring->rx_info = NULL;
@@ -743,6 +746,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
+	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb;
 	int index;
 	int nr;
@@ -759,6 +763,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	if (budget <= 0)
 		return polled;
 
+	xdp_prog = READ_ONCE(ring->xdp_prog);
+
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
 	 * reading 'cqe->index' */
@@ -835,6 +841,35 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
 			(cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
 
+		/* A bpf program gets first chance to drop the packet. It may
+		 * read bytes but not past the end of the frag.
+		 */
+		if (xdp_prog) {
+			struct xdp_buff xdp;
+			dma_addr_t dma;
+			u32 act;
+
+			dma = be64_to_cpu(rx_desc->data[0].addr);
+			dma_sync_single_for_cpu(priv->ddev, dma,
+						priv->frag_info[0].frag_size,
+						DMA_FROM_DEVICE);
+
+			xdp.data = page_address(frags[0].page) +
+							frags[0].page_offset;
+			xdp.data_end = xdp.data + length;
+
+			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			switch (act) {
+			case XDP_PASS:
+				break;
+			default:
+				bpf_warn_invalid_xdp_action(act);
+			case XDP_ABORTED:
+			case XDP_DROP:
+				goto next;
+			}
+		}
+
 		if (likely(dev->features & NETIF_F_RXCSUM)) {
 			if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
 						      MLX4_CQE_STATUS_UDP)) {
@@ -1062,10 +1097,7 @@ static const int frag_sizes[] = {
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
-	/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
-	 * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
-	 */
-	int eff_mtu = dev->mtu + ETH_HLEN + (2 * VLAN_HLEN);
+	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
 	int buf_size = 0;
 	int i = 0;
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index d39bf59..eb1238d 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -164,6 +164,10 @@ enum {
 #define MLX4_LOOPBACK_TEST_PAYLOAD (HEADER_COPY_SIZE - ETH_HLEN)
 
 #define MLX4_EN_MIN_MTU		46
+/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
+ * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
+ */
+#define MLX4_EN_EFF_MTU(mtu)	((mtu) + ETH_HLEN + (2 * VLAN_HLEN))
 #define ETH_BCAST		0xffffffffffffULL
 
 #define MLX4_EN_LOOPBACK_RETRIES	5
@@ -319,6 +323,7 @@ struct mlx4_en_rx_ring {
 	u8  fcs_del;
 	void *buf;
 	void *rx_info;
+	struct bpf_prog *xdp_prog;
 	unsigned long bytes;
 	unsigned long packets;
 	unsigned long csum_ok;
@@ -558,6 +563,7 @@ struct mlx4_en_priv {
 	struct mlx4_en_frag_info frag_info[MLX4_EN_MAX_RX_FRAGS];
 	u16 num_frags;
 	u16 log_rx_info;
+	int xdp_ring_num;
 
 	struct mlx4_en_tx_ring **tx_ring;
 	struct mlx4_en_rx_ring *rx_ring[MAX_RX_RINGS];
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 06/12] Add sample for adding simple drop program to link
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (4 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 21:44   ` Alexei Starovoitov
  2016-07-19 19:16 ` [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
hook of a link. With the drop-only program, observed single core rate is
~20Mpps.

Other tests were run, for instance without the dropcnt increment or
without reading from the packet header, the packet rate was mostly
unchanged.

$ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
proto 17:   20403027 drops/s

./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
Running... ctrl^C to stop
Device: eth4@0
Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
  5056638pps 2427Mb/sec (2427186240bps) errors: 0
Device: eth4@1
Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
  5133311pps 2463Mb/sec (2463989280bps) errors: 0
Device: eth4@2
Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
  5077431pps 2437Mb/sec (2437166880bps) errors: 0
Device: eth4@3
Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
  5043067pps 2420Mb/sec (2420672160bps) errors: 0

perf report --no-children:
 26.05%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 17.84%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
  5.52%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
  4.90%  swapper      [kernel.vmlinux]  [k] poll_idle
  4.14%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  2.78%  ksoftirqd/0  [kernel.vmlinux]  [k] __free_pages_ok
  2.57%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.51%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
  1.94%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
  1.45%  swapper      [mlx4_en]         [k] mlx4_en_alloc_frags
  1.35%  ksoftirqd/0  [kernel.vmlinux]  [k] free_one_page
  1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
  1.04%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5c5
  0.96%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c58d
  0.93%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6ee
  0.92%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6b9
  0.89%  ksoftirqd/0  [kernel.vmlinux]  [k] __alloc_pages_nodemask
  0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c686
  0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5d5
  0.78%  ksoftirqd/0  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
  0.77%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5b4
  0.77%  ksoftirqd/0  [kernel.vmlinux]  [k] net_rx_action

machine specs:
 receiver - Intel E5-1630 v3 @ 3.70GHz
 sender - Intel E5645 @ 2.40GHz
 Mellanox ConnectX-3 @40G

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 samples/bpf/Makefile    |   4 ++
 samples/bpf/bpf_load.c  |   8 +++
 samples/bpf/xdp1_kern.c |  93 +++++++++++++++++++++++++
 samples/bpf/xdp1_user.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 286 insertions(+)
 create mode 100644 samples/bpf/xdp1_kern.c
 create mode 100644 samples/bpf/xdp1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index a98b780..0e4ab3a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -21,6 +21,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += xdp1
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -42,6 +43,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -64,6 +66,7 @@ always += test_overhead_tp_kern.o
 always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
+always += xdp1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -84,6 +87,7 @@ HOSTLOADLIBES_offwaketime += -lelf
 HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
+HOSTLOADLIBES_xdp1 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 022af71..0cfda23 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -50,6 +50,7 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
 	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
 	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+	bool is_xdp = strncmp(event, "xdp", 3) == 0;
 	enum bpf_prog_type prog_type;
 	char buf[256];
 	int fd, efd, err, id;
@@ -66,6 +67,8 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 		prog_type = BPF_PROG_TYPE_KPROBE;
 	} else if (is_tracepoint) {
 		prog_type = BPF_PROG_TYPE_TRACEPOINT;
+	} else if (is_xdp) {
+		prog_type = BPF_PROG_TYPE_XDP;
 	} else {
 		printf("Unknown event '%s'\n", event);
 		return -1;
@@ -79,6 +82,9 @@ static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
 
 	prog_fd[prog_cnt++] = fd;
 
+	if (is_xdp)
+		return 0;
+
 	if (is_socket) {
 		event += 6;
 		if (*event != '/')
@@ -319,6 +325,7 @@ int load_bpf_file(char *path)
 			if (memcmp(shname_prog, "kprobe/", 7) == 0 ||
 			    memcmp(shname_prog, "kretprobe/", 10) == 0 ||
 			    memcmp(shname_prog, "tracepoint/", 11) == 0 ||
+			    memcmp(shname_prog, "xdp", 3) == 0 ||
 			    memcmp(shname_prog, "socket", 6) == 0)
 				load_and_attach(shname_prog, insns, data_prog->d_size);
 		}
@@ -336,6 +343,7 @@ int load_bpf_file(char *path)
 		if (memcmp(shname, "kprobe/", 7) == 0 ||
 		    memcmp(shname, "kretprobe/", 10) == 0 ||
 		    memcmp(shname, "tracepoint/", 11) == 0 ||
+		    memcmp(shname, "xdp", 3) == 0 ||
 		    memcmp(shname, "socket", 6) == 0)
 			load_and_attach(shname, data->d_buf, data->d_size);
 	}
diff --git a/samples/bpf/xdp1_kern.c b/samples/bpf/xdp1_kern.c
new file mode 100644
index 0000000..e7dd8ac
--- /dev/null
+++ b/samples/bpf/xdp1_kern.c
@@ -0,0 +1,93 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+	struct iphdr *iph = data + nh_off;
+
+	if (iph + 1 > data_end)
+		return 0;
+	return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+	struct ipv6hdr *ip6h = data + nh_off;
+
+	if (ip6h + 1 > data_end)
+		return 0;
+	return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u64 nh_off;
+	u32 index;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+
+	if (h_proto == htons(ETH_P_IP))
+		index = parse_ipv4(data, nh_off, data_end);
+	else if (h_proto == htons(ETH_P_IPV6))
+		index = parse_ipv6(data, nh_off, data_end);
+	else
+		index = 0;
+
+	value = bpf_map_lookup_elem(&dropcnt, &index);
+	if (value)
+		*value += 1;
+
+	return rc;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
new file mode 100644
index 0000000..a5e109e
--- /dev/null
+++ b/samples/bpf/xdp1_user.c
@@ -0,0 +1,181 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <linux/bpf.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <unistd.h>
+#include "bpf_load.h"
+#include "libbpf.h"
+
+static int set_link_xdp_fd(int ifindex, int fd)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+
+	nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len = NLA_HDRLEN + nla_xdp->nla_len;
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
+
+static int ifindex;
+
+static void int_exit(int sig)
+{
+	set_link_xdp_fd(ifindex, -1);
+	exit(0);
+}
+
+/* simple per-protocol drop counter
+ */
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
+	const unsigned int nr_keys = 256;
+	__u64 values[nr_cpus], prev[nr_keys][nr_cpus];
+	__u32 key;
+	int i;
+
+	memset(prev, 0, sizeof(prev));
+
+	while (1) {
+		sleep(interval);
+
+		for (key = 0; key < nr_keys; key++) {
+			__u64 sum = 0;
+
+			assert(bpf_lookup_elem(map_fd[0], &key, values) == 0);
+			for (i = 0; i < nr_cpus; i++)
+				sum += (values[i] - prev[key][i]);
+			if (sum)
+				printf("proto %u: %10llu pkt/s\n",
+				       key, sum / interval);
+			memcpy(prev[key], values, sizeof(values));
+		}
+	}
+}
+
+int main(int ac, char **argv)
+{
+	char filename[256];
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+	if (ac != 2) {
+		printf("usage: %s IFINDEX\n", argv[0]);
+		return 1;
+	}
+
+	ifindex = strtoul(argv[1], NULL, 0);
+
+	if (load_bpf_file(filename)) {
+		printf("%s", bpf_log_buf);
+		return 1;
+	}
+
+	if (!prog_fd[0]) {
+		printf("load_bpf_file: %s\n", strerror(errno));
+		return 1;
+	}
+
+	signal(SIGINT, int_exit);
+
+	if (set_link_xdp_fd(ifindex, prog_fd[0]) < 0) {
+		printf("link set xdp fd failed\n");
+		return 1;
+	}
+
+	poll_stats(2);
+
+	return 0;
+}
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (5 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 06/12] Add sample for adding simple drop program to link Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 21:49   ` Alexei Starovoitov
  2016-07-25  7:35   ` Eric Dumazet
  2016-07-19 19:16 ` [PATCH v10 08/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
                   ` (5 subsequent siblings)
  12 siblings, 2 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

The mlx4 driver by default allocates order-3 pages for the ring to
consume in multiple fragments. When the device has an xdp program, this
behavior will prevent tx actions since the page must be re-mapped in
TODEVICE mode, which cannot be done if the page is still shared.

Start by making the allocator configurable based on whether xdp is
running, such that order-0 pages are always used and never shared.

Since this will stress the page allocator, add a simple page cache to
each rx ring. Pages in the cache are left dma-mapped, and in drop-only
stress tests the page allocator is eliminated from the perf report.

Note that setting an xdp program will now require the rings to be
reconfigured.

Before:
 26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
 17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
  6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
  4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
  3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
  2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
  2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq

After:
 31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
  8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
  7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
  6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
  4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
  4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
  3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
  2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
  1.37%  swapper      [kernel.vmlinux]       [k] menu_select
  1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 38 +++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 70 +++++++++++++++++++++++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   | 11 +++-
 3 files changed, 109 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index c34a33d..47ae2a2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2529,12 +2529,33 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
 static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_dev *mdev = priv->mdev;
 	struct bpf_prog *old_prog;
 	int xdp_ring_num;
+	int port_up = 0;
+	int err;
 	int i;
 
 	xdp_ring_num = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;
 
+	/* No need to reconfigure buffers when simply swapping the
+	 * program for a new one.
+	 */
+	if (priv->xdp_ring_num == xdp_ring_num) {
+		if (prog) {
+			prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
+			if (IS_ERR(prog))
+				return PTR_ERR(prog);
+		}
+		for (i = 0; i < priv->rx_ring_num; i++) {
+			/* This xchg is paired with READ_ONCE in the fastpath */
+			old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
+			if (old_prog)
+				bpf_prog_put(old_prog);
+		}
+		return 0;
+	}
+
 	if (priv->num_frags > 1) {
 		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
 		return -EOPNOTSUPP;
@@ -2546,15 +2567,30 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 			return PTR_ERR(prog);
 	}
 
+	mutex_lock(&mdev->state_lock);
+	if (priv->port_up) {
+		port_up = 1;
+		mlx4_en_stop_port(dev, 1);
+	}
+
 	priv->xdp_ring_num = xdp_ring_num;
 
-	/* This xchg is paired with READ_ONCE in the fast path */
 	for (i = 0; i < priv->rx_ring_num; i++) {
 		old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
 		if (old_prog)
 			bpf_prog_put(old_prog);
 	}
 
+	if (port_up) {
+		err = mlx4_en_start_port(dev);
+		if (err) {
+			en_err(priv, "Failed starting port %d for XDP change\n",
+			       priv->port);
+			queue_work(mdev->workqueue, &priv->watchdog_task);
+		}
+	}
+
+	mutex_unlock(&mdev->state_lock);
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6729545..9dd5dc1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -58,7 +58,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 	struct page *page;
 	dma_addr_t dma;
 
-	for (order = MLX4_EN_ALLOC_PREFER_ORDER; ;) {
+	for (order = frag_info->order; ;) {
 		gfp_t gfp = _gfp;
 
 		if (order)
@@ -71,7 +71,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
 			return -ENOMEM;
 	}
 	dma = dma_map_page(priv->ddev, page, 0, PAGE_SIZE << order,
-			   PCI_DMA_FROMDEVICE);
+			   frag_info->dma_dir);
 	if (dma_mapping_error(priv->ddev, dma)) {
 		put_page(page);
 		return -ENOMEM;
@@ -125,7 +125,8 @@ out:
 	while (i--) {
 		if (page_alloc[i].page != ring_alloc[i].page) {
 			dma_unmap_page(priv->ddev, page_alloc[i].dma,
-				page_alloc[i].page_size, PCI_DMA_FROMDEVICE);
+				page_alloc[i].page_size,
+				priv->frag_info[i].dma_dir);
 			page = page_alloc[i].page;
 			/* Revert changes done by mlx4_alloc_pages */
 			page_ref_sub(page, page_alloc[i].page_size /
@@ -146,7 +147,7 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
 
 	if (next_frag_end > frags[i].page_size)
 		dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
-			       PCI_DMA_FROMDEVICE);
+			       frag_info->dma_dir);
 
 	if (frags[i].page)
 		put_page(frags[i].page);
@@ -177,7 +178,8 @@ out:
 
 		page_alloc = &ring->page_alloc[i];
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-			       page_alloc->page_size, PCI_DMA_FROMDEVICE);
+			       page_alloc->page_size,
+			       priv->frag_info[i].dma_dir);
 		page = page_alloc->page;
 		/* Revert changes done by mlx4_alloc_pages */
 		page_ref_sub(page, page_alloc->page_size /
@@ -202,7 +204,7 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
 		       i, page_count(page_alloc->page));
 
 		dma_unmap_page(priv->ddev, page_alloc->dma,
-				page_alloc->page_size, PCI_DMA_FROMDEVICE);
+				page_alloc->page_size, frag_info->dma_dir);
 		while (page_alloc->page_offset + frag_info->frag_stride <
 		       page_alloc->page_size) {
 			put_page(page_alloc->page);
@@ -245,6 +247,12 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
 	struct mlx4_en_rx_alloc *frags = ring->rx_info +
 					(index << priv->log_rx_info);
 
+	if (ring->page_cache.index > 0) {
+		frags[0] = ring->page_cache.buf[--ring->page_cache.index];
+		rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
+		return 0;
+	}
+
 	return mlx4_en_alloc_frags(priv, rx_desc, frags, ring->page_alloc, gfp);
 }
 
@@ -503,6 +511,24 @@ void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv)
 	}
 }
 
+/* When the rx ring is running in page-per-packet mode, a released frame can go
+ * directly into a small cache, to avoid unmapping or touching the page
+ * allocator. In bpf prog performance scenarios, buffers are either forwarded
+ * or dropped, never converted to skbs, so every page can come directly from
+ * this cache when it is sized to be a multiple of the napi budget.
+ */
+bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
+			struct mlx4_en_rx_alloc *frame)
+{
+	struct mlx4_en_page_cache *cache = &ring->page_cache;
+
+	if (cache->index >= MLX4_EN_CACHE_SIZE)
+		return false;
+
+	cache->buf[cache->index++] = *frame;
+	return true;
+}
+
 void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 			     struct mlx4_en_rx_ring **pring,
 			     u32 size, u16 stride)
@@ -525,6 +551,16 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
 void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
 				struct mlx4_en_rx_ring *ring)
 {
+	int i;
+
+	for (i = 0; i < ring->page_cache.index; i++) {
+		struct mlx4_en_rx_alloc *frame = &ring->page_cache.buf[i];
+
+		dma_unmap_page(priv->ddev, frame->dma, frame->page_size,
+			       priv->frag_info[0].dma_dir);
+		put_page(frame->page);
+	}
+	ring->page_cache.index = 0;
 	mlx4_en_free_rx_buf(priv, ring);
 	if (ring->stride <= TXBB_SIZE)
 		ring->buf -= TXBB_SIZE;
@@ -866,6 +902,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 				bpf_warn_invalid_xdp_action(act);
 			case XDP_ABORTED:
 			case XDP_DROP:
+				if (mlx4_en_rx_recycle(ring, frags))
+					goto consumed;
 				goto next;
 			}
 		}
@@ -1021,6 +1059,7 @@ next:
 		for (nr = 0; nr < priv->num_frags; nr++)
 			mlx4_en_free_frag(priv, frags, nr);
 
+consumed:
 		++cq->mcq.cons_index;
 		index = (cq->mcq.cons_index) & ring->size_mask;
 		cqe = mlx4_en_get_cqe(cq->buf, index, priv->cqe_size) + factor;
@@ -1096,19 +1135,34 @@ static const int frag_sizes[] = {
 
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
+	enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
+	int order = MLX4_EN_ALLOC_PREFER_ORDER;
+	u32 align = SMP_CACHE_BYTES;
 	int buf_size = 0;
 	int i = 0;
 
+	/* bpf requires buffers to be set up as 1 packet per page.
+	 * This only works when num_frags == 1.
+	 */
+	if (priv->xdp_ring_num) {
+		/* This will gain efficient xdp frame recycling at the expense
+		 * of more costly truesize accounting
+		 */
+		align = PAGE_SIZE;
+		order = 0;
+	}
+
 	while (buf_size < eff_mtu) {
+		priv->frag_info[i].order = order;
 		priv->frag_info[i].frag_size =
 			(eff_mtu > buf_size + frag_sizes[i]) ?
 				frag_sizes[i] : eff_mtu - buf_size;
 		priv->frag_info[i].frag_prefix_size = buf_size;
 		priv->frag_info[i].frag_stride =
-				ALIGN(priv->frag_info[i].frag_size,
-				      SMP_CACHE_BYTES);
+				ALIGN(priv->frag_info[i].frag_size, align);
+		priv->frag_info[i].dma_dir = dma_dir;
 		buf_size += priv->frag_info[i].frag_size;
 		i++;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index eb1238d..eff4be0 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -259,6 +259,12 @@ struct mlx4_en_rx_alloc {
 	u32		page_size;
 };
 
+#define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
+struct mlx4_en_page_cache {
+	u32 index;
+	struct mlx4_en_rx_alloc buf[MLX4_EN_CACHE_SIZE];
+};
+
 struct mlx4_en_tx_ring {
 	/* cache line used and dirtied in tx completion
 	 * (mlx4_en_free_tx_buf())
@@ -324,6 +330,7 @@ struct mlx4_en_rx_ring {
 	void *buf;
 	void *rx_info;
 	struct bpf_prog *xdp_prog;
+	struct mlx4_en_page_cache page_cache;
 	unsigned long bytes;
 	unsigned long packets;
 	unsigned long csum_ok;
@@ -443,7 +450,9 @@ struct mlx4_en_mc_list {
 struct mlx4_en_frag_info {
 	u16 frag_size;
 	u16 frag_prefix_size;
-	u16 frag_stride;
+	u32 frag_stride;
+	enum dma_data_direction dma_dir;
+	int order;
 };
 
 #ifdef CONFIG_MLX4_EN_DCB
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 08/12] bpf: add XDP_TX xdp_action for direct forwarding
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (6 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 21:53   ` Alexei Starovoitov
  2016-07-19 19:16 ` [PATCH v10 09/12] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

XDP enabled drivers must transmit received packets back out on the same
port they were received on when a program returns this action.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 include/uapi/linux/bpf.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a517865..2b7076f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -449,6 +449,7 @@ enum xdp_action {
 	XDP_ABORTED = 0,
 	XDP_DROP,
 	XDP_PASS,
+	XDP_TX,
 };
 
 /* user accessible metadata for XDP packet hook
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 09/12] net/mlx4_en: break out tx_desc write into separate function
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (7 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 08/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 19:16 ` [PATCH v10 10/12] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

In preparation for writing the tx descriptor from multiple functions,
create a helper for both normal and blueflame access.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/infiniband/hw/mlx4/qp.c            |  11 +--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c | 134 ++++++++++++++++-------------
 include/linux/mlx4/qp.h                    |  18 ++--
 3 files changed, 92 insertions(+), 71 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 8db8405..768085f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -232,7 +232,7 @@ static void stamp_send_wqe(struct mlx4_ib_qp *qp, int n, int size)
 		}
 	} else {
 		ctrl = buf = get_send_wqe(qp, n & (qp->sq.wqe_cnt - 1));
-		s = (ctrl->fence_size & 0x3f) << 4;
+		s = (ctrl->qpn_vlan.fence_size & 0x3f) << 4;
 		for (i = 64; i < s; i += 64) {
 			wqe = buf + i;
 			*wqe = cpu_to_be32(0xffffffff);
@@ -264,7 +264,7 @@ static void post_nop_wqe(struct mlx4_ib_qp *qp, int n, int size)
 		inl->byte_count = cpu_to_be32(1 << 31 | (size - s - sizeof *inl));
 	}
 	ctrl->srcrb_flags = 0;
-	ctrl->fence_size = size / 16;
+	ctrl->qpn_vlan.fence_size = size / 16;
 	/*
 	 * Make sure descriptor is fully written before setting ownership bit
 	 * (because HW can start executing as soon as we do).
@@ -1992,7 +1992,8 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 			ctrl = get_send_wqe(qp, i);
 			ctrl->owner_opcode = cpu_to_be32(1 << 31);
 			if (qp->sq_max_wqes_per_wr == 1)
-				ctrl->fence_size = 1 << (qp->sq.wqe_shift - 4);
+				ctrl->qpn_vlan.fence_size =
+						1 << (qp->sq.wqe_shift - 4);
 
 			stamp_send_wqe(qp, i, 1 << qp->sq.wqe_shift);
 		}
@@ -3169,8 +3170,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		wmb();
 		*lso_wqe = lso_hdr_sz;
 
-		ctrl->fence_size = (wr->send_flags & IB_SEND_FENCE ?
-				    MLX4_WQE_CTRL_FENCE : 0) | size;
+		ctrl->qpn_vlan.fence_size = (wr->send_flags & IB_SEND_FENCE ?
+					     MLX4_WQE_CTRL_FENCE : 0) | size;
 
 		/*
 		 * Make sure descriptor is fully written before
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 76aa4d2..2f56018 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -631,8 +631,7 @@ static int get_real_size(const struct sk_buff *skb,
 static void build_inline_wqe(struct mlx4_en_tx_desc *tx_desc,
 			     const struct sk_buff *skb,
 			     const struct skb_shared_info *shinfo,
-			     int real_size, u16 *vlan_tag,
-			     int tx_ind, void *fragptr)
+			     void *fragptr)
 {
 	struct mlx4_wqe_inline_seg *inl = &tx_desc->inl;
 	int spc = MLX4_INLINE_ALIGN - CTRL_SIZE - sizeof *inl;
@@ -700,10 +699,66 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
 	__iowrite64_copy(dst, src, bytecnt / 8);
 }
 
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
+{
+	wmb();
+	/* Since there is no iowrite*_native() that writes the
+	 * value as is, without byteswapping - using the one
+	 * the doesn't do byteswapping in the relevant arch
+	 * endianness.
+	 */
+#if defined(__LITTLE_ENDIAN)
+	iowrite32(
+#else
+	iowrite32be(
+#endif
+		  ring->doorbell_qpn,
+		  ring->bf.uar->map + MLX4_SEND_DOORBELL);
+}
+
+static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
+				  struct mlx4_en_tx_desc *tx_desc,
+				  union mlx4_wqe_qpn_vlan qpn_vlan,
+				  int desc_size, int bf_index,
+				  __be32 op_own, bool bf_ok,
+				  bool send_doorbell)
+{
+	tx_desc->ctrl.qpn_vlan = qpn_vlan;
+
+	if (bf_ok) {
+		op_own |= htonl((bf_index & 0xffff) << 8);
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
+		dma_wmb();
+		tx_desc->ctrl.owner_opcode = op_own;
+
+		wmb();
+
+		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
+			     desc_size);
+
+		wmb();
+
+		ring->bf.offset ^= ring->bf.buf_size;
+	} else {
+		/* Ensure new descriptor hits memory
+		 * before setting ownership of this descriptor to HW
+		 */
+		dma_wmb();
+		tx_desc->ctrl.owner_opcode = op_own;
+		if (send_doorbell)
+			mlx4_en_xmit_doorbell(ring);
+		else
+			ring->xmit_more++;
+	}
+}
+
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct skb_shared_info *shinfo = skb_shinfo(skb);
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
 	struct device *ddev = priv->ddev;
 	struct mlx4_en_tx_ring *ring;
 	struct mlx4_en_tx_desc *tx_desc;
@@ -715,7 +770,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	int real_size;
 	u32 index, bf_index;
 	__be32 op_own;
-	u16 vlan_tag = 0;
 	u16 vlan_proto = 0;
 	int i_frag;
 	int lso_header_size;
@@ -725,6 +779,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	bool stop_queue;
 	bool inline_ok;
 	u32 ring_cons;
+	bool bf_ok;
 
 	tx_ind = skb_get_queue_mapping(skb);
 	ring = priv->tx_ring[tx_ind];
@@ -749,9 +804,17 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto tx_drop;
 	}
 
+	bf_ok = ring->bf_enabled;
 	if (skb_vlan_tag_present(skb)) {
-		vlan_tag = skb_vlan_tag_get(skb);
+		qpn_vlan.vlan_tag = cpu_to_be16(skb_vlan_tag_get(skb));
 		vlan_proto = be16_to_cpu(skb->vlan_proto);
+		if (vlan_proto == ETH_P_8021AD)
+			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
+		else if (vlan_proto == ETH_P_8021Q)
+			qpn_vlan.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
+		else
+			qpn_vlan.ins_vlan = 0;
+		bf_ok = false;
 	}
 
 	netdev_txq_bql_enqueue_prefetchw(ring->tx_queue);
@@ -771,6 +834,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	else {
 		tx_desc = (struct mlx4_en_tx_desc *) ring->bounce_buf;
 		bounce = true;
+		bf_ok = false;
 	}
 
 	/* Save skb in tx_info ring */
@@ -907,8 +971,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, skb->len);
 
 	if (tx_info->inl)
-		build_inline_wqe(tx_desc, skb, shinfo, real_size, &vlan_tag,
-				 tx_ind, fragptr);
+		build_inline_wqe(tx_desc, skb, shinfo, fragptr);
 
 	if (skb->encapsulation) {
 		union {
@@ -946,60 +1009,15 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	real_size = (real_size / 16) & 0x3f;
 
-	if (ring->bf_enabled && desc_size <= MAX_BF && !bounce &&
-	    !skb_vlan_tag_present(skb) && send_doorbell) {
-		tx_desc->ctrl.bf_qpn = ring->doorbell_qpn |
-				       cpu_to_be32(real_size);
-
-		op_own |= htonl((bf_index & 0xffff) << 8);
-		/* Ensure new descriptor hits memory
-		 * before setting ownership of this descriptor to HW
-		 */
-		dma_wmb();
-		tx_desc->ctrl.owner_opcode = op_own;
-
-		wmb();
-
-		mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
-			     desc_size);
-
-		wmb();
-
-		ring->bf.offset ^= ring->bf.buf_size;
-	} else {
-		tx_desc->ctrl.vlan_tag = cpu_to_be16(vlan_tag);
-		if (vlan_proto == ETH_P_8021AD)
-			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_SVLAN;
-		else if (vlan_proto == ETH_P_8021Q)
-			tx_desc->ctrl.ins_vlan = MLX4_WQE_CTRL_INS_CVLAN;
-		else
-			tx_desc->ctrl.ins_vlan = 0;
+	bf_ok &= desc_size <= MAX_BF && send_doorbell;
 
-		tx_desc->ctrl.fence_size = real_size;
+	if (bf_ok)
+		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
+	else
+		qpn_vlan.fence_size = real_size;
 
-		/* Ensure new descriptor hits memory
-		 * before setting ownership of this descriptor to HW
-		 */
-		dma_wmb();
-		tx_desc->ctrl.owner_opcode = op_own;
-		if (send_doorbell) {
-			wmb();
-			/* Since there is no iowrite*_native() that writes the
-			 * value as is, without byteswapping - using the one
-			 * the doesn't do byteswapping in the relevant arch
-			 * endianness.
-			 */
-#if defined(__LITTLE_ENDIAN)
-			iowrite32(
-#else
-			iowrite32be(
-#endif
-				  ring->doorbell_qpn,
-				  ring->bf.uar->map + MLX4_SEND_DOORBELL);
-		} else {
-			ring->xmit_more++;
-		}
-	}
+	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
+			      op_own, bf_ok, send_doorbell);
 
 	if (unlikely(stop_queue)) {
 		/* If queue was emptied after the if (stop_queue) , and before
diff --git a/include/linux/mlx4/qp.h b/include/linux/mlx4/qp.h
index 587cdf9..deaa221 100644
--- a/include/linux/mlx4/qp.h
+++ b/include/linux/mlx4/qp.h
@@ -291,16 +291,18 @@ enum {
 	MLX4_WQE_CTRL_FORCE_LOOPBACK	= 1 << 0,
 };
 
+union mlx4_wqe_qpn_vlan {
+	struct {
+		__be16	vlan_tag;
+		u8	ins_vlan;
+		u8	fence_size;
+	};
+	__be32		bf_qpn;
+};
+
 struct mlx4_wqe_ctrl_seg {
 	__be32			owner_opcode;
-	union {
-		struct {
-			__be16			vlan_tag;
-			u8			ins_vlan;
-			u8			fence_size;
-		};
-		__be32			bf_qpn;
-	};
+	union mlx4_wqe_qpn_vlan	qpn_vlan;
 	/*
 	 * High 24 bits are SRC remote buffer; low 8 bits are flags:
 	 * [7]   SO (strong ordering)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 10/12] net/mlx4_en: add xdp forwarding and data write support
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (8 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 09/12] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 19:16 ` [PATCH v10 11/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

A user will now be able to loop packets back out of the same port using
a bpf program attached to xdp hook. Updates to the packet contents from
the bpf program is also supported.

For the packet write feature to work, the rx buffers are now mapped as
bidirectional when the page is allocated. This occurs only when the xdp
hook is active.

When the program returns a TX action, enqueue the packet directly to a
dedicated tx ring, so as to avoid completely any locking. This requires
the tx ring to be allocated 1:1 for each rx ring, as well as the tx
completion running in the same softirq.

Upon tx completion, this dedicated tx ring recycles pages without
unmapping directly back to the original rx ring. In steady state tx/drop
workload, effectively 0 page allocs/frees will occur.

In order to separate out the paths between free and recycle, a
free_tx_desc func pointer is introduced that is optionally updated
whenever recycle_ring is activated. By default the original free
function is always initialized.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_ethtool.c |   9 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c  |  29 +++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c      |  14 +++
 drivers/net/ethernet/mellanox/mlx4/en_tx.c      | 140 +++++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h    |  27 ++++-
 5 files changed, 211 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
index 51a2e82..f32e272 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_ethtool.c
@@ -1722,6 +1722,12 @@ static int mlx4_en_set_channels(struct net_device *dev,
 	    !channel->tx_count || !channel->rx_count)
 		return -EINVAL;
 
+	if (channel->tx_count * MLX4_EN_NUM_UP <= priv->xdp_ring_num) {
+		en_err(priv, "Minimum %d tx channels required with XDP on\n",
+		       priv->xdp_ring_num / MLX4_EN_NUM_UP + 1);
+		return -EINVAL;
+	}
+
 	mutex_lock(&mdev->state_lock);
 	if (priv->port_up) {
 		port_up = 1;
@@ -1740,7 +1746,8 @@ static int mlx4_en_set_channels(struct net_device *dev,
 		goto out;
 	}
 
-	netif_set_real_num_tx_queues(dev, priv->tx_ring_num);
+	netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
+							priv->xdp_ring_num);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
 
 	if (dev->num_tc)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 47ae2a2..9abbba6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1522,6 +1522,24 @@ static void mlx4_en_free_affinity_hint(struct mlx4_en_priv *priv, int ring_idx)
 	free_cpumask_var(priv->rx_ring[ring_idx]->affinity_mask);
 }
 
+static void mlx4_en_init_recycle_ring(struct mlx4_en_priv *priv,
+				      int tx_ring_idx)
+{
+	struct mlx4_en_tx_ring *tx_ring = priv->tx_ring[tx_ring_idx];
+	int rr_index;
+
+	rr_index = (priv->xdp_ring_num - priv->tx_ring_num) + tx_ring_idx;
+	if (rr_index >= 0) {
+		tx_ring->free_tx_desc = mlx4_en_recycle_tx_desc;
+		tx_ring->recycle_ring = priv->rx_ring[rr_index];
+		en_dbg(DRV, priv,
+		       "Set tx_ring[%d]->recycle_ring = rx_ring[%d]\n",
+		       tx_ring_idx, rr_index);
+	} else {
+		tx_ring->recycle_ring = NULL;
+	}
+}
+
 int mlx4_en_start_port(struct net_device *dev)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -1644,6 +1662,8 @@ int mlx4_en_start_port(struct net_device *dev)
 		}
 		tx_ring->tx_queue = netdev_get_tx_queue(dev, i);
 
+		mlx4_en_init_recycle_ring(priv, i);
+
 		/* Arm CQ for TX completions */
 		mlx4_en_arm_cq(priv, cq);
 
@@ -2561,6 +2581,13 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 		return -EOPNOTSUPP;
 	}
 
+	if (priv->tx_ring_num < xdp_ring_num + MLX4_EN_NUM_UP) {
+		en_err(priv,
+		       "Minimum %d tx channels required to run XDP\n",
+		       (xdp_ring_num + MLX4_EN_NUM_UP) / MLX4_EN_NUM_UP);
+		return -EINVAL;
+	}
+
 	if (prog) {
 		prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
 		if (IS_ERR(prog))
@@ -2574,6 +2601,8 @@ static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
 	}
 
 	priv->xdp_ring_num = xdp_ring_num;
+	netif_set_real_num_tx_queues(dev, priv->tx_ring_num -
+							priv->xdp_ring_num);
 
 	for (i = 0; i < priv->rx_ring_num; i++) {
 		old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 9dd5dc1..11d88c8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -783,7 +783,9 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	struct mlx4_en_rx_alloc *frags;
 	struct mlx4_en_rx_desc *rx_desc;
 	struct bpf_prog *xdp_prog;
+	int doorbell_pending;
 	struct sk_buff *skb;
+	int tx_index;
 	int index;
 	int nr;
 	unsigned int length;
@@ -800,6 +802,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 		return polled;
 
 	xdp_prog = READ_ONCE(ring->xdp_prog);
+	doorbell_pending = 0;
+	tx_index = (priv->tx_ring_num - priv->xdp_ring_num) + cq->ring;
 
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
 	 * descriptor offset can be deduced from the CQE index instead of
@@ -898,6 +902,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			switch (act) {
 			case XDP_PASS:
 				break;
+			case XDP_TX:
+				if (!mlx4_en_xmit_frame(frags, dev,
+							length, tx_index,
+							&doorbell_pending))
+					goto consumed;
+				break;
 			default:
 				bpf_warn_invalid_xdp_action(act);
 			case XDP_ABORTED:
@@ -1068,6 +1078,9 @@ consumed:
 	}
 
 out:
+	if (doorbell_pending)
+		mlx4_en_xmit_doorbell(priv->tx_ring[tx_index]);
+
 	AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
 	mlx4_cq_set_ci(&cq->mcq);
 	wmb(); /* ensure HW sees CQ consumer before we post new buffers */
@@ -1147,6 +1160,7 @@ void mlx4_en_calc_rx_buf(struct net_device *dev)
 	 * This only works when num_frags == 1.
 	 */
 	if (priv->xdp_ring_num) {
+		dma_dir = PCI_DMA_BIDIRECTIONAL;
 		/* This will gain efficient xdp frame recycling at the expense
 		 * of more costly truesize accounting
 		 */
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
index 2f56018..9df87ca 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
@@ -196,6 +196,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
 	ring->last_nr_txbb = 1;
 	memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
 	memset(ring->buf, 0, ring->buf_size);
+	ring->free_tx_desc = mlx4_en_free_tx_desc;
 
 	ring->qp_state = MLX4_QP_STATE_RST;
 	ring->doorbell_qpn = cpu_to_be32(ring->qp.qpn << 8);
@@ -265,10 +266,10 @@ static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
 }
 
 
-static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
-				struct mlx4_en_tx_ring *ring,
-				int index, u8 owner, u64 timestamp,
-				int napi_mode)
+u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
+			 struct mlx4_en_tx_ring *ring,
+			 int index, u8 owner, u64 timestamp,
+			 int napi_mode)
 {
 	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
 	struct mlx4_en_tx_desc *tx_desc = ring->buf + index * TXBB_SIZE;
@@ -344,6 +345,27 @@ static u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
 	return tx_info->nr_txbb;
 }
 
+u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
+			    struct mlx4_en_tx_ring *ring,
+			    int index, u8 owner, u64 timestamp,
+			    int napi_mode)
+{
+	struct mlx4_en_tx_info *tx_info = &ring->tx_info[index];
+	struct mlx4_en_rx_alloc frame = {
+		.page = tx_info->page,
+		.dma = tx_info->map0_dma,
+		.page_offset = 0,
+		.page_size = PAGE_SIZE,
+	};
+
+	if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
+		dma_unmap_page(priv->ddev, tx_info->map0_dma,
+			       PAGE_SIZE, priv->frag_info[0].dma_dir);
+		put_page(tx_info->page);
+	}
+
+	return tx_info->nr_txbb;
+}
 
 int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 {
@@ -362,7 +384,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
 	}
 
 	while (ring->cons != ring->prod) {
-		ring->last_nr_txbb = mlx4_en_free_tx_desc(priv, ring,
+		ring->last_nr_txbb = ring->free_tx_desc(priv, ring,
 						ring->cons & ring->size_mask,
 						!!(ring->cons & ring->size), 0,
 						0 /* Non-NAPI caller */);
@@ -444,7 +466,7 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 				timestamp = mlx4_en_get_cqe_ts(cqe);
 
 			/* free next descriptor */
-			last_nr_txbb = mlx4_en_free_tx_desc(
+			last_nr_txbb = ring->free_tx_desc(
 					priv, ring, ring_index,
 					!!((ring_cons + txbbs_skipped) &
 					ring->size), timestamp, napi_budget);
@@ -476,6 +498,9 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
 	ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
 	ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
 
+	if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
+		return done < budget;
+
 	netdev_tx_completed_queue(ring->tx_queue, packets, bytes);
 
 	/* Wakeup Tx queue if this stopped, and ring is not full.
@@ -1052,3 +1077,106 @@ tx_drop:
 	return NETDEV_TX_OK;
 }
 
+netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
+			       struct net_device *dev, unsigned int length,
+			       int tx_ind, int *doorbell_pending)
+{
+	struct mlx4_en_priv *priv = netdev_priv(dev);
+	union mlx4_wqe_qpn_vlan	qpn_vlan = {};
+	struct mlx4_en_tx_ring *ring;
+	struct mlx4_en_tx_desc *tx_desc;
+	struct mlx4_wqe_data_seg *data;
+	struct mlx4_en_tx_info *tx_info;
+	int index, bf_index;
+	bool send_doorbell;
+	int nr_txbb = 1;
+	bool stop_queue;
+	dma_addr_t dma;
+	int real_size;
+	__be32 op_own;
+	u32 ring_cons;
+	bool bf_ok;
+
+	BUILD_BUG_ON_MSG(ALIGN(CTRL_SIZE + DS_SIZE, TXBB_SIZE) != TXBB_SIZE,
+			 "mlx4_en_xmit_frame requires minimum size tx desc");
+
+	ring = priv->tx_ring[tx_ind];
+
+	if (!priv->port_up)
+		goto tx_drop;
+
+	if (mlx4_en_is_tx_ring_full(ring))
+		goto tx_drop;
+
+	/* fetch ring->cons far ahead before needing it to avoid stall */
+	ring_cons = READ_ONCE(ring->cons);
+
+	index = ring->prod & ring->size_mask;
+	tx_info = &ring->tx_info[index];
+
+	bf_ok = ring->bf_enabled;
+
+	/* Track current inflight packets for performance analysis */
+	AVG_PERF_COUNTER(priv->pstats.inflight_avg,
+			 (u32)(ring->prod - ring_cons - 1));
+
+	bf_index = ring->prod;
+	tx_desc = ring->buf + index * TXBB_SIZE;
+	data = &tx_desc->data;
+
+	dma = frame->dma;
+
+	tx_info->page = frame->page;
+	frame->page = NULL;
+	tx_info->map0_dma = dma;
+	tx_info->map0_byte_count = length;
+	tx_info->nr_txbb = nr_txbb;
+	tx_info->nr_bytes = max_t(unsigned int, length, ETH_ZLEN);
+	tx_info->data_offset = (void *)data - (void *)tx_desc;
+	tx_info->ts_requested = 0;
+	tx_info->nr_maps = 1;
+	tx_info->linear = 1;
+	tx_info->inl = 0;
+
+	dma_sync_single_for_device(priv->ddev, dma, length, PCI_DMA_TODEVICE);
+
+	data->addr = cpu_to_be64(dma);
+	data->lkey = ring->mr_key;
+	dma_wmb();
+	data->byte_count = cpu_to_be32(length);
+
+	/* tx completion can avoid cache line miss for common cases */
+	tx_desc->ctrl.srcrb_flags = priv->ctrl_flags;
+
+	op_own = cpu_to_be32(MLX4_OPCODE_SEND) |
+		((ring->prod & ring->size) ?
+		 cpu_to_be32(MLX4_EN_BIT_DESC_OWN) : 0);
+
+	ring->packets++;
+	ring->bytes += tx_info->nr_bytes;
+	AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
+
+	ring->prod += nr_txbb;
+
+	stop_queue = mlx4_en_is_tx_ring_full(ring);
+	send_doorbell = stop_queue ||
+				*doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
+	bf_ok &= send_doorbell;
+
+	real_size = ((CTRL_SIZE + nr_txbb * DS_SIZE) / 16) & 0x3f;
+
+	if (bf_ok)
+		qpn_vlan.bf_qpn = ring->doorbell_qpn | cpu_to_be32(real_size);
+	else
+		qpn_vlan.fence_size = real_size;
+
+	mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
+			      op_own, bf_ok, send_doorbell);
+	*doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
+
+	return NETDEV_TX_OK;
+
+tx_drop:
+	ring->tx_dropped++;
+	return NETDEV_TX_BUSY;
+}
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index eff4be0..29c81d2 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -132,6 +132,7 @@ enum {
 					 MLX4_EN_NUM_UP)
 
 #define MLX4_EN_DEFAULT_TX_WORK		256
+#define MLX4_EN_DOORBELL_BUDGET		8
 
 /* Target number of packets to coalesce with interrupt moderation */
 #define MLX4_EN_RX_COAL_TARGET	44
@@ -219,7 +220,10 @@ enum cq_type {
 
 
 struct mlx4_en_tx_info {
-	struct sk_buff *skb;
+	union {
+		struct sk_buff *skb;
+		struct page *page;
+	};
 	dma_addr_t	map0_dma;
 	u32		map0_byte_count;
 	u32		nr_txbb;
@@ -265,6 +269,8 @@ struct mlx4_en_page_cache {
 	struct mlx4_en_rx_alloc buf[MLX4_EN_CACHE_SIZE];
 };
 
+struct mlx4_en_priv;
+
 struct mlx4_en_tx_ring {
 	/* cache line used and dirtied in tx completion
 	 * (mlx4_en_free_tx_buf())
@@ -298,6 +304,11 @@ struct mlx4_en_tx_ring {
 	__be32			mr_key;
 	void			*buf;
 	struct mlx4_en_tx_info	*tx_info;
+	struct mlx4_en_rx_ring	*recycle_ring;
+	u32			(*free_tx_desc)(struct mlx4_en_priv *priv,
+						struct mlx4_en_tx_ring *ring,
+						int index, u8 owner,
+						u64 timestamp, int napi_mode);
 	u8			*bounce_buf;
 	struct mlx4_qp_context	context;
 	int			qpn;
@@ -678,6 +689,12 @@ void mlx4_en_tx_irq(struct mlx4_cq *mcq);
 u16 mlx4_en_select_queue(struct net_device *dev, struct sk_buff *skb,
 			 void *accel_priv, select_queue_fallback_t fallback);
 netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev);
+netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
+			       struct net_device *dev, unsigned int length,
+			       int tx_ind, int *doorbell_pending);
+void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
+bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
+			struct mlx4_en_rx_alloc *frame);
 
 int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
 			   struct mlx4_en_tx_ring **pring,
@@ -706,6 +723,14 @@ int mlx4_en_process_rx_cq(struct net_device *dev,
 			  int budget);
 int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget);
 int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget);
+u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv,
+			 struct mlx4_en_tx_ring *ring,
+			 int index, u8 owner, u64 timestamp,
+			 int napi_mode);
+u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
+			    struct mlx4_en_tx_ring *ring,
+			    int index, u8 owner, u64 timestamp,
+			    int napi_mode);
 void mlx4_en_fill_qp_context(struct mlx4_en_priv *priv, int size, int stride,
 		int is_tx, int rss, int qpn, int cqn, int user_prio,
 		struct mlx4_qp_context *context);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 11/12] bpf: enable direct packet data write for xdp progs
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (9 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 10/12] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 21:59   ` Alexei Starovoitov
  2016-07-19 19:16 ` [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
  2016-07-20  5:09 ` [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding David Miller
  12 siblings, 1 reply; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

For forwarding to be effective, XDP programs should be allowed to
rewrite packet data.

This requires that the drivers supporting XDP must all map the packet
memory as TODEVICE or BIDIRECTIONAL before invoking the program.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 kernel/bpf/verifier.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a8d67d0..f72f23b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -653,6 +653,16 @@ static int check_map_access(struct verifier_env *env, u32 regno, int off,
 
 #define MAX_PACKET_OFF 0xffff
 
+static bool may_write_pkt_data(enum bpf_prog_type type)
+{
+	switch (type) {
+	case BPF_PROG_TYPE_XDP:
+		return true;
+	default:
+		return false;
+	}
+}
+
 static int check_packet_access(struct verifier_env *env, u32 regno, int off,
 			       int size)
 {
@@ -806,10 +816,15 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off,
 			err = check_stack_read(state, off, size, value_regno);
 		}
 	} else if (state->regs[regno].type == PTR_TO_PACKET) {
-		if (t == BPF_WRITE) {
+		if (t == BPF_WRITE && !may_write_pkt_data(env->prog->type)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
 		}
+		if (t == BPF_WRITE && value_regno >= 0 &&
+		    is_pointer_value(env, value_regno)) {
+			verbose("R%d leaks addr into packet\n", value_regno);
+			return -EACCES;
+		}
 		err = check_packet_access(env, regno, off, size);
 		if (!err && t == BPF_READ && value_regno >= 0)
 			mark_reg_unknown_value(state->regs, value_regno);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (10 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 11/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
@ 2016-07-19 19:16 ` Brenden Blanco
  2016-07-19 22:05   ` Alexei Starovoitov
  2016-08-03 17:01   ` Tom Herbert
  2016-07-20  5:09 ` [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding David Miller
  12 siblings, 2 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-19 19:16 UTC (permalink / raw)
  To: davem, netdev
  Cc: Brenden Blanco, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

Add a sample that rewrites and forwards packets out on the same
interface. Observed single core forwarding performance of ~10Mpps.

Since the mlx4 driver under test recycles every single packet page, the
perf output shows almost exclusively just the ring management and bpf
program work. Slowdowns are likely occurring due to cache misses.

Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
---
 samples/bpf/Makefile    |   5 +++
 samples/bpf/xdp2_kern.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+)
 create mode 100644 samples/bpf/xdp2_kern.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 0e4ab3a..d2d2b35 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
+hostprogs-y += xdp2
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
+# reuse xdp1 source intentionally
+xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
 always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
+always += xdp2_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
 HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
+HOSTLOADLIBES_xdp2 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
new file mode 100644
index 0000000..38fe7e1
--- /dev/null
+++ b/samples/bpf/xdp2_kern.c
@@ -0,0 +1,114 @@
+/* Copyright (c) 2016 PLUMgrid
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") dropcnt = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
+static void swap_src_dst_mac(void *data)
+{
+	unsigned short *p = data;
+	unsigned short dst[3];
+
+	dst[0] = p[0];
+	dst[1] = p[1];
+	dst[2] = p[2];
+	p[0] = p[3];
+	p[1] = p[4];
+	p[2] = p[5];
+	p[3] = dst[0];
+	p[4] = dst[1];
+	p[5] = dst[2];
+}
+
+static int parse_ipv4(void *data, u64 nh_off, void *data_end)
+{
+	struct iphdr *iph = data + nh_off;
+
+	if (iph + 1 > data_end)
+		return 0;
+	return iph->protocol;
+}
+
+static int parse_ipv6(void *data, u64 nh_off, void *data_end)
+{
+	struct ipv6hdr *ip6h = data + nh_off;
+
+	if (ip6h + 1 > data_end)
+		return 0;
+	return ip6h->nexthdr;
+}
+
+SEC("xdp1")
+int xdp_prog1(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	struct ethhdr *eth = data;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u64 nh_off;
+	u32 index;
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return rc;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(struct vlan_hdr);
+		if (data + nh_off > data_end)
+			return rc;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+
+	if (h_proto == htons(ETH_P_IP))
+		index = parse_ipv4(data, nh_off, data_end);
+	else if (h_proto == htons(ETH_P_IPV6))
+		index = parse_ipv6(data, nh_off, data_end);
+	else
+		index = 0;
+
+	value = bpf_map_lookup_elem(&dropcnt, &index);
+	if (value)
+		*value += 1;
+
+	if (index == 17) {
+		swap_src_dst_mac(data);
+		rc = XDP_TX;
+	}
+
+	return rc;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 02/12] bpf: add XDP prog type for early driver filter
  2016-07-19 19:16 ` [PATCH v10 02/12] bpf: add XDP prog type for early driver filter Brenden Blanco
@ 2016-07-19 21:33   ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 21:33 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:47PM -0700, Brenden Blanco wrote:
> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
> 
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. Out
> of bounds return codes should alias to XDP_ABORTED.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-19 19:16 ` [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
@ 2016-07-19 21:41   ` Alexei Starovoitov
  2016-07-20  9:07   ` Daniel Borkmann
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 21:41 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:50PM -0700, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
> 
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touches the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
> 
> Therefore, disallow MTU settings that would produce a multi-fragment
> packet that XDP programs would fail to access. Future enhancements could
> be done to increase the allowable MTU.
> 
> The xdp program is present as a per-ring data structure, but as of yet
> it is not possible to set at that granularity through any ndo.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
...
> +static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
> +{
> +	struct mlx4_en_priv *priv = netdev_priv(dev);
> +	struct bpf_prog *old_prog;
> +	int xdp_ring_num;
> +	int i;
> +
> +	xdp_ring_num = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;
> +
> +	if (priv->num_frags > 1) {
> +		en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (prog) {
> +		prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
> +		if (IS_ERR(prog))
> +			return PTR_ERR(prog);
> +	}
> +
> +	priv->xdp_ring_num = xdp_ring_num;
> +
> +	/* This xchg is paired with READ_ONCE in the fast path */
> +	for (i = 0; i < priv->rx_ring_num; i++) {
> +		old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
> +		if (old_prog)
> +			bpf_prog_put(old_prog);
> +	}

priv->xdp_ring_num looks similar priv->rx_ring_num, so on the first glance
it seemed that the per ring refactoring broke detach logic, but no. it's good.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 06/12] Add sample for adding simple drop program to link
  2016-07-19 19:16 ` [PATCH v10 06/12] Add sample for adding simple drop program to link Brenden Blanco
@ 2016-07-19 21:44   ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 21:44 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:51PM -0700, Brenden Blanco wrote:
> Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
> hook of a link. With the drop-only program, observed single core rate is
> ~20Mpps.
> 
> Other tests were run, for instance without the dropcnt increment or
> without reading from the packet header, the packet rate was mostly
> unchanged.
> 
> $ perf record -a samples/bpf/xdp1 $(</sys/class/net/eth0/ifindex)
> proto 17:   20403027 drops/s
> 
> ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> Running... ctrl^C to stop
> Device: eth4@0
> Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
>   5056638pps 2427Mb/sec (2427186240bps) errors: 0
> Device: eth4@1
> Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
>   5133311pps 2463Mb/sec (2463989280bps) errors: 0
> Device: eth4@2
> Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
>   5077431pps 2437Mb/sec (2437166880bps) errors: 0
> Device: eth4@3
> Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
>   5043067pps 2420Mb/sec (2420672160bps) errors: 0
> 
> perf report --no-children:
>  26.05%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
>  17.84%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
>   5.52%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
>   4.90%  swapper      [kernel.vmlinux]  [k] poll_idle
>   4.14%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
>   2.78%  ksoftirqd/0  [kernel.vmlinux]  [k] __free_pages_ok
>   2.57%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
>   2.51%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
>   1.94%  ksoftirqd/0  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
>   1.45%  swapper      [mlx4_en]         [k] mlx4_en_alloc_frags
>   1.35%  ksoftirqd/0  [kernel.vmlinux]  [k] free_one_page
>   1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
>   1.04%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5c5
>   0.96%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c58d
>   0.93%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6ee
>   0.92%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c6b9
>   0.89%  ksoftirqd/0  [kernel.vmlinux]  [k] __alloc_pages_nodemask
>   0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c686
>   0.83%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5d5
>   0.78%  ksoftirqd/0  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
>   0.77%  ksoftirqd/0  [mlx4_en]         [k] 0x000000000001c5b4
>   0.77%  ksoftirqd/0  [kernel.vmlinux]  [k] net_rx_action
> 
> machine specs:
>  receiver - Intel E5-1630 v3 @ 3.70GHz
>  sender - Intel E5645 @ 2.40GHz
>  Mellanox ConnectX-3 @40G
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
...
> +int main(int ac, char **argv)
> +{
> +	char filename[256];
> +
> +	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
> +
> +	if (ac != 2) {
> +		printf("usage: %s IFINDEX\n", argv[0]);
> +		return 1;
> +	}
> +
> +	ifindex = strtoul(argv[1], NULL, 0);

great test. some future extension could be to use dev name instead of id.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 01/12] bpf: add bpf_prog_add api for bulk prog refcnt
  2016-07-19 19:16 ` [PATCH v10 01/12] bpf: add bpf_prog_add api for bulk prog refcnt Brenden Blanco
@ 2016-07-19 21:46   ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 21:46 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:46PM -0700, Brenden Blanco wrote:
> A subsystem may need to store many copies of a bpf program, each
> deserving its own reference. Rather than requiring the caller to loop
> one by one (with possible mid-loop failure), add a bulk bpf_prog_add
> api.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/linux/bpf.h  |  1 +
>  kernel/bpf/syscall.c | 12 +++++++++---
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index c13e92b..75a5ae6 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -224,6 +224,7 @@ void bpf_register_map_type(struct bpf_map_type_list *tl);
>  
>  struct bpf_prog *bpf_prog_get(u32 ufd);
>  struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
> +struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
>  struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
>  void bpf_prog_put(struct bpf_prog *prog);
>  
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 96d938a..228f962 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -670,14 +670,20 @@ static struct bpf_prog *____bpf_prog_get(struct fd f)
>  	return f.file->private_data;
>  }
>  
> -struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
> +struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
>  {
> -	if (atomic_inc_return(&prog->aux->refcnt) > BPF_MAX_REFCNT) {
> -		atomic_dec(&prog->aux->refcnt);
> +	if (atomic_add_return(i, &prog->aux->refcnt) > BPF_MAX_REFCNT) {
> +		atomic_sub(i, &prog->aux->refcnt);
>  		return ERR_PTR(-EBUSY);
>  	}
>  	return prog;
>  }
> +EXPORT_SYMBOL_GPL(bpf_prog_add);
> +
> +struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
> +{
> +	return bpf_prog_add(prog, 1);
> +}

that extension turned out to be smaller than I thought :)
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-19 19:16 ` [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
@ 2016-07-19 21:49   ` Alexei Starovoitov
  2016-07-25  7:35   ` Eric Dumazet
  1 sibling, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 21:49 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:52PM -0700, Brenden Blanco wrote:
> The mlx4 driver by default allocates order-3 pages for the ring to
> consume in multiple fragments. When the device has an xdp program, this
> behavior will prevent tx actions since the page must be re-mapped in
> TODEVICE mode, which cannot be done if the page is still shared.
> 
> Start by making the allocator configurable based on whether xdp is
> running, such that order-0 pages are always used and never shared.
> 
> Since this will stress the page allocator, add a simple page cache to
> each rx ring. Pages in the cache are left dma-mapped, and in drop-only
> stress tests the page allocator is eliminated from the perf report.
> 
> Note that setting an xdp program will now require the rings to be
> reconfigured.
> 
> Before:
>  26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
>  17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
>   6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
>   4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
>   3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
>   2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
>   2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
> 
> After:
>  31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
>   8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
>   7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
>   6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
>   4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
>   4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
>   3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
>   2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
>   1.37%  swapper      [kernel.vmlinux]       [k] menu_select
>   1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
...
> +#define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
> +struct mlx4_en_page_cache {
> +	u32 index;
> +	struct mlx4_en_rx_alloc buf[MLX4_EN_CACHE_SIZE];
> +};

amazing that this tiny recycling pool makes such a huge difference.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 08/12] bpf: add XDP_TX xdp_action for direct forwarding
  2016-07-19 19:16 ` [PATCH v10 08/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
@ 2016-07-19 21:53   ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 21:53 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:53PM -0700, Brenden Blanco wrote:
> XDP enabled drivers must transmit received packets back out on the same
> port they were received on when a program returns this action.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  include/uapi/linux/bpf.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a517865..2b7076f 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -449,6 +449,7 @@ enum xdp_action {
>  	XDP_ABORTED = 0,
>  	XDP_DROP,
>  	XDP_PASS,
> +	XDP_TX,

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 11/12] bpf: enable direct packet data write for xdp progs
  2016-07-19 19:16 ` [PATCH v10 11/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
@ 2016-07-19 21:59   ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 21:59 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:56PM -0700, Brenden Blanco wrote:
> For forwarding to be effective, XDP programs should be allowed to
> rewrite packet data.
> 
> This requires that the drivers supporting XDP must all map the packet
> memory as TODEVICE or BIDIRECTIONAL before invoking the program.
> 
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  kernel/bpf/verifier.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a8d67d0..f72f23b 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -653,6 +653,16 @@ static int check_map_access(struct verifier_env *env, u32 regno, int off,
>  
>  #define MAX_PACKET_OFF 0xffff
>  
> +static bool may_write_pkt_data(enum bpf_prog_type type)
> +{
> +	switch (type) {
> +	case BPF_PROG_TYPE_XDP:
> +		return true;
> +	default:
> +		return false;
> +	}
> +}
> +
>  static int check_packet_access(struct verifier_env *env, u32 regno, int off,
>  			       int size)
>  {
> @@ -806,10 +816,15 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off,
>  			err = check_stack_read(state, off, size, value_regno);
>  		}
>  	} else if (state->regs[regno].type == PTR_TO_PACKET) {
> -		if (t == BPF_WRITE) {
> +		if (t == BPF_WRITE && !may_write_pkt_data(env->prog->type)) {
>  			verbose("cannot write into packet\n");
>  			return -EACCES;
>  		}
> +		if (t == BPF_WRITE && value_regno >= 0 &&
> +		    is_pointer_value(env, value_regno)) {
> +			verbose("R%d leaks addr into packet\n", value_regno);
> +			return -EACCES;
> +		}

Like this extra security check :) though it's arguably overkill.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-07-19 19:16 ` [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
@ 2016-07-19 22:05   ` Alexei Starovoitov
  2016-07-20 17:38     ` Brenden Blanco
  2016-07-27 18:25     ` Jesper Dangaard Brouer
  2016-08-03 17:01   ` Tom Herbert
  1 sibling, 2 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-07-19 22:05 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 12:16:57PM -0700, Brenden Blanco wrote:
> Add a sample that rewrites and forwards packets out on the same
> interface. Observed single core forwarding performance of ~10Mpps.
> 
> Since the mlx4 driver under test recycles every single packet page, the
> perf output shows almost exclusively just the ring management and bpf
> program work. Slowdowns are likely occurring due to cache misses.

long term we need to resurrect your prefetch patch. sux to leave
so much performance on the table.

> +static int parse_ipv4(void *data, u64 nh_off, void *data_end)
> +{
> +	struct iphdr *iph = data + nh_off;
> +
> +	if (iph + 1 > data_end)
> +		return 0;
> +	return iph->protocol;
> +}
> +
> +static int parse_ipv6(void *data, u64 nh_off, void *data_end)
> +{
> +	struct ipv6hdr *ip6h = data + nh_off;
> +
> +	if (ip6h + 1 > data_end)
> +		return 0;
> +	return ip6h->nexthdr;
> +}
...
> +	if (h_proto == htons(ETH_P_IP))
> +		index = parse_ipv4(data, nh_off, data_end);
> +	else if (h_proto == htons(ETH_P_IPV6))
> +		index = parse_ipv6(data, nh_off, data_end);
> +	else
> +		index = 0;
> +
> +	value = bpf_map_lookup_elem(&dropcnt, &index);
> +	if (value)
> +		*value += 1;
> +
> +	if (index == 17) {

not an obvious xdp example. if you'd have to respin for other
reasons please consider 'proto' name and IPPROTO_UDP here.

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding
  2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
                   ` (11 preceding siblings ...)
  2016-07-19 19:16 ` [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
@ 2016-07-20  5:09 ` David Miller
       [not found]   ` <6a09ce5d-f902-a576-e44e-8e1e111ae26b@gmail.com>
  12 siblings, 1 reply; 59+ messages in thread
From: David Miller @ 2016-07-20  5:09 UTC (permalink / raw)
  To: bblanco
  Cc: netdev, jhs, saeedm, kafai, brouer, as754m, alexei.starovoitov,
	gerlitz.or, john.fastabend, hannes, tgraf, tom, daniel,
	ttoukan.linux

From: Brenden Blanco <bblanco@plumgrid.com>
Date: Tue, 19 Jul 2016 12:16:45 -0700

> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling eXpress Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before
> even an skb has been allocated.
> 
> Extend on this with the ability to modify packet data and send back out
> on the same port.

Series applied, thanks.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 04/12] rtnl: add option for setting link xdp prog
  2016-07-19 19:16 ` [PATCH v10 04/12] rtnl: add option for setting link xdp prog Brenden Blanco
@ 2016-07-20  8:38   ` Daniel Borkmann
  2016-07-20 17:35     ` Brenden Blanco
  0 siblings, 1 reply; 59+ messages in thread
From: Daniel Borkmann @ 2016-07-20  8:38 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert, Tariq Toukan

On 07/19/2016 09:16 PM, Brenden Blanco wrote:
> Sets the bpf program represented by fd as an early filter in the rx path
> of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
> Providing a negative value as fd clears the program. Getting the fd back
> via rtnl is not possible, therefore reading of this value merely
> provides a bool whether the program is valid on the link or not.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
[...]
> @@ -2054,6 +2101,23 @@ static int do_setlink(const struct sk_buff *skb,
>   		status |= DO_SETLINK_NOTIFY;
>   	}
>
> +	if (tb[IFLA_XDP]) {
> +		struct nlattr *xdp[IFLA_XDP_MAX + 1];
> +
> +		err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
> +				       ifla_xdp_policy);
> +		if (err < 0)
> +			goto errout;
> +
> +		if (xdp[IFLA_XDP_FD]) {
> +			err = dev_change_xdp_fd(dev,
> +						nla_get_s32(xdp[IFLA_XDP_FD]));
> +			if (err)
> +				goto errout;
> +			status |= DO_SETLINK_NOTIFY;
> +		}

For the setlink case IFLA_XDP_ATTACHED has no meaning currently, so I'd
suggest it would be good to be strict and still add a:

		/* Only used in rtnl_fill_ifinfo currently. */
		if (xdp[IFLA_XDP_ATTACHED]) {
			err = -EINVAL;
			goto errout;
		}

> +	}
> +
>   errout:
>   	if (status & DO_SETLINK_MODIFIED) {
>   		if (status & DO_SETLINK_NOTIFY)
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-19 19:16 ` [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
  2016-07-19 21:41   ` Alexei Starovoitov
@ 2016-07-20  9:07   ` Daniel Borkmann
  2016-07-20 17:33     ` Brenden Blanco
  2016-07-24 11:56   ` Jesper Dangaard Brouer
  2016-07-24 16:57   ` Tom Herbert
  3 siblings, 1 reply; 59+ messages in thread
From: Daniel Borkmann @ 2016-07-20  9:07 UTC (permalink / raw)
  To: Brenden Blanco, davem, netdev
  Cc: Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert, Tariq Toukan

On 07/19/2016 09:16 PM, Brenden Blanco wrote:
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touches the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
>
> Therefore, disallow MTU settings that would produce a multi-fragment
> packet that XDP programs would fail to access. Future enhancements could
> be done to increase the allowable MTU.
>
> The xdp program is present as a per-ring data structure, but as of yet
> it is not possible to set at that granularity through any ndo.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
[...]
>   struct mlx4_en_bond {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index c1b3a9c..6729545 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -32,6 +32,7 @@
>    */
>
>   #include <net/busy_poll.h>
> +#include <linux/bpf.h>
>   #include <linux/mlx4/cq.h>
>   #include <linux/slab.h>
>   #include <linux/mlx4/qp.h>
> @@ -509,6 +510,8 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
>   	struct mlx4_en_dev *mdev = priv->mdev;
>   	struct mlx4_en_rx_ring *ring = *pring;
>
> +	if (ring->xdp_prog)
> +		bpf_prog_put(ring->xdp_prog);

Would be good if you also make this a READ_ONCE() here. I believe this is the
only other spot in your set that has this 'direct' access (besides xchg() and
READ_ONCE() from mlx4_en_process_rx_cq()). It would be mostly for consistency
and to indicate that there's a more complex synchronization behind it. I'm mostly
worried that if it's not consistently used, people might copy this and not use
the READ_ONCE() also in other spots where it matters, and thus add hard to find
bugs.

>   	mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
>   	vfree(ring->rx_info);
>   	ring->rx_info = NULL;
> @@ -743,6 +746,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
>   	struct mlx4_en_rx_alloc *frags;
>   	struct mlx4_en_rx_desc *rx_desc;
> +	struct bpf_prog *xdp_prog;
>   	struct sk_buff *skb;
>   	int index;
>   	int nr;
> @@ -759,6 +763,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>   	if (budget <= 0)
>   		return polled;

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding
       [not found]   ` <6a09ce5d-f902-a576-e44e-8e1e111ae26b@gmail.com>
@ 2016-07-20 14:08     ` Brenden Blanco
  2016-07-20 19:14     ` David Miller
  1 sibling, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-20 14:08 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: David Miller, netdev, jhs, saeedm, kafai, brouer, as754m,
	alexei.starovoitov, gerlitz.or, john.fastabend, hannes, tgraf,
	tom, daniel

On Wed, Jul 20, 2016 at 12:18:49PM +0300, Tariq Toukan wrote:
> 
> On 20/07/2016 8:09 AM, David Miller wrote:
> >From: Brenden Blanco <bblanco@plumgrid.com>
> >Date: Tue, 19 Jul 2016 12:16:45 -0700
> >
> >>This patch set introduces new infrastructure for programmatically
> >>processing packets in the earliest stages of rx, as part of an effort
> >>others are calling eXpress Data Path (XDP) [1]. Start this effort by
> >>introducing a new bpf program type for early packet filtering, before
> >>even an skb has been allocated.
> >>
> >>Extend on this with the ability to modify packet data and send back out
> >>on the same port.
> >Series applied, thanks.
> 
> Hi Dave,
> 
> The series causes compilation errors in our driver (and warnings).
> Please revert it.
My bad. The kbuild robot caught it as well. As an alternative to revert,
I can also send a patch to add the necessary inline stub.
> 
> *23:08:37*drivers/net/ethernet/mellanox/mlx4/en_netdev.c: In
> function 'mlx4_xdp_set':
> 
> *23:08:37*drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2566:4:
> error: implicit declaration of function 'bpf_prog_add'
> [-Werror=implicit-function-declaration]
> 
> *23:08:37*    prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
> 
> *23:08:37*    ^
> 
> *23:08:37*  CC [M]  drivers/gpu/drm/nouveau/nvkm/subdev/i2c/padgm200.o
> 
> *23:08:37*drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2566:9:
> warning: assignment makes pointer from integer without a cast
> [enabled by default]
> 
> *23:08:37*    prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
> 
> *23:08:37*         ^
> 
> *23:08:37*drivers/net/ethernet/mellanox/mlx4/en_netdev.c:2592:8:
> warning: assignment makes pointer from integer without a cast
> [enabled by default]
> 
> *23:08:37*   prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
> 
> *23:08:37*        ^
> 
> *23:08:37*cc1: some warnings being treated as errors
> 
> *23:08:37*make[7]: ***
> [drivers/net/ethernet/mellanox/mlx4/en_netdev.o] Error 1
> 
> *23:08:37*make[7]: *** Waiting for unfinished jobs....
> 
> *23:08:37*  CC      drivers/tty/serial/8250/8250_pnp.o
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-20  9:07   ` Daniel Borkmann
@ 2016-07-20 17:33     ` Brenden Blanco
  0 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-20 17:33 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Tariq Toukan

On Wed, Jul 20, 2016 at 11:07:57AM +0200, Daniel Borkmann wrote:
> On 07/19/2016 09:16 PM, Brenden Blanco wrote:
[...]
> >+	if (ring->xdp_prog)
> >+		bpf_prog_put(ring->xdp_prog);
> 
> Would be good if you also make this a READ_ONCE() here. I believe this is the
> only other spot in your set that has this 'direct' access (besides xchg() and
> READ_ONCE() from mlx4_en_process_rx_cq()). It would be mostly for consistency
> and to indicate that there's a more complex synchronization behind it. I'm mostly
> worried that if it's not consistently used, people might copy this and not use
> the READ_ONCE() also in other spots where it matters, and thus add hard to find
> bugs.
I can do that. My thinking was just that this is the cleanup path so the
code would have been superfluous. I think there were a few nits so I'll
collect those and clean them up.
> 
[...]

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 04/12] rtnl: add option for setting link xdp prog
  2016-07-20  8:38   ` Daniel Borkmann
@ 2016-07-20 17:35     ` Brenden Blanco
  0 siblings, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-20 17:35 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Tariq Toukan

On Wed, Jul 20, 2016 at 10:38:49AM +0200, Daniel Borkmann wrote:
> On 07/19/2016 09:16 PM, Brenden Blanco wrote:
> >Sets the bpf program represented by fd as an early filter in the rx path
> >of the netdev. The fd must have been created as BPF_PROG_TYPE_XDP.
> >Providing a negative value as fd clears the program. Getting the fd back
> >via rtnl is not possible, therefore reading of this value merely
> >provides a bool whether the program is valid on the link or not.
> >
> >Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> [...]
> >@@ -2054,6 +2101,23 @@ static int do_setlink(const struct sk_buff *skb,
> >  		status |= DO_SETLINK_NOTIFY;
> >  	}
> >
> >+	if (tb[IFLA_XDP]) {
> >+		struct nlattr *xdp[IFLA_XDP_MAX + 1];
> >+
> >+		err = nla_parse_nested(xdp, IFLA_XDP_MAX, tb[IFLA_XDP],
> >+				       ifla_xdp_policy);
> >+		if (err < 0)
> >+			goto errout;
> >+
> >+		if (xdp[IFLA_XDP_FD]) {
> >+			err = dev_change_xdp_fd(dev,
> >+						nla_get_s32(xdp[IFLA_XDP_FD]));
> >+			if (err)
> >+				goto errout;
> >+			status |= DO_SETLINK_NOTIFY;
> >+		}
> 
> For the setlink case IFLA_XDP_ATTACHED has no meaning currently, so I'd
> suggest it would be good to be strict and still add a:
> 
> 		/* Only used in rtnl_fill_ifinfo currently. */
> 		if (xdp[IFLA_XDP_ATTACHED]) {
> 			err = -EINVAL;
> 			goto errout;
> 		}
> 
Agreed.
> >+	}
> >+
> >  errout:
> >  	if (status & DO_SETLINK_MODIFIED) {
> >  		if (status & DO_SETLINK_NOTIFY)
> >
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-07-19 22:05   ` Alexei Starovoitov
@ 2016-07-20 17:38     ` Brenden Blanco
  2016-07-27 18:25     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-07-20 17:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 03:05:37PM -0700, Alexei Starovoitov wrote:
> On Tue, Jul 19, 2016 at 12:16:57PM -0700, Brenden Blanco wrote:
> > Add a sample that rewrites and forwards packets out on the same
> > interface. Observed single core forwarding performance of ~10Mpps.
> > 
> > Since the mlx4 driver under test recycles every single packet page, the
> > perf output shows almost exclusively just the ring management and bpf
> > program work. Slowdowns are likely occurring due to cache misses.
> 
> long term we need to resurrect your prefetch patch. sux to leave
> so much performance on the table.

I know :( Let's keep working at it, in a way that's good for both
xdp/non-xdp.
> 
> > +static int parse_ipv4(void *data, u64 nh_off, void *data_end)
> > +{
> > +	struct iphdr *iph = data + nh_off;
> > +
> > +	if (iph + 1 > data_end)
> > +		return 0;
> > +	return iph->protocol;
> > +}
> > +
> > +static int parse_ipv6(void *data, u64 nh_off, void *data_end)
> > +{
> > +	struct ipv6hdr *ip6h = data + nh_off;
> > +
> > +	if (ip6h + 1 > data_end)
> > +		return 0;
> > +	return ip6h->nexthdr;
> > +}
> ...
> > +	if (h_proto == htons(ETH_P_IP))
> > +		index = parse_ipv4(data, nh_off, data_end);
> > +	else if (h_proto == htons(ETH_P_IPV6))
> > +		index = parse_ipv6(data, nh_off, data_end);
> > +	else
> > +		index = 0;
> > +
> > +	value = bpf_map_lookup_elem(&dropcnt, &index);
> > +	if (value)
> > +		*value += 1;
> > +
> > +	if (index == 17) {
> 
> not an obvious xdp example. if you'd have to respin for other
> reasons please consider 'proto' name and IPPROTO_UDP here.
Will collect this into a followup.
> 
> Acked-by: Alexei Starovoitov <ast@kernel.org>
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding
       [not found]   ` <6a09ce5d-f902-a576-e44e-8e1e111ae26b@gmail.com>
  2016-07-20 14:08     ` Brenden Blanco
@ 2016-07-20 19:14     ` David Miller
  1 sibling, 0 replies; 59+ messages in thread
From: David Miller @ 2016-07-20 19:14 UTC (permalink / raw)
  To: ttoukan.linux
  Cc: bblanco, netdev, jhs, saeedm, kafai, brouer, as754m,
	alexei.starovoitov, gerlitz.or, john.fastabend, hannes, tgraf,
	tom, daniel

From: Tariq Toukan <ttoukan.linux@gmail.com>
Date: Wed, 20 Jul 2016 12:18:49 +0300

> 
> On 20/07/2016 8:09 AM, David Miller wrote:
>> From: Brenden Blanco <bblanco@plumgrid.com>
>> Date: Tue, 19 Jul 2016 12:16:45 -0700
>>
>>> This patch set introduces new infrastructure for programmatically
>>> processing packets in the earliest stages of rx, as part of an effort
>>> others are calling eXpress Data Path (XDP) [1]. Start this effort by
>>> introducing a new bpf program type for early packet filtering, before
>>> even an skb has been allocated.
>>>
>>> Extend on this with the ability to modify packet data and send back
>>> out
>>> on the same port.
>> Series applied, thanks.
> 
> Hi Dave,
> 
> The series causes compilation errors in our driver (and warnings).
> Please revert it.

Asking for a revert in such a knee-jerk reaction like this is not
appropriate.

Never ask me to do this.

Why don't you try to figure out what the simple fix is instead?

Meanwhile I've already pushed out the appropriate fix.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-19 19:16 ` [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
  2016-07-19 21:41   ` Alexei Starovoitov
  2016-07-20  9:07   ` Daniel Borkmann
@ 2016-07-24 11:56   ` Jesper Dangaard Brouer
  2016-07-24 16:57   ` Tom Herbert
  3 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-24 11:56 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, brouer

On Tue, 19 Jul 2016 12:16:50 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> The xdp program is present as a per-ring data structure, but as of yet
> it is not possible to set at that granularity through any ndo.

Thank you for doing this! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-19 19:16 ` [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
                     ` (2 preceding siblings ...)
  2016-07-24 11:56   ` Jesper Dangaard Brouer
@ 2016-07-24 16:57   ` Tom Herbert
  2016-07-24 20:34     ` Daniel Borkmann
  3 siblings, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-07-24 16:57 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann, Tariq Toukan

On Tue, Jul 19, 2016 at 2:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touches the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
>
> Therefore, disallow MTU settings that would produce a multi-fragment
> packet that XDP programs would fail to access. Future enhancements could
> be done to increase the allowable MTU.
>
> The xdp program is present as a per-ring data structure, but as of yet
> it is not possible to set at that granularity through any ndo.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 60 ++++++++++++++++++++++++++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c     | 40 +++++++++++++++--
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  6 +++
>  3 files changed, 102 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index 6083775..c34a33d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -31,6 +31,7 @@
>   *
>   */
>
> +#include <linux/bpf.h>
>  #include <linux/etherdevice.h>
>  #include <linux/tcp.h>
>  #include <linux/if_vlan.h>
> @@ -2112,6 +2113,11 @@ static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
>                 en_err(priv, "Bad MTU size:%d.\n", new_mtu);
>                 return -EPERM;
>         }
> +       if (priv->xdp_ring_num && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
> +               en_err(priv, "MTU size:%d requires frags but XDP running\n",
> +                      new_mtu);
> +               return -EOPNOTSUPP;
> +       }
>         dev->mtu = new_mtu;
>
>         if (netif_running(dev)) {
> @@ -2520,6 +2526,58 @@ static int mlx4_en_set_tx_maxrate(struct net_device *dev, int queue_index, u32 m
>         return err;
>  }
>
> +static int mlx4_xdp_set(struct net_device *dev, struct bpf_prog *prog)
> +{
> +       struct mlx4_en_priv *priv = netdev_priv(dev);
> +       struct bpf_prog *old_prog;
> +       int xdp_ring_num;
> +       int i;
> +
> +       xdp_ring_num = prog ? ALIGN(priv->rx_ring_num, MLX4_EN_NUM_UP) : 0;
> +
> +       if (priv->num_frags > 1) {
> +               en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
> +               return -EOPNOTSUPP;
> +       }
> +
> +       if (prog) {
> +               prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
> +               if (IS_ERR(prog))
> +                       return PTR_ERR(prog);
> +       }
> +
> +       priv->xdp_ring_num = xdp_ring_num;
> +
> +       /* This xchg is paired with READ_ONCE in the fast path */
> +       for (i = 0; i < priv->rx_ring_num; i++) {
> +               old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);

This can be done under a lock instead of relying on xchg.

> +               if (old_prog)
> +                       bpf_prog_put(old_prog);

I don't see how this can work. Even after setting the new program, the
old program might still be run (pointer dereferenced before xchg).
Either rcu needs to be used or the queue should stopped and synced
before setting the new program.

> +       }
> +
> +       return 0;
> +}
> +
> +static bool mlx4_xdp_attached(struct net_device *dev)
> +{
> +       struct mlx4_en_priv *priv = netdev_priv(dev);
> +
> +       return !!priv->xdp_ring_num;
> +}
> +
> +static int mlx4_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> +{
> +       switch (xdp->command) {
> +       case XDP_SETUP_PROG:
> +               return mlx4_xdp_set(dev, xdp->prog);
> +       case XDP_QUERY_PROG:
> +               xdp->prog_attached = mlx4_xdp_attached(dev);
> +               return 0;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
>  static const struct net_device_ops mlx4_netdev_ops = {
>         .ndo_open               = mlx4_en_open,
>         .ndo_stop               = mlx4_en_close,
> @@ -2548,6 +2606,7 @@ static const struct net_device_ops mlx4_netdev_ops = {
>         .ndo_udp_tunnel_del     = mlx4_en_del_vxlan_port,
>         .ndo_features_check     = mlx4_en_features_check,
>         .ndo_set_tx_maxrate     = mlx4_en_set_tx_maxrate,
> +       .ndo_xdp                = mlx4_xdp,
>  };
>
>  static const struct net_device_ops mlx4_netdev_ops_master = {
> @@ -2584,6 +2643,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
>         .ndo_udp_tunnel_del     = mlx4_en_del_vxlan_port,
>         .ndo_features_check     = mlx4_en_features_check,
>         .ndo_set_tx_maxrate     = mlx4_en_set_tx_maxrate,
> +       .ndo_xdp                = mlx4_xdp,
>  };
>
>  struct mlx4_en_bond {
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index c1b3a9c..6729545 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -32,6 +32,7 @@
>   */
>
>  #include <net/busy_poll.h>
> +#include <linux/bpf.h>
>  #include <linux/mlx4/cq.h>
>  #include <linux/slab.h>
>  #include <linux/mlx4/qp.h>
> @@ -509,6 +510,8 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
>         struct mlx4_en_dev *mdev = priv->mdev;
>         struct mlx4_en_rx_ring *ring = *pring;
>
> +       if (ring->xdp_prog)
> +               bpf_prog_put(ring->xdp_prog);
>         mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE);
>         vfree(ring->rx_info);
>         ring->rx_info = NULL;
> @@ -743,6 +746,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>         struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring];
>         struct mlx4_en_rx_alloc *frags;
>         struct mlx4_en_rx_desc *rx_desc;
> +       struct bpf_prog *xdp_prog;
>         struct sk_buff *skb;
>         int index;
>         int nr;
> @@ -759,6 +763,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>         if (budget <= 0)
>                 return polled;
>
> +       xdp_prog = READ_ONCE(ring->xdp_prog);
> +
>         /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
>          * descriptor offset can be deduced from the CQE index instead of
>          * reading 'cqe->index' */
> @@ -835,6 +841,35 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>                 l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
>                         (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
> +               /* A bpf program gets first chance to drop the packet. It may
> +                * read bytes but not past the end of the frag.
> +                */
> +               if (xdp_prog) {
> +                       struct xdp_buff xdp;
> +                       dma_addr_t dma;
> +                       u32 act;
> +
> +                       dma = be64_to_cpu(rx_desc->data[0].addr);
> +                       dma_sync_single_for_cpu(priv->ddev, dma,
> +                                               priv->frag_info[0].frag_size,
> +                                               DMA_FROM_DEVICE);
> +
> +                       xdp.data = page_address(frags[0].page) +
> +                                                       frags[0].page_offset;
> +                       xdp.data_end = xdp.data + length;
> +
> +                       act = bpf_prog_run_xdp(xdp_prog, &xdp);
> +                       switch (act) {
> +                       case XDP_PASS:
> +                               break;
> +                       default:
> +                               bpf_warn_invalid_xdp_action(act);
> +                       case XDP_ABORTED:
> +                       case XDP_DROP:
> +                               goto next;
> +                       }
> +               }
> +
>                 if (likely(dev->features & NETIF_F_RXCSUM)) {
>                         if (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP |
>                                                       MLX4_CQE_STATUS_UDP)) {
> @@ -1062,10 +1097,7 @@ static const int frag_sizes[] = {
>  void mlx4_en_calc_rx_buf(struct net_device *dev)
>  {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
> -       /* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
> -        * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
> -        */
> -       int eff_mtu = dev->mtu + ETH_HLEN + (2 * VLAN_HLEN);
> +       int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
>         int buf_size = 0;
>         int i = 0;
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index d39bf59..eb1238d 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -164,6 +164,10 @@ enum {
>  #define MLX4_LOOPBACK_TEST_PAYLOAD (HEADER_COPY_SIZE - ETH_HLEN)
>
>  #define MLX4_EN_MIN_MTU                46
> +/* VLAN_HLEN is added twice,to support skb vlan tagged with multiple
> + * headers. (For example: ETH_P_8021Q and ETH_P_8021AD).
> + */
> +#define MLX4_EN_EFF_MTU(mtu)   ((mtu) + ETH_HLEN + (2 * VLAN_HLEN))
>  #define ETH_BCAST              0xffffffffffffULL
>
>  #define MLX4_EN_LOOPBACK_RETRIES       5
> @@ -319,6 +323,7 @@ struct mlx4_en_rx_ring {
>         u8  fcs_del;
>         void *buf;
>         void *rx_info;
> +       struct bpf_prog *xdp_prog;
>         unsigned long bytes;
>         unsigned long packets;
>         unsigned long csum_ok;
> @@ -558,6 +563,7 @@ struct mlx4_en_priv {
>         struct mlx4_en_frag_info frag_info[MLX4_EN_MAX_RX_FRAGS];
>         u16 num_frags;
>         u16 log_rx_info;
> +       int xdp_ring_num;
>
>         struct mlx4_en_tx_ring **tx_ring;
>         struct mlx4_en_rx_ring *rx_ring[MAX_RX_RINGS];
> --
> 2.8.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program
  2016-07-24 16:57   ` Tom Herbert
@ 2016-07-24 20:34     ` Daniel Borkmann
  0 siblings, 0 replies; 59+ messages in thread
From: Daniel Borkmann @ 2016-07-24 20:34 UTC (permalink / raw)
  To: Tom Herbert, Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf, Tariq Toukan

On 07/24/2016 06:57 PM, Tom Herbert wrote:
> On Tue, Jul 19, 2016 at 2:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
>> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>>
>> In tc/socket bpf programs, helpers linearize skb fragments as needed
>> when the program touches the packet data. However, in the pursuit of
>> speed, XDP programs will not be allowed to use these slower functions,
>> especially if it involves allocating an skb.
>>
>> Therefore, disallow MTU settings that would produce a multi-fragment
>> packet that XDP programs would fail to access. Future enhancements could
>> be done to increase the allowable MTU.
>>
>> The xdp program is present as a per-ring data structure, but as of yet
>> it is not possible to set at that granularity through any ndo.
>>
>> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
[...]
>> +       if (prog) {
>> +               prog = bpf_prog_add(prog, priv->rx_ring_num - 1);
>> +               if (IS_ERR(prog))
>> +                       return PTR_ERR(prog);
>> +       }
>> +
>> +       priv->xdp_ring_num = xdp_ring_num;
>> +
>> +       /* This xchg is paired with READ_ONCE in the fast path */
>> +       for (i = 0; i < priv->rx_ring_num; i++) {
>> +               old_prog = xchg(&priv->rx_ring[i]->xdp_prog, prog);
>
> This can be done under a lock instead of relying on xchg.
>
>> +               if (old_prog)
>> +                       bpf_prog_put(old_prog);
>
> I don't see how this can work. Even after setting the new program, the
> old program might still be run (pointer dereferenced before xchg).
> Either rcu needs to be used or the queue should stopped and synced
> before setting the new program.

It's a strict requirement that all BPF programs must run under RCU.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-19 19:16 ` [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
  2016-07-19 21:49   ` Alexei Starovoitov
@ 2016-07-25  7:35   ` Eric Dumazet
  2016-08-03 17:45     ` order-0 vs order-N driver allocation. Was: " Alexei Starovoitov
  1 sibling, 1 reply; 59+ messages in thread
From: Eric Dumazet @ 2016-07-25  7:35 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha,
	Alexei Starovoitov, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan

On Tue, 2016-07-19 at 12:16 -0700, Brenden Blanco wrote:
> The mlx4 driver by default allocates order-3 pages for the ring to
> consume in multiple fragments. When the device has an xdp program, this
> behavior will prevent tx actions since the page must be re-mapped in
> TODEVICE mode, which cannot be done if the page is still shared.
> 
> Start by making the allocator configurable based on whether xdp is
> running, such that order-0 pages are always used and never shared.
> 
> Since this will stress the page allocator, add a simple page cache to
> each rx ring. Pages in the cache are left dma-mapped, and in drop-only
> stress tests the page allocator is eliminated from the perf report.
> 
> Note that setting an xdp program will now require the rings to be
> reconfigured.

Again, this has nothing to do with XDP ?

Please submit a separate patch, switching this driver to order-0
allocations.

I mentioned this order-3 vs order-0 issue earlier [1], and proposed to
send a generic patch, but had been traveling lately, and currently in
vacation.

order-3 pages are problematic when dealing with hostile traffic anyway,
so we should exclusively use order-0 pages, and page recycling like
Intel drivers.

http://lists.openwall.net/netdev/2016/04/11/88

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-07-19 22:05   ` Alexei Starovoitov
  2016-07-20 17:38     ` Brenden Blanco
@ 2016-07-27 18:25     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-07-27 18:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john.fastabend, hannes,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan, brouer,
	Rana Shahout

On Tue, 19 Jul 2016 15:05:37 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Jul 19, 2016 at 12:16:57PM -0700, Brenden Blanco wrote:
> > Add a sample that rewrites and forwards packets out on the same
> > interface. Observed single core forwarding performance of ~10Mpps.
> > 
> > Since the mlx4 driver under test recycles every single packet page, the
> > perf output shows almost exclusively just the ring management and bpf
> > program work. Slowdowns are likely occurring due to cache misses.  
> 
> long term we need to resurrect your prefetch patch. sux to leave
> so much performance on the table.

I will do some (more) attempts at getting prefetching working, also in
a way that benefit the normal stack (when I'm back from vacation, and
when net-next opens again).

I think Rana is also looking into this for the mlx5 driver (based on
some patches I send to her).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-07-19 19:16 ` [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
  2016-07-19 22:05   ` Alexei Starovoitov
@ 2016-08-03 17:01   ` Tom Herbert
  2016-08-03 17:11     ` Alexei Starovoitov
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-08-03 17:01 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Alexei Starovoitov, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf,
	Daniel Borkmann, Tariq Toukan, Aaron Yue

On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> Add a sample that rewrites and forwards packets out on the same
> interface. Observed single core forwarding performance of ~10Mpps.
>
> Since the mlx4 driver under test recycles every single packet page, the
> perf output shows almost exclusively just the ring management and bpf
> program work. Slowdowns are likely occurring due to cache misses.
>
> Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> ---
>  samples/bpf/Makefile    |   5 +++
>  samples/bpf/xdp2_kern.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 119 insertions(+)
>  create mode 100644 samples/bpf/xdp2_kern.c
>
> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 0e4ab3a..d2d2b35 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile
> @@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
>  hostprogs-y += test_overhead
>  hostprogs-y += test_cgrp2_array_pin
>  hostprogs-y += xdp1
> +hostprogs-y += xdp2
>
>  test_verifier-objs := test_verifier.o libbpf.o
>  test_maps-objs := test_maps.o libbpf.o
> @@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
>  test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
>  test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
>  xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
> +# reuse xdp1 source intentionally
> +xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
>
>  # Tell kbuild to always build the programs
>  always := $(hostprogs-y)
> @@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
>  always += parse_varlen.o parse_simple.o parse_ldabs.o
>  always += test_cgrp2_tc_kern.o
>  always += xdp1_kern.o
> +always += xdp2_kern.o
>
>  HOSTCFLAGS += -I$(objtree)/usr/include
>
> @@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
>  HOSTLOADLIBES_map_perf_test += -lelf -lrt
>  HOSTLOADLIBES_test_overhead += -lelf -lrt
>  HOSTLOADLIBES_xdp1 += -lelf
> +HOSTLOADLIBES_xdp2 += -lelf
>
>  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
> diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
> new file mode 100644
> index 0000000..38fe7e1
> --- /dev/null
> +++ b/samples/bpf/xdp2_kern.c
> @@ -0,0 +1,114 @@
> +/* Copyright (c) 2016 PLUMgrid
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#define KBUILD_MODNAME "foo"
> +#include <uapi/linux/bpf.h>
> +#include <linux/in.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> +#include <linux/if_vlan.h>
> +#include <linux/ip.h>
> +#include <linux/ipv6.h>
> +#include "bpf_helpers.h"
> +
> +struct bpf_map_def SEC("maps") dropcnt = {
> +       .type = BPF_MAP_TYPE_PERCPU_ARRAY,
> +       .key_size = sizeof(u32),
> +       .value_size = sizeof(long),
> +       .max_entries = 256,
> +};
> +
> +static void swap_src_dst_mac(void *data)
> +{
> +       unsigned short *p = data;
> +       unsigned short dst[3];
> +
> +       dst[0] = p[0];
> +       dst[1] = p[1];
> +       dst[2] = p[2];
> +       p[0] = p[3];
> +       p[1] = p[4];
> +       p[2] = p[5];
> +       p[3] = dst[0];
> +       p[4] = dst[1];
> +       p[5] = dst[2];
> +}
> +
> +static int parse_ipv4(void *data, u64 nh_off, void *data_end)
> +{
> +       struct iphdr *iph = data + nh_off;
> +
> +       if (iph + 1 > data_end)
> +               return 0;
> +       return iph->protocol;
> +}
> +
> +static int parse_ipv6(void *data, u64 nh_off, void *data_end)
> +{
> +       struct ipv6hdr *ip6h = data + nh_off;
> +
> +       if (ip6h + 1 > data_end)
> +               return 0;
> +       return ip6h->nexthdr;
> +}
> +
> +SEC("xdp1")
> +int xdp_prog1(struct xdp_md *ctx)
> +{
> +       void *data_end = (void *)(long)ctx->data_end;
> +       void *data = (void *)(long)ctx->data;

Brendan,

It seems that the cast to long here is done because data_end and data
are u32s in xdp_md. So the effect is that we are upcasting a
thirty-bit integer into a sixty-four bit pointer (in fact without the
cast we see compiler warnings). I don't understand how this can be
correct. Can you shed some light on this?

Thanks,
Tom

> +       struct ethhdr *eth = data;
> +       int rc = XDP_DROP;
> +       long *value;
> +       u16 h_proto;
> +       u64 nh_off;
> +       u32 index;
> +
> +       nh_off = sizeof(*eth);
> +       if (data + nh_off > data_end)
> +               return rc;
> +
> +       h_proto = eth->h_proto;
> +
> +       if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
> +               struct vlan_hdr *vhdr;
> +
> +               vhdr = data + nh_off;
> +               nh_off += sizeof(struct vlan_hdr);
> +               if (data + nh_off > data_end)
> +                       return rc;
> +               h_proto = vhdr->h_vlan_encapsulated_proto;
> +       }
> +       if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
> +               struct vlan_hdr *vhdr;
> +
> +               vhdr = data + nh_off;
> +               nh_off += sizeof(struct vlan_hdr);
> +               if (data + nh_off > data_end)
> +                       return rc;
> +               h_proto = vhdr->h_vlan_encapsulated_proto;
> +       }
> +
> +       if (h_proto == htons(ETH_P_IP))
> +               index = parse_ipv4(data, nh_off, data_end);
> +       else if (h_proto == htons(ETH_P_IPV6))
> +               index = parse_ipv6(data, nh_off, data_end);
> +       else
> +               index = 0;
> +
> +       value = bpf_map_lookup_elem(&dropcnt, &index);
> +       if (value)
> +               *value += 1;
> +
> +       if (index == 17) {
> +               swap_src_dst_mac(data);
> +               rc = XDP_TX;
> +       }
> +
> +       return rc;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.8.2
>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 17:01   ` Tom Herbert
@ 2016-08-03 17:11     ` Alexei Starovoitov
  2016-08-03 17:29       ` Tom Herbert
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-08-03 17:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann, Tariq Toukan,
	Aaron Yue

On Wed, Aug 03, 2016 at 10:01:54AM -0700, Tom Herbert wrote:
> On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> > Add a sample that rewrites and forwards packets out on the same
> > interface. Observed single core forwarding performance of ~10Mpps.
> >
> > Since the mlx4 driver under test recycles every single packet page, the
> > perf output shows almost exclusively just the ring management and bpf
> > program work. Slowdowns are likely occurring due to cache misses.
> >
> > Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
> > ---
> >  samples/bpf/Makefile    |   5 +++
> >  samples/bpf/xdp2_kern.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 119 insertions(+)
> >  create mode 100644 samples/bpf/xdp2_kern.c
> >
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 0e4ab3a..d2d2b35 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
> >  hostprogs-y += test_overhead
> >  hostprogs-y += test_cgrp2_array_pin
> >  hostprogs-y += xdp1
> > +hostprogs-y += xdp2
> >
> >  test_verifier-objs := test_verifier.o libbpf.o
> >  test_maps-objs := test_maps.o libbpf.o
> > @@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
> >  test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
> >  test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
> >  xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
> > +# reuse xdp1 source intentionally
> > +xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
> >
> >  # Tell kbuild to always build the programs
> >  always := $(hostprogs-y)
> > @@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
> >  always += parse_varlen.o parse_simple.o parse_ldabs.o
> >  always += test_cgrp2_tc_kern.o
> >  always += xdp1_kern.o
> > +always += xdp2_kern.o
> >
> >  HOSTCFLAGS += -I$(objtree)/usr/include
> >
> > @@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
> >  HOSTLOADLIBES_map_perf_test += -lelf -lrt
> >  HOSTLOADLIBES_test_overhead += -lelf -lrt
> >  HOSTLOADLIBES_xdp1 += -lelf
> > +HOSTLOADLIBES_xdp2 += -lelf
> >
> >  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
> >  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
> > diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
> > new file mode 100644
> > index 0000000..38fe7e1
> > --- /dev/null
> > +++ b/samples/bpf/xdp2_kern.c
> > @@ -0,0 +1,114 @@
> > +/* Copyright (c) 2016 PLUMgrid
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of version 2 of the GNU General Public
> > + * License as published by the Free Software Foundation.
> > + */
> > +#define KBUILD_MODNAME "foo"
> > +#include <uapi/linux/bpf.h>
> > +#include <linux/in.h>
> > +#include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> > +#include <linux/if_vlan.h>
> > +#include <linux/ip.h>
> > +#include <linux/ipv6.h>
> > +#include "bpf_helpers.h"
> > +
> > +struct bpf_map_def SEC("maps") dropcnt = {
> > +       .type = BPF_MAP_TYPE_PERCPU_ARRAY,
> > +       .key_size = sizeof(u32),
> > +       .value_size = sizeof(long),
> > +       .max_entries = 256,
> > +};
> > +
> > +static void swap_src_dst_mac(void *data)
> > +{
> > +       unsigned short *p = data;
> > +       unsigned short dst[3];
> > +
> > +       dst[0] = p[0];
> > +       dst[1] = p[1];
> > +       dst[2] = p[2];
> > +       p[0] = p[3];
> > +       p[1] = p[4];
> > +       p[2] = p[5];
> > +       p[3] = dst[0];
> > +       p[4] = dst[1];
> > +       p[5] = dst[2];
> > +}
> > +
> > +static int parse_ipv4(void *data, u64 nh_off, void *data_end)
> > +{
> > +       struct iphdr *iph = data + nh_off;
> > +
> > +       if (iph + 1 > data_end)
> > +               return 0;
> > +       return iph->protocol;
> > +}
> > +
> > +static int parse_ipv6(void *data, u64 nh_off, void *data_end)
> > +{
> > +       struct ipv6hdr *ip6h = data + nh_off;
> > +
> > +       if (ip6h + 1 > data_end)
> > +               return 0;
> > +       return ip6h->nexthdr;
> > +}
> > +
> > +SEC("xdp1")
> > +int xdp_prog1(struct xdp_md *ctx)
> > +{
> > +       void *data_end = (void *)(long)ctx->data_end;
> > +       void *data = (void *)(long)ctx->data;
> 
> Brendan,
> 
> It seems that the cast to long here is done because data_end and data
> are u32s in xdp_md. So the effect is that we are upcasting a
> thirty-bit integer into a sixty-four bit pointer (in fact without the
> cast we see compiler warnings). I don't understand how this can be
> correct. Can you shed some light on this?

please see:
http://lists.iovisor.org/pipermail/iovisor-dev/2016-August/000355.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 17:11     ` Alexei Starovoitov
@ 2016-08-03 17:29       ` Tom Herbert
  2016-08-03 18:29         ` David Miller
  2016-08-03 18:29         ` Brenden Blanco
  0 siblings, 2 replies; 59+ messages in thread
From: Tom Herbert @ 2016-08-03 17:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann, Tariq Toukan,
	Aaron Yue

On Wed, Aug 3, 2016 at 10:11 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Wed, Aug 03, 2016 at 10:01:54AM -0700, Tom Herbert wrote:
>> On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
>> > Add a sample that rewrites and forwards packets out on the same
>> > interface. Observed single core forwarding performance of ~10Mpps.
>> >
>> > Since the mlx4 driver under test recycles every single packet page, the
>> > perf output shows almost exclusively just the ring management and bpf
>> > program work. Slowdowns are likely occurring due to cache misses.
>> >
>> > Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
>> > ---
>> >  samples/bpf/Makefile    |   5 +++
>> >  samples/bpf/xdp2_kern.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
>> >  2 files changed, 119 insertions(+)
>> >  create mode 100644 samples/bpf/xdp2_kern.c
>> >
>> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
>> > index 0e4ab3a..d2d2b35 100644
>> > --- a/samples/bpf/Makefile
>> > +++ b/samples/bpf/Makefile
>> > @@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
>> >  hostprogs-y += test_overhead
>> >  hostprogs-y += test_cgrp2_array_pin
>> >  hostprogs-y += xdp1
>> > +hostprogs-y += xdp2
>> >
>> >  test_verifier-objs := test_verifier.o libbpf.o
>> >  test_maps-objs := test_maps.o libbpf.o
>> > @@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
>> >  test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
>> >  test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
>> >  xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
>> > +# reuse xdp1 source intentionally
>> > +xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
>> >
>> >  # Tell kbuild to always build the programs
>> >  always := $(hostprogs-y)
>> > @@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
>> >  always += parse_varlen.o parse_simple.o parse_ldabs.o
>> >  always += test_cgrp2_tc_kern.o
>> >  always += xdp1_kern.o
>> > +always += xdp2_kern.o
>> >
>> >  HOSTCFLAGS += -I$(objtree)/usr/include
>> >
>> > @@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
>> >  HOSTLOADLIBES_map_perf_test += -lelf -lrt
>> >  HOSTLOADLIBES_test_overhead += -lelf -lrt
>> >  HOSTLOADLIBES_xdp1 += -lelf
>> > +HOSTLOADLIBES_xdp2 += -lelf
>> >
>> >  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>> >  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
>> > diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
>> > new file mode 100644
>> > index 0000000..38fe7e1
>> > --- /dev/null
>> > +++ b/samples/bpf/xdp2_kern.c
>> > @@ -0,0 +1,114 @@
>> > +/* Copyright (c) 2016 PLUMgrid
>> > + *
>> > + * This program is free software; you can redistribute it and/or
>> > + * modify it under the terms of version 2 of the GNU General Public
>> > + * License as published by the Free Software Foundation.
>> > + */
>> > +#define KBUILD_MODNAME "foo"
>> > +#include <uapi/linux/bpf.h>
>> > +#include <linux/in.h>
>> > +#include <linux/if_ether.h>
>> > +#include <linux/if_packet.h>
>> > +#include <linux/if_vlan.h>
>> > +#include <linux/ip.h>
>> > +#include <linux/ipv6.h>
>> > +#include "bpf_helpers.h"
>> > +
>> > +struct bpf_map_def SEC("maps") dropcnt = {
>> > +       .type = BPF_MAP_TYPE_PERCPU_ARRAY,
>> > +       .key_size = sizeof(u32),
>> > +       .value_size = sizeof(long),
>> > +       .max_entries = 256,
>> > +};
>> > +
>> > +static void swap_src_dst_mac(void *data)
>> > +{
>> > +       unsigned short *p = data;
>> > +       unsigned short dst[3];
>> > +
>> > +       dst[0] = p[0];
>> > +       dst[1] = p[1];
>> > +       dst[2] = p[2];
>> > +       p[0] = p[3];
>> > +       p[1] = p[4];
>> > +       p[2] = p[5];
>> > +       p[3] = dst[0];
>> > +       p[4] = dst[1];
>> > +       p[5] = dst[2];
>> > +}
>> > +
>> > +static int parse_ipv4(void *data, u64 nh_off, void *data_end)
>> > +{
>> > +       struct iphdr *iph = data + nh_off;
>> > +
>> > +       if (iph + 1 > data_end)
>> > +               return 0;
>> > +       return iph->protocol;
>> > +}
>> > +
>> > +static int parse_ipv6(void *data, u64 nh_off, void *data_end)
>> > +{
>> > +       struct ipv6hdr *ip6h = data + nh_off;
>> > +
>> > +       if (ip6h + 1 > data_end)
>> > +               return 0;
>> > +       return ip6h->nexthdr;
>> > +}
>> > +
>> > +SEC("xdp1")
>> > +int xdp_prog1(struct xdp_md *ctx)
>> > +{
>> > +       void *data_end = (void *)(long)ctx->data_end;
>> > +       void *data = (void *)(long)ctx->data;
>>
>> Brendan,
>>
>> It seems that the cast to long here is done because data_end and data
>> are u32s in xdp_md. So the effect is that we are upcasting a
>> thirty-bit integer into a sixty-four bit pointer (in fact without the
>> cast we see compiler warnings). I don't understand how this can be
>> correct. Can you shed some light on this?
>
> please see:
> http://lists.iovisor.org/pipermail/iovisor-dev/2016-August/000355.html
>
That doesn't explain it. The only thing I can figure is that there is
an implicit assumption somewhere that even though the pointer size may
be 64 bits, only the low order thirty-two bits are relevant in this
environment (i.e. upper bit are always zero for any pointers)-- so
then it would safe store pointers as u32 and to upcast them to void *.
If this is indeed the case, then we really need to make this explicit
to the user. Casting to long without comment just to avoid the
compiler warning is not good programming style, maybe a function
xdp_md_data_to_ptr or the like could be used.

Tom

^ permalink raw reply	[flat|nested] 59+ messages in thread

* order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-07-25  7:35   ` Eric Dumazet
@ 2016-08-03 17:45     ` Alexei Starovoitov
  2016-08-04 16:19       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-08-03 17:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Brenden Blanco, davem, netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Jesper Dangaard Brouer, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan

On Mon, Jul 25, 2016 at 09:35:20AM +0200, Eric Dumazet wrote:
> On Tue, 2016-07-19 at 12:16 -0700, Brenden Blanco wrote:
> > The mlx4 driver by default allocates order-3 pages for the ring to
> > consume in multiple fragments. When the device has an xdp program, this
> > behavior will prevent tx actions since the page must be re-mapped in
> > TODEVICE mode, which cannot be done if the page is still shared.
> > 
> > Start by making the allocator configurable based on whether xdp is
> > running, such that order-0 pages are always used and never shared.
> > 
> > Since this will stress the page allocator, add a simple page cache to
> > each rx ring. Pages in the cache are left dma-mapped, and in drop-only
> > stress tests the page allocator is eliminated from the perf report.
> > 
> > Note that setting an xdp program will now require the rings to be
> > reconfigured.
> 
> Again, this has nothing to do with XDP ?
> 
> Please submit a separate patch, switching this driver to order-0
> allocations.
> 
> I mentioned this order-3 vs order-0 issue earlier [1], and proposed to
> send a generic patch, but had been traveling lately, and currently in
> vacation.
> 
> order-3 pages are problematic when dealing with hostile traffic anyway,
> so we should exclusively use order-0 pages, and page recycling like
> Intel drivers.
> 
> http://lists.openwall.net/netdev/2016/04/11/88

Completely agree. These multi-page tricks work only for benchmarks and
not for production.
Eric,
if you can submit that patch for mlx4 that would be awesome.

I think we should default to order-0 for both mlx4 and mlx5.
Alternatively we're thinking to do a netlink or ethtool switch to
preserve old behavior, but frankly I don't see who needs this order-N
allocation schemes.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 17:29       ` Tom Herbert
@ 2016-08-03 18:29         ` David Miller
  2016-08-03 18:29         ` Brenden Blanco
  1 sibling, 0 replies; 59+ messages in thread
From: David Miller @ 2016-08-03 18:29 UTC (permalink / raw)
  To: tom
  Cc: alexei.starovoitov, bblanco, netdev, jhs, saeedm, kafai, brouer,
	as754m, gerlitz.or, john.fastabend, hannes, tgraf, daniel,
	ttoukan.linux, haoxuany

From: Tom Herbert <tom@herbertland.com>
Date: Wed, 3 Aug 2016 10:29:58 -0700

> On Wed, Aug 3, 2016 at 10:11 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Wed, Aug 03, 2016 at 10:01:54AM -0700, Tom Herbert wrote:
>>> On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
>>> > Add a sample that rewrites and forwards packets out on the same
>>> > interface. Observed single core forwarding performance of ~10Mpps.
>>> >
>>> > Since the mlx4 driver under test recycles every single packet page, the
>>> > perf output shows almost exclusively just the ring management and bpf
>>> > program work. Slowdowns are likely occurring due to cache misses.
>>> >
>>> > Signed-off-by: Brenden Blanco <bblanco@plumgrid.com>
>>> > ---
>>> >  samples/bpf/Makefile    |   5 +++
>>> >  samples/bpf/xdp2_kern.c | 114 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> >  2 files changed, 119 insertions(+)
>>> >  create mode 100644 samples/bpf/xdp2_kern.c
>>> >
>>> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
>>> > index 0e4ab3a..d2d2b35 100644
>>> > --- a/samples/bpf/Makefile
>>> > +++ b/samples/bpf/Makefile
>>> > @@ -22,6 +22,7 @@ hostprogs-y += map_perf_test
>>> >  hostprogs-y += test_overhead
>>> >  hostprogs-y += test_cgrp2_array_pin
>>> >  hostprogs-y += xdp1
>>> > +hostprogs-y += xdp2
>>> >
>>> >  test_verifier-objs := test_verifier.o libbpf.o
>>> >  test_maps-objs := test_maps.o libbpf.o
>>> > @@ -44,6 +45,8 @@ map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
>>> >  test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
>>> >  test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
>>> >  xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
>>> > +# reuse xdp1 source intentionally
>>> > +xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
>>> >
>>> >  # Tell kbuild to always build the programs
>>> >  always := $(hostprogs-y)
>>> > @@ -67,6 +70,7 @@ always += test_overhead_kprobe_kern.o
>>> >  always += parse_varlen.o parse_simple.o parse_ldabs.o
>>> >  always += test_cgrp2_tc_kern.o
>>> >  always += xdp1_kern.o
>>> > +always += xdp2_kern.o
>>> >
>>> >  HOSTCFLAGS += -I$(objtree)/usr/include
>>> >
>>> > @@ -88,6 +92,7 @@ HOSTLOADLIBES_spintest += -lelf
>>> >  HOSTLOADLIBES_map_perf_test += -lelf -lrt
>>> >  HOSTLOADLIBES_test_overhead += -lelf -lrt
>>> >  HOSTLOADLIBES_xdp1 += -lelf
>>> > +HOSTLOADLIBES_xdp2 += -lelf
>>> >
>>> >  # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
>>> >  #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
>>> > diff --git a/samples/bpf/xdp2_kern.c b/samples/bpf/xdp2_kern.c
>>> > new file mode 100644
>>> > index 0000000..38fe7e1
>>> > --- /dev/null
>>> > +++ b/samples/bpf/xdp2_kern.c
>>> > @@ -0,0 +1,114 @@
>>> > +/* Copyright (c) 2016 PLUMgrid
>>> > + *
>>> > + * This program is free software; you can redistribute it and/or
>>> > + * modify it under the terms of version 2 of the GNU General Public
>>> > + * License as published by the Free Software Foundation.
>>> > + */
>>> > +#define KBUILD_MODNAME "foo"
>>> > +#include <uapi/linux/bpf.h>
>>> > +#include <linux/in.h>
>>> > +#include <linux/if_ether.h>
>>> > +#include <linux/if_packet.h>
>>> > +#include <linux/if_vlan.h>
>>> > +#include <linux/ip.h>
>>> > +#include <linux/ipv6.h>
>>> > +#include "bpf_helpers.h"
>>> > +
>>> > +struct bpf_map_def SEC("maps") dropcnt = {
>>> > +       .type = BPF_MAP_TYPE_PERCPU_ARRAY,
>>> > +       .key_size = sizeof(u32),
>>> > +       .value_size = sizeof(long),
>>> > +       .max_entries = 256,
>>> > +};
>>> > +
>>> > +static void swap_src_dst_mac(void *data)
>>> > +{
>>> > +       unsigned short *p = data;
>>> > +       unsigned short dst[3];
>>> > +
>>> > +       dst[0] = p[0];
>>> > +       dst[1] = p[1];
>>> > +       dst[2] = p[2];
>>> > +       p[0] = p[3];
>>> > +       p[1] = p[4];
>>> > +       p[2] = p[5];
>>> > +       p[3] = dst[0];
>>> > +       p[4] = dst[1];
>>> > +       p[5] = dst[2];
>>> > +}
>>> > +
>>> > +static int parse_ipv4(void *data, u64 nh_off, void *data_end)
>>> > +{
>>> > +       struct iphdr *iph = data + nh_off;
>>> > +
>>> > +       if (iph + 1 > data_end)
>>> > +               return 0;
>>> > +       return iph->protocol;
>>> > +}
>>> > +
>>> > +static int parse_ipv6(void *data, u64 nh_off, void *data_end)
>>> > +{
>>> > +       struct ipv6hdr *ip6h = data + nh_off;
>>> > +
>>> > +       if (ip6h + 1 > data_end)
>>> > +               return 0;
>>> > +       return ip6h->nexthdr;
>>> > +}
>>> > +
>>> > +SEC("xdp1")
>>> > +int xdp_prog1(struct xdp_md *ctx)
>>> > +{
>>> > +       void *data_end = (void *)(long)ctx->data_end;
>>> > +       void *data = (void *)(long)ctx->data;
>>>
>>> Brendan,
>>>
>>> It seems that the cast to long here is done because data_end and data
>>> are u32s in xdp_md. So the effect is that we are upcasting a
>>> thirty-bit integer into a sixty-four bit pointer (in fact without the
>>> cast we see compiler warnings). I don't understand how this can be
>>> correct. Can you shed some light on this?
>>
>> please see:
>> http://lists.iovisor.org/pipermail/iovisor-dev/2016-August/000355.html
>>
> That doesn't explain it.

Yes it does explain it, think more about the word "meta" and what the
code generator might be doing.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 17:29       ` Tom Herbert
  2016-08-03 18:29         ` David Miller
@ 2016-08-03 18:29         ` Brenden Blanco
  2016-08-03 18:31           ` David Miller
  2016-08-03 19:06           ` Tom Herbert
  1 sibling, 2 replies; 59+ messages in thread
From: Brenden Blanco @ 2016-08-03 18:29 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Thomas Graf, Daniel Borkmann, Tariq Toukan, Aaron Yue

On Wed, Aug 03, 2016 at 10:29:58AM -0700, Tom Herbert wrote:
> On Wed, Aug 3, 2016 at 10:11 AM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Wed, Aug 03, 2016 at 10:01:54AM -0700, Tom Herbert wrote:
> >> On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
[...]
> >> > +SEC("xdp1")
> >> > +int xdp_prog1(struct xdp_md *ctx)
> >> > +{
> >> > +       void *data_end = (void *)(long)ctx->data_end;
> >> > +       void *data = (void *)(long)ctx->data;
> >>
> >> Brendan,
> >>
> >> It seems that the cast to long here is done because data_end and data
> >> are u32s in xdp_md. So the effect is that we are upcasting a
> >> thirty-bit integer into a sixty-four bit pointer (in fact without the
> >> cast we see compiler warnings). I don't understand how this can be
> >> correct. Can you shed some light on this?
> >
> > please see:
> > http://lists.iovisor.org/pipermail/iovisor-dev/2016-August/000355.html
> >
> That doesn't explain it. The only thing I can figure is that there is
> an implicit assumption somewhere that even though the pointer size may
> be 64 bits, only the low order thirty-two bits are relevant in this
> environment (i.e. upper bit are always zero for any pointers)-- so
> then it would safe store pointers as u32 and to upcast them to void *.
No, the actual pointer storage is always void* sized (see struct
xdp_buff). The mangling is cosmetic. The verifier converts the
underlying bpf load instruction to the right sized operation.
> If this is indeed the case, then we really need to make this explicit
> to the user. Casting to long without comment just to avoid the
> compiler warning is not good programming style, maybe a function
> xdp_md_data_to_ptr or the like could be used.
> 
> Tom

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 18:29         ` Brenden Blanco
@ 2016-08-03 18:31           ` David Miller
  2016-08-03 19:06           ` Tom Herbert
  1 sibling, 0 replies; 59+ messages in thread
From: David Miller @ 2016-08-03 18:31 UTC (permalink / raw)
  To: bblanco
  Cc: tom, alexei.starovoitov, netdev, jhs, saeedm, kafai, brouer,
	as754m, gerlitz.or, john.fastabend, hannes, tgraf, daniel,
	ttoukan.linux, haoxuany

From: Brenden Blanco <bblanco@plumgrid.com>
Date: Wed, 3 Aug 2016 11:29:52 -0700

> On Wed, Aug 03, 2016 at 10:29:58AM -0700, Tom Herbert wrote:
>> On Wed, Aug 3, 2016 at 10:11 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Wed, Aug 03, 2016 at 10:01:54AM -0700, Tom Herbert wrote:
>> >> On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> [...]
>> >> > +SEC("xdp1")
>> >> > +int xdp_prog1(struct xdp_md *ctx)
>> >> > +{
>> >> > +       void *data_end = (void *)(long)ctx->data_end;
>> >> > +       void *data = (void *)(long)ctx->data;
>> >>
>> >> Brendan,
>> >>
>> >> It seems that the cast to long here is done because data_end and data
>> >> are u32s in xdp_md. So the effect is that we are upcasting a
>> >> thirty-bit integer into a sixty-four bit pointer (in fact without the
>> >> cast we see compiler warnings). I don't understand how this can be
>> >> correct. Can you shed some light on this?
>> >
>> > please see:
>> > http://lists.iovisor.org/pipermail/iovisor-dev/2016-August/000355.html
>> >
>> That doesn't explain it. The only thing I can figure is that there is
>> an implicit assumption somewhere that even though the pointer size may
>> be 64 bits, only the low order thirty-two bits are relevant in this
>> environment (i.e. upper bit are always zero for any pointers)-- so
>> then it would safe store pointers as u32 and to upcast them to void *.
> No, the actual pointer storage is always void* sized (see struct
> xdp_buff). The mangling is cosmetic. The verifier converts the
> underlying bpf load instruction to the right sized operation.

And this is what Alexei meant by "meta".  Tom this stuff works just
fine.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 18:29         ` Brenden Blanco
  2016-08-03 18:31           ` David Miller
@ 2016-08-03 19:06           ` Tom Herbert
  2016-08-03 22:36             ` Alexei Starovoitov
  1 sibling, 1 reply; 59+ messages in thread
From: Tom Herbert @ 2016-08-03 19:06 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: Alexei Starovoitov, David S. Miller,
	Linux Kernel Network Developers, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Jesper Dangaard Brouer,
	Ari Saha, Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Thomas Graf, Daniel Borkmann, Tariq Toukan, Aaron Yue

On Wed, Aug 3, 2016 at 11:29 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> On Wed, Aug 03, 2016 at 10:29:58AM -0700, Tom Herbert wrote:
>> On Wed, Aug 3, 2016 at 10:11 AM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Wed, Aug 03, 2016 at 10:01:54AM -0700, Tom Herbert wrote:
>> >> On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> [...]
>> >> > +SEC("xdp1")
>> >> > +int xdp_prog1(struct xdp_md *ctx)
>> >> > +{
>> >> > +       void *data_end = (void *)(long)ctx->data_end;
>> >> > +       void *data = (void *)(long)ctx->data;
>> >>
>> >> Brendan,
>> >>
>> >> It seems that the cast to long here is done because data_end and data
>> >> are u32s in xdp_md. So the effect is that we are upcasting a
>> >> thirty-bit integer into a sixty-four bit pointer (in fact without the
>> >> cast we see compiler warnings). I don't understand how this can be
>> >> correct. Can you shed some light on this?
>> >
>> > please see:
>> > http://lists.iovisor.org/pipermail/iovisor-dev/2016-August/000355.html
>> >
>> That doesn't explain it. The only thing I can figure is that there is
>> an implicit assumption somewhere that even though the pointer size may
>> be 64 bits, only the low order thirty-two bits are relevant in this
>> environment (i.e. upper bit are always zero for any pointers)-- so
>> then it would safe store pointers as u32 and to upcast them to void *.
> No, the actual pointer storage is always void* sized (see struct
> xdp_buff). The mangling is cosmetic. The verifier converts the
> underlying bpf load instruction to the right sized operation.

This is not at all obvious to XDP programmer. The type of ctx
structure is xdp_md and the definition of that structure in
uapi/linux/bpf.h says that the fields in the that structure are __u32.
So when I, as a user naive the inner workings of the verifier, read
this code it sure looks like we are upcasting a 32 bit value to a 64
bit value-- that does not seem right at all and the compiler
apparently concurs my point of view. If the code ends up being correct
anyway, then the obvious answer to have an explicit cast that points
out the special nature of this cast. Blindly casting to u32 to long
for the purposes of assigning to a pointer is only going to confuse
more people as it has me.

Tom

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 19:06           ` Tom Herbert
@ 2016-08-03 22:36             ` Alexei Starovoitov
  2016-08-03 23:18               ` Daniel Borkmann
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-08-03 22:36 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Daniel Borkmann, Tariq Toukan,
	Aaron Yue

On Wed, Aug 03, 2016 at 12:06:55PM -0700, Tom Herbert wrote:
> On Wed, Aug 3, 2016 at 11:29 AM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> > On Wed, Aug 03, 2016 at 10:29:58AM -0700, Tom Herbert wrote:
> >> On Wed, Aug 3, 2016 at 10:11 AM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >> > On Wed, Aug 03, 2016 at 10:01:54AM -0700, Tom Herbert wrote:
> >> >> On Tue, Jul 19, 2016 at 12:16 PM, Brenden Blanco <bblanco@plumgrid.com> wrote:
> > [...]
> >> >> > +SEC("xdp1")
> >> >> > +int xdp_prog1(struct xdp_md *ctx)
> >> >> > +{
> >> >> > +       void *data_end = (void *)(long)ctx->data_end;
> >> >> > +       void *data = (void *)(long)ctx->data;
> >> >>
> >> >> Brendan,
> >> >>
> >> >> It seems that the cast to long here is done because data_end and data
> >> >> are u32s in xdp_md. So the effect is that we are upcasting a
> >> >> thirty-bit integer into a sixty-four bit pointer (in fact without the
> >> >> cast we see compiler warnings). I don't understand how this can be
> >> >> correct. Can you shed some light on this?
> >> >
> >> > please see:
> >> > http://lists.iovisor.org/pipermail/iovisor-dev/2016-August/000355.html
> >> >
> >> That doesn't explain it. The only thing I can figure is that there is
> >> an implicit assumption somewhere that even though the pointer size may
> >> be 64 bits, only the low order thirty-two bits are relevant in this
> >> environment (i.e. upper bit are always zero for any pointers)-- so
> >> then it would safe store pointers as u32 and to upcast them to void *.
> > No, the actual pointer storage is always void* sized (see struct
> > xdp_buff). The mangling is cosmetic. The verifier converts the
> > underlying bpf load instruction to the right sized operation.
> 
> This is not at all obvious to XDP programmer. The type of ctx
> structure is xdp_md and the definition of that structure in
> uapi/linux/bpf.h says that the fields in the that structure are __u32.
> So when I, as a user naive the inner workings of the verifier, read
> this code it sure looks like we are upcasting a 32 bit value to a 64
> bit value-- that does not seem right at all and the compiler
> apparently concurs my point of view. If the code ends up being correct
> anyway, then the obvious answer to have an explicit cast that points
> out the special nature of this cast. Blindly casting to u32 to long
> for the purposes of assigning to a pointer is only going to confuse
> more people as it has me.

Agree. Would be nice to have few helpers. The question is whether
they belong in bpf.h. Probably not, since they're not kernel abi.
For the same reasons we didn't include instruction building macros
like BPF_ALU64_REG and instead kept them in samples/bpf/libbpf.h
Here probably four static inline functions are needed. Two for __sk_buff
and two for xpd_md. That should make xdp*_kern.c examples a bit easier
to read.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite
  2016-08-03 22:36             ` Alexei Starovoitov
@ 2016-08-03 23:18               ` Daniel Borkmann
  0 siblings, 0 replies; 59+ messages in thread
From: Daniel Borkmann @ 2016-08-03 23:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Jesper Dangaard Brouer, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Tariq Toukan, Aaron Yue

On 08/04/2016 12:36 AM, Alexei Starovoitov wrote:
> On Wed, Aug 03, 2016 at 12:06:55PM -0700, Tom Herbert wrote:
[...]
>> This is not at all obvious to XDP programmer. The type of ctx
>> structure is xdp_md and the definition of that structure in
>> uapi/linux/bpf.h says that the fields in the that structure are __u32.
>> So when I, as a user naive the inner workings of the verifier, read
>> this code it sure looks like we are upcasting a 32 bit value to a 64
>> bit value-- that does not seem right at all and the compiler
>> apparently concurs my point of view. If the code ends up being correct
>> anyway, then the obvious answer to have an explicit cast that points
>> out the special nature of this cast. Blindly casting to u32 to long
>> for the purposes of assigning to a pointer is only going to confuse
>> more people as it has me.
>
> Agree. Would be nice to have few helpers. The question is whether
> they belong in bpf.h. Probably not, since they're not kernel abi.

Yeah, fully agree, they don't belong into uapi.

> For the same reasons we didn't include instruction building macros
> like BPF_ALU64_REG and instead kept them in samples/bpf/libbpf.h

+1

> Here probably four static inline functions are needed. Two for __sk_buff
> and two for xpd_md. That should make xdp*_kern.c examples a bit easier
> to read.

Yeah, all these bits should go into some library headers that the test
cases make use of (and make them easier to parse for developers).

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-03 17:45     ` order-0 vs order-N driver allocation. Was: " Alexei Starovoitov
@ 2016-08-04 16:19       ` Jesper Dangaard Brouer
  2016-08-05  0:30         ` Alexander Duyck
  2016-08-05  7:15           ` Eric Dumazet
  0 siblings, 2 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-08-04 16:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Brenden Blanco, davem, netdev, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, brouer, Mel Gorman, linux-mm


On Wed, 3 Aug 2016 10:45:13 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Mon, Jul 25, 2016 at 09:35:20AM +0200, Eric Dumazet wrote:
> > On Tue, 2016-07-19 at 12:16 -0700, Brenden Blanco wrote:  
> > > The mlx4 driver by default allocates order-3 pages for the ring to
> > > consume in multiple fragments. When the device has an xdp program, this
> > > behavior will prevent tx actions since the page must be re-mapped in
> > > TODEVICE mode, which cannot be done if the page is still shared.
> > > 
> > > Start by making the allocator configurable based on whether xdp is
> > > running, such that order-0 pages are always used and never shared.
> > > 
> > > Since this will stress the page allocator, add a simple page cache to
> > > each rx ring. Pages in the cache are left dma-mapped, and in drop-only
> > > stress tests the page allocator is eliminated from the perf report.
> > > 
> > > Note that setting an xdp program will now require the rings to be
> > > reconfigured.  
> > 
> > Again, this has nothing to do with XDP ?
> > 
> > Please submit a separate patch, switching this driver to order-0
> > allocations.
> > 
> > I mentioned this order-3 vs order-0 issue earlier [1], and proposed to
> > send a generic patch, but had been traveling lately, and currently in
> > vacation.
> > 
> > order-3 pages are problematic when dealing with hostile traffic anyway,
> > so we should exclusively use order-0 pages, and page recycling like
> > Intel drivers.
> > 
> > http://lists.openwall.net/netdev/2016/04/11/88  
> 
> Completely agree. These multi-page tricks work only for benchmarks and
> not for production.
> Eric, if you can submit that patch for mlx4 that would be awesome.
> 
> I think we should default to order-0 for both mlx4 and mlx5.
> Alternatively we're thinking to do a netlink or ethtool switch to
> preserve old behavior, but frankly I don't see who needs this order-N
> allocation schemes.

I actually agree, that we should switch to order-0 allocations.

*BUT* this will cause performance regressions on platforms with
expensive DMA operations (as they no longer amortize the cost of
mapping a larger page).

Plus, the base cost of order-0 page is 246 cycles (see [1] slide#9),
and the 10G wirespeed target is approx 201 cycles.  Thus, for these
speeds some page recycling tricks are needed.  I described how the Intel
drives does a cool trick in [1] slide#14, but it does not address the
DMA part and costs some extra atomic ops.

I've started coding on the page-pool last week, which address both the
DMA mapping and recycling (with less atomic ops). (p.s. still on
vacation this week).

http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-04 16:19       ` Jesper Dangaard Brouer
@ 2016-08-05  0:30         ` Alexander Duyck
  2016-08-05  3:55           ` Alexei Starovoitov
  2016-08-05  7:15           ` Eric Dumazet
  1 sibling, 1 reply; 59+ messages in thread
From: Alexander Duyck @ 2016-08-05  0:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Eric Dumazet, Brenden Blanco, David Miller,
	Netdev, Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau,
	Ari Saha, Or Gerlitz, john fastabend, Hannes Frederic Sowa,
	Thomas Graf, Tom Herbert, Daniel Borkmann, Tariq Toukan,
	Mel Gorman, linux-mm

On Thu, Aug 4, 2016 at 9:19 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 3 Aug 2016 10:45:13 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
>> On Mon, Jul 25, 2016 at 09:35:20AM +0200, Eric Dumazet wrote:
>> > On Tue, 2016-07-19 at 12:16 -0700, Brenden Blanco wrote:
>> > > The mlx4 driver by default allocates order-3 pages for the ring to
>> > > consume in multiple fragments. When the device has an xdp program, this
>> > > behavior will prevent tx actions since the page must be re-mapped in
>> > > TODEVICE mode, which cannot be done if the page is still shared.
>> > >
>> > > Start by making the allocator configurable based on whether xdp is
>> > > running, such that order-0 pages are always used and never shared.
>> > >
>> > > Since this will stress the page allocator, add a simple page cache to
>> > > each rx ring. Pages in the cache are left dma-mapped, and in drop-only
>> > > stress tests the page allocator is eliminated from the perf report.
>> > >
>> > > Note that setting an xdp program will now require the rings to be
>> > > reconfigured.
>> >
>> > Again, this has nothing to do with XDP ?
>> >
>> > Please submit a separate patch, switching this driver to order-0
>> > allocations.
>> >
>> > I mentioned this order-3 vs order-0 issue earlier [1], and proposed to
>> > send a generic patch, but had been traveling lately, and currently in
>> > vacation.
>> >
>> > order-3 pages are problematic when dealing with hostile traffic anyway,
>> > so we should exclusively use order-0 pages, and page recycling like
>> > Intel drivers.
>> >
>> > http://lists.openwall.net/netdev/2016/04/11/88
>>
>> Completely agree. These multi-page tricks work only for benchmarks and
>> not for production.
>> Eric, if you can submit that patch for mlx4 that would be awesome.
>>
>> I think we should default to order-0 for both mlx4 and mlx5.
>> Alternatively we're thinking to do a netlink or ethtool switch to
>> preserve old behavior, but frankly I don't see who needs this order-N
>> allocation schemes.
>
> I actually agree, that we should switch to order-0 allocations.
>
> *BUT* this will cause performance regressions on platforms with
> expensive DMA operations (as they no longer amortize the cost of
> mapping a larger page).

The trick is to use page reuse like we do for the Intel NICs.  If you
can get away with just reusing the page you don't have to keep making
the expensive map/unmap calls.

> Plus, the base cost of order-0 page is 246 cycles (see [1] slide#9),
> and the 10G wirespeed target is approx 201 cycles.  Thus, for these
> speeds some page recycling tricks are needed.  I described how the Intel
> drives does a cool trick in [1] slide#14, but it does not address the
> DMA part and costs some extra atomic ops.

I'm not sure what you mean about it not addressing the DMA part.  Last
I knew we should be just as fast using the page reuse in the Intel
drivers as the Mellanox driver using the 32K page.  The only real
difference in cost is the spot where we are atomically incrementing
the page count since that is the atomic I assume you are referring to.

I had thought about it and amortizing the atomic operation would
probably be pretty straight forward.  All we would have to do is the
same trick we use in the page frag allocator.  We could add a separate
page_count type variable to the Rx buffer info structure and decrement
that instead.  If I am not mistaken that would allow us to drop it
down to only one atomic update of the page count every 64K or so uses
of the page.

> I've started coding on the page-pool last week, which address both the
> DMA mapping and recycling (with less atomic ops). (p.s. still on
> vacation this week).
>
> http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

I really wonder if we couldn't get away with creating some sort of 2
tiered allocator for this.  So instead of allocating a page pool we
just reserved blocks of memory like we do with huge pages.  Then you
have essentially a huge page that is mapped to a given device for DMA
and reserved for it to use as a memory resource to allocate the order
0 pages out of.  Doing it that way would likely have multiple
advantages when working with things like IOMMU since the pages would
all belong to one linear block so it would likely consume less
resources on those devices, and it wouldn't be that far off from how
DPDK is making use of huge pages in order to improve it's memory
access times and such.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-05  0:30         ` Alexander Duyck
@ 2016-08-05  3:55           ` Alexei Starovoitov
  2016-08-05 15:15             ` Alexander Duyck
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-08-05  3:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Brenden Blanco,
	David Miller, Netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Tom Herbert, Daniel Borkmann,
	Tariq Toukan, Mel Gorman, linux-mm

On Thu, Aug 04, 2016 at 05:30:56PM -0700, Alexander Duyck wrote:
> On Thu, Aug 4, 2016 at 9:19 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Wed, 3 Aug 2016 10:45:13 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> >> On Mon, Jul 25, 2016 at 09:35:20AM +0200, Eric Dumazet wrote:
> >> > On Tue, 2016-07-19 at 12:16 -0700, Brenden Blanco wrote:
> >> > > The mlx4 driver by default allocates order-3 pages for the ring to
> >> > > consume in multiple fragments. When the device has an xdp program, this
> >> > > behavior will prevent tx actions since the page must be re-mapped in
> >> > > TODEVICE mode, which cannot be done if the page is still shared.
> >> > >
> >> > > Start by making the allocator configurable based on whether xdp is
> >> > > running, such that order-0 pages are always used and never shared.
> >> > >
> >> > > Since this will stress the page allocator, add a simple page cache to
> >> > > each rx ring. Pages in the cache are left dma-mapped, and in drop-only
> >> > > stress tests the page allocator is eliminated from the perf report.
> >> > >
> >> > > Note that setting an xdp program will now require the rings to be
> >> > > reconfigured.
> >> >
> >> > Again, this has nothing to do with XDP ?
> >> >
> >> > Please submit a separate patch, switching this driver to order-0
> >> > allocations.
> >> >
> >> > I mentioned this order-3 vs order-0 issue earlier [1], and proposed to
> >> > send a generic patch, but had been traveling lately, and currently in
> >> > vacation.
> >> >
> >> > order-3 pages are problematic when dealing with hostile traffic anyway,
> >> > so we should exclusively use order-0 pages, and page recycling like
> >> > Intel drivers.
> >> >
> >> > http://lists.openwall.net/netdev/2016/04/11/88
> >>
> >> Completely agree. These multi-page tricks work only for benchmarks and
> >> not for production.
> >> Eric, if you can submit that patch for mlx4 that would be awesome.
> >>
> >> I think we should default to order-0 for both mlx4 and mlx5.
> >> Alternatively we're thinking to do a netlink or ethtool switch to
> >> preserve old behavior, but frankly I don't see who needs this order-N
> >> allocation schemes.
> >
> > I actually agree, that we should switch to order-0 allocations.
> >
> > *BUT* this will cause performance regressions on platforms with
> > expensive DMA operations (as they no longer amortize the cost of
> > mapping a larger page).

order-0 is mainly about correctness under memory pressure.
As Eric pointed out order-N is a serious issue for hostile traffic,
but even for normal traffic it's a problem. Sooner or later
only order-0 pages will be available.
Performance considerations come second.

> The trick is to use page reuse like we do for the Intel NICs.  If you
> can get away with just reusing the page you don't have to keep making
> the expensive map/unmap calls.

you mean two packet per page trick?
I think it's trading off performance vs memory.
It's useful. I wish there was a knob to turn it on/off instead
of relying on mtu size threshold.

> > I've started coding on the page-pool last week, which address both the
> > DMA mapping and recycling (with less atomic ops). (p.s. still on
> > vacation this week).
> >
> > http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
> 
> I really wonder if we couldn't get away with creating some sort of 2
> tiered allocator for this.  So instead of allocating a page pool we
> just reserved blocks of memory like we do with huge pages.  Then you
> have essentially a huge page that is mapped to a given device for DMA
> and reserved for it to use as a memory resource to allocate the order
> 0 pages out of.  Doing it that way would likely have multiple
> advantages when working with things like IOMMU since the pages would
> all belong to one linear block so it would likely consume less
> resources on those devices, and it wouldn't be that far off from how
> DPDK is making use of huge pages in order to improve it's memory
> access times and such.

interesting idea. Like dma_map 1GB region and then allocate
pages from it only? but the rest of the kernel won't be able
to use them? so only some smaller region then? or it will be
a boot time flag to reserve this pseudo-huge page?
I don't think any of that is needed for XDP. As demonstrated by current
mlx4 it's very fast already. No bottlenecks in page allocators.
Tiny page recycle array does the magic because most of the traffic
is not going to the stack.
This order-0 vs order-N discussion is for the main stack.
Not related to XDP.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-04 16:19       ` Jesper Dangaard Brouer
@ 2016-08-05  7:15           ` Eric Dumazet
  2016-08-05  7:15           ` Eric Dumazet
  1 sibling, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-08-05  7:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Brenden Blanco, davem, netdev,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm

On Thu, 2016-08-04 at 18:19 +0200, Jesper Dangaard Brouer wrote:

> I actually agree, that we should switch to order-0 allocations.
> 
> *BUT* this will cause performance regressions on platforms with
> expensive DMA operations (as they no longer amortize the cost of
> mapping a larger page).


We much prefer reliable behavior, even it it is ~1 % slower than the
super-optimized thing that opens highways for attackers.

Anyway, in most cases pages are re-used, so we only call
dma_sync_single_range_for_cpu(), and there is no way to avoid this.

Using order-0 pages [1] is actually faster, since when we use high-order
pages (multiple frames per 'page') we can not reuse the pages.

[1] I had a local patch to allocate these pages using a very simple
allocator allocating max order (order-10) pages and splitting them into
order-0 ages, in order to lower TLB footprint. But I could not measure a
gain doing so on x86, at least on my lab machines.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
@ 2016-08-05  7:15           ` Eric Dumazet
  0 siblings, 0 replies; 59+ messages in thread
From: Eric Dumazet @ 2016-08-05  7:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Brenden Blanco, davem, netdev,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm

On Thu, 2016-08-04 at 18:19 +0200, Jesper Dangaard Brouer wrote:

> I actually agree, that we should switch to order-0 allocations.
> 
> *BUT* this will cause performance regressions on platforms with
> expensive DMA operations (as they no longer amortize the cost of
> mapping a larger page).


We much prefer reliable behavior, even it it is ~1 % slower than the
super-optimized thing that opens highways for attackers.

Anyway, in most cases pages are re-used, so we only call
dma_sync_single_range_for_cpu(), and there is no way to avoid this.

Using order-0 pages [1] is actually faster, since when we use high-order
pages (multiple frames per 'page') we can not reuse the pages.

[1] I had a local patch to allocate these pages using a very simple
allocator allocating max order (order-10) pages and splitting them into
order-0 ages, in order to lower TLB footprint. But I could not measure a
gain doing so on x86, at least on my lab machines.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-05  3:55           ` Alexei Starovoitov
@ 2016-08-05 15:15             ` Alexander Duyck
  2016-08-05 15:33                 ` David Laight
  0 siblings, 1 reply; 59+ messages in thread
From: Alexander Duyck @ 2016-08-05 15:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Brenden Blanco,
	David Miller, Netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Tom Herbert, Daniel Borkmann,
	Tariq Toukan, Mel Gorman, linux-mm

On Thu, Aug 4, 2016 at 8:55 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Thu, Aug 04, 2016 at 05:30:56PM -0700, Alexander Duyck wrote:
>> On Thu, Aug 4, 2016 at 9:19 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> >
>> > On Wed, 3 Aug 2016 10:45:13 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>> >
>> >> On Mon, Jul 25, 2016 at 09:35:20AM +0200, Eric Dumazet wrote:
>> >> > On Tue, 2016-07-19 at 12:16 -0700, Brenden Blanco wrote:
>> >> > > The mlx4 driver by default allocates order-3 pages for the ring to
>> >> > > consume in multiple fragments. When the device has an xdp program, this
>> >> > > behavior will prevent tx actions since the page must be re-mapped in
>> >> > > TODEVICE mode, which cannot be done if the page is still shared.
>> >> > >
>> >> > > Start by making the allocator configurable based on whether xdp is
>> >> > > running, such that order-0 pages are always used and never shared.
>> >> > >
>> >> > > Since this will stress the page allocator, add a simple page cache to
>> >> > > each rx ring. Pages in the cache are left dma-mapped, and in drop-only
>> >> > > stress tests the page allocator is eliminated from the perf report.
>> >> > >
>> >> > > Note that setting an xdp program will now require the rings to be
>> >> > > reconfigured.
>> >> >
>> >> > Again, this has nothing to do with XDP ?
>> >> >
>> >> > Please submit a separate patch, switching this driver to order-0
>> >> > allocations.
>> >> >
>> >> > I mentioned this order-3 vs order-0 issue earlier [1], and proposed to
>> >> > send a generic patch, but had been traveling lately, and currently in
>> >> > vacation.
>> >> >
>> >> > order-3 pages are problematic when dealing with hostile traffic anyway,
>> >> > so we should exclusively use order-0 pages, and page recycling like
>> >> > Intel drivers.
>> >> >
>> >> > http://lists.openwall.net/netdev/2016/04/11/88
>> >>
>> >> Completely agree. These multi-page tricks work only for benchmarks and
>> >> not for production.
>> >> Eric, if you can submit that patch for mlx4 that would be awesome.
>> >>
>> >> I think we should default to order-0 for both mlx4 and mlx5.
>> >> Alternatively we're thinking to do a netlink or ethtool switch to
>> >> preserve old behavior, but frankly I don't see who needs this order-N
>> >> allocation schemes.
>> >
>> > I actually agree, that we should switch to order-0 allocations.
>> >
>> > *BUT* this will cause performance regressions on platforms with
>> > expensive DMA operations (as they no longer amortize the cost of
>> > mapping a larger page).
>
> order-0 is mainly about correctness under memory pressure.
> As Eric pointed out order-N is a serious issue for hostile traffic,
> but even for normal traffic it's a problem. Sooner or later
> only order-0 pages will be available.
> Performance considerations come second.
>
>> The trick is to use page reuse like we do for the Intel NICs.  If you
>> can get away with just reusing the page you don't have to keep making
>> the expensive map/unmap calls.
>
> you mean two packet per page trick?
> I think it's trading off performance vs memory.
> It's useful. I wish there was a knob to turn it on/off instead
> of relying on mtu size threshold.

The MTU size doesn't really play a role in the Intel drivers in
regards to page reuse anymore.  We pretty much are just treating the
page as a pair of 2K buffers.  It does have some disadvantages in that
we cannot pack the frames as tight in the case of jumbo frames with
GRO, but at the same time jumbo frames are just not that common.

>> > I've started coding on the page-pool last week, which address both the
>> > DMA mapping and recycling (with less atomic ops). (p.s. still on
>> > vacation this week).
>> >
>> > http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf
>>
>> I really wonder if we couldn't get away with creating some sort of 2
>> tiered allocator for this.  So instead of allocating a page pool we
>> just reserved blocks of memory like we do with huge pages.  Then you
>> have essentially a huge page that is mapped to a given device for DMA
>> and reserved for it to use as a memory resource to allocate the order
>> 0 pages out of.  Doing it that way would likely have multiple
>> advantages when working with things like IOMMU since the pages would
>> all belong to one linear block so it would likely consume less
>> resources on those devices, and it wouldn't be that far off from how
>> DPDK is making use of huge pages in order to improve it's memory
>> access times and such.
>
> interesting idea. Like dma_map 1GB region and then allocate
> pages from it only? but the rest of the kernel won't be able
> to use them? so only some smaller region then? or it will be
> a boot time flag to reserve this pseudo-huge page?

Yeah, something like that.  If we were already talking about
allocating a pool of pages it might make sense to just setup something
like this where you could reserve a 1GB region for a single 10G device
for instance.  Then it would make the whole thing much easier to deal
with since you would have a block of memory that should perform very
well in terms of DMA accesses.

> I don't think any of that is needed for XDP. As demonstrated by current
> mlx4 it's very fast already. No bottlenecks in page allocators.
> Tiny page recycle array does the magic because most of the traffic
> is not going to the stack.

Agreed.  If you aren't handing the frames up we don't really don't
even have to bother.  In the Intel drivers for instance if the frame
size is less than 256 bytes we just copy the whole thing out since it
is cheaper to just extend the header copy rather than taking the extra
hit for get_page/put_page.

> This order-0 vs order-N discussion is for the main stack.
> Not related to XDP.

Agreed.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-05 15:15             ` Alexander Duyck
@ 2016-08-05 15:33                 ` David Laight
  0 siblings, 0 replies; 59+ messages in thread
From: David Laight @ 2016-08-05 15:33 UTC (permalink / raw)
  To: 'Alexander Duyck', Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Brenden Blanco,
	David Miller, Netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Tom Herbert, Daniel Borkmann,
	Tariq Toukan, Mel Gorman, linux-mm

From: Alexander Duyck
> Sent: 05 August 2016 16:15
...
> >
> > interesting idea. Like dma_map 1GB region and then allocate
> > pages from it only? but the rest of the kernel won't be able
> > to use them? so only some smaller region then? or it will be
> > a boot time flag to reserve this pseudo-huge page?
> 
> Yeah, something like that.  If we were already talking about
> allocating a pool of pages it might make sense to just setup something
> like this where you could reserve a 1GB region for a single 10G device
> for instance.  Then it would make the whole thing much easier to deal
> with since you would have a block of memory that should perform very
> well in terms of DMA accesses.

ISTM that the main kernel allocator ought to be keeping a cache
of pages that are mapped into the various IOMMU.
This might be a per-driver cache, but could be much wider.

Then if some code wants such a page it can be allocated one that is
already mapped.
Under memory pressure the pages could then be reused for other purposes.

...
> In the Intel drivers for instance if the frame
> size is less than 256 bytes we just copy the whole thing out since it
> is cheaper to just extend the header copy rather than taking the extra
> hit for get_page/put_page.

How fast is 'rep movsb' (on cached addresses) on recent x86 cpu?
It might actually be worth unconditionally copying the entire frame
on those cpus.

A long time ago we found the breakeven point for the copy to be about
1kb on sparc mbus/sbus systems - and that might not have been aligning
the copy.

	David


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
@ 2016-08-05 15:33                 ` David Laight
  0 siblings, 0 replies; 59+ messages in thread
From: David Laight @ 2016-08-05 15:33 UTC (permalink / raw)
  To: 'Alexander Duyck', Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Eric Dumazet, Brenden Blanco,
	David Miller, Netdev, Jamal Hadi Salim, Saeed Mahameed,
	Martin KaFai Lau, Ari Saha, Or Gerlitz, john fastabend,
	Hannes Frederic Sowa, Thomas Graf, Tom Herbert, Daniel Borkmann,
	Tariq Toukan, Mel Gorman, linux-mm

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1704 bytes --]

From: Alexander Duyck
> Sent: 05 August 2016 16:15
...
> >
> > interesting idea. Like dma_map 1GB region and then allocate
> > pages from it only? but the rest of the kernel won't be able
> > to use them? so only some smaller region then? or it will be
> > a boot time flag to reserve this pseudo-huge page?
> 
> Yeah, something like that.  If we were already talking about
> allocating a pool of pages it might make sense to just setup something
> like this where you could reserve a 1GB region for a single 10G device
> for instance.  Then it would make the whole thing much easier to deal
> with since you would have a block of memory that should perform very
> well in terms of DMA accesses.

ISTM that the main kernel allocator ought to be keeping a cache
of pages that are mapped into the various IOMMU.
This might be a per-driver cache, but could be much wider.

Then if some code wants such a page it can be allocated one that is
already mapped.
Under memory pressure the pages could then be reused for other purposes.

...
> In the Intel drivers for instance if the frame
> size is less than 256 bytes we just copy the whole thing out since it
> is cheaper to just extend the header copy rather than taking the extra
> hit for get_page/put_page.

How fast is 'rep movsb' (on cached addresses) on recent x86 cpu?
It might actually be worth unconditionally copying the entire frame
on those cpus.

A long time ago we found the breakeven point for the copy to be about
1kb on sparc mbus/sbus systems - and that might not have been aligning
the copy.

	David

N‹§²æìr¸›zǧu©ž²Æ {\b­†éì¹»\x1c®&Þ–)îÆi¢žØ^n‡r¶‰šŽŠÝ¢j$½§$¢¸\x05¢¹¨­è§~Š'.)îÄÃ,yèm¶ŸÿÃ\f%Š{±šj+ƒðèž×¦j)Z†·Ÿ

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-05 15:33                 ` David Laight
@ 2016-08-05 16:00                   ` Alexander Duyck
  -1 siblings, 0 replies; 59+ messages in thread
From: Alexander Duyck @ 2016-08-05 16:00 UTC (permalink / raw)
  To: David Laight
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Eric Dumazet,
	Brenden Blanco, David Miller, Netdev, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm

On Fri, Aug 5, 2016 at 8:33 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Alexander Duyck
>> Sent: 05 August 2016 16:15
> ...
>> >
>> > interesting idea. Like dma_map 1GB region and then allocate
>> > pages from it only? but the rest of the kernel won't be able
>> > to use them? so only some smaller region then? or it will be
>> > a boot time flag to reserve this pseudo-huge page?
>>
>> Yeah, something like that.  If we were already talking about
>> allocating a pool of pages it might make sense to just setup something
>> like this where you could reserve a 1GB region for a single 10G device
>> for instance.  Then it would make the whole thing much easier to deal
>> with since you would have a block of memory that should perform very
>> well in terms of DMA accesses.
>
> ISTM that the main kernel allocator ought to be keeping a cache
> of pages that are mapped into the various IOMMU.
> This might be a per-driver cache, but could be much wider.
>
> Then if some code wants such a page it can be allocated one that is
> already mapped.
> Under memory pressure the pages could then be reused for other purposes.
>
> ...
>> In the Intel drivers for instance if the frame
>> size is less than 256 bytes we just copy the whole thing out since it
>> is cheaper to just extend the header copy rather than taking the extra
>> hit for get_page/put_page.
>
> How fast is 'rep movsb' (on cached addresses) on recent x86 cpu?
> It might actually be worth unconditionally copying the entire frame
> on those cpus.

The cost for rep movsb on modern x86 is about 1 cycle for every 16
bytes plus some fixed amount of setup time.  The time it usually takes
for something like an atomic operation can vary but it is usually in
the 10s of cycles when you factor in a get_page/put_page which is why
I ended up going with 256 being the upper limit as I recall since that
allowed for the best performance without starting to incur any
penalty.

> A long time ago we found the breakeven point for the copy to be about
> 1kb on sparc mbus/sbus systems - and that might not have been aligning
> the copy.

I wouldn't know about other architectures.

- Alex

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
@ 2016-08-05 16:00                   ` Alexander Duyck
  0 siblings, 0 replies; 59+ messages in thread
From: Alexander Duyck @ 2016-08-05 16:00 UTC (permalink / raw)
  To: David Laight
  Cc: Alexei Starovoitov, Jesper Dangaard Brouer, Eric Dumazet,
	Brenden Blanco, David Miller, Netdev, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john fastabend, Hannes Frederic Sowa, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm

On Fri, Aug 5, 2016 at 8:33 AM, David Laight <David.Laight@aculab.com> wrote:
> From: Alexander Duyck
>> Sent: 05 August 2016 16:15
> ...
>> >
>> > interesting idea. Like dma_map 1GB region and then allocate
>> > pages from it only? but the rest of the kernel won't be able
>> > to use them? so only some smaller region then? or it will be
>> > a boot time flag to reserve this pseudo-huge page?
>>
>> Yeah, something like that.  If we were already talking about
>> allocating a pool of pages it might make sense to just setup something
>> like this where you could reserve a 1GB region for a single 10G device
>> for instance.  Then it would make the whole thing much easier to deal
>> with since you would have a block of memory that should perform very
>> well in terms of DMA accesses.
>
> ISTM that the main kernel allocator ought to be keeping a cache
> of pages that are mapped into the various IOMMU.
> This might be a per-driver cache, but could be much wider.
>
> Then if some code wants such a page it can be allocated one that is
> already mapped.
> Under memory pressure the pages could then be reused for other purposes.
>
> ...
>> In the Intel drivers for instance if the frame
>> size is less than 256 bytes we just copy the whole thing out since it
>> is cheaper to just extend the header copy rather than taking the extra
>> hit for get_page/put_page.
>
> How fast is 'rep movsb' (on cached addresses) on recent x86 cpu?
> It might actually be worth unconditionally copying the entire frame
> on those cpus.

The cost for rep movsb on modern x86 is about 1 cycle for every 16
bytes plus some fixed amount of setup time.  The time it usually takes
for something like an atomic operation can vary but it is usually in
the 10s of cycles when you factor in a get_page/put_page which is why
I ended up going with 256 being the upper limit as I recall since that
allowed for the best performance without starting to incur any
penalty.

> A long time ago we found the breakeven point for the copy to be about
> 1kb on sparc mbus/sbus systems - and that might not have been aligning
> the copy.

I wouldn't know about other architectures.

- Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-05  7:15           ` Eric Dumazet
@ 2016-08-08  2:15             ` Alexei Starovoitov
  -1 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-08-08  2:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Brenden Blanco, davem, netdev,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm

On Fri, Aug 05, 2016 at 09:15:33AM +0200, Eric Dumazet wrote:
> On Thu, 2016-08-04 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> 
> > I actually agree, that we should switch to order-0 allocations.
> > 
> > *BUT* this will cause performance regressions on platforms with
> > expensive DMA operations (as they no longer amortize the cost of
> > mapping a larger page).
> 
> 
> We much prefer reliable behavior, even it it is ~1 % slower than the
> super-optimized thing that opens highways for attackers.

+1
It's more important to have deterministic performance at fresh boot
and after long uptime when high order-N are gone.

> Anyway, in most cases pages are re-used, so we only call
> dma_sync_single_range_for_cpu(), and there is no way to avoid this.
> 
> Using order-0 pages [1] is actually faster, since when we use high-order
> pages (multiple frames per 'page') we can not reuse the pages.
> 
> [1] I had a local patch to allocate these pages using a very simple
> allocator allocating max order (order-10) pages and splitting them into
> order-0 ages, in order to lower TLB footprint. But I could not measure a
> gain doing so on x86, at least on my lab machines.

Which driver was that?
I suspect that should indeed be the case for any driver that
uses build_skb and <256 copybreak.

Saeed,
could you please share the performance numbers for mlx5 order-0 vs order-N ?
You mentioned that there was some performance improvement. We need to know
how much we'll lose when we turn off order-N.
Thanks!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
@ 2016-08-08  2:15             ` Alexei Starovoitov
  0 siblings, 0 replies; 59+ messages in thread
From: Alexei Starovoitov @ 2016-08-08  2:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Brenden Blanco, davem, netdev,
	Jamal Hadi Salim, Saeed Mahameed, Martin KaFai Lau, Ari Saha,
	Or Gerlitz, john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm

On Fri, Aug 05, 2016 at 09:15:33AM +0200, Eric Dumazet wrote:
> On Thu, 2016-08-04 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> 
> > I actually agree, that we should switch to order-0 allocations.
> > 
> > *BUT* this will cause performance regressions on platforms with
> > expensive DMA operations (as they no longer amortize the cost of
> > mapping a larger page).
> 
> 
> We much prefer reliable behavior, even it it is ~1 % slower than the
> super-optimized thing that opens highways for attackers.

+1
It's more important to have deterministic performance at fresh boot
and after long uptime when high order-N are gone.

> Anyway, in most cases pages are re-used, so we only call
> dma_sync_single_range_for_cpu(), and there is no way to avoid this.
> 
> Using order-0 pages [1] is actually faster, since when we use high-order
> pages (multiple frames per 'page') we can not reuse the pages.
> 
> [1] I had a local patch to allocate these pages using a very simple
> allocator allocating max order (order-10) pages and splitting them into
> order-0 ages, in order to lower TLB footprint. But I could not measure a
> gain doing so on x86, at least on my lab machines.

Which driver was that?
I suspect that should indeed be the case for any driver that
uses build_skb and <256 copybreak.

Saeed,
could you please share the performance numbers for mlx5 order-0 vs order-N ?
You mentioned that there was some performance improvement. We need to know
how much we'll lose when we turn off order-N.
Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-08  2:15             ` Alexei Starovoitov
  (?)
@ 2016-08-08  8:01             ` Jesper Dangaard Brouer
  2016-08-08 18:34               ` Alexei Starovoitov
  -1 siblings, 1 reply; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-08-08  8:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Brenden Blanco, davem, netdev, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm, brouer


On Sun, 7 Aug 2016 19:15:27 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Fri, Aug 05, 2016 at 09:15:33AM +0200, Eric Dumazet wrote:
> > On Thu, 2016-08-04 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> >   
> > > I actually agree, that we should switch to order-0 allocations.
> > > 
> > > *BUT* this will cause performance regressions on platforms with
> > > expensive DMA operations (as they no longer amortize the cost of
> > > mapping a larger page).  
> > 
> > 
> > We much prefer reliable behavior, even it it is ~1 % slower than the
> > super-optimized thing that opens highways for attackers.  
> 
> +1
> It's more important to have deterministic performance at fresh boot
> and after long uptime when high order-N are gone.

Yes, exactly. Doing high order-N pages allocations might look good on
benchmarks on a freshly booted system, but once the page allocator gets
fragmented (after long uptime) then performance characteristics change.
(Discussed this with Christoph Lameter during MM-summit, and he have
seen issues with this kind of fragmentation in production)


> > Anyway, in most cases pages are re-used, so we only call
> > dma_sync_single_range_for_cpu(), and there is no way to avoid this.
> > 
> > Using order-0 pages [1] is actually faster, since when we use high-order
> > pages (multiple frames per 'page') we can not reuse the pages.
> > 
> > [1] I had a local patch to allocate these pages using a very simple
> > allocator allocating max order (order-10) pages and splitting them into
> > order-0 ages, in order to lower TLB footprint. But I could not measure a
> > gain doing so on x86, at least on my lab machines.  
> 
> Which driver was that?
> I suspect that should indeed be the case for any driver that
> uses build_skb and <256 copybreak.
> 
> Saeed,
> could you please share the performance numbers for mlx5 order-0 vs order-N ?
> You mentioned that there was some performance improvement. We need to know
> how much we'll lose when we turn off order-N.

I'm not sure the compare will be "fair" with the mlx5 driver, because
(1) the N-order page mode (MPWQE) is a hardware feature, plus (2) the
order-0 page mode is done "wrongly" (by preallocating SKBs together
with RX ring entries).

AFAIK it is a hardware feature the MPQWE (Multi-Packet Work Queue
Element) or Striding RQ, for ConnectX4-Lx.  Thus, the need to support
two modes in the mlx5 driver.

Commit[1] 461017cb006a ("net/mlx5e: Support RX multi-packet WQE
(Striding RQ)") states this gives a 10-15% performance improvement for
netperf TCP stream (and ability to absorb bursty traffic).

 [1] https://git.kernel.org/torvalds/c/461017cb006


The MPWQE mode, uses order-5 pages.  The critical question is: what
happens to the performance when order-5 allocations gets slower (or
impossible) due to page fragmentation? (Notice the page allocator uses
a central lock for order-N pages)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-08  8:01             ` Jesper Dangaard Brouer
@ 2016-08-08 18:34               ` Alexei Starovoitov
  2016-08-09 12:14                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 59+ messages in thread
From: Alexei Starovoitov @ 2016-08-08 18:34 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Brenden Blanco, davem, netdev, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm

On Mon, Aug 08, 2016 at 10:01:15AM +0200, Jesper Dangaard Brouer wrote:
> 
> On Sun, 7 Aug 2016 19:15:27 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Fri, Aug 05, 2016 at 09:15:33AM +0200, Eric Dumazet wrote:
> > > On Thu, 2016-08-04 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> > >   
> > > > I actually agree, that we should switch to order-0 allocations.
> > > > 
> > > > *BUT* this will cause performance regressions on platforms with
> > > > expensive DMA operations (as they no longer amortize the cost of
> > > > mapping a larger page).  
> > > 
> > > 
> > > We much prefer reliable behavior, even it it is ~1 % slower than the
> > > super-optimized thing that opens highways for attackers.  
> > 
> > +1
> > It's more important to have deterministic performance at fresh boot
> > and after long uptime when high order-N are gone.
> 
> Yes, exactly. Doing high order-N pages allocations might look good on
> benchmarks on a freshly booted system, but once the page allocator gets
> fragmented (after long uptime) then performance characteristics change.
> (Discussed this with Christoph Lameter during MM-summit, and he have
> seen issues with this kind of fragmentation in production)
> 
> 
> > > Anyway, in most cases pages are re-used, so we only call
> > > dma_sync_single_range_for_cpu(), and there is no way to avoid this.
> > > 
> > > Using order-0 pages [1] is actually faster, since when we use high-order
> > > pages (multiple frames per 'page') we can not reuse the pages.
> > > 
> > > [1] I had a local patch to allocate these pages using a very simple
> > > allocator allocating max order (order-10) pages and splitting them into
> > > order-0 ages, in order to lower TLB footprint. But I could not measure a
> > > gain doing so on x86, at least on my lab machines.  
> > 
> > Which driver was that?
> > I suspect that should indeed be the case for any driver that
> > uses build_skb and <256 copybreak.
> > 
> > Saeed,
> > could you please share the performance numbers for mlx5 order-0 vs order-N ?
> > You mentioned that there was some performance improvement. We need to know
> > how much we'll lose when we turn off order-N.
> 
> I'm not sure the compare will be "fair" with the mlx5 driver, because
> (1) the N-order page mode (MPWQE) is a hardware feature, plus (2) the
> order-0 page mode is done "wrongly" (by preallocating SKBs together
> with RX ring entries).
> 
> AFAIK it is a hardware feature the MPQWE (Multi-Packet Work Queue
> Element) or Striding RQ, for ConnectX4-Lx.  Thus, the need to support
> two modes in the mlx5 driver.
> 
> Commit[1] 461017cb006a ("net/mlx5e: Support RX multi-packet WQE
> (Striding RQ)") states this gives a 10-15% performance improvement for
> netperf TCP stream (and ability to absorb bursty traffic).
> 
>  [1] https://git.kernel.org/torvalds/c/461017cb006

I suspect this 10% perf improvement is due to build_skb approach
instead of mpqwe which works fine with order-0 pages as well.
The request for perf numbers was for mlx5 order-0 vs order-N _with_
build_skb. In other words using MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
with order-0.
Old mlx5e_handle_rx_cqe should also be converted to build_skb
even when striding_rq is not available in hw, it's a win.

> The MPWQE mode, uses order-5 pages.  The critical question is: what
> happens to the performance when order-5 allocations gets slower (or
> impossible) due to page fragmentation? (Notice the page allocator uses
> a central lock for order-N pages)

it suppose to fallback to order-0. See mlx5e_alloc_rx_fragmented_mpwqe.
which scares me a lot, since I don't see how such logic could have
been stress tested and we'll be hitting it in production.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: order-0 vs order-N driver allocation. Was: [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support
  2016-08-08 18:34               ` Alexei Starovoitov
@ 2016-08-09 12:14                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 59+ messages in thread
From: Jesper Dangaard Brouer @ 2016-08-09 12:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Rana Shahout
  Cc: Eric Dumazet, Brenden Blanco, davem, netdev, Jamal Hadi Salim,
	Saeed Mahameed, Martin KaFai Lau, Ari Saha, Or Gerlitz,
	john.fastabend, hannes, Thomas Graf, Tom Herbert,
	Daniel Borkmann, Tariq Toukan, Mel Gorman, linux-mm, brouer


> > On Sun, 7 Aug 2016 19:15:27 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
[...]
> > > could you please share the performance numbers for mlx5 order-0 vs order-N ?
> > > You mentioned that there was some performance improvement. We need to know
> > > how much we'll lose when we turn off order-N.  

There is an really easy way (after XDP) to benchmark this
order-0 vs order-N, for the driver mlx4.

I simply load a XDP program, that returns XDP_PASS, because loading XDP
will reallocate the RX rings to use a single frame/packet and order-0
pages (for RX ring slots).

Result summary: (order-3 pages) 4,453,022 -> (XDP_PASS) 3,295,798 pps
 * 3295798 - 4453022 = -1157224 pps slower
 * (3295798/4453022-1)*100 = -25.98% slower
 * (1/4453022-1/3295798)*10^9 - -78.85 nanosec slower
 * Approx convert nanosec to cycles (78.85 * 4GHz) = 315 cycles slower

Where does this performance regression originate from. Well, this
basically only changed the page allocation strategy and number of DMA
calls in the driver.  Thus, lets look at the performance of the page
allocator (see tool Page_bench_ and MM_slides_ page 9)

On this machine:
 * Cost of order-0: 237 cycles(tsc)  59.336 ns
 * Cost of order-3: 423 cycles(tsc) 106.029 ns

The order-3 cost is amortized, as it can store 21 frames of size 1536,
to cost per page-fragment 20 cycles / 5.049 ns. Thus, I would expect
to see a (59.336-5.049) 54.287 ns performance reduction, not 78.85,
which is 24.563 ns higher than expected (extra dma maps cannot explain
this on a Intel platform).

There is a higher percentage of L3/LLC-load-misses, which is strange,
as I though the simple XDP (inc map cnt and return XDP_PASS) program
should not touch the data.  Quick experiment with xdp-prog that touch
data like xdp1 and always return XDP_PASS, show 3209235 with is only
8ns slower ((1/3209235-1/3295798)*10^9 = 8.184 ns).  Thus, the extra
24ns (or 16ns) might originate from an earlier cache-miss.

Conclusion: These measurements confirm that we need a page recycle
facility for the drivers before switching to order-0 allocations.


Links:

.. _Page_bench: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench01.c

.. _MM_slides: http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.odp



Benchmarking notes and perf results below:

Base setup:
 * Drop packets in iptables RAW
 * Disable Ethernet flow control
 * Disable GRO (changes driver code path)
 * Mlx4 NIC CX3-pro (mlx4_core log_num_mgm_entry_size=-2)
 * CPU: i7-4790K CPU @ 4.00GHz (turbostat report 4.3GHz)

Baseline: 4.7.0-baseline+ #102 SMP PREEMPT
 * instant rx:4558943 tx:0 pps n:162 average: rx:4453022 tx:0 pps
   (instant variation TX 0.000 ns (min:0.000 max:0.000) RX 5.217 ns)

Baseline perf stat::

 $ sudo perf stat -C3 -e L1-icache-load-misses -e cycles:k -e  instructions:k -e cache-misses:k -e   cache-references:k  -e LLC-store-misses:k -e LLC-store -e LLC-load-misses:k -e  LLC-load -r 5 sleep 1

Performance counter stats for 'CPU(s) 3' (5 runs) ::

       271,417  L1-icache-load-misses  ( +-  0.69% )  (33.32%)
 4,383,371,009  cycles:k               ( +-  0.01% )  (44.51%)
 7,587,502,193  instructions:k #  1.50  insns per cycle     (+- 0.01% )(55.62%)
     5,856,640  cache-misses:k # 48.435 % of all cache refs (+- 0.01% )(66.72%)
    12,091,854  cache-references:k                         ( +-  0.04%)(66.72%)
       451,681  LLC-store-misses                           ( +-  0.13%)(66.72%)
       463,152  LLC-store                                  ( +-  0.12%)(66.68%)
     5,408,934  LLC-load-misses # 47.26% of all LL-cache hits (0.01%) (22.19%)
    11,446,060  LLC-load                                 ( +-  0.04%) (22.19%)

 Samples: 40K of event 'cycles', Event count (approx.): 43956150960 ::
  Overhead  Command        Shared Object        Symbol
 +   36.59%  ksoftirqd/3    [kernel.vmlinux]     [k] memcpy_erms
 +    6.76%  ksoftirqd/3    [mlx4_en]            [k] mlx4_en_process_rx_cq
 +    6.66%  ksoftirqd/3    [ip_tables]          [k] ipt_do_table
 +    6.03%  ksoftirqd/3    [kernel.vmlinux]     [k] __build_skb
 +    4.65%  ksoftirqd/3    [kernel.vmlinux]     [k] ip_rcv
 +    4.22%  ksoftirqd/3    [mlx4_en]            [k] mlx4_en_prepare_rx_desc
 +    3.46%  ksoftirqd/3    [mlx4_en]            [k] mlx4_en_free_frag
 +    3.37%  ksoftirqd/3    [kernel.vmlinux]     [k] __netif_receive_skb_core
 +    3.04%  ksoftirqd/3    [kernel.vmlinux]     [k] __netdev_alloc_skb
 +    2.80%  ksoftirqd/3    [kernel.vmlinux]     [k] kmem_cache_alloc
 +    2.38%  ksoftirqd/3    [kernel.vmlinux]     [k] __free_page_frag
 +    1.88%  ksoftirqd/3    [kernel.vmlinux]     [k] kmem_cache_free
 +    1.65%  ksoftirqd/3    [kernel.vmlinux]     [k] nf_iterate
 +    1.59%  ksoftirqd/3    [kernel.vmlinux]     [k] nf_hook_slow
 +    1.31%  ksoftirqd/3    [kernel.vmlinux]     [k] __rcu_read_unlock
 +    0.91%  ksoftirqd/3    [kernel.vmlinux]     [k] __alloc_page_frag
 +    0.88%  ksoftirqd/3    [kernel.vmlinux]     [k] eth_type_trans
 +    0.77%  ksoftirqd/3    [kernel.vmlinux]     [k] dev_gro_receive
 +    0.76%  ksoftirqd/3    [kernel.vmlinux]     [k] skb_release_data
 +    0.76%  ksoftirqd/3    [kernel.vmlinux]     [k] __local_bh_enable_ip
 +    0.72%  ksoftirqd/3    [kernel.vmlinux]     [k] netif_receive_skb_internal
 +    0.66%  ksoftirqd/3    [kernel.vmlinux]     [k] napi_gro_receive
 +    0.66%  ksoftirqd/3    [kernel.vmlinux]     [k] __rcu_read_lock
 +    0.65%  ksoftirqd/3    [kernel.vmlinux]     [k] skb_release_head_state
 +    0.57%  ksoftirqd/3    [kernel.vmlinux]     [k] get_page_from_freelist
 +    0.57%  ksoftirqd/3    [kernel.vmlinux]     [k] __free_pages_ok
 +    0.51%  ksoftirqd/3    [kernel.vmlinux]     [k] kfree_skb
 +    0.43%  ksoftirqd/3    [kernel.vmlinux]     [k] skb_release_all

Result-xdp-pass: loading XDP_PASS program
 * instant rx:3374269 tx:0 pps n:537 average: rx:3295798 tx:0 pps
   (instant variation TX 0.000 ns (min:0.000 max:0.000) RX 7.056 ns)

Difference: 4,453,022 -> 3,295,798 pps
 * 3295798 - 4453022 = -1157224 pps slower
 * (3295798/4453022-1)*100 = -25.98% slower
 * (1/4453022-1/3295798)*10^9 - -78.85 nanosec slower

Perf stats xdp-pass::

  Performance counter stats for 'CPU(s) 3' (5 runs):

       294,219 L1-icache-load-misses  (+-0.25% )  (33.33%)
 4,382,764,897 cycles:k               (+-0.00% )  (44.51%)
 7,223,252,624 instructions:k #  1.65  insns per cycle     (+-0.00%)(55.62%)
     7,166,907 cache-misses:k # 58.792 % of all cache refs (+-0.01%)(66.72%)
    12,190,275 cache-references:k        (+-0.03% )  (66.72%)
       525,262 LLC-store-misses          (+-0.11% )  (66.72%)
       587,354 LLC-store                 (+-0.09% )  (66.68%)
     6,647,957 LLC-load-misses # 58.23% of all LL-cache hits (+-0.02%)(22.19%)
    11,417,001 LLC-load                                      (+-0.03%)(22.19%)

There is a higher percentage of L3/LLC-load-misses, which is strange,
as I though the simple XDP (return XDP_PASS and inc map cnt) program
would not touch the data.

Perf report xdp-pass::

 Samples: 40K of event 'cycles', Event count (approx.): 43953682891
   Overhead  Command        Shared Object     Symbol
 +   25.79%  ksoftirqd/3    [kernel.vmlinux]  [k] memcpy_erms
 +    7.29%  ksoftirqd/3    [mlx4_en]         [k] mlx4_en_process_rx_cq
 +    5.42%  ksoftirqd/3    [mlx4_en]         [k] mlx4_en_free_frag
 +    5.16%  ksoftirqd/3    [kernel.vmlinux]  [k] get_page_from_freelist
 +    4.55%  ksoftirqd/3    [ip_tables]       [k] ipt_do_table
 +    4.46%  ksoftirqd/3    [mlx4_en]         [k] mlx4_alloc_pages.isra.19
 +    3.97%  ksoftirqd/3    [kernel.vmlinux]  [k] __build_skb
 +    3.67%  ksoftirqd/3    [kernel.vmlinux]  [k] free_hot_cold_page
 +    3.46%  ksoftirqd/3    [kernel.vmlinux]  [k] ip_rcv
 +    2.71%  ksoftirqd/3    [kernel.vmlinux]  [k] __alloc_pages_nodemask
 +    2.62%  ksoftirqd/3    [kernel.vmlinux]  [k] __netif_receive_skb_core
 +    2.46%  ksoftirqd/3    [kernel.vmlinux]  [k] kmem_cache_alloc
 +    2.24%  ksoftirqd/3    [kernel.vmlinux]  [k] __netdev_alloc_skb
 +    2.15%  ksoftirqd/3    [mlx4_en]         [k] mlx4_en_prepare_rx_desc
 +    1.88%  ksoftirqd/3    [kernel.vmlinux]  [k] __free_page_frag
 +    1.55%  ksoftirqd/3    [kernel.vmlinux]  [k] kmem_cache_free
 +    1.42%  ksoftirqd/3    [kernel.vmlinux]  [k] __rcu_read_unlock
 +    1.27%  ksoftirqd/3    [kernel.vmlinux]  [k] nf_iterate
 +    1.14%  ksoftirqd/3    [kernel.vmlinux]  [k] nf_hook_slow
 +    1.05%  ksoftirqd/3    [kernel.vmlinux]  [k] alloc_pages_current
 +    0.83%  ksoftirqd/3    [kernel.vmlinux]  [k] __inc_zone_state
 +    0.73%  ksoftirqd/3    [kernel.vmlinux]  [k] __list_del_entry
 +    0.69%  ksoftirqd/3    [kernel.vmlinux]  [k] __list_add
 +    0.64%  ksoftirqd/3    [kernel.vmlinux]  [k] __local_bh_enable_ip
 +    0.64%  ksoftirqd/3    [kernel.vmlinux]  [k] __rcu_read_lock
 +    0.62%  ksoftirqd/3    [kernel.vmlinux]  [k] dev_gro_receive
 +    0.62%  ksoftirqd/3    [kernel.vmlinux]  [k] swiotlb_map_page
 +    0.61%  ksoftirqd/3    [kernel.vmlinux]  [k] skb_release_data
 +    0.60%  ksoftirqd/3    [kernel.vmlinux]  [k] __alloc_page_frag
 +    0.58%  ksoftirqd/3    [kernel.vmlinux]  [k] eth_type_trans
 +    0.57%  ksoftirqd/3    [kernel.vmlinux]  [k] policy_zonelist
 +    0.51%  ksoftirqd/3    [pps_core]        [k] 0x000000000000692d
 +    0.51%  ksoftirqd/3    [kernel.vmlinux]  [k] netif_receive_skb_internal
 +    0.50%  ksoftirqd/3    [kernel.vmlinux]  [k] napi_gro_receive
 +    0.49%  ksoftirqd/3    [kernel.vmlinux]  [k] __put_page
 +    0.49%  ksoftirqd/3    [kernel.vmlinux]  [k] skb_release_head_state
 +    0.42%  ksoftirqd/3    [kernel.vmlinux]  [k] kfree_skb
 +    0.34%  ksoftirqd/3    [pps_core]        [k] 0x0000000000006935
 +    0.33%  ksoftirqd/3    [kernel.vmlinux]  [k] skb_free_head
 +    0.32%  ksoftirqd/3    [kernel.vmlinux]  [k] __netif_receive_skb
 +    0.31%  ksoftirqd/3    [kernel.vmlinux]  [k] swiotlb_sync_single
 +    0.31%  ksoftirqd/3    [kernel.vmlinux]  [k] skb_gro_reset_offset
 +    0.29%  ksoftirqd/3    [kernel.vmlinux]  [k] swiotlb_sync_single_for_cpu
 +    0.29%  ksoftirqd/3    [kernel.vmlinux]  [k] list_del
 +    0.27%  ksoftirqd/3    [iptable_raw]     [k] iptable_raw_hook
 +    0.27%  ksoftirqd/3    [kernel.vmlinux]  [k] skb_release_all
 +    0.26%  ksoftirqd/3    [kernel.vmlinux]  [k] kfree_skbmem
 +    0.25%  ksoftirqd/3    [kernel.vmlinux]  [k] swiotlb_unmap_page
 +    0.23%  ksoftirqd/3    [kernel.vmlinux]  [k] bpf_map_lookup_elem
 +    0.22%  ksoftirqd/3    [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
 +    0.20%  ksoftirqd/3    [kernel.vmlinux]  [k] __page_cache_release

In perf-diff notice the increase for:
 * get_page_from_freelist(0.57%) +4.59%,
 * mlx4_en_free_frag     (3.46%) +1.96%,
 * mlx4_alloc_pages      (0.26%) +4.20%
 * __alloc_pages_nodemask(0.14%) +2.57%
 * swiotlb_map_page      (0.04%) +0.57%

Perf diff::

 # Baseline    Delta  Shared Object        Symbol
 # ........  .......  ...................  ................................
 #
    36.59%  -10.80%  [kernel.vmlinux]     [k] memcpy_erms
     6.76%   +0.53%  [mlx4_en]            [k] mlx4_en_process_rx_cq
     6.66%   -2.11%  [ip_tables]          [k] ipt_do_table
     6.03%   -2.06%  [kernel.vmlinux]     [k] __build_skb
     4.65%   -1.18%  [kernel.vmlinux]     [k] ip_rcv
     4.22%   -2.06%  [mlx4_en]            [k] mlx4_en_prepare_rx_desc
     3.46%   +1.96%  [mlx4_en]            [k] mlx4_en_free_frag
     3.37%   -0.75%  [kernel.vmlinux]     [k] __netif_receive_skb_core
     3.04%   -0.80%  [kernel.vmlinux]     [k] __netdev_alloc_skb
     2.80%   -0.34%  [kernel.vmlinux]     [k] kmem_cache_alloc
     2.38%   -0.50%  [kernel.vmlinux]     [k] __free_page_frag
     1.88%   -0.34%  [kernel.vmlinux]     [k] kmem_cache_free
     1.65%   -0.38%  [kernel.vmlinux]     [k] nf_iterate
     1.59%   -0.45%  [kernel.vmlinux]     [k] nf_hook_slow
     1.31%   +0.11%  [kernel.vmlinux]     [k] __rcu_read_unlock
     0.91%   -0.31%  [kernel.vmlinux]     [k] __alloc_page_frag
     0.88%   -0.30%  [kernel.vmlinux]     [k] eth_type_trans
     0.77%   -0.15%  [kernel.vmlinux]     [k] dev_gro_receive
     0.76%   -0.15%  [kernel.vmlinux]     [k] skb_release_data
     0.76%   -0.12%  [kernel.vmlinux]     [k] __local_bh_enable_ip
     0.72%   -0.21%  [kernel.vmlinux]     [k] netif_receive_skb_internal
     0.66%   -0.16%  [kernel.vmlinux]     [k] napi_gro_receive
     0.66%   -0.02%  [kernel.vmlinux]     [k] __rcu_read_lock
     0.65%   -0.17%  [kernel.vmlinux]     [k] skb_release_head_state
     0.57%   +4.59%  [kernel.vmlinux]     [k] get_page_from_freelist
     0.57%           [kernel.vmlinux]     [k] __free_pages_ok
     0.51%   -0.09%  [kernel.vmlinux]     [k] kfree_skb
     0.43%   -0.15%  [kernel.vmlinux]     [k] skb_release_all
     0.42%   -0.11%  [kernel.vmlinux]     [k] skb_gro_reset_offset
     0.41%   -0.08%  [kernel.vmlinux]     [k] skb_free_head
     0.39%   -0.07%  [kernel.vmlinux]     [k] __netif_receive_skb
     0.36%   -0.08%  [iptable_raw]        [k] iptable_raw_hook
     0.34%   -0.08%  [kernel.vmlinux]     [k] kfree_skbmem
     0.28%   +0.01%  [kernel.vmlinux]     [k] swiotlb_sync_single_for_cpu
     0.26%   +4.20%  [mlx4_en]            [k] mlx4_alloc_pages.isra.19
     0.20%   +0.11%  [kernel.vmlinux]     [k] swiotlb_sync_single
     0.15%   -0.03%  [kernel.vmlinux]     [k] __do_softirq
     0.14%   +2.57%  [kernel.vmlinux]     [k] __alloc_pages_nodemask
     0.14%           [kernel.vmlinux]     [k] free_one_page
     0.13%   -0.13%  [kernel.vmlinux]     [k] _raw_spin_lock_irqsave
     0.13%   -0.12%  [kernel.vmlinux]     [k] _raw_spin_lock
     0.10%           [kernel.vmlinux]     [k] __mod_zone_page_state
     0.09%   +0.06%  [kernel.vmlinux]     [k] net_rx_action
     0.09%           [kernel.vmlinux]     [k] __rmqueue
     0.07%           [kernel.vmlinux]     [k] __zone_watermark_ok
     0.07%           [kernel.vmlinux]     [k] PageHuge
     0.06%   +0.77%  [kernel.vmlinux]     [k] __inc_zone_state
     0.76%   -0.15%  [kernel.vmlinux]     [k] skb_release_data
     0.76%   -0.12%  [kernel.vmlinux]     [k] __local_bh_enable_ip
     0.72%   -0.21%  [kernel.vmlinux]     [k] netif_receive_skb_internal
     0.66%   -0.16%  [kernel.vmlinux]     [k] napi_gro_receive
     0.66%   -0.02%  [kernel.vmlinux]     [k] __rcu_read_lock
     0.65%   -0.17%  [kernel.vmlinux]     [k] skb_release_head_state
     0.57%   +4.59%  [kernel.vmlinux]     [k] get_page_from_freelist
     0.57%           [kernel.vmlinux]     [k] __free_pages_ok
     0.51%   -0.09%  [kernel.vmlinux]     [k] kfree_skb
     0.43%   -0.15%  [kernel.vmlinux]     [k] skb_release_all
     0.42%   -0.11%  [kernel.vmlinux]     [k] skb_gro_reset_offset
     0.41%   -0.08%  [kernel.vmlinux]     [k] skb_free_head
     0.39%   -0.07%  [kernel.vmlinux]     [k] __netif_receive_skb
     0.36%   -0.08%  [iptable_raw]        [k] iptable_raw_hook
     0.34%   -0.08%  [kernel.vmlinux]     [k] kfree_skbmem
     0.28%   +0.01%  [kernel.vmlinux]     [k] swiotlb_sync_single_for_cpu
     0.26%   +4.20%  [mlx4_en]            [k] mlx4_alloc_pages.isra.19
     0.20%   +0.11%  [kernel.vmlinux]     [k] swiotlb_sync_single
     0.15%   -0.03%  [kernel.vmlinux]     [k] __do_softirq
     0.14%   +2.57%  [kernel.vmlinux]     [k] __alloc_pages_nodemask
     0.14%           [kernel.vmlinux]     [k] free_one_page
     0.13%   -0.13%  [kernel.vmlinux]     [k] _raw_spin_lock_irqsave
     0.13%   -0.12%  [kernel.vmlinux]     [k] _raw_spin_lock
     0.10%           [kernel.vmlinux]     [k] __mod_zone_page_state
     0.09%   +0.06%  [kernel.vmlinux]     [k] net_rx_action
     0.09%           [kernel.vmlinux]     [k] __rmqueue
     0.07%           [kernel.vmlinux]     [k] __zone_watermark_ok
     0.07%           [kernel.vmlinux]     [k] PageHuge
     0.06%   +0.77%  [kernel.vmlinux]     [k] __inc_zone_state
     0.06%   +0.98%  [kernel.vmlinux]     [k] alloc_pages_current
     0.06%   +0.51%  [kernel.vmlinux]     [k] policy_zonelist
     0.06%   +0.01%  [kernel.vmlinux]     [k] delay_tsc
     0.05%   -0.00%  [mlx4_en]            [k] mlx4_en_poll_rx_cq
     0.05%   +0.01%  [kernel.vmlinux]     [k] __memcpy
     0.04%   +0.57%  [kernel.vmlinux]     [k] swiotlb_map_page
     0.04%   +0.69%  [kernel.vmlinux]     [k] __list_del_entry
     0.04%           [kernel.vmlinux]     [k] free_compound_page
     0.04%           [kernel.vmlinux]     [k] __put_compound_page
     0.03%   +0.66%  [kernel.vmlinux]     [k] __list_add



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2016-08-09 12:14 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-19 19:16 [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding Brenden Blanco
2016-07-19 19:16 ` [PATCH v10 01/12] bpf: add bpf_prog_add api for bulk prog refcnt Brenden Blanco
2016-07-19 21:46   ` Alexei Starovoitov
2016-07-19 19:16 ` [PATCH v10 02/12] bpf: add XDP prog type for early driver filter Brenden Blanco
2016-07-19 21:33   ` Alexei Starovoitov
2016-07-19 19:16 ` [PATCH v10 03/12] net: add ndo to setup/query xdp prog in adapter rx Brenden Blanco
2016-07-19 19:16 ` [PATCH v10 04/12] rtnl: add option for setting link xdp prog Brenden Blanco
2016-07-20  8:38   ` Daniel Borkmann
2016-07-20 17:35     ` Brenden Blanco
2016-07-19 19:16 ` [PATCH v10 05/12] net/mlx4_en: add support for fast rx drop bpf program Brenden Blanco
2016-07-19 21:41   ` Alexei Starovoitov
2016-07-20  9:07   ` Daniel Borkmann
2016-07-20 17:33     ` Brenden Blanco
2016-07-24 11:56   ` Jesper Dangaard Brouer
2016-07-24 16:57   ` Tom Herbert
2016-07-24 20:34     ` Daniel Borkmann
2016-07-19 19:16 ` [PATCH v10 06/12] Add sample for adding simple drop program to link Brenden Blanco
2016-07-19 21:44   ` Alexei Starovoitov
2016-07-19 19:16 ` [PATCH v10 07/12] net/mlx4_en: add page recycle to prepare rx ring for tx support Brenden Blanco
2016-07-19 21:49   ` Alexei Starovoitov
2016-07-25  7:35   ` Eric Dumazet
2016-08-03 17:45     ` order-0 vs order-N driver allocation. Was: " Alexei Starovoitov
2016-08-04 16:19       ` Jesper Dangaard Brouer
2016-08-05  0:30         ` Alexander Duyck
2016-08-05  3:55           ` Alexei Starovoitov
2016-08-05 15:15             ` Alexander Duyck
2016-08-05 15:33               ` David Laight
2016-08-05 15:33                 ` David Laight
2016-08-05 16:00                 ` Alexander Duyck
2016-08-05 16:00                   ` Alexander Duyck
2016-08-05  7:15         ` Eric Dumazet
2016-08-05  7:15           ` Eric Dumazet
2016-08-08  2:15           ` Alexei Starovoitov
2016-08-08  2:15             ` Alexei Starovoitov
2016-08-08  8:01             ` Jesper Dangaard Brouer
2016-08-08 18:34               ` Alexei Starovoitov
2016-08-09 12:14                 ` Jesper Dangaard Brouer
2016-07-19 19:16 ` [PATCH v10 08/12] bpf: add XDP_TX xdp_action for direct forwarding Brenden Blanco
2016-07-19 21:53   ` Alexei Starovoitov
2016-07-19 19:16 ` [PATCH v10 09/12] net/mlx4_en: break out tx_desc write into separate function Brenden Blanco
2016-07-19 19:16 ` [PATCH v10 10/12] net/mlx4_en: add xdp forwarding and data write support Brenden Blanco
2016-07-19 19:16 ` [PATCH v10 11/12] bpf: enable direct packet data write for xdp progs Brenden Blanco
2016-07-19 21:59   ` Alexei Starovoitov
2016-07-19 19:16 ` [PATCH v10 12/12] bpf: add sample for xdp forwarding and rewrite Brenden Blanco
2016-07-19 22:05   ` Alexei Starovoitov
2016-07-20 17:38     ` Brenden Blanco
2016-07-27 18:25     ` Jesper Dangaard Brouer
2016-08-03 17:01   ` Tom Herbert
2016-08-03 17:11     ` Alexei Starovoitov
2016-08-03 17:29       ` Tom Herbert
2016-08-03 18:29         ` David Miller
2016-08-03 18:29         ` Brenden Blanco
2016-08-03 18:31           ` David Miller
2016-08-03 19:06           ` Tom Herbert
2016-08-03 22:36             ` Alexei Starovoitov
2016-08-03 23:18               ` Daniel Borkmann
2016-07-20  5:09 ` [PATCH v10 00/12] Add driver bpf hook for early packet drop and forwarding David Miller
     [not found]   ` <6a09ce5d-f902-a576-e44e-8e1e111ae26b@gmail.com>
2016-07-20 14:08     ` Brenden Blanco
2016-07-20 19:14     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.