All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)
@ 2018-06-27  2:46 Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests Saeed Mahameed
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

Hello,

Although it is far from being close to completion, this series provides
the means and infrastructure to enable device drivers to share packets
meta data information and accelerations with XDP programs, 
such as hash, stripped vlan, checksum, flow mark, packet header types,
etc ..

The series is still work in progress and sending it now as RFC in order
to get early feedback and to discuss the design, structures and UAPI.

For now the general idea is to help XDP programs accelerate, and provide
them with meta data that already available from the HW or device driver
to save cpu cycles on packet headers and data processing and decision
making, aside of that we want to avoid having a fixed size meta data
structure that wastes a lot of buffer space for stuff that the xdp program
might not need, like the current "elephant" SKB fields, kidding :) ..

So my idea here is to provide a dynamic mechanism to allow XDP programs
on xdp load to request only the specific meta data they need, and
the device driver or netdevice will provide them in a predefined order
in the xdp_buff/xdp_md data meta section.

And here is how it is done and what i would like to discuss:

1. In the first patch: added the infrastructure to request predefined meta data
flags on xdp load indicating that the XDP program is going to need them.

1.1) In this patch I am using the current u32 bit IFLA_XDP_FLAGS,
TODO: this needs to be improved in order to allow more meta data flags,
maybe use a new dedicated flags ?

1.2) Device driver that want to support xdp meta data should implement
new XDP command XDP_QUERY_META_FLAGS and report the meta data flags they
can support.

1.3) the kernel will cross check the requested flags with the device's
supported flags and will fail the xdp load in case of discrepancy.

Question : do we want this ? or is it better to return to the program
the actual supported flags and let it decide ?


2. This is the interesting part: in the 2nd patch Added the meta data
set/get infrastructure to allow device drivers populate them into xdp_buff
data meta in a well defined and structured format and let the xdp program
read them according to the predefined format/structure, the idea here is
that the xdp program and the device driver will share a static information
offsets array that will define where each meta data will sit inside
xdp_buff data meta, such structure will be decided once on xdp load.

Enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : for XDP prgram that requested the following flags:
flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

the offsets array will be:

xdp_md_info_arr mdi = {
        [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
        [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
        [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
        [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

For this example: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.
The user space XDP program will use a similar xdp_md_info_arr to
statically know what is the offset of each meta data.

*For future meta data they will be added to the end of the array with
higher flags value.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Question: currently the XDP program is responsible to build the static
meta data offsets array "xdp_md_info_arr" and the kernel will build it
for the netdevice, if we are going to choose this direction, should we
somehow share the same xdp_md_info_arr built by the kernel with the xdp
program ?

3. The last mlx5e patch is the actual show case of how the device driver
will add the support for xdp meta data, it is pretty simple and straight
forward ! The first two mlx5e patches are just small refactoring to make
the xdp_md_info_arr and packet completion information available in the rx
xdp handlers data path.

4. Just added a small example to demonstrate how XDP program can request
meta data on xdp load and how it can read them on the critical path.
of course more examples are needed and some performance numbers.
Exciting use cases such as:
  - using flow mark to allow fast white/black listing lookups.
  - flow mark to accelerate DDos prevention.
  - Generic Data path: Use the meta data from the xdp_buff to build SKBs !(Jesper's Idea)
  - using hash type to know the packet headers and encapsulation without
    touching the packet data at all.
  - using packet hash to do RPS and XPS like cpu redirecting.
  - and many more.


More ideas:

>From Jesper: 
=========================
Give XDP more rich information about the hash,
By reducing the 'ht' value to the PKT_HASH_TYPE's we are loosing information.

I happen to know your hardware well; CQE_RSS_HTYPE_IP tell us if this
is IPv4 or IPv6.  And CQE_RSS_HTYPE_L4 tell us if this is TCP, UDP or
IPSEC. (And you have another bit telling of this is IPv6 with extension
headers).

If we don't want to invent our own xdp_hash_types, we can simply adopt
the RSS Hashing Types defined by Microsoft:
 https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types

An interesting part of the RSS standard, is that the hash type can help
identify if this is a fragment. (XDP could use this info to avoid
touching payload and react, e.g. drop fragments, or redirect all
fragments to another CPU, or skip parsing in XDP and defer to network
stack via XDP_PASS).

By using the RSS standard, we do e.g. loose the ability to say this is
IPSEC traffic, even-though your HW supports this.

I have tried to implemented my own (non-dynamic) XDP RX-types UAPI here:
 https://marc.info/?l=linux-netdev&m=149512213531769
 https://marc.info/?l=linux-netdev&m=149512213631774
=========================

Thanks,
Saeed.

Saeed Mahameed (6):
  net: xdp: Add support for meta data flags requests
  net: xdp: RX meta data infrastructure
  net/mlx5e: Store xdp flags and meta data info
  net/mlx5e: Pass CQE to RX handlers
  net/mlx5e: Add XDP RX meta data support
  samples/bpf: Add meta data hash example to xdp_redirect_cpu

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  19 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  58 +++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  47 ++++++--
 include/linux/netdevice.h                     |  10 ++
 include/net/xdp.h                             |   6 +
 include/uapi/linux/bpf.h                      | 114 ++++++++++++++++++
 include/uapi/linux/if_link.h                  |  20 ++-
 net/core/dev.c                                |  41 +++++++
 samples/bpf/xdp_redirect_cpu_kern.c           |  67 ++++++++++
 samples/bpf/xdp_redirect_cpu_user.c           |   7 ++
 10 files changed, 354 insertions(+), 35 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests
  2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
@ 2018-06-27  2:46 ` Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure Saeed Mahameed
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

A user space application can request to enable a specific set of meta data
to be reported in every xdp buffer provided to the xdp program.

When meta_data flags are required, XDP devices must respond to
XDP_QUERY_META_FLAGS command with all the meta data flags the device
actually supports, and the kernel will cross check them with the
requested flags, in case of discrepancy the xdp install will fail.

If the flags are supported, the device must guarantee to deliver all RX
packets with only the meta data requested in meta data_flags on
xdp_install operation.

The following flags are added, and can be provided by the netlink xdp
flags field.

+#define XDP_FLAGS_META_HASH            (1U << 16)
+#define XDP_FLAGS_META_FLOW_MARK       (1U << 17)
+#define XDP_FLAGS_META_VLAN            (1U << 18)
+#define XDP_FLAGS_META_CSUM_COMPLETE   (1U << 19)

The format, device delivery methods and XDP program access to such meta
data is discussed in a later patch.

TODO: use a different flags field for XDP meta data, to make sure we
have more free bits.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h    |  9 +++++++++
 include/uapi/linux/if_link.h | 16 +++++++++++++++-
 net/core/dev.c               | 20 ++++++++++++++++++++
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..fc8b6ce48a0f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -812,6 +812,10 @@ enum bpf_netdev_command {
 	 * is equivalent to XDP_ATTACHED_DRV.
 	 */
 	XDP_QUERY_PROG,
+	/* Query Supported XDP_FLAGS_META_*, Called before new XDP program setup
+	 * and only if meta_flags were requested by the user to validate if the
+	 * device supports the requested flags, if not program setup will fail */
+	XDP_QUERY_META_FLAGS,
 	/* BPF program for offload callbacks, invoked at program load time. */
 	BPF_OFFLOAD_VERIFIER_PREP,
 	BPF_OFFLOAD_TRANSLATE,
@@ -842,6 +846,11 @@ struct netdev_bpf {
 			/* flags with which program was installed */
 			u32 prog_flags;
 		};
+		/* XDP_QUERY_META_FLAGS */
+		struct {
+			/* TODO u64 */
+			u32 meta_flags;
+		};
 		/* BPF_OFFLOAD_VERIFIER_PREP */
 		struct {
 			struct bpf_prog *prog;
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index cf01b6824244..dfb1e26bacef 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -911,8 +911,22 @@ enum {
 #define XDP_FLAGS_MODES			(XDP_FLAGS_SKB_MODE | \
 					 XDP_FLAGS_DRV_MODE | \
 					 XDP_FLAGS_HW_MODE)
+
+/* TODO : add new netlink xdp u64 meta_flags
+ * for meta data only
+ */
+#define XDP_FLAGS_META_HASH		(1U << 16)
+#define XDP_FLAGS_META_MARK		(1U << 17)
+#define XDP_FLAGS_META_VLAN		(1U << 18)
+#define XDP_FLAGS_META_CSUM_COMPLETE	(1U << 19)
+#define XDP_FLAGS_META_ALL		(XDP_FLAGS_META_HASH      | \
+					 XDP_FLAGS_META_MARK      | \
+					 XDP_FLAGS_META_VLAN      | \
+					 XDP_FLAGS_META_CSUM_COMPLETE)
+
 #define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
-					 XDP_FLAGS_MODES)
+					 XDP_FLAGS_MODES             | \
+					 XDP_FLAGS_META_ALL)
 
 /* These are stored into IFLA_XDP_ATTACHED on dump. */
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index a5aa1c7444e6..8a5cc2c731ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7295,6 +7295,21 @@ static u8 __dev_xdp_attached(struct net_device *dev, bpf_op_t bpf_op)
 	return xdp.prog_attached;
 }
 
+static bool __dev_xdp_meta_supported(struct net_device *dev,
+				     bpf_op_t bpf_op, u32 meta_flags)
+{
+	struct netdev_bpf xdp = {};
+
+	 /* Backward compatible, all devices support no meta_flags */
+	if (!meta_flags)
+		return true;
+
+	xdp.command = XDP_QUERY_META_FLAGS;
+	bpf_op(dev, &xdp);
+
+	return ((xdp.meta_flags & meta_flags) == meta_flags);
+}
+
 static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 			   struct netlink_ext_ack *extack, u32 flags,
 			   struct bpf_prog *prog)
@@ -7362,12 +7377,17 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		bpf_chk = generic_xdp_install;
 
 	if (fd >= 0) {
+		u32 meta_flags = (flags & XDP_FLAGS_META_ALL);
+
 		if (bpf_chk && __dev_xdp_attached(dev, bpf_chk))
 			return -EEXIST;
 		if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) &&
 		    __dev_xdp_attached(dev, bpf_op))
 			return -EBUSY;
 
+		if (!__dev_xdp_meta_supported(dev, bpf_op, meta_flags))
+			return -EINVAL;
+
 		prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
 					     bpf_op == ops->ndo_bpf);
 		if (IS_ERR(prog))
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests Saeed Mahameed
@ 2018-06-27  2:46 ` Saeed Mahameed
  2018-06-27 14:15   ` Jesper Dangaard Brouer
  2018-07-03 23:01   ` Alexei Starovoitov
  2018-06-27  2:46 ` [RFC bpf-next 3/6] net/mlx5e: Store xdp flags and meta data info Saeed Mahameed
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

The idea from this patch is to define a well known structure for XDP meta
data fields format and offset placement inside the xdp data meta buffer.

For that driver will need some static information to know what to
provide and where, enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

xdp_md_info_arr mdi = {
	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
	[XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
	[XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

i.e: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/linux/netdevice.h    |   1 +
 include/net/xdp.h            |   6 ++
 include/uapi/linux/bpf.h     | 114 +++++++++++++++++++++++++++++++++++
 include/uapi/linux/if_link.h |  16 +++--
 net/core/dev.c               |  21 +++++++
 5 files changed, 152 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fc8b6ce48a0f..172363310ade 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -838,6 +838,7 @@ struct netdev_bpf {
 			u32 flags;
 			struct bpf_prog *prog;
 			struct netlink_ext_ack *extack;
+			xdp_md_info_arr md_info;
 		};
 		/* XDP_QUERY_PROG */
 		struct {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 2deea7166a34..afe302613ae1 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff *xdp)
 	xdp->data_meta = xdp->data + 1;
 }
 
+static __always_inline void
+xdp_reset_data_meta(struct xdp_buff *xdp)
+{
+	xdp->data_meta = xdp->data_hard_start;
+}
+
 static __always_inline bool
 xdp_data_meta_unsupported(const struct xdp_buff *xdp)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59b19b6a40d7..e320e7421565 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2353,6 +2353,120 @@ struct xdp_md {
 	__u32 rx_queue_index;  /* rxq->queue_index  */
 };
 
+enum {
+	XDP_DATA_META_HASH = 0,
+	XDP_DATA_META_MARK,
+	XDP_DATA_META_VLAN,
+	XDP_DATA_META_CSUM_COMPLETE,
+
+	XDP_DATA_META_MAX,
+};
+
+struct xdp_md_hash {
+	__u32 hash;
+	__u8  type;
+} __attribute__((aligned(4)));
+
+struct xdp_md_mark {
+	__u32 mark;
+} __attribute__((aligned(4)));
+
+struct xdp_md_vlan {
+	__u16 vlan;
+} __attribute__((aligned(4)));
+
+struct xdp_md_csum {
+	__u16 csum;
+} __attribute__((aligned(4)));
+
+static const __u16 xdp_md_size[XDP_DATA_META_MAX] = {
+	sizeof(struct xdp_md_hash), /* XDP_DATA_META_HASH */
+	sizeof(struct xdp_md_mark), /* XDP_DATA_META_MARK */
+	sizeof(struct xdp_md_vlan), /* XDP_DATA_META_VLAN */
+	sizeof(struct xdp_md_csum), /* XDP_DATA_META_CSUM_COMPLETE */
+};
+
+struct xdp_md_info {
+	__u16 present:1;
+	__u16 offset:15; /* offset from data_meta in xdp_md buffer */
+};
+
+/* XDP meta data offsets info array
+ * present bit describes if a meta data is or will be present in xdp_md buff
+ * offset describes where a meta data is or should be placed in xdp_md buff
+ *
+ * Kernel builds this array using xdp_md_info_build helper on demand.
+ * User space builds it statically in the xdp program.
+ */
+typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
+
+static __always_inline __u8
+xdp_data_meta_present(xdp_md_info_arr mdi, int meta_data)
+{
+	return mdi[meta_data].present;
+}
+
+static __always_inline void
+xdp_data_meta_set_hash(xdp_md_info_arr mdi, void *data_meta, __u32 hash, __u8 htype)
+{
+	struct xdp_md_hash *hash_md;
+
+	hash_md = (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
+	hash_md->hash = hash;
+	hash_md->type = htype;
+}
+
+static __always_inline struct xdp_md_hash *
+xdp_data_meta_get_hash(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_mark(xdp_md_info_arr mdi, void *data_meta, __u32 mark)
+{
+	struct xdp_md_mark *mark_md;
+
+	mark_md = (struct xdp_md_mark *)((char*)data_meta + mdi[XDP_DATA_META_MARK].offset);
+	mark_md->mark = mark;
+}
+
+static __always_inline struct xdp_md_mark *
+xdp_data_meta_get_mark(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_mark *)((char*)data_meta + mdi[XDP_DATA_META_MARK].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_vlan(xdp_md_info_arr mdi, void *data_meta, __u16 vlan_tci)
+{
+	struct xdp_md_vlan *vlan_md;
+
+	vlan_md = (struct xdp_md_vlan *)((char*)data_meta + mdi[XDP_DATA_META_VLAN].offset);
+	vlan_md->vlan = vlan_tci;
+}
+
+static __always_inline struct xdp_md_vlan *
+xdp_data_meta_get_vlan(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_vlan *)((char*)data_meta + mdi[XDP_DATA_META_VLAN].offset);
+}
+
+static __always_inline void
+xdp_data_meta_set_csum_complete(xdp_md_info_arr mdi, void *data_meta, __u16 csum)
+{
+	struct xdp_md_csum *csum_md;
+
+	csum_md = (struct xdp_md_csum *)((char*)data_meta + mdi[XDP_DATA_META_CSUM_COMPLETE].offset);
+	csum_md->csum = csum;
+}
+
+static __always_inline struct xdp_md_csum *
+xdp_data_meta_get_csum_complete(xdp_md_info_arr mdi, void *data_meta)
+{
+	return (struct xdp_md_csum *)((char*)data_meta + mdi[XDP_DATA_META_CSUM_COMPLETE].offset);
+}
+
 enum sk_action {
 	SK_DROP = 0,
 	SK_PASS,
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index dfb1e26bacef..5bdc07bfad35 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -4,6 +4,7 @@
 
 #include <linux/types.h>
 #include <linux/netlink.h>
+#include <linux/bpf.h>
 
 /* This struct should be in sync with struct rtnl_link_stats64 */
 struct rtnl_link_stats {
@@ -913,13 +914,16 @@ enum {
 					 XDP_FLAGS_HW_MODE)
 
 /* TODO : add new netlink xdp u64 meta_flags
- * for meta data only
+ * for meta data only & build according XDP_DATA_META_* enums
  */
-#define XDP_FLAGS_META_HASH		(1U << 16)
-#define XDP_FLAGS_META_MARK		(1U << 17)
-#define XDP_FLAGS_META_VLAN		(1U << 18)
-#define XDP_FLAGS_META_CSUM_COMPLETE	(1U << 19)
-#define XDP_FLAGS_META_ALL		(XDP_FLAGS_META_HASH      | \
+
+#define XDP_FLAGS_META_SHIFT            (16)
+#define XDP_FLAG_META(FLAG)             ((1U << XDP_DATA_META_##FLAG) << XDP_FLAGS_META_SHIFT)
+#define XDP_FLAGS_META_HASH             XDP_FLAG_META(HASH)
+#define XDP_FLAGS_META_MARK             XDP_FLAG_META(MARK)
+#define XDP_FLAGS_META_VLAN             XDP_FLAG_META(VLAN)
+#define XDP_FLAGS_META_CSUM_COMPLETE    XDP_FLAG_META(CSUM_COMPLETE)
+#define XDP_FLAGS_META_ALL              (XDP_FLAGS_META_HASH      | \
 					 XDP_FLAGS_META_MARK      | \
 					 XDP_FLAGS_META_VLAN      | \
 					 XDP_FLAGS_META_CSUM_COMPLETE)
diff --git a/net/core/dev.c b/net/core/dev.c
index 8a5cc2c731ec..141dd6632b00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7310,6 +7310,25 @@ static bool __dev_xdp_meta_supported(struct net_device *dev,
 	return ((xdp.meta_flags & meta_flags) == meta_flags);
 }
 
+static void xdp_md_info_build(xdp_md_info_arr mdi, u32 xdp_flags_meta)
+{
+	u16 offset = 0;
+	int i;
+
+	for (i = 0; i < XDP_DATA_META_MAX; i++) {
+		u32 meta_data_flag = BIT(i) << XDP_FLAGS_META_SHIFT;
+
+		if (!(meta_data_flag & xdp_flags_meta)) {
+			mdi[i].present = 0;
+			continue;
+		}
+
+		mdi[i].present = 1;
+		mdi[i].offset = offset;
+		offset += xdp_md_size[i];
+	}
+}
+
 static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 			   struct netlink_ext_ack *extack, u32 flags,
 			   struct bpf_prog *prog)
@@ -7325,6 +7344,8 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 	xdp.flags = flags;
 	xdp.prog = prog;
 
+	xdp_md_info_build(xdp.md_info, flags & XDP_FLAGS_META_ALL);
+
 	return bpf_op(dev, &xdp);
 }
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC bpf-next 3/6] net/mlx5e: Store xdp flags and meta data info
  2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure Saeed Mahameed
@ 2018-06-27  2:46 ` Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 4/6] net/mlx5e: Pass CQE to RX handlers Saeed Mahameed
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

Make xdp flags and meta data info avaiable to RQs data path,
to enable xdp meta data offloads in next two patches.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 10 +++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 52 +++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  2 +-
 3 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index eb9eb7aa953a..5893acfae307 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -236,6 +236,12 @@ enum mlx5e_priv_flag {
 #define MLX5E_MAX_BW_ALLOC 100 /* Max percentage of BW allocation */
 #endif
 
+struct mlx5e_xdp_info {
+	struct bpf_prog      *prog;
+	u32                   flags;
+	xdp_md_info_arr       md_info;
+};
+
 struct mlx5e_params {
 	u8  log_sq_size;
 	u8  rq_wq_type;
@@ -257,7 +263,7 @@ struct mlx5e_params {
 	bool tx_dim_enabled;
 	u32 lro_timeout;
 	u32 pflags;
-	struct bpf_prog *xdp_prog;
+	struct mlx5e_xdp_info xdp;
 	unsigned int sw_mtu;
 	int hard_mtu;
 };
@@ -566,7 +572,7 @@ struct mlx5e_rq {
 	struct net_dim         dim; /* Dynamic Interrupt Moderation */
 
 	/* XDP */
-	struct bpf_prog       *xdp_prog;
+	struct mlx5e_xdp_info  xdp;
 	unsigned int           hw_mtu;
 	struct mlx5e_xdpsq     xdpsq;
 	DECLARE_BITMAP(flags, 8);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 56c1b6f5593e..8debae6b9cab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -96,7 +96,7 @@ bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
 
 static u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params)
 {
-	if (!params->xdp_prog) {
+	if (!params->xdp.prog) {
 		u16 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 		u16 rq_headroom = MLX5_RX_HEADROOM + NET_IP_ALIGN;
 
@@ -170,7 +170,7 @@ static u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
 static u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev,
 				 struct mlx5e_params *params)
 {
-	u16 linear_rq_headroom = params->xdp_prog ?
+	u16 linear_rq_headroom = params->xdp.prog ?
 		XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
 	bool is_linear_skb;
 
@@ -205,7 +205,7 @@ bool mlx5e_striding_rq_possible(struct mlx5_core_dev *mdev,
 {
 	return mlx5e_check_fragmented_striding_rq_cap(mdev) &&
 		!MLX5_IPSEC_DEV(mdev) &&
-		!(params->xdp_prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params));
+		!(params->xdp.prog && !mlx5e_rx_mpwqe_is_linear_skb(mdev, params));
 }
 
 void mlx5e_set_rq_type(struct mlx5_core_dev *mdev, struct mlx5e_params *params)
@@ -489,11 +489,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	rq->mdev    = mdev;
 	rq->hw_mtu  = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 	rq->stats   = &c->priv->channel_stats[c->ix].rq;
+	rq->xdp     = params->xdp;
 
-	rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL;
-	if (IS_ERR(rq->xdp_prog)) {
-		err = PTR_ERR(rq->xdp_prog);
-		rq->xdp_prog = NULL;
+	rq->xdp.prog = params->xdp.prog ? bpf_prog_inc(params->xdp.prog) : NULL;
+	if (IS_ERR(rq->xdp.prog)) {
+		err = PTR_ERR(rq->xdp.prog);
+		rq->xdp.prog = NULL;
 		goto err_rq_wq_destroy;
 	}
 
@@ -501,7 +502,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (err < 0)
 		goto err_rq_wq_destroy;
 
-	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
+	rq->buff.map_dir = rq->xdp.prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
 	pool_size = 1 << params->log_rq_mtu_frames;
 
@@ -679,8 +680,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	}
 
 err_rq_wq_destroy:
-	if (rq->xdp_prog)
-		bpf_prog_put(rq->xdp_prog);
+	if (rq->xdp.prog)
+		bpf_prog_put(rq->xdp.prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
 	if (rq->page_pool)
 		page_pool_destroy(rq->page_pool);
@@ -693,8 +694,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 {
 	int i;
 
-	if (rq->xdp_prog)
-		bpf_prog_put(rq->xdp_prog);
+	if (rq->xdp.prog)
+		bpf_prog_put(rq->xdp.prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
 	if (rq->page_pool)
@@ -1906,7 +1907,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	c->netdev   = priv->netdev;
 	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
 	c->num_tc   = params->num_tc;
-	c->xdp      = !!params->xdp_prog;
+	c->xdp      = !!params->xdp.prog;
 	c->stats    = &priv->channel_stats[ix].ch;
 
 	mlx5_vector2eqn(priv->mdev, ix, &eqn, &irq);
@@ -4090,12 +4091,12 @@ static void mlx5e_tx_timeout(struct net_device *dev)
 	queue_work(priv->wq, &priv->tx_timeout_work);
 }
 
-static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+static int mlx5e_xdp_set(struct net_device *netdev, struct mlx5e_xdp_info *xdp)
 {
 	struct mlx5e_priv *priv = netdev_priv(netdev);
-	struct bpf_prog *old_prog;
-	int err = 0;
+	struct bpf_prog *old_prog, *prog = xdp->prog;
 	bool reset, was_opened;
+	int err = 0;
 	int i;
 
 	mutex_lock(&priv->state_lock);
@@ -4114,7 +4115,7 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 
 	was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
 	/* no need for full reset when exchanging programs */
-	reset = (!priv->channels.params.xdp_prog || !prog);
+	reset = (!priv->channels.params.xdp.prog || !prog);
 
 	if (was_opened && reset)
 		mlx5e_close_locked(netdev);
@@ -4132,10 +4133,12 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	/* exchange programs, extra prog reference we got from caller
 	 * as long as we don't fail from this point onwards.
 	 */
-	old_prog = xchg(&priv->channels.params.xdp_prog, prog);
+	old_prog = xchg(&priv->channels.params.xdp.prog, prog);
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
+	priv->channels.params.xdp = *xdp;
+
 	if (reset) /* change RQ type according to priv->xdp_prog */
 		mlx5e_set_rq_type(priv->mdev, &priv->channels.params);
 
@@ -4155,7 +4158,8 @@ static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 		napi_synchronize(&c->napi);
 		/* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */
 
-		old_prog = xchg(&c->rq.xdp_prog, prog);
+		old_prog = xchg(&c->rq.xdp.prog, prog);
+		c->rq.xdp = *xdp;
 
 		set_bit(MLX5E_RQ_STATE_ENABLED, &c->rq.state);
 		/* napi_schedule in case we have missed anything */
@@ -4177,7 +4181,7 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 	u32 prog_id = 0;
 
 	mutex_lock(&priv->state_lock);
-	xdp_prog = priv->channels.params.xdp_prog;
+	xdp_prog = priv->channels.params.xdp.prog;
 	if (xdp_prog)
 		prog_id = xdp_prog->aux->id;
 	mutex_unlock(&priv->state_lock);
@@ -4187,9 +4191,15 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 
 static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
+	struct mlx5e_xdp_info xdp_info;
+
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
-		return mlx5e_xdp_set(dev, xdp->prog);
+		xdp_info.prog  = xdp->prog;
+		xdp_info.flags = xdp->flags;
+		memcpy(xdp_info.md_info, xdp->md_info, sizeof(xdp->md_info));
+
+		return mlx5e_xdp_set(dev, &xdp_info);
 	case XDP_QUERY_PROG:
 		xdp->prog_id = mlx5e_xdp_query(dev);
 		xdp->prog_attached = !!xdp->prog_id;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d3a1dd20e41d..d12577c17011 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -925,7 +925,7 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
 				    void *va, u16 *rx_headroom, u32 *len)
 {
-	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+	struct bpf_prog *prog = READ_ONCE(rq->xdp.prog);
 	struct xdp_buff xdp;
 	u32 act;
 	int err;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC bpf-next 4/6] net/mlx5e: Pass CQE to RX handlers
  2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
                   ` (2 preceding siblings ...)
  2018-06-27  2:46 ` [RFC bpf-next 3/6] net/mlx5e: Store xdp flags and meta data info Saeed Mahameed
@ 2018-06-27  2:46 ` Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support Saeed Mahameed
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

CQE has all the meta data information from HW.
Make it available to the driver xdp handlers.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |  9 ++++++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 15 +++++++++------
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 5893acfae307..98bb315fc8a8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -505,7 +505,8 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+			       u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+			       struct mlx5_cqe64 *cqe);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			 struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
@@ -901,10 +902,12 @@ void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe);
 struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 			  struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index d12577c17011..e37f9747a0e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -923,7 +923,8 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
-				    void *va, u16 *rx_headroom, u32 *len)
+				    void *va, u16 *rx_headroom,
+				    u32 *len, struct mlx5_cqe64 *cqe)
 {
 	struct bpf_prog *prog = READ_ONCE(rq->xdp.prog);
 	struct xdp_buff xdp;
@@ -1012,7 +1013,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 	}
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt, cqe);
 	rcu_read_unlock();
 	if (consumed)
 		return NULL; /* page/packet was consumed by XDP */
@@ -1155,7 +1156,8 @@ void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				   u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				   struct mlx5_cqe64 *cqe)
 {
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
@@ -1202,7 +1204,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				u16 cqe_bcnt, u32 head_offset, u32 page_idx,
+				struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_dma_info *di = &wi->umr.dma_info[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -1221,7 +1224,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 	prefetch(data);
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32);
+	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt32, cqe);
 	rcu_read_unlock();
 	if (consumed) {
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
@@ -1268,7 +1271,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
 
 	skb = rq->mpwqe.skb_from_cqe_mpwrq(rq, wi, cqe_bcnt, head_offset,
-					   page_idx);
+					   page_idx, cqe);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support
  2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
                   ` (3 preceding siblings ...)
  2018-06-27  2:46 ` [RFC bpf-next 4/6] net/mlx5e: Pass CQE to RX handlers Saeed Mahameed
@ 2018-06-27  2:46 ` Saeed Mahameed
  2018-07-04  8:28   ` Daniel Borkmann
  2018-06-27  2:46 ` [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu Saeed Mahameed
  2018-06-27 16:42 ` [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Parikh, Neerav
  6 siblings, 1 reply; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

Implement XDP meta data hash and vlan support.

1. on xdp setup ndo: add support for XDP_QUERY_META_FLAGS and return the
two supported flags
2. use xdp_data_meta_set_hash and xdp_data_meta_set_vlan helpers to fill
in the meta data fileds.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 ++++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 30 ++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8debae6b9cab..3d1066a953cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4189,6 +4189,9 @@ static u32 mlx5e_xdp_query(struct net_device *dev)
 	return prog_id;
 }
 
+#define MLX5E_SUPPORTED_XDP_META_FLAGS  \
+             (XDP_FLAGS_META_HASH  | XDP_FLAGS_META_VLAN)
+
 static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct mlx5e_xdp_info xdp_info;
@@ -4204,6 +4207,9 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 		xdp->prog_id = mlx5e_xdp_query(dev);
 		xdp->prog_attached = !!xdp->prog_id;
 		return 0;
+	case XDP_QUERY_META_FLAGS:
+		xdp->meta_flags = MLX5E_SUPPORTED_XDP_META_FLAGS;
+		return 0;
 	default:
 		return -EINVAL;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index e37f9747a0e3..1f3e934d0dd8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -920,6 +920,29 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 	return true;
 }
 
+static void
+mlx5e_xdp_fill_data_meta(xdp_md_info_arr mdi,  void *data_meta, struct mlx5_cqe64 *cqe)
+{
+	if (xdp_data_meta_present(mdi, XDP_DATA_META_HASH))
+	{
+		u8 cht = cqe->rss_hash_type;
+		int ht = (cht & CQE_RSS_HTYPE_L4) ? PKT_HASH_TYPE_L4 :
+			 (cht & CQE_RSS_HTYPE_IP) ? PKT_HASH_TYPE_L3 :
+						    PKT_HASH_TYPE_NONE;
+		u32 hash = be32_to_cpu(cqe->rss_hash_result);
+
+		xdp_data_meta_set_hash(mdi, data_meta, hash, ht);
+	}
+
+	if (xdp_data_meta_present(mdi, XDP_DATA_META_VLAN))
+	{
+		u16 vlan = (!!cqe_has_vlan(cqe) * VLAN_TAG_PRESENT) |
+			   be16_to_cpu(cqe->vlan_info);
+
+		xdp_data_meta_set_vlan(mdi, data_meta, vlan);
+	}
+}
+
 /* returns true if packet was consumed by xdp */
 static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				    struct mlx5e_dma_info *di,
@@ -935,11 +958,16 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		return false;
 
 	xdp.data = va + *rx_headroom;
-	xdp_set_data_meta_invalid(&xdp);
 	xdp.data_end = xdp.data + *len;
 	xdp.data_hard_start = va;
 	xdp.rxq = &rq->xdp_rxq;
 
+	if (rq->xdp.flags & XDP_FLAGS_META_ALL) {
+		xdp_reset_data_meta(&xdp);
+		mlx5e_xdp_fill_data_meta(rq->xdp.md_info, xdp.data_meta, cqe);
+	} else
+		xdp_set_data_meta_invalid(&xdp);
+
 	act = bpf_prog_run_xdp(prog, &xdp);
 	switch (act) {
 	case XDP_PASS:
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu
  2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
                   ` (4 preceding siblings ...)
  2018-06-27  2:46 ` [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support Saeed Mahameed
@ 2018-06-27  2:46 ` Saeed Mahameed
  2018-06-27 10:59   ` Jesper Dangaard Brouer
  2018-06-27 16:42 ` [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Parikh, Neerav
  6 siblings, 1 reply; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

Add a new program (prog_num = 4) that will not parse packets and will
use the meta data hash to spread/redirect traffic into different cpus.

For the new program we set on bpf_set_link_xdp_fd:
	xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

On mlx5 it will succeed since mlx5 already supports these flags.

The new program will read the value of the hash from the data_meta
pointer from the xdp_md and will use it to compute the destination cpu.

Note: I didn't test this patch to show redirect works with the hash!
I only used it to see that the hash and vlan values are set correctly
by the driver and can be seen by the xdp program.

* I faced some difficulties to read the hash value using the helper
functions defined in the previous patches, but once i used the same logic
with out these functions it worked ! Will have to figure this out later.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
 samples/bpf/xdp_redirect_cpu_kern.c | 67 +++++++++++++++++++++++++++++
 samples/bpf/xdp_redirect_cpu_user.c |  7 +++
 2 files changed, 74 insertions(+)

diff --git a/samples/bpf/xdp_redirect_cpu_kern.c b/samples/bpf/xdp_redirect_cpu_kern.c
index 303e9e7161f3..d6b3f55f342a 100644
--- a/samples/bpf/xdp_redirect_cpu_kern.c
+++ b/samples/bpf/xdp_redirect_cpu_kern.c
@@ -376,6 +376,73 @@ int  xdp_prognum3_proto_separate(struct xdp_md *ctx)
 	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
 }
 
+#if 0
+xdp_md_info_arr mdi = {
+	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
+	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
+};
+#endif
+
+SEC("xdp_cpu_map4_hash_separate")
+int  xdp_prognum4_hash_separate(struct xdp_md *ctx)
+{
+	void *data_meta = (void *)(long)ctx->data_meta;
+	void *data_end  = (void *)(long)ctx->data_end;
+	void *data      = (void *)(long)ctx->data;
+	struct xdp_md_hash *hash;
+	struct xdp_md_vlan *vlan;
+	struct datarec *rec;
+	u32 cpu_dest = 0;
+	u32 cpu_idx = 0;
+	u32 *cpu_lookup;
+	u32 key = 0;
+
+	/* Count RX packet in map */
+	rec = bpf_map_lookup_elem(&rx_cnt, &key);
+	if (!rec)
+		return XDP_ABORTED;
+	rec->processed++;
+
+	/* for some reason this code fails to be verified */
+#if 0
+	hash = xdp_data_meta_get_hash(mdi, data_meta);
+	if (hash + 1 > data)
+		return XDP_ABORTED;
+
+	vlan = xdp_data_meta_get_vlan(mdi, data_meta);
+	if (vlan + 1 > data)
+		return XDP_ABORTED;
+#endif
+
+	/* Work around for the above code */
+	hash = data_meta; /* since we know hash will appear first */
+        if (hash + 1 > data)
+		return XDP_ABORTED;
+
+#if 0
+	// Just for testing
+	/* We know that vlan will appear after the hash */
+	vlan = (void *)((char *)data_meta + sizeof(*hash));
+	if (vlan + 1 > data) {
+		return XDP_ABORTED;
+	}
+#endif
+
+	cpu_idx = reciprocal_scale(hash->hash, MAX_CPUS);
+
+	cpu_lookup = bpf_map_lookup_elem(&cpus_available, &cpu_idx);
+	if (!cpu_lookup)
+		return XDP_ABORTED;
+	cpu_dest = *cpu_lookup;
+
+	if (cpu_dest >= MAX_CPUS) {
+		rec->issue++;
+		return XDP_ABORTED;
+	}
+
+	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
+}
+
 SEC("xdp_cpu_map4_ddos_filter_pktgen")
 int  xdp_prognum4_ddos_filter_pktgen(struct xdp_md *ctx)
 {
diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index f6efaefd485b..3429215d5a7b 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -679,6 +679,13 @@ int main(int argc, char **argv)
 		return EXIT_FAIL_OPTION;
 	}
 
+	/*
+	 * prog_num 4 requires xdp meta data hash
+	 * Vlan is not required but added just for testing..
+	 */
+	if (prog_num == 4)
+		xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
+
 	/* Remove XDP program when program is interrupted */
 	signal(SIGINT, int_exit);
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu
  2018-06-27  2:46 ` [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu Saeed Mahameed
@ 2018-06-27 10:59   ` Jesper Dangaard Brouer
  2018-06-27 18:04     ` Saeed Mahameed
  0 siblings, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2018-06-27 10:59 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexei Starovoitov, Daniel Borkmann, neerav.parikh, pjwaskiewicz,
	ttoukan.linux, Tariq Toukan, alexander.h.duyck,
	peter.waskiewicz.jr, Opher Reviv, Rony Efraim, netdev,
	Saeed Mahameed, brouer

On Tue, 26 Jun 2018 19:46:15 -0700
Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:

> Add a new program (prog_num = 4) that will not parse packets and will
> use the meta data hash to spread/redirect traffic into different cpus.

You cannot "steal" prognum 4, as it is already used for
"xdp_prognum4_ddos_filter_pktgen".  Please append your new prog as #5.

> For the new program we set on bpf_set_link_xdp_fd:
> 	xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> 
> On mlx5 it will succeed since mlx5 already supports these flags.
> 
> The new program will read the value of the hash from the data_meta
> pointer from the xdp_md and will use it to compute the destination cpu.
> 
> Note: I didn't test this patch to show redirect works with the hash!
> I only used it to see that the hash and vlan values are set correctly
> by the driver and can be seen by the xdp program.
> 
> * I faced some difficulties to read the hash value using the helper
> functions defined in the previous patches, but once i used the same logic
> with out these functions it worked ! Will have to figure this out later.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  samples/bpf/xdp_redirect_cpu_kern.c | 67 +++++++++++++++++++++++++++++
>  samples/bpf/xdp_redirect_cpu_user.c |  7 +++
>  2 files changed, 74 insertions(+)
> 
> diff --git a/samples/bpf/xdp_redirect_cpu_kern.c b/samples/bpf/xdp_redirect_cpu_kern.c
> index 303e9e7161f3..d6b3f55f342a 100644
> --- a/samples/bpf/xdp_redirect_cpu_kern.c
> +++ b/samples/bpf/xdp_redirect_cpu_kern.c
> @@ -376,6 +376,73 @@ int  xdp_prognum3_proto_separate(struct xdp_md *ctx)
>  	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
>  }
>  
> +#if 0
> +xdp_md_info_arr mdi = {
> +	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
> +	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
> +};
> +#endif

Sorry, no global variables avail in the generated BPF byte-code.

> +SEC("xdp_cpu_map4_hash_separate")
> +int  xdp_prognum4_hash_separate(struct xdp_md *ctx)
> +{
> +	void *data_meta = (void *)(long)ctx->data_meta;
> +	void *data_end  = (void *)(long)ctx->data_end;
> +	void *data      = (void *)(long)ctx->data;
> +	struct xdp_md_hash *hash;
> +	struct xdp_md_vlan *vlan;
> +	struct datarec *rec;
> +	u32 cpu_dest = 0;
> +	u32 cpu_idx = 0;
> +	u32 *cpu_lookup;
> +	u32 key = 0;
> +
> +	/* Count RX packet in map */
> +	rec = bpf_map_lookup_elem(&rx_cnt, &key);
> +	if (!rec)
> +		return XDP_ABORTED;
> +	rec->processed++;
> +
> +	/* for some reason this code fails to be verified */
> +#if 0
> +	hash = xdp_data_meta_get_hash(mdi, data_meta);

This will not work, because it is not implemented as a proper
BPF-helper call.

First, you currently store the xdp_md_info_arr inside the driver, which
makes it hard for a helper to access this.  For helper access, we could
store this in xdp_rxq_info.

Second, in your design it looks like you are introducing a helper per
possible item in xdp_md_info_arr.  I think we can reduce this to a
single helper, that takes a XDP_DATA_META_xxx flag, and returns an
offset.  (The helper could return a direct pointer, but I don't think
the verfier can handle that, as it need to "see" this is related to
data_meta pointer, and that we do the proper boundry checks.).

The BPF prog already have direct memory access to the data_meta area,
and all it really need is an offset.  Sure, the XDP/bpf programmer
could just calculate these offsets as constants, and remember to load
the XDP prog with the flags that corresponds to the calculated offsets.

But I think we can do something even smarter... 

It should be possible to convert/patch the BPF instructions, of the
helper call that returns an offset, to instead avoid the call and
either (1) provide the offset as a constant/IMM or (2) make BPF inst
doing the lookup in xdp_md_info_arr.


> +	if (hash + 1 > data)
> +		return XDP_ABORTED;
> +
> +	vlan = xdp_data_meta_get_vlan(mdi, data_meta);
> +	if (vlan + 1 > data)
> +		return XDP_ABORTED;
> +#endif
> +
> +	/* Work around for the above code */
> +	hash = data_meta; /* since we know hash will appear first */
> +        if (hash + 1 > data)
> +		return XDP_ABORTED;
> +
> +#if 0
> +	// Just for testing
> +	/* We know that vlan will appear after the hash */
> +	vlan = (void *)((char *)data_meta + sizeof(*hash));
> +	if (vlan + 1 > data) {
> +		return XDP_ABORTED;
> +	}
> +#endif

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-06-27  2:46 ` [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure Saeed Mahameed
@ 2018-06-27 14:15   ` Jesper Dangaard Brouer
  2018-06-27 17:55     ` Saeed Mahameed
  2018-07-03 23:01   ` Alexei Starovoitov
  1 sibling, 1 reply; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2018-06-27 14:15 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexei Starovoitov, Daniel Borkmann, neerav.parikh, pjwaskiewicz,
	ttoukan.linux, Tariq Toukan, alexander.h.duyck,
	peter.waskiewicz.jr, Opher Reviv, Rony Efraim, netdev,
	Saeed Mahameed, brouer

On Tue, 26 Jun 2018 19:46:11 -0700
Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:

> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 2deea7166a34..afe302613ae1 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff *xdp)
>  	xdp->data_meta = xdp->data + 1;
>  }
>  
> +static __always_inline void
> +xdp_reset_data_meta(struct xdp_buff *xdp)
> +{
> +	xdp->data_meta = xdp->data_hard_start;
> +}

This is WRONG ... it should be:

 xdp->data_meta = xdp->data;

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)
  2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
                   ` (5 preceding siblings ...)
  2018-06-27  2:46 ` [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu Saeed Mahameed
@ 2018-06-27 16:42 ` Parikh, Neerav
  6 siblings, 0 replies; 36+ messages in thread
From: Parikh, Neerav @ 2018-06-27 16:42 UTC (permalink / raw)
  To: Saeed Mahameed, Jesper Dangaard Brouer, Alexei Starovoitov,
	Daniel Borkmann
  Cc: pjwaskiewicz, ttoukan.linux, Tariq Toukan, Duyck, Alexander H,
	Waskiewicz Jr, Peter, Opher Reviv, Rony Efraim, netdev,
	Saeed Mahameed

Thanks Saeed for starting this thread  :)
My comments inline. 

> -----Original Message-----
> From: Saeed Mahameed [mailto:saeedm@dev.mellanox.co.il]
> Sent: Tuesday, June 26, 2018 7:46 PM
> To: Jesper Dangaard Brouer <brouer@redhat.com>; Alexei Starovoitov
> <alexei.starovoitov@gmail.com>; Daniel Borkmann
> <borkmann@iogearbox.net>
> Cc: Parikh, Neerav <neerav.parikh@intel.com>; pjwaskiewicz@gmail.com;
> ttoukan.linux@gmail.com; Tariq Toukan <tariqt@mellanox.com>; Duyck,
> Alexander H <alexander.h.duyck@intel.com>; Waskiewicz Jr, Peter
> <peter.waskiewicz.jr@intel.com>; Opher Reviv <opher@mellanox.com>; Rony
> Efraim <ronye@mellanox.com>; netdev@vger.kernel.org; Saeed Mahameed
> <saeedm@mellanox.com>
> Subject: [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)
> 
> Hello,
> 
> Although it is far from being close to completion, this series provides
> the means and infrastructure to enable device drivers to share packets
> meta data information and accelerations with XDP programs,
> such as hash, stripped vlan, checksum, flow mark, packet header types,
> etc ..
> 
> The series is still work in progress and sending it now as RFC in order
> to get early feedback and to discuss the design, structures and UAPI.
> 
> For now the general idea is to help XDP programs accelerate, and provide
> them with meta data that already available from the HW or device driver
> to save cpu cycles on packet headers and data processing and decision
> making, aside of that we want to avoid having a fixed size meta data
> structure that wastes a lot of buffer space for stuff that the xdp program
> might not need, like the current "elephant" SKB fields, kidding :) ..
> 
> So my idea here is to provide a dynamic mechanism to allow XDP programs
> on xdp load to request only the specific meta data they need, and
> the device driver or netdevice will provide them in a predefined order
> in the xdp_buff/xdp_md data meta section.
>
> And here is how it is done and what i would like to discuss:
> 
> 1. In the first patch: added the infrastructure to request predefined meta data
> flags on xdp load indicating that the XDP program is going to need them.
> 
> 1.1) In this patch I am using the current u32 bit IFLA_XDP_FLAGS,
> TODO: this needs to be improved in order to allow more meta data flags,
> maybe use a new dedicated flags ?
> 
> 1.2) Device driver that want to support xdp meta data should implement
> new XDP command XDP_QUERY_META_FLAGS and report the meta data flags
> they
> can support.
> 
> 1.3) the kernel will cross check the requested flags with the device's
> supported flags and will fail the xdp load in case of discrepancy.
> 
> Question : do we want this ? or is it better to return to the program
> the actual supported flags and let it decide ?
> 
>
The work we are doing in this direction does not assume any specific flags but
instead the XDP program requests for certain "meta data" that it needs and
the driver (or HW) if can provide that then allow loading of that program.
If we put the flags and capabilities in kernel then it will depend on the control
program loading the program to pass on that information. If the XDP program
has that built-in via say ELF "section" (similar to maps) then the program
can be loaded independently and knows what kind of meta data it wants
and receives. 
If the meta data is not supported by the device (driver or software mode)
then that would perhaps fail the program load.
 
> 2. This is the interesting part: in the 2nd patch Added the meta data
> set/get infrastructure to allow device drivers populate them into xdp_buff
> data meta in a well defined and structured format and let the xdp program
> read them according to the predefined format/structure, the idea here is
> that the xdp program and the device driver will share a static information
> offsets array that will define where each meta data will sit inside
> xdp_buff data meta, such structure will be decided once on xdp load.
> 
> Enters struct xdp_md_info and xdp_md_info_arr:
> 
> struct xdp_md_info {
>        __u16 present:1;
>        __u16 offset:15; /* offset from data_meta in xdp_md buffer */
> };
We were trying to not define a generic structure if possible and were working
towards a model similar to current usage where the XDP program produces
a meta data that is consumed by the eBPF program on the TC classifier without
hard coding any structure. It's purely a producer-consumer model with the
kernel helping transfer that data from one end to another instead of providing
a structure. 
So, the NIC produces the meta data requested by the XDP program v/s producing
some meta data then that gets translated in drivers into a generic structure.


> 
> /* XDP meta data offsets info array
>  * present bit describes if a meta data is or will be present in xdp_md buff
>  * offset describes where a meta data is or should be placed in xdp_md buff
>  *
>  * Kernel builds this array using xdp_md_info_build helper on demand.
>  * User space builds it statically in the xdp program.
>  */
> typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> 
> Offsets in xdp_md_info_arr are always in ascending order and only for
> requested meta data:
> example : for XDP prgram that requested the following flags:
> flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> 
> the offsets array will be:
> 
> xdp_md_info_arr mdi = {
>         [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
>         [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
>         [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present
> = 1},
>         [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
> }
> 
> For this example: hash fields will always appear first then the vlan for every
> xdp_md.
> 
> Once requested to provide xdp meta data, device driver will use a pre-built
> xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
> xdp_md_info_arr will tell the driver what is the offset of each meta data.
> The user space XDP program will use a similar xdp_md_info_arr to
> statically know what is the offset of each meta data.
> 
> *For future meta data they will be added to the end of the array with
> higher flags value.
> 
> This patch also provides helper functions for the device drivers to store
> meta data into xdp_buff, and helper function for XDP programs to fetch
> specific xdp meta data from xdp_md buffer.
> 
> Question: currently the XDP program is responsible to build the static
> meta data offsets array "xdp_md_info_arr" and the kernel will build it
> for the netdevice, if we are going to choose this direction, should we
> somehow share the same xdp_md_info_arr built by the kernel with the xdp
> program ?
> 
> 3. The last mlx5e patch is the actual show case of how the device driver
> will add the support for xdp meta data, it is pretty simple and straight
> forward ! The first two mlx5e patches are just small refactoring to make
> the xdp_md_info_arr and packet completion information available in the rx
> xdp handlers data path.
> 
> 4. Just added a small example to demonstrate how XDP program can request
> meta data on xdp load and how it can read them on the critical path.
> of course more examples are needed and some performance numbers.
> Exciting use cases such as:
>   - using flow mark to allow fast white/black listing lookups.
>   - flow mark to accelerate DDos prevention.
>   - Generic Data path: Use the meta data from the xdp_buff to build SKBs
> !(Jesper's Idea)
>   - using hash type to know the packet headers and encapsulation without
>     touching the packet data at all.
>   - using packet hash to do RPS and XPS like cpu redirecting.
>   - and many more.
> 
> 
> More ideas:
> 
> From Jesper:
> =========================
> Give XDP more rich information about the hash,
> By reducing the 'ht' value to the PKT_HASH_TYPE's we are loosing information.
> 
> I happen to know your hardware well; CQE_RSS_HTYPE_IP tell us if this
> is IPv4 or IPv6.  And CQE_RSS_HTYPE_L4 tell us if this is TCP, UDP or
> IPSEC. (And you have another bit telling of this is IPv6 with extension
> headers).
> 
Yes the commodity NICs have lot of rich information about packet parsing and
protocol information. It would be worth extending this for the XDP programs so
that they can take advantage of as much work done by the NIC already v/s
redoing all that in the XDP program. 
So, basically that will allow XDP programs to put more focus on the 
"business logic" of whatever it determines to do with the packet. 


> If we don't want to invent our own xdp_hash_types, we can simply adopt
> the RSS Hashing Types defined by Microsoft:
>  https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-
> hashing-types
> 
> An interesting part of the RSS standard, is that the hash type can help
> identify if this is a fragment. (XDP could use this info to avoid
> touching payload and react, e.g. drop fragments, or redirect all
> fragments to another CPU, or skip parsing in XDP and defer to network
> stack via XDP_PASS).
> 
Yes as well it would be interesting if there is use if XDP program would like to
monitor any packet errors that were captured by the NIC to monitor. 
While traditionally the driver may decide to drop/consume such packets but
it would be a good use-case for debugging.

> By using the RSS standard, we do e.g. loose the ability to say this is
> IPSEC traffic, even-though your HW supports this.
> 
> I have tried to implemented my own (non-dynamic) XDP RX-types UAPI here:
>  https://marc.info/?l=linux-netdev&m=149512213531769
>  https://marc.info/?l=linux-netdev&m=149512213631774
> =========================
> 
> Thanks,
> Saeed.
> 
> Saeed Mahameed (6):
>   net: xdp: Add support for meta data flags requests
>   net: xdp: RX meta data infrastructure
>   net/mlx5e: Store xdp flags and meta data info
>   net/mlx5e: Pass CQE to RX handlers
>   net/mlx5e: Add XDP RX meta data support
>   samples/bpf: Add meta data hash example to xdp_redirect_cpu
> 
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |  19 ++-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |  58 +++++----
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  47 ++++++--
>  include/linux/netdevice.h                     |  10 ++
>  include/net/xdp.h                             |   6 +
>  include/uapi/linux/bpf.h                      | 114 ++++++++++++++++++
>  include/uapi/linux/if_link.h                  |  20 ++-
>  net/core/dev.c                                |  41 +++++++
>  samples/bpf/xdp_redirect_cpu_kern.c           |  67 ++++++++++
>  samples/bpf/xdp_redirect_cpu_user.c           |   7 ++
>  10 files changed, 354 insertions(+), 35 deletions(-)
> 
> --
> 2.17.0

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-06-27 14:15   ` Jesper Dangaard Brouer
@ 2018-06-27 17:55     ` Saeed Mahameed
  2018-07-02  8:01       ` Daniel Borkmann
  0 siblings, 1 reply; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27 17:55 UTC (permalink / raw)
  To: saeedm, brouer
  Cc: alexander.h.duyck, peter.waskiewicz.jr, Rony Efraim, borkmann,
	Tariq Toukan, neerav.parikh, Opher Reviv, alexei.starovoitov,
	pjwaskiewicz, netdev, ttoukan.linux

On Wed, 2018-06-27 at 16:15 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 26 Jun 2018 19:46:11 -0700
> Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> 
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 2deea7166a34..afe302613ae1 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff
> > *xdp)
> >  	xdp->data_meta = xdp->data + 1;
> >  }
> >  
> > +static __always_inline void
> > +xdp_reset_data_meta(struct xdp_buff *xdp)
> > +{
> > +	xdp->data_meta = xdp->data_hard_start;
> > +}
> 
> This is WRONG ... it should be:
> 
>  xdp->data_meta = xdp->data;
> 

maybe the name of the function is not suitable for the use case.
i need to set xdp->data_meta = xdp->data in the driver to start storing
meta data.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu
  2018-06-27 10:59   ` Jesper Dangaard Brouer
@ 2018-06-27 18:04     ` Saeed Mahameed
  0 siblings, 0 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27 18:04 UTC (permalink / raw)
  To: saeedm, brouer
  Cc: alexander.h.duyck, peter.waskiewicz.jr, Rony Efraim, borkmann,
	Tariq Toukan, neerav.parikh, Opher Reviv, alexei.starovoitov,
	pjwaskiewicz, netdev, ttoukan.linux

On Wed, 2018-06-27 at 12:59 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 26 Jun 2018 19:46:15 -0700
> Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> 
> > Add a new program (prog_num = 4) that will not parse packets and
> > will
> > use the meta data hash to spread/redirect traffic into different
> > cpus.
> 
> You cannot "steal" prognum 4, as it is already used for
> "xdp_prognum4_ddos_filter_pktgen".  Please append your new prog as
> #5.
> 

Sure.

> > For the new program we set on bpf_set_link_xdp_fd:
> > 	xdp_flags |= XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> > 
> > On mlx5 it will succeed since mlx5 already supports these flags.
> > 
> > The new program will read the value of the hash from the data_meta
> > pointer from the xdp_md and will use it to compute the destination
> > cpu.
> > 
> > Note: I didn't test this patch to show redirect works with the
> > hash!
> > I only used it to see that the hash and vlan values are set
> > correctly
> > by the driver and can be seen by the xdp program.
> > 
> > * I faced some difficulties to read the hash value using the helper
> > functions defined in the previous patches, but once i used the same
> > logic
> > with out these functions it worked ! Will have to figure this out
> > later.
> > 
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  samples/bpf/xdp_redirect_cpu_kern.c | 67
> > +++++++++++++++++++++++++++++
> >  samples/bpf/xdp_redirect_cpu_user.c |  7 +++
> >  2 files changed, 74 insertions(+)
> > 
> > diff --git a/samples/bpf/xdp_redirect_cpu_kern.c
> > b/samples/bpf/xdp_redirect_cpu_kern.c
> > index 303e9e7161f3..d6b3f55f342a 100644
> > --- a/samples/bpf/xdp_redirect_cpu_kern.c
> > +++ b/samples/bpf/xdp_redirect_cpu_kern.c
> > @@ -376,6 +376,73 @@ int  xdp_prognum3_proto_separate(struct xdp_md
> > *ctx)
> >  	return bpf_redirect_map(&cpu_map, cpu_dest, 0);
> >  }
> >  
> > +#if 0
> > +xdp_md_info_arr mdi = {
> > +	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
> > +	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct
> > xdp_md_hash), .present = 1},
> > +};
> > +#endif
> 
> Sorry, no global variables avail in the generated BPF byte-code.
> 
Yea i found out the hard way :), but for my final solution i would like
to somehow share the same static md info array between netdev and bpf
program, so this code was just experimental.

> > +SEC("xdp_cpu_map4_hash_separate")
> > +int  xdp_prognum4_hash_separate(struct xdp_md *ctx)
> > +{
> > +	void *data_meta = (void *)(long)ctx->data_meta;
> > +	void *data_end  = (void *)(long)ctx->data_end;
> > +	void *data      = (void *)(long)ctx->data;
> > +	struct xdp_md_hash *hash;
> > +	struct xdp_md_vlan *vlan;
> > +	struct datarec *rec;
> > +	u32 cpu_dest = 0;
> > +	u32 cpu_idx = 0;
> > +	u32 *cpu_lookup;
> > +	u32 key = 0;
> > +
> > +	/* Count RX packet in map */
> > +	rec = bpf_map_lookup_elem(&rx_cnt, &key);
> > +	if (!rec)
> > +		return XDP_ABORTED;
> > +	rec->processed++;
> > +
> > +	/* for some reason this code fails to be verified */
> > +#if 0
> > +	hash = xdp_data_meta_get_hash(mdi, data_meta);
> 
> This will not work, because it is not implemented as a proper
> BPF-helper call.
> 
> First, you currently store the xdp_md_info_arr inside the driver,
> which
> makes it hard for a helper to access this.  For helper access, we
> could
> store this in xdp_rxq_info.
> 

Good Idea!

> Second, in your design it looks like you are introducing a helper per
> possible item in xdp_md_info_arr.  I think we can reduce this to a
> single helper, that takes a XDP_DATA_META_xxx flag, and returns an
> offset.  (The helper could return a direct pointer, but I don't think
> the verfier can handle that, as it need to "see" this is related to
> data_meta pointer, and that we do the proper boundry checks.).
> 

We can update the verifier to allow access to any offset between
data_meta and data_meta + offset(last meta data) + sizeof(last meta
data) ?

> The BPF prog already have direct memory access to the data_meta area,
> and all it really need is an offset.  Sure, the XDP/bpf programmer
> could just calculate these offsets as constants, and remember to load
> the XDP prog with the flags that corresponds to the calculated
> offsets.
> 
> But I think we can do something even smarter... 
> 
> It should be possible to convert/patch the BPF instructions, of the
> helper call that returns an offset, to instead avoid the call and
> either (1) provide the offset as a constant/IMM or (2) make BPF inst
> doing the lookup in xdp_md_info_arr.
> 

I vote (2).

> 
> > +	if (hash + 1 > data)
> > +		return XDP_ABORTED;
> > +
> > +	vlan = xdp_data_meta_get_vlan(mdi, data_meta);
> > +	if (vlan + 1 > data)
> > +		return XDP_ABORTED;
> > +#endif
> > +
> > +	/* Work around for the above code */
> > +	hash = data_meta; /* since we know hash will appear first
> > */
> > +        if (hash + 1 > data)
> > +		return XDP_ABORTED;
> > +
> > +#if 0
> > +	// Just for testing
> > +	/* We know that vlan will appear after the hash */
> > +	vlan = (void *)((char *)data_meta + sizeof(*hash));
> > +	if (vlan + 1 > data) {
> > +		return XDP_ABORTED;
> > +	}
> > +#endif
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-06-27 17:55     ` Saeed Mahameed
@ 2018-07-02  8:01       ` Daniel Borkmann
  2018-07-03 23:52         ` Saeed Mahameed
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Borkmann @ 2018-07-02  8:01 UTC (permalink / raw)
  To: Saeed Mahameed, saeedm, brouer
  Cc: alexander.h.duyck, peter.waskiewicz.jr, Rony Efraim,
	Tariq Toukan, neerav.parikh, Opher Reviv, alexei.starovoitov,
	pjwaskiewicz, netdev, ttoukan.linux, john.fastabend

On 06/27/2018 07:55 PM, Saeed Mahameed wrote:
> On Wed, 2018-06-27 at 16:15 +0200, Jesper Dangaard Brouer wrote:
>> On Tue, 26 Jun 2018 19:46:11 -0700
>> Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>>
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index 2deea7166a34..afe302613ae1 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff
>>> *xdp)
>>>  	xdp->data_meta = xdp->data + 1;
>>>  }
>>>  
>>> +static __always_inline void
>>> +xdp_reset_data_meta(struct xdp_buff *xdp)
>>> +{
>>> +	xdp->data_meta = xdp->data_hard_start;
>>> +}
>>
>> This is WRONG ... it should be:
>>
>>  xdp->data_meta = xdp->data;
> 
> maybe the name of the function is not suitable for the use case.
> i need to set xdp->data_meta = xdp->data in the driver to start storing
> meta data.

The xdp_set_data_meta_invalid() is a straight forward way for XDP drivers
to tell they do not support xdp->data_meta, since setting xdp->data + 1 will
fail the checks for direct (meta) packet access in the BPF code and at the
same time bpf_xdp_adjust_meta() will know to bail out with error when program
attempts to make headroom for meta data that driver cannot handle later on.

So later setting 'xdp->data_meta = xdp->data' to enable it is as setting any
other of the initializers in xdp_buff, and done so today in the i40e, ixgbe,
ixgbevf and nfp drivers. (Theoretically it wouldn't have to be exactly set to
xdp->data, but anything <= xdp->data works, if the driver would prepend info
from hw in front of it that program can then use or later override.)

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-06-27  2:46 ` [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure Saeed Mahameed
  2018-06-27 14:15   ` Jesper Dangaard Brouer
@ 2018-07-03 23:01   ` Alexei Starovoitov
  2018-07-04  0:57     ` Saeed Mahameed
  1 sibling, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-03 23:01 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: brouer, borkmann, tariqt, alexander.h.duyck, peter.waskiewicz.jr,
	netdev, john.fastabend

On Tue, Jun 26, 2018 at 07:46:11PM -0700, Saeed Mahameed wrote:
> The idea from this patch is to define a well known structure for XDP meta
> data fields format and offset placement inside the xdp data meta buffer.
> 
> For that driver will need some static information to know what to
> provide and where, enters struct xdp_md_info and xdp_md_info_arr:
> 
> struct xdp_md_info {
>        __u16 present:1;
>        __u16 offset:15; /* offset from data_meta in xdp_md buffer */
> };
> 
> /* XDP meta data offsets info array
>  * present bit describes if a meta data is or will be present in xdp_md buff
>  * offset describes where a meta data is or should be placed in xdp_md buff
>  *
>  * Kernel builds this array using xdp_md_info_build helper on demand.
>  * User space builds it statically in the xdp program.
>  */
> typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> 
> Offsets in xdp_md_info_arr are always in ascending order and only for
> requested meta data:
> example : flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> 
> xdp_md_info_arr mdi = {
> 	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
> 	[XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
> 	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
> 	[XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
> }
> 
> i.e: hash fields will always appear first then the vlan for every
> xdp_md.
> 
> Once requested to provide xdp meta data, device driver will use a pre-built
> xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
> xdp_md_info_arr will tell the driver what is the offset of each meta data.
> 
> This patch also provides helper functions for the device drivers to store
> meta data into xdp_buff, and helper function for XDP programs to fetch
> specific xdp meta data from xdp_md buffer.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
...
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 59b19b6a40d7..e320e7421565 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2353,6 +2353,120 @@ struct xdp_md {
>  	__u32 rx_queue_index;  /* rxq->queue_index  */
>  };
>  
> +enum {
> +	XDP_DATA_META_HASH = 0,
> +	XDP_DATA_META_MARK,
> +	XDP_DATA_META_VLAN,
> +	XDP_DATA_META_CSUM_COMPLETE,
> +
> +	XDP_DATA_META_MAX,
> +};
> +
> +struct xdp_md_hash {
> +	__u32 hash;
> +	__u8  type;
> +} __attribute__((aligned(4)));
> +
> +struct xdp_md_mark {
> +	__u32 mark;
> +} __attribute__((aligned(4)));
> +
> +struct xdp_md_vlan {
> +	__u16 vlan;
> +} __attribute__((aligned(4)));
> +
> +struct xdp_md_csum {
> +	__u16 csum;
> +} __attribute__((aligned(4)));
> +
> +static const __u16 xdp_md_size[XDP_DATA_META_MAX] = {
> +	sizeof(struct xdp_md_hash), /* XDP_DATA_META_HASH */
> +	sizeof(struct xdp_md_mark), /* XDP_DATA_META_MARK */
> +	sizeof(struct xdp_md_vlan), /* XDP_DATA_META_VLAN */
> +	sizeof(struct xdp_md_csum), /* XDP_DATA_META_CSUM_COMPLETE */
> +};
> +
> +struct xdp_md_info {
> +	__u16 present:1;
> +	__u16 offset:15; /* offset from data_meta in xdp_md buffer */
> +};
> +
> +/* XDP meta data offsets info array
> + * present bit describes if a meta data is or will be present in xdp_md buff
> + * offset describes where a meta data is or should be placed in xdp_md buff
> + *
> + * Kernel builds this array using xdp_md_info_build helper on demand.
> + * User space builds it statically in the xdp program.
> + */
> +typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> +
> +static __always_inline __u8
> +xdp_data_meta_present(xdp_md_info_arr mdi, int meta_data)
> +{
> +	return mdi[meta_data].present;
> +}
> +
> +static __always_inline void
> +xdp_data_meta_set_hash(xdp_md_info_arr mdi, void *data_meta, __u32 hash, __u8 htype)
> +{
> +	struct xdp_md_hash *hash_md;
> +
> +	hash_md = (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
> +	hash_md->hash = hash;
> +	hash_md->type = htype;
> +}
> +
> +static __always_inline struct xdp_md_hash *
> +xdp_data_meta_get_hash(xdp_md_info_arr mdi, void *data_meta)
> +{
> +	return (struct xdp_md_hash *)((char*)data_meta + mdi[XDP_DATA_META_HASH].offset);
> +}

I'm afraid this is not scalable uapi.
This looks very much mlx specific. Every NIC vendor cannot add its own
struct definitions into uapi/bpf.h. It doesn't scale.
Also doing 15 bit bitfield extraction using bpf instructions is not that simple.
mlx should consider doing plain u8 or u16 fields in firmware instead.
The metadata that "hw" provides is a joint work of actual asic and firmware.
I suspect the format can and will change with different firmware.
Baking this stuff into uapi/bpf.h is not an option.
How about we make driver+firmware provide a BTF definition of metadata that they
can provide? There can be multiple definitions of such structs.
Then in userpsace we can have BTF->plain C converter.
(bpftool practically ready to do that already).
Then the programmer can take such generated C definition, add it to .h and include
it in their programs. llvm will compile the whole thing and will include BTF
of maps, progs and this md struct in the target elf file.
During loading the kernel can check that BTF in elf is matching one-to-one
to what driver+firmware are saying they support.
No ambiguity and no possibility of mistake, since offsets and field names
are verified.
Every driver can have their own BTF for md and their own special features.
We can try to standardize the names (like vlan and csum), so xdp programs
can stay relatively portable across NICs.
Such api will address exposing asic+firmware metadata to the xdp program.
Once we tackle this problem, we'll think how to do the backward config
(to do firmware reconfig for specific BTF definition of md supplied by the prog).
What people think?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-02  8:01       ` Daniel Borkmann
@ 2018-07-03 23:52         ` Saeed Mahameed
  0 siblings, 0 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-07-03 23:52 UTC (permalink / raw)
  To: daniel, saeedm, brouer
  Cc: alexander.h.duyck, peter.waskiewicz.jr, Rony Efraim,
	Tariq Toukan, john.fastabend, neerav.parikh, Opher Reviv,
	alexei.starovoitov, pjwaskiewicz, netdev, ttoukan.linux

On Mon, 2018-07-02 at 10:01 +0200, Daniel Borkmann wrote:
> On 06/27/2018 07:55 PM, Saeed Mahameed wrote:
> > On Wed, 2018-06-27 at 16:15 +0200, Jesper Dangaard Brouer wrote:
> > > On Tue, 26 Jun 2018 19:46:11 -0700
> > > Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
> > > 
> > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > index 2deea7166a34..afe302613ae1 100644
> > > > --- a/include/net/xdp.h
> > > > +++ b/include/net/xdp.h
> > > > @@ -138,6 +138,12 @@ xdp_set_data_meta_invalid(struct xdp_buff
> > > > *xdp)
> > > >  	xdp->data_meta = xdp->data + 1;
> > > >  }
> > > >  
> > > > +static __always_inline void
> > > > +xdp_reset_data_meta(struct xdp_buff *xdp)
> > > > +{
> > > > +	xdp->data_meta = xdp->data_hard_start;
> > > > +}
> > > 
> > > This is WRONG ... it should be:
> > > 
> > >  xdp->data_meta = xdp->data;
> > 
> > maybe the name of the function is not suitable for the use case.
> > i need to set xdp->data_meta = xdp->data in the driver to start
> > storing
> > meta data.
> 
> The xdp_set_data_meta_invalid() is a straight forward way for XDP
> drivers
> to tell they do not support xdp->data_meta, since setting xdp->data +
> 1 will
> fail the checks for direct (meta) packet access in the BPF code and
> at the
> same time bpf_xdp_adjust_meta() will know to bail out with error when
> program
> attempts to make headroom for meta data that driver cannot handle
> later on.
> 
> So later setting 'xdp->data_meta = xdp->data' to enable it is as
> setting any
> other of the initializers in xdp_buff, and done so today in the i40e,
> ixgbe,
> ixgbevf and nfp drivers. (Theoretically it wouldn't have to be
> exactly set to
> xdp->data, but anything <= xdp->data works, if the driver would
> prepend info
> from hw in front of it that program can then use or later override.)
> 

Right ! As jesper pointer out as well, 
driver should set: xdp->data_meta = xdp->data
and then adjust it to the offset of the first meta data hint.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-03 23:01   ` Alexei Starovoitov
@ 2018-07-04  0:57     ` Saeed Mahameed
  2018-07-04  7:51       ` Daniel Borkmann
  0 siblings, 1 reply; 36+ messages in thread
From: Saeed Mahameed @ 2018-07-04  0:57 UTC (permalink / raw)
  To: alexei.starovoitov, saeedm
  Cc: alexander.h.duyck, netdev, Tariq Toukan, john.fastabend, brouer,
	borkmann, peter.waskiewicz.jr

On Tue, 2018-07-03 at 16:01 -0700, Alexei Starovoitov wrote:
> On Tue, Jun 26, 2018 at 07:46:11PM -0700, Saeed Mahameed wrote:
> > The idea from this patch is to define a well known structure for
> > XDP meta
> > data fields format and offset placement inside the xdp data meta
> > buffer.
> > 
> > For that driver will need some static information to know what to
> > provide and where, enters struct xdp_md_info and xdp_md_info_arr:
> > 
> > struct xdp_md_info {
> >        __u16 present:1;
> >        __u16 offset:15; /* offset from data_meta in xdp_md buffer
> > */
> > };
> > 
> > /* XDP meta data offsets info array
> >  * present bit describes if a meta data is or will be present in
> > xdp_md buff
> >  * offset describes where a meta data is or should be placed in
> > xdp_md buff
> >  *
> >  * Kernel builds this array using xdp_md_info_build helper on
> > demand.
> >  * User space builds it statically in the xdp program.
> >  */
> > typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> > 
> > Offsets in xdp_md_info_arr are always in ascending order and only
> > for
> > requested meta data:
> > example : flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;
> > 
> > xdp_md_info_arr mdi = {
> > 	[XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
> > 	[XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
> > 	[XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash),
> > .present = 1},
> > 	[XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
> > }
> > 
> > i.e: hash fields will always appear first then the vlan for every
> > xdp_md.
> > 
> > Once requested to provide xdp meta data, device driver will use a
> > pre-built
> > xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
> > xdp_md_info_arr will tell the driver what is the offset of each
> > meta data.
> > 
> > This patch also provides helper functions for the device drivers to
> > store
> > meta data into xdp_buff, and helper function for XDP programs to
> > fetch
> > specific xdp meta data from xdp_md buffer.
> > 
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> 
> ...
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 59b19b6a40d7..e320e7421565 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -2353,6 +2353,120 @@ struct xdp_md {
> >  	__u32 rx_queue_index;  /* rxq->queue_index  */
> >  };
> >  
> > +enum {
> > +	XDP_DATA_META_HASH = 0,
> > +	XDP_DATA_META_MARK,
> > +	XDP_DATA_META_VLAN,
> > +	XDP_DATA_META_CSUM_COMPLETE,
> > +
> > +	XDP_DATA_META_MAX,
> > +};
> > +
> > +struct xdp_md_hash {
> > +	__u32 hash;
> > +	__u8  type;
> > +} __attribute__((aligned(4)));
> > +
> > +struct xdp_md_mark {
> > +	__u32 mark;
> > +} __attribute__((aligned(4)));
> > +
> > +struct xdp_md_vlan {
> > +	__u16 vlan;
> > +} __attribute__((aligned(4)));
> > +
> > +struct xdp_md_csum {
> > +	__u16 csum;
> > +} __attribute__((aligned(4)));
> > +
> > +static const __u16 xdp_md_size[XDP_DATA_META_MAX] = {
> > +	sizeof(struct xdp_md_hash), /* XDP_DATA_META_HASH */
> > +	sizeof(struct xdp_md_mark), /* XDP_DATA_META_MARK */
> > +	sizeof(struct xdp_md_vlan), /* XDP_DATA_META_VLAN */
> > +	sizeof(struct xdp_md_csum), /* XDP_DATA_META_CSUM_COMPLETE
> > */
> > +};
> > +
> > +struct xdp_md_info {
> > +	__u16 present:1;
> > +	__u16 offset:15; /* offset from data_meta in xdp_md buffer
> > */
> > +};
> > +
> > +/* XDP meta data offsets info array
> > + * present bit describes if a meta data is or will be present in
> > xdp_md buff
> > + * offset describes where a meta data is or should be placed in
> > xdp_md buff
> > + *
> > + * Kernel builds this array using xdp_md_info_build helper on
> > demand.
> > + * User space builds it statically in the xdp program.
> > + */
> > +typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];
> > +
> > +static __always_inline __u8
> > +xdp_data_meta_present(xdp_md_info_arr mdi, int meta_data)
> > +{
> > +	return mdi[meta_data].present;
> > +}
> > +
> > +static __always_inline void
> > +xdp_data_meta_set_hash(xdp_md_info_arr mdi, void *data_meta, __u32
> > hash, __u8 htype)
> > +{
> > +	struct xdp_md_hash *hash_md;
> > +
> > +	hash_md = (struct xdp_md_hash *)((char*)data_meta +
> > mdi[XDP_DATA_META_HASH].offset);
> > +	hash_md->hash = hash;
> > +	hash_md->type = htype;
> > +}
> > +
> > +static __always_inline struct xdp_md_hash *
> > +xdp_data_meta_get_hash(xdp_md_info_arr mdi, void *data_meta)
> > +{
> > +	return (struct xdp_md_hash *)((char*)data_meta +
> > mdi[XDP_DATA_META_HASH].offset);
> > +}
> 
> I'm afraid this is not scalable uapi.
> This looks very much mlx specific. Every NIC vendor cannot add its
> own
> struct definitions into uapi/bpf.h. It doesn't scale.

No, this is not mlx specific, all you can find here is standard
hash/csum/vlan/flow mark, fields as described in SKB, we want them to
be well defined so you can construct SKBs with these standard values
from xdp_md with no surprises.

I agree this is not scalable, but this patch is not about arbitrary
vendor specific hints, with unknown sizes and shapes, i understand the
demand but we do need the notion of standard hints as well in order to
allow (driver/fw/hw)=>(xdp program)=>(stack) data flow, since the three
of those entities must agree on some well defined hints.

Anyway we discussed this in the last iovisor meeting and we agreed that
along with the standard hints suggested by this patch we want to allow
arbitrary hw specific hints, the question is how to have both.

> Also doing 15 bit bitfield extraction using bpf instructions is not
> that simple.
> mlx should consider doing plain u8 or u16 fields in firmware instead.
> The metadata that "hw" provides is a joint work of actual asic and
> firmware.

Right, this would be a huge flexibility improvement for future hw.
what about current HW, the driver can always translate the current
format to whatever format we decide on.

> I suspect the format can and will change with different firmware.
> Baking this stuff into uapi/bpf.h is not an option.

Yes but we still need to have a well defined structures for standard
hints, for the hints to flow into the stack as well. current and future
HW/FW/driver must respect the well known format of specific hints, such
as hash/csum/vlan, etc..

> How about we make driver+firmware provide a BTF definition of
> metadata that they
> can provide? There can be multiple definitions of such structs.
> Then in userpsace we can have BTF->plain C converter.
> (bpftool practically ready to do that already).
> Then the programmer can take such generated C definition, add it to
> .h and include
> it in their programs. llvm will compile the whole thing and will
> include BTF
> of maps, progs and this md struct in the target elf file.
> During loading the kernel can check that BTF in elf is matching one-
> to-one
> to what driver+firmware are saying they support.

Just thinking out loud, can't we do this at program load ? just run a
setup function in the xdp program to load nic md BTF definition into
the elf section ?

> No ambiguity and no possibility of mistake, since offsets and field
> names
> are verified.

But what about the dynamic nature of this feature ? Sometimes you only
want HW/Driver to provide a subset of whatever the HW can provide and
save md buffer for other stuff.

Yes a well defined format is favorable here, but we need to make sure
there is no computational overhead in data path just to extract each
field! for example if i want to know what is the offset of the hash
will i need to go parse (for every packet) the whole BTF definition of
metadata just to find the offset of type=hash ?

> Every driver can have their own BTF for md and their own special
> features.
> We can try to standardize the names (like vlan and csum), so xdp
> programs
> can stay relatively portable across NICs.

Yes this is a must.

> Such api will address exposing asic+firmware metadata to the xdp
> program.
> Once we tackle this problem, we'll think how to do the backward
> config
> (to do firmware reconfig for specific BTF definition of md supplied
> by the prog).
> What people think?
> 

For legacy HW, we can do it already in the driver, provide whatever the
prog requested, its only a matter of translation to the BTF format in
the driver xdp setup and pushing the values accordingly into the md
offsets on data path.

Question: how can you share the md BTF from the driver/HW with the xdp
program ?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-04  0:57     ` Saeed Mahameed
@ 2018-07-04  7:51       ` Daniel Borkmann
  2018-07-05 17:18         ` Jakub Kicinski
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Borkmann @ 2018-07-04  7:51 UTC (permalink / raw)
  To: Saeed Mahameed, alexei.starovoitov, saeedm
  Cc: alexander.h.duyck, netdev, Tariq Toukan, john.fastabend, brouer,
	borkmann, peter.waskiewicz.jr

On 07/04/2018 02:57 AM, Saeed Mahameed wrote:
> On Tue, 2018-07-03 at 16:01 -0700, Alexei Starovoitov wrote:
[...]
>> How about we make driver+firmware provide a BTF definition of
>> metadata that they
>> can provide? There can be multiple definitions of such structs.
>> Then in userpsace we can have BTF->plain C converter.
>> (bpftool practically ready to do that already).
>> Then the programmer can take such generated C definition, add it to
>> .h and include
>> it in their programs. llvm will compile the whole thing and will
>> include BTF
>> of maps, progs and this md struct in the target elf file.
>> During loading the kernel can check that BTF in elf is matching one-
>> to-one
>> to what driver+firmware are saying they support.

I do like the above idea of utilizing BTF for this, seems like a good fit.

> Just thinking out loud, can't we do this at program load ? just run a
> setup function in the xdp program to load nic md BTF definition into
> the elf section ?
> 
>> No ambiguity and no possibility of mistake, since offsets and field
>> names
>> are verified.
> 
> But what about the dynamic nature of this feature ? Sometimes you only
> want HW/Driver to provide a subset of whatever the HW can provide and
> save md buffer for other stuff.
> 
> Yes a well defined format is favorable here, but we need to make sure
> there is no computational overhead in data path just to extract each
> field! for example if i want to know what is the offset of the hash
> will i need to go parse (for every packet) the whole BTF definition of
> metadata just to find the offset of type=hash ?

I don't think this would be the case that you'd need to walk BTF in fast
path here. In the ideal case, the only thing that driver would need to do
in fast path would be to set proper xdp->data_meta offset and _that_ would
be it. For the rest, program would know how to access the data since it's
already aware of it from BTF definition the driver provided. Other drivers
which would be less flexible on that regard would internally prep the buffer
based on the progs needs more or less similar as in mlx5e_xdp_fill_data_meta(),
but it would be really up to the driver how to handle this internally. The
BTF it would check at XDP setup time to do the configuration needed in the
driver. Verifier would only check BTF, pass it along for XDP setup, prog
rewrites in verifier aren't even needed since LLVM compiled everything
already.

>> Every driver can have their own BTF for md and their own special
>> features.
>> We can try to standardize the names (like vlan and csum), so xdp
>> programs
>> can stay relatively portable across NICs.
> 
> Yes this is a must.

Agree, there needs to be a basic common set that would be provided by
every XDP aware driver.

>> Such api will address exposing asic+firmware metadata to the xdp
>> program.
>> Once we tackle this problem, we'll think how to do the backward
>> config
>> (to do firmware reconfig for specific BTF definition of md supplied
>> by the prog).
>> What people think?
> 
> For legacy HW, we can do it already in the driver, provide whatever the
> prog requested, its only a matter of translation to the BTF format in
> the driver xdp setup and pushing the values accordingly into the md
> offsets on data path.
> 
> Question: how can you share the md BTF from the driver/HW with the xdp
> program ?
I think this would likely be a new query as in XDP_QUERY_META_BTF
implemented in ndo_bpf callback and then exported e.g. via bpf(2)
or netlink such that bpftool can generate BTF -> C from there for the
program to include later in compilation.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support
  2018-06-27  2:46 ` [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support Saeed Mahameed
@ 2018-07-04  8:28   ` Daniel Borkmann
  0 siblings, 0 replies; 36+ messages in thread
From: Daniel Borkmann @ 2018-07-04  8:28 UTC (permalink / raw)
  To: Saeed Mahameed, Jesper Dangaard Brouer, Alexei Starovoitov,
	Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

On 06/27/2018 04:46 AM, Saeed Mahameed wrote:
[...]
> @@ -935,11 +958,16 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  		return false;
>  
>  	xdp.data = va + *rx_headroom;
> -	xdp_set_data_meta_invalid(&xdp);
>  	xdp.data_end = xdp.data + *len;
>  	xdp.data_hard_start = va;
>  	xdp.rxq = &rq->xdp_rxq;
>  
> +	if (rq->xdp.flags & XDP_FLAGS_META_ALL) {
> +		xdp_reset_data_meta(&xdp);
> +		mlx5e_xdp_fill_data_meta(rq->xdp.md_info, xdp.data_meta, cqe);
> +	} else
> +		xdp_set_data_meta_invalid(&xdp);
> +

Just a quick note on this one: would actually be great to not set
the xdp_set_data_meta_invalid() in the else path as this meta buffer
should also be usable independent of hw hints. Meaning, in any case
it would be great if mlx5 + mlx4 could implement the xdp->data_meta
support we have today, this might probably be a good first step
anyway; so far supported on i40e, ixgbe, ixgbevf, nfp.

>  	act = bpf_prog_run_xdp(prog, &xdp);
>  	switch (act) {
>  	case XDP_PASS:
> 

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-04  7:51       ` Daniel Borkmann
@ 2018-07-05 17:18         ` Jakub Kicinski
  2018-07-06 16:30           ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2018-07-05 17:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Saeed Mahameed, alexei.starovoitov, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Wed, 4 Jul 2018 09:51:54 +0200, Daniel Borkmann wrote:
> On 07/04/2018 02:57 AM, Saeed Mahameed wrote:
> > On Tue, 2018-07-03 at 16:01 -0700, Alexei Starovoitov wrote:  
> >> How about we make driver+firmware provide a BTF definition of
> >> metadata that they
> >> can provide? There can be multiple definitions of such structs.
> >> Then in userpsace we can have BTF->plain C converter.
> >> (bpftool practically ready to do that already).
> >> Then the programmer can take such generated C definition, add it to
> >> .h and include
> >> it in their programs. llvm will compile the whole thing and will
> >> include BTF
> >> of maps, progs and this md struct in the target elf file.
> >> During loading the kernel can check that BTF in elf is matching one-
> >> to-one
> >> to what driver+firmware are saying they support.  
> 
> I do like the above idea of utilizing BTF for this, seems like a good fit.
>
> > Just thinking out loud, can't we do this at program load ? just run a
> > setup function in the xdp program to load nic md BTF definition into
> > the elf section ?
> >   
> >> No ambiguity and no possibility of mistake, since offsets and field
> >> names
> >> are verified.  
> > 
> > But what about the dynamic nature of this feature ? Sometimes you only
> > want HW/Driver to provide a subset of whatever the HW can provide and
> > save md buffer for other stuff.
> > 
> > Yes a well defined format is favorable here, but we need to make sure
> > there is no computational overhead in data path just to extract each
> > field! for example if i want to know what is the offset of the hash
> > will i need to go parse (for every packet) the whole BTF definition of
> > metadata just to find the offset of type=hash ?  
> 
> I don't think this would be the case that you'd need to walk BTF in fast
> path here. In the ideal case, the only thing that driver would need to do
> in fast path would be to set proper xdp->data_meta offset and _that_ would
> be it. For the rest, program would know how to access the data since it's
> already aware of it from BTF definition the driver provided. Other drivers
> which would be less flexible on that regard would internally prep the buffer
> based on the progs needs more or less similar as in mlx5e_xdp_fill_data_meta(),
> but it would be really up to the driver how to handle this internally. The
> BTF it would check at XDP setup time to do the configuration needed in the
> driver. Verifier would only check BTF, pass it along for XDP setup, prog
> rewrites in verifier aren't even needed since LLVM compiled everything
> already.

I don't think we should force drivers to place such meta data in the
buffer.  The moment that happens we loose the "zero-touch" abilities
Jesper was trying to achieve.

Besides what happens to the meta data after XDP is finished.  We really
need the ability to communicate the modified fields further to the
stack.  With meta data in the buffer we don't really know if the
information place there after XDP finishes is still valid or did the
program overwrite it with something completely different.

I'm also not 100% on board with the argument that "future" FW can
reshuffle things whatever way it wants to.  Is the assumption that
future ASICs/FW will be designed to always use the "blessed" BTF
format?  Or will it be reconfigurable at runtime?

> >> Every driver can have their own BTF for md and their own special
> >> features.
> >> We can try to standardize the names (like vlan and csum), so xdp
> >> programs
> >> can stay relatively portable across NICs.  
> > 
> > Yes this is a must.  
> 
> Agree, there needs to be a basic common set that would be provided by
> every XDP aware driver.

I'm sorry to bring this up again, but can we really not let drivers
define their own "get_XYZ/set_XYZ" helpers, and link those to the
program at attachment time?  Sure we'd have to create a new copy of the
program for each driver it's used with, but is that really a problem?
That'd have the lowest impact on the performance and complexity of the
driver fast path.  The BTF solution already has all the same problems
WRT tail calls and not being sure which driver the program is attached
to.

> >> Such api will address exposing asic+firmware metadata to the xdp
> >> program.
> >> Once we tackle this problem, we'll think how to do the backward
> >> config
> >> (to do firmware reconfig for specific BTF definition of md supplied
> >> by the prog).
> >> What people think?  
> > 
> > For legacy HW, we can do it already in the driver, provide whatever the
> > prog requested, its only a matter of translation to the BTF format in
> > the driver xdp setup and pushing the values accordingly into the md
> > offsets on data path.
> > 
> > Question: how can you share the md BTF from the driver/HW with the xdp
> > program ?
> 
> I think this would likely be a new query as in XDP_QUERY_META_BTF
> implemented in ndo_bpf callback and then exported e.g. via bpf(2)
> or netlink such that bpftool can generate BTF -> C from there for the
> program to include later in compilation.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-05 17:18         ` Jakub Kicinski
@ 2018-07-06 16:30           ` Alexei Starovoitov
  2018-07-06 20:44             ` Waskiewicz Jr, Peter
                               ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-06 16:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Saeed Mahameed, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> 
> I'm also not 100% on board with the argument that "future" FW can
> reshuffle things whatever way it wants to.  Is the assumption that
> future ASICs/FW will be designed to always use the "blessed" BTF
> format?  Or will it be reconfigurable at runtime?

let's table configuration of metadata aside for a second.

Describing metedata layout in BTF allows NICs to disclose everything
NIC has to users in a standard and generic way.
Whether firmware is reconfigurable on the fly or has to reflashed
and hw powercycled to have new md layout (and corresponding BTF description)
is a separate discussion.
Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
to reach 'hash' value in metadata.
Essentially it's a run-time way to access 'hash' instead of build-time.
So bpf program would need two loads to read csum or hash field instead of one.
With BTF the layout of metadata is known to the program at build-time.

To reiterate the proposal:
- driver+firmware keep layout of the metadata in BTF format (either in the driver
  or driver can read it from firmware)
- 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and
  generate normal C header file based on BTF in the given NIC
- user does #include "md_desc.h" and bpf program can access md->csum or md->hash
  with direct single load out of metadata area in front of the packet
- llvm compiles bpf program and records how program is doing this md->csum accesses
  in BTF format as well (the compiler will be keeping such records
  for __sk_buff and all other structs too, but that's separate discussion)
- during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program
  accesses metadata (and other structs) matches BTF from the driver,
  so no surprises if driver+firmware got updated, but program is not recompiled
- every NIC can have their own layout of metadata and its own meaning of the fields,
  but would be good to standardize at least a few common fields like hash

Once this is working we can do more cool things with BTF.
Like doing offset rewriting at program load time similar to what we plan
to do for tracing. Tracing programs will be doing 'task->pid' access
and the kernel will adjust offsetof(struct task_struct, pid) during program load
depending on BTF for the kernel.
The same trick we can do for networking metadata.
The program will contain load instruction md->hash that will get automatically
adjusted to proper offset depending on BTF of 'hash' field in the NIC.
For now I'm proposing _not_ to go that far with offset rewriting and start
with simple steps described above.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 16:30           ` Alexei Starovoitov
@ 2018-07-06 20:44             ` Waskiewicz Jr, Peter
  2018-07-06 23:38               ` Alexei Starovoitov
  2018-07-06 21:33             ` Jakub Kicinski
  2018-07-07  1:27             ` David Miller
  2 siblings, 1 reply; 36+ messages in thread
From: Waskiewicz Jr, Peter @ 2018-07-06 20:44 UTC (permalink / raw)
  To: alexei.starovoitov, jakub.kicinski
  Cc: Duyck, Alexander H, daniel, saeedm, brouer, borkmann, tariqt,
	john.fastabend, netdev, saeedm

On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > 
> > I'm also not 100% on board with the argument that "future" FW can
> > reshuffle things whatever way it wants to.  Is the assumption that
> > future ASICs/FW will be designed to always use the "blessed" BTF
> > format?  Or will it be reconfigurable at runtime?
> 
> let's table configuration of metadata aside for a second.

I agree that this should/could be NIC-specific and shouldn't weigh on
the metadata interface between the drivers and XDP layer.

> Describing metedata layout in BTF allows NICs to disclose everything
> NIC has to users in a standard and generic way.
> Whether firmware is reconfigurable on the fly or has to reflashed
> and hw powercycled to have new md layout (and corresponding BTF
> description)
> is a separate discussion.
> Saeed's proposal introduces the concept of 'offset' inside 'struct
> xdp_md_info'
> to reach 'hash' value in metadata.
> Essentially it's a run-time way to access 'hash' instead of build-
> time.
> So bpf program would need two loads to read csum or hash field
> instead of one.
> With BTF the layout of metadata is known to the program at build-
> time.
> 
> To reiterate the proposal:
> - driver+firmware keep layout of the metadata in BTF format (either
> in the driver
>   or driver can read it from firmware)
> - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query
> the driver and
>   generate normal C header file based on BTF in the given NIC
> - user does #include "md_desc.h" and bpf program can access md->csum
> or md->hash
>   with direct single load out of metadata area in front of the packet

This piece is where I'd like to discuss more.  When we discussed this
in Seoul, the initial proposal was a static struct that we'd try to
hammer out a common layout between the interested parties.  That
obviously wasn't going to scale, and we wanted to pursue something more
dynamic.  But I thought the goal was the XDP/eBPF program wouldn't want
to care what the underlying device is, and could just ask for metadata
that it's interested in.  With this approach, your eBPF program is now
bound/tied to the NIC/driver, and if you switched to a differen
NIC/driver combo, then you'd have to rewrite part of your eBPF program
to comprehend that.  I thought we were trying to avoid that.

Our proposed approach (still working on something ready to RFC) is to
provide a method for the eBPF program to send a struct of requested
hints down to the driver on load.  If the driver can provide the hints,
then that'd be how they'd be laid out in the metadata.  If it can't
provide them, we'd probably reject the program loading, or discuss
providing a software fallback (I know this is an area of contention).

I suppose we could get there with the rewriting mechanism described
below, but that'd be a tough sell to set a bit of ABI for metadata,
then change it to be potentially dynamic at runtime in the future.

-PJ

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 16:30           ` Alexei Starovoitov
  2018-07-06 20:44             ` Waskiewicz Jr, Peter
@ 2018-07-06 21:33             ` Jakub Kicinski
  2018-07-06 23:42               ` Alexei Starovoitov
  2018-07-07  1:27             ` David Miller
  2 siblings, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2018-07-06 21:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Saeed Mahameed, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:
> On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > 
> > I'm also not 100% on board with the argument that "future" FW can
> > reshuffle things whatever way it wants to.  Is the assumption that
> > future ASICs/FW will be designed to always use the "blessed" BTF
> > format?  Or will it be reconfigurable at runtime?  
> 
> let's table configuration of metadata aside for a second.
> 
> Describing metedata layout in BTF allows NICs to disclose everything
> NIC has to users in a standard and generic way.
> Whether firmware is reconfigurable on the fly or has to reflashed
> and hw powercycled to have new md layout (and corresponding BTF description)
> is a separate discussion.
> Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
> to reach 'hash' value in metadata.
> Essentially it's a run-time way to access 'hash' instead of build-time.
> So bpf program would need two loads to read csum or hash field instead of one.
> With BTF the layout of metadata is known to the program at build-time.
> 
> To reiterate the proposal:
> - driver+firmware keep layout of the metadata in BTF format (either in the driver
>   or driver can read it from firmware)
> - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and
>   generate normal C header file based on BTF in the given NIC
> - user does #include "md_desc.h" and bpf program can access md->csum or md->hash
>   with direct single load out of metadata area in front of the packet
> - llvm compiles bpf program and records how program is doing this md->csum accesses
>   in BTF format as well (the compiler will be keeping such records
>   for __sk_buff and all other structs too, but that's separate discussion)
> - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program
>   accesses metadata (and other structs) matches BTF from the driver,
>   so no surprises if driver+firmware got updated, but program is not recompiled
> - every NIC can have their own layout of metadata and its own meaning of the fields,
>   but would be good to standardize at least a few common fields like hash

Can I expose HW descriptors this way, though, or is the proposal to
copy this data into the packet buffer?

> Once this is working we can do more cool things with BTF.
> Like doing offset rewriting at program load time similar to what we plan
> to do for tracing. Tracing programs will be doing 'task->pid' access
> and the kernel will adjust offsetof(struct task_struct, pid) during program load
> depending on BTF for the kernel.
> The same trick we can do for networking metadata.
> The program will contain load instruction md->hash that will get automatically
> adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> For now I'm proposing _not_ to go that far with offset rewriting and start
> with simple steps described above.

Why? :(  Could we please go with the rewrite/driver helpers instead of
impacting fast paths of the drivers yet again?  This rewrite should be
easier than task->pid, because we have the synthetic user space struct
xdp_md definition.

The flexibility and power of BPF rewrites/JITing is at our disposal to
deliver maximum performance..

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 20:44             ` Waskiewicz Jr, Peter
@ 2018-07-06 23:38               ` Alexei Starovoitov
  2018-07-06 23:49                 ` Waskiewicz Jr, Peter
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-06 23:38 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter
  Cc: jakub.kicinski, Duyck, Alexander H, daniel, saeedm, brouer,
	borkmann, tariqt, john.fastabend, netdev, saeedm

On Fri, Jul 06, 2018 at 08:44:24PM +0000, Waskiewicz Jr, Peter wrote:
> On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > 
> > > I'm also not 100% on board with the argument that "future" FW can
> > > reshuffle things whatever way it wants to.  Is the assumption that
> > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > format?  Or will it be reconfigurable at runtime?
> > 
> > let's table configuration of metadata aside for a second.
> 
> I agree that this should/could be NIC-specific and shouldn't weigh on
> the metadata interface between the drivers and XDP layer.
> 
> > Describing metedata layout in BTF allows NICs to disclose everything
> > NIC has to users in a standard and generic way.
> > Whether firmware is reconfigurable on the fly or has to reflashed
> > and hw powercycled to have new md layout (and corresponding BTF
> > description)
> > is a separate discussion.
> > Saeed's proposal introduces the concept of 'offset' inside 'struct
> > xdp_md_info'
> > to reach 'hash' value in metadata.
> > Essentially it's a run-time way to access 'hash' instead of build-
> > time.
> > So bpf program would need two loads to read csum or hash field
> > instead of one.
> > With BTF the layout of metadata is known to the program at build-
> > time.
> > 
> > To reiterate the proposal:
> > - driver+firmware keep layout of the metadata in BTF format (either
> > in the driver
> >   or driver can read it from firmware)
> > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query
> > the driver and
> >   generate normal C header file based on BTF in the given NIC
> > - user does #include "md_desc.h" and bpf program can access md->csum
> > or md->hash
> >   with direct single load out of metadata area in front of the packet
> 
> This piece is where I'd like to discuss more.  When we discussed this
> in Seoul, the initial proposal was a static struct that we'd try to
> hammer out a common layout between the interested parties.  That
> obviously wasn't going to scale, and we wanted to pursue something more
> dynamic.  But I thought the goal was the XDP/eBPF program wouldn't want
> to care what the underlying device is, and could just ask for metadata
> that it's interested in.  With this approach, your eBPF program is now
> bound/tied to the NIC/driver, and if you switched to a differen
> NIC/driver combo, then you'd have to rewrite part of your eBPF program
> to comprehend that.  I thought we were trying to avoid that.

It looks to me that NICs have a lot more differences instead of common pieces.
If we focus the architecture on making common things as generic
as possible we miss the bigger picture of covering distinct
and unique features that hw provides.

In the last email when I said "would be good to standardize at least a few common fields"
I meant that it make sense only at the later phases.
I don't think we should put things into uapi upfront.
Otherwise we will end up with likely useless fields.
Like vlan field. Surely a lot of nics can store it into metadata, but why?
bpf program can easily read it from the packet.
What's the benefit of adding it to md?
This metadata concept we're discussing in the context of program acceleration.
If metadata doesn't give perf benefits it should not be done by
firmware, it should not be supported by the driver and certainly
should not be part of uapi.
Hence I'd like to see pure BTF description without any common/standard
fields across the drivers and figure out what (if anything) is useful
to accelerate xdp programs (and af_xdp in the future).

> Our proposed approach (still working on something ready to RFC) is to
> provide a method for the eBPF program to send a struct of requested
> hints down to the driver on load.  If the driver can provide the hints,
> then that'd be how they'd be laid out in the metadata.  If it can't
> provide them, we'd probably reject the program loading, or discuss
> providing a software fallback (I know this is an area of contention).

I don't think that approach can work.
My .02 is the only useful acceleration feature out if intel nics is
an ability to parse the packet and report it to the xdp program
as a node id in the parse graph. As far as I know that is pretty unique
to intel. Configuration of that is a different matter.
Other things like hash are interesting, but we will quickly get
into rss spec definition if we go standardization route now instead
of later phases.
Therefore I'd rather let firmware+driver define its own BTF description
that says here is 'hash' of the packet hw can provide and it's fine
that this hash may be over different tuples depending on the nic
and even version of the firmware in that nic.
We need to see performance numbers and benefits for real world
xdp programs before drilling into specific semantics of the hash.

> I suppose we could get there with the rewriting mechanism described
> below, but that'd be a tough sell to set a bit of ABI for metadata,
> then change it to be potentially dynamic at runtime in the future.

Hmm. I don't see how future offset rewriting will break abi.
It looks to me as natural extension that wouldn't break any apps
that would be written for specific nic with given BTF description.
With rewritting in place some progs would become portable. That's all.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 21:33             ` Jakub Kicinski
@ 2018-07-06 23:42               ` Alexei Starovoitov
  2018-07-07  0:08                 ` Jakub Kicinski
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-06 23:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Saeed Mahameed, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:
> > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > 
> > > I'm also not 100% on board with the argument that "future" FW can
> > > reshuffle things whatever way it wants to.  Is the assumption that
> > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > format?  Or will it be reconfigurable at runtime?  
> > 
> > let's table configuration of metadata aside for a second.
> > 
> > Describing metedata layout in BTF allows NICs to disclose everything
> > NIC has to users in a standard and generic way.
> > Whether firmware is reconfigurable on the fly or has to reflashed
> > and hw powercycled to have new md layout (and corresponding BTF description)
> > is a separate discussion.
> > Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
> > to reach 'hash' value in metadata.
> > Essentially it's a run-time way to access 'hash' instead of build-time.
> > So bpf program would need two loads to read csum or hash field instead of one.
> > With BTF the layout of metadata is known to the program at build-time.
> > 
> > To reiterate the proposal:
> > - driver+firmware keep layout of the metadata in BTF format (either in the driver
> >   or driver can read it from firmware)
> > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and
> >   generate normal C header file based on BTF in the given NIC
> > - user does #include "md_desc.h" and bpf program can access md->csum or md->hash
> >   with direct single load out of metadata area in front of the packet
> > - llvm compiles bpf program and records how program is doing this md->csum accesses
> >   in BTF format as well (the compiler will be keeping such records
> >   for __sk_buff and all other structs too, but that's separate discussion)
> > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program
> >   accesses metadata (and other structs) matches BTF from the driver,
> >   so no surprises if driver+firmware got updated, but program is not recompiled
> > - every NIC can have their own layout of metadata and its own meaning of the fields,
> >   but would be good to standardize at least a few common fields like hash
> 
> Can I expose HW descriptors this way, though, or is the proposal to
> copy this data into the packet buffer?

That crossed my mind too. We can use BTF to describe HW descriptors too,
but I don't think it would buy us anything. AF_XDP approach is better.

> > Once this is working we can do more cool things with BTF.
> > Like doing offset rewriting at program load time similar to what we plan
> > to do for tracing. Tracing programs will be doing 'task->pid' access
> > and the kernel will adjust offsetof(struct task_struct, pid) during program load
> > depending on BTF for the kernel.
> > The same trick we can do for networking metadata.
> > The program will contain load instruction md->hash that will get automatically
> > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > For now I'm proposing _not_ to go that far with offset rewriting and start
> > with simple steps described above.
> 
> Why? :(  Could we please go with the rewrite/driver helpers instead of
> impacting fast paths of the drivers yet again?  This rewrite should be
> easier than task->pid, because we have the synthetic user space struct
> xdp_md definition.

I don't understand 'impact fast path yet again' concern.
If NIC has certain metadata today, just derscribe what it has in BTF
and supply the actual per-packet md to xdp program as-is.
No changes for fast path at all.
Future rewritting will be done by the bpf/xdp core with zero
changes to the driver. All driver has to do is to provide BTF.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 23:38               ` Alexei Starovoitov
@ 2018-07-06 23:49                 ` Waskiewicz Jr, Peter
  2018-07-07  0:40                   ` Jakub Kicinski
  2018-07-07  0:45                   ` Alexei Starovoitov
  0 siblings, 2 replies; 36+ messages in thread
From: Waskiewicz Jr, Peter @ 2018-07-06 23:49 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: Duyck, Alexander H, daniel, saeedm, brouer, borkmann, tariqt,
	john.fastabend, jakub.kicinski, netdev, saeedm

On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 08:44:24PM +0000, Waskiewicz Jr, Peter wrote:
> > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > > 
> > > > I'm also not 100% on board with the argument that "future" FW
> > > > can
> > > > reshuffle things whatever way it wants to.  Is the assumption
> > > > that
> > > > future ASICs/FW will be designed to always use the "blessed"
> > > > BTF
> > > > format?  Or will it be reconfigurable at runtime?
> > > 
> > > let's table configuration of metadata aside for a second.
> > 
> > I agree that this should/could be NIC-specific and shouldn't weigh
> > on
> > the metadata interface between the drivers and XDP layer.
> > 
> > > Describing metedata layout in BTF allows NICs to disclose
> > > everything
> > > NIC has to users in a standard and generic way.
> > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > and hw powercycled to have new md layout (and corresponding BTF
> > > description)
> > > is a separate discussion.
> > > Saeed's proposal introduces the concept of 'offset' inside
> > > 'struct
> > > xdp_md_info'
> > > to reach 'hash' value in metadata.
> > > Essentially it's a run-time way to access 'hash' instead of
> > > build-
> > > time.
> > > So bpf program would need two loads to read csum or hash field
> > > instead of one.
> > > With BTF the layout of metadata is known to the program at build-
> > > time.
> > > 
> > > To reiterate the proposal:
> > > - driver+firmware keep layout of the metadata in BTF format
> > > (either
> > > in the driver
> > >   or driver can read it from firmware)
> > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will
> > > query
> > > the driver and
> > >   generate normal C header file based on BTF in the given NIC
> > > - user does #include "md_desc.h" and bpf program can access md-
> > > >csum
> > > or md->hash
> > >   with direct single load out of metadata area in front of the
> > > packet
> > 
> > This piece is where I'd like to discuss more.  When we discussed
> > this
> > in Seoul, the initial proposal was a static struct that we'd try to
> > hammer out a common layout between the interested parties.  That
> > obviously wasn't going to scale, and we wanted to pursue something
> > more
> > dynamic.  But I thought the goal was the XDP/eBPF program wouldn't
> > want
> > to care what the underlying device is, and could just ask for
> > metadata
> > that it's interested in.  With this approach, your eBPF program is
> > now
> > bound/tied to the NIC/driver, and if you switched to a differen
> > NIC/driver combo, then you'd have to rewrite part of your eBPF
> > program
> > to comprehend that.  I thought we were trying to avoid that.
> 
> It looks to me that NICs have a lot more differences instead of
> common pieces.
> If we focus the architecture on making common things as generic
> as possible we miss the bigger picture of covering distinct
> and unique features that hw provides.
> 
> In the last email when I said "would be good to standardize at least
> a few common fields"
> I meant that it make sense only at the later phases.
> I don't think we should put things into uapi upfront.
> Otherwise we will end up with likely useless fields.
> Like vlan field. Surely a lot of nics can store it into metadata, but
> why?
> bpf program can easily read it from the packet.
> What's the benefit of adding it to md?
> This metadata concept we're discussing in the context of program
> acceleration.
> If metadata doesn't give perf benefits it should not be done by
> firmware, it should not be supported by the driver and certainly
> should not be part of uapi.
> Hence I'd like to see pure BTF description without any
> common/standard
> fields across the drivers and figure out what (if anything) is useful
> to accelerate xdp programs (and af_xdp in the future).

I guess I didn't write my response very well before, apologies.  I
actually really like the BTF description and the dynamic-ness of it. 
It solves lots of problems and keeps things out of the UAPI.

What I was getting at was using the BTF description and dumping a
header file to be used in the program to be loaded.  If we have an
application developer include that header, it will be the Intel BTF
description, or Mellanox BTF description, etc.  I'm fine with that, but
it makes your program not as portable, since you'd need to get a
different header and recompile it if you change the NIC.  Unless I'm
missing something.

> 
> > Our proposed approach (still working on something ready to RFC) is
> > to
> > provide a method for the eBPF program to send a struct of requested
> > hints down to the driver on load.  If the driver can provide the
> > hints,
> > then that'd be how they'd be laid out in the metadata.  If it can't
> > provide them, we'd probably reject the program loading, or discuss
> > providing a software fallback (I know this is an area of
> > contention).
> 
> I don't think that approach can work.
> My .02 is the only useful acceleration feature out if intel nics is
> an ability to parse the packet and report it to the xdp program
> as a node id in the parse graph. As far as I know that is pretty
> unique
> to intel. Configuration of that is a different matter.
> Other things like hash are interesting, but we will quickly get
> into rss spec definition if we go standardization route now instead
> of later phases.
> Therefore I'd rather let firmware+driver define its own BTF
> description
> that says here is 'hash' of the packet hw can provide and it's fine
> that this hash may be over different tuples depending on the nic
> and even version of the firmware in that nic.
> We need to see performance numbers and benefits for real world
> xdp programs before drilling into specific semantics of the hash.

I agree that "useful" fields should be targeted.  I can see hash
getting more interesting though if one can specify what parts of the n-
tuples get included in the hash, or if we want hashes for inner packets
or outer headers if things are tunneled.  But that can be revisited
down the road.

I also agree that changing the configuration of the underlying parser
or pipeline to generate the metadata is a completely separate
discussion.

> 
> > I suppose we could get there with the rewriting mechanism described
> > below, but that'd be a tough sell to set a bit of ABI for metadata,
> > then change it to be potentially dynamic at runtime in the future.
> 
> Hmm. I don't see how future offset rewriting will break abi.
> It looks to me as natural extension that wouldn't break any apps
> that would be written for specific nic with given BTF description.
> With rewritting in place some progs would become portable. That's
> all.

What I meant here was if anything was put into the UAPI, then changing
to something more dynamic might make transitioning more challenging. 
And I can see now what the rewriting can do to help my previous
thoughts on the BTF description locking a program to a specific NIC. 
But I'd need to play with it to be convinced about the portability.

-PJ

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 23:42               ` Alexei Starovoitov
@ 2018-07-07  0:08                 ` Jakub Kicinski
  2018-07-07  0:53                   ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2018-07-07  0:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Saeed Mahameed, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:
> > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:  
> > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:  
> > > > 
> > > > I'm also not 100% on board with the argument that "future" FW can
> > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > format?  Or will it be reconfigurable at runtime?    
> > > 
> > > let's table configuration of metadata aside for a second.
> > > 
> > > Describing metedata layout in BTF allows NICs to disclose everything
> > > NIC has to users in a standard and generic way.
> > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > and hw powercycled to have new md layout (and corresponding BTF description)
> > > is a separate discussion.
> > > Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
> > > to reach 'hash' value in metadata.
> > > Essentially it's a run-time way to access 'hash' instead of build-time.
> > > So bpf program would need two loads to read csum or hash field instead of one.
> > > With BTF the layout of metadata is known to the program at build-time.
> > > 
> > > To reiterate the proposal:
> > > - driver+firmware keep layout of the metadata in BTF format (either in the driver
> > >   or driver can read it from firmware)
> > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and
> > >   generate normal C header file based on BTF in the given NIC
> > > - user does #include "md_desc.h" and bpf program can access md->csum or md->hash
> > >   with direct single load out of metadata area in front of the packet
> > > - llvm compiles bpf program and records how program is doing this md->csum accesses
> > >   in BTF format as well (the compiler will be keeping such records
> > >   for __sk_buff and all other structs too, but that's separate discussion)
> > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program
> > >   accesses metadata (and other structs) matches BTF from the driver,
> > >   so no surprises if driver+firmware got updated, but program is not recompiled
> > > - every NIC can have their own layout of metadata and its own meaning of the fields,
> > >   but would be good to standardize at least a few common fields like hash  
> > 
> > Can I expose HW descriptors this way, though, or is the proposal to
> > copy this data into the packet buffer?  
> 
> That crossed my mind too. We can use BTF to describe HW descriptors too,
> but I don't think it would buy us anything. AF_XDP approach is better.
>
> > > Once this is working we can do more cool things with BTF.
> > > Like doing offset rewriting at program load time similar to what we plan
> > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > and the kernel will adjust offsetof(struct task_struct, pid) during program load
> > > depending on BTF for the kernel.
> > > The same trick we can do for networking metadata.
> > > The program will contain load instruction md->hash that will get automatically
> > > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > > For now I'm proposing _not_ to go that far with offset rewriting and start
> > > with simple steps described above.  
> > 
> > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > impacting fast paths of the drivers yet again?  This rewrite should be
> > easier than task->pid, because we have the synthetic user space struct
> > xdp_md definition.  
> 
> I don't understand 'impact fast path yet again' concern.
> If NIC has certain metadata today, just derscribe what it has in BTF
> and supply the actual per-packet md to xdp program as-is.
> No changes for fast path at all.
> Future rewritting will be done by the bpf/xdp core with zero
> changes to the driver. All driver has to do is to provide BTF.

I'm confused.  AFAIK most *existing* NICs have the metadata in the
"descriptor", i.e. not in the packet buffer.  So if the NIC just
describes what it has, and there is no data shuffling/copying
(performance) then we have to expose the descriptor, no?

In case of NFP half of the metadata is in the packet buffer, half in
the descriptor.

Maybe I'm conflating your proposal with Saeed's.  In your proposal
would we still use the overload of metadata prepend (as in struct
xdp_md::data_meta, bpf_xdp_adjust_meta() etc.) or are we adding a
pointer to a new driver-defined struct inside struct xdp_md?

Thanks for bearing with me!

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 23:49                 ` Waskiewicz Jr, Peter
@ 2018-07-07  0:40                   ` Jakub Kicinski
  2018-07-07  1:00                     ` Alexei Starovoitov
  2018-07-07  0:45                   ` Alexei Starovoitov
  1 sibling, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2018-07-07  0:40 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter
  Cc: alexei.starovoitov, Duyck, Alexander H, daniel, saeedm, brouer,
	borkmann, tariqt, john.fastabend, netdev, saeedm

On Fri, 6 Jul 2018 23:49:55 +0000, Waskiewicz Jr, Peter wrote:
> On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 08:44:24PM +0000, Waskiewicz Jr, Peter wrote:  
> > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:  
> > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:  
> > > > > 
> > > > > I'm also not 100% on board with the argument that "future" FW
> > > > > can
> > > > > reshuffle things whatever way it wants to.  Is the assumption
> > > > > that
> > > > > future ASICs/FW will be designed to always use the "blessed"
> > > > > BTF
> > > > > format?  Or will it be reconfigurable at runtime?  
> > > > 
> > > > let's table configuration of metadata aside for a second.  
> > > 
> > > I agree that this should/could be NIC-specific and shouldn't weigh
> > > on
> > > the metadata interface between the drivers and XDP layer.
> > >   
> > > > Describing metedata layout in BTF allows NICs to disclose
> > > > everything
> > > > NIC has to users in a standard and generic way.
> > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > and hw powercycled to have new md layout (and corresponding BTF
> > > > description)
> > > > is a separate discussion.
> > > > Saeed's proposal introduces the concept of 'offset' inside
> > > > 'struct
> > > > xdp_md_info'
> > > > to reach 'hash' value in metadata.
> > > > Essentially it's a run-time way to access 'hash' instead of
> > > > build-
> > > > time.
> > > > So bpf program would need two loads to read csum or hash field
> > > > instead of one.
> > > > With BTF the layout of metadata is known to the program at build-
> > > > time.
> > > > 
> > > > To reiterate the proposal:
> > > > - driver+firmware keep layout of the metadata in BTF format
> > > > (either
> > > > in the driver
> > > >   or driver can read it from firmware)
> > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will
> > > > query
> > > > the driver and
> > > >   generate normal C header file based on BTF in the given NIC
> > > > - user does #include "md_desc.h" and bpf program can access md-  
> > > > >csum  
> > > > or md->hash
> > > >   with direct single load out of metadata area in front of the
> > > > packet  
> > > 
> > > This piece is where I'd like to discuss more.  When we discussed
> > > this
> > > in Seoul, the initial proposal was a static struct that we'd try to
> > > hammer out a common layout between the interested parties.  That
> > > obviously wasn't going to scale, and we wanted to pursue something
> > > more
> > > dynamic.  But I thought the goal was the XDP/eBPF program wouldn't
> > > want
> > > to care what the underlying device is, and could just ask for
> > > metadata
> > > that it's interested in.  With this approach, your eBPF program is
> > > now
> > > bound/tied to the NIC/driver, and if you switched to a differen
> > > NIC/driver combo, then you'd have to rewrite part of your eBPF
> > > program
> > > to comprehend that.  I thought we were trying to avoid that.  
> > 
> > It looks to me that NICs have a lot more differences instead of
> > common pieces.
> > If we focus the architecture on making common things as generic
> > as possible we miss the bigger picture of covering distinct
> > and unique features that hw provides.
> > 
> > In the last email when I said "would be good to standardize at least
> > a few common fields"
> > I meant that it make sense only at the later phases.
> > I don't think we should put things into uapi upfront.
> > Otherwise we will end up with likely useless fields.
> > Like vlan field. Surely a lot of nics can store it into metadata, but
> > why?

FWIW vlan is extracted from the packet because of cBPF uABI.  tcpdump
breaks badly if VLAN header is in the packet buffer.  It may not be
entirely without merit to expose the field for writing, perhaps that can
simplify some EVPN use cases?  But that's most likely off-topic.

> > bpf program can easily read it from the packet.
> > What's the benefit of adding it to md?
> > This metadata concept we're discussing in the context of program
> > acceleration.
> > If metadata doesn't give perf benefits it should not be done by
> > firmware, it should not be supported by the driver and certainly
> > should not be part of uapi.
> > Hence I'd like to see pure BTF description without any
> > common/standard
> > fields across the drivers and figure out what (if anything) is useful
> > to accelerate xdp programs (and af_xdp in the future).  
> 
> I guess I didn't write my response very well before, apologies.  I
> actually really like the BTF description and the dynamic-ness of it. 
> It solves lots of problems and keeps things out of the UAPI.
> 
> What I was getting at was using the BTF description and dumping a
> header file to be used in the program to be loaded.  If we have an
> application developer include that header, it will be the Intel BTF
> description, or Mellanox BTF description, etc.  I'm fine with that, but
> it makes your program not as portable, since you'd need to get a
> different header and recompile it if you change the NIC.  Unless I'm
> missing something.

Could we just say that at the start we expose only existing SKB fields
(csum, hash, mark) and the meaning of the is the same as in the SKB?
That covers a lot of use cases and provides clear semantics (at least
to the driver developer) while naturally allowing to communicate the
fields from XDP to the stack with XDP_PASS?  It covers all Saeed did 
in his RFC, for example.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 23:49                 ` Waskiewicz Jr, Peter
  2018-07-07  0:40                   ` Jakub Kicinski
@ 2018-07-07  0:45                   ` Alexei Starovoitov
  1 sibling, 0 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-07  0:45 UTC (permalink / raw)
  To: Waskiewicz Jr, Peter
  Cc: Duyck, Alexander H, daniel, saeedm, brouer, borkmann, tariqt,
	john.fastabend, jakub.kicinski, netdev, saeedm

On Fri, Jul 06, 2018 at 11:49:55PM +0000, Waskiewicz Jr, Peter wrote:
> On Fri, 2018-07-06 at 16:38 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 08:44:24PM +0000, Waskiewicz Jr, Peter wrote:
> > > On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > > > 
> > > > > I'm also not 100% on board with the argument that "future" FW
> > > > > can
> > > > > reshuffle things whatever way it wants to.  Is the assumption
> > > > > that
> > > > > future ASICs/FW will be designed to always use the "blessed"
> > > > > BTF
> > > > > format?  Or will it be reconfigurable at runtime?
> > > > 
> > > > let's table configuration of metadata aside for a second.
> > > 
> > > I agree that this should/could be NIC-specific and shouldn't weigh
> > > on
> > > the metadata interface between the drivers and XDP layer.
> > > 
> > > > Describing metedata layout in BTF allows NICs to disclose
> > > > everything
> > > > NIC has to users in a standard and generic way.
> > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > and hw powercycled to have new md layout (and corresponding BTF
> > > > description)
> > > > is a separate discussion.
> > > > Saeed's proposal introduces the concept of 'offset' inside
> > > > 'struct
> > > > xdp_md_info'
> > > > to reach 'hash' value in metadata.
> > > > Essentially it's a run-time way to access 'hash' instead of
> > > > build-
> > > > time.
> > > > So bpf program would need two loads to read csum or hash field
> > > > instead of one.
> > > > With BTF the layout of metadata is known to the program at build-
> > > > time.
> > > > 
> > > > To reiterate the proposal:
> > > > - driver+firmware keep layout of the metadata in BTF format
> > > > (either
> > > > in the driver
> > > >   or driver can read it from firmware)
> > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will
> > > > query
> > > > the driver and
> > > >   generate normal C header file based on BTF in the given NIC
> > > > - user does #include "md_desc.h" and bpf program can access md-
> > > > >csum
> > > > or md->hash
> > > >   with direct single load out of metadata area in front of the
> > > > packet
> > > 
> > > This piece is where I'd like to discuss more.  When we discussed
> > > this
> > > in Seoul, the initial proposal was a static struct that we'd try to
> > > hammer out a common layout between the interested parties.  That
> > > obviously wasn't going to scale, and we wanted to pursue something
> > > more
> > > dynamic.  But I thought the goal was the XDP/eBPF program wouldn't
> > > want
> > > to care what the underlying device is, and could just ask for
> > > metadata
> > > that it's interested in.  With this approach, your eBPF program is
> > > now
> > > bound/tied to the NIC/driver, and if you switched to a differen
> > > NIC/driver combo, then you'd have to rewrite part of your eBPF
> > > program
> > > to comprehend that.  I thought we were trying to avoid that.
> > 
> > It looks to me that NICs have a lot more differences instead of
> > common pieces.
> > If we focus the architecture on making common things as generic
> > as possible we miss the bigger picture of covering distinct
> > and unique features that hw provides.
> > 
> > In the last email when I said "would be good to standardize at least
> > a few common fields"
> > I meant that it make sense only at the later phases.
> > I don't think we should put things into uapi upfront.
> > Otherwise we will end up with likely useless fields.
> > Like vlan field. Surely a lot of nics can store it into metadata, but
> > why?
> > bpf program can easily read it from the packet.
> > What's the benefit of adding it to md?
> > This metadata concept we're discussing in the context of program
> > acceleration.
> > If metadata doesn't give perf benefits it should not be done by
> > firmware, it should not be supported by the driver and certainly
> > should not be part of uapi.
> > Hence I'd like to see pure BTF description without any
> > common/standard
> > fields across the drivers and figure out what (if anything) is useful
> > to accelerate xdp programs (and af_xdp in the future).
> 
> I guess I didn't write my response very well before, apologies.  I
> actually really like the BTF description and the dynamic-ness of it. 
> It solves lots of problems and keeps things out of the UAPI.
> 
> What I was getting at was using the BTF description and dumping a
> header file to be used in the program to be loaded.  If we have an
> application developer include that header, it will be the Intel BTF
> description, or Mellanox BTF description, etc.  I'm fine with that, but
> it makes your program not as portable, since you'd need to get a
> different header and recompile it if you change the NIC.  Unless I'm
> missing something.

Correct. It makes programs not portable.
That is phase one to figure out what's useful.
Rewriting of offsets and figuring out which fields should be
standardized in uapi is a second phase to make those programs
actually portable.

Once it becomes obvious that feature X of vendor Y is useful
to accelerate particular well known production xdp application
I suspect other vendors will find a way to tweak their firmware
to provide the same feature. Right now vendor <-> customer discussions
are somewhat baseless. Vendors believe that what they have in firmware
is useful, but sw owners may be don't see it this way and don't have
a way to try it and cannot prove one way or the other.
This will help us get unstuck: expose features that nics have
while not baking anything into uapi

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-07  0:08                 ` Jakub Kicinski
@ 2018-07-07  0:53                   ` Alexei Starovoitov
  2018-07-07  1:37                     ` David Miller
  2018-07-07  1:44                     ` Jakub Kicinski
  0 siblings, 2 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-07  0:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Saeed Mahameed, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:
> > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:  
> > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:  
> > > > > 
> > > > > I'm also not 100% on board with the argument that "future" FW can
> > > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > > format?  Or will it be reconfigurable at runtime?    
> > > > 
> > > > let's table configuration of metadata aside for a second.
> > > > 
> > > > Describing metedata layout in BTF allows NICs to disclose everything
> > > > NIC has to users in a standard and generic way.
> > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > and hw powercycled to have new md layout (and corresponding BTF description)
> > > > is a separate discussion.
> > > > Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
> > > > to reach 'hash' value in metadata.
> > > > Essentially it's a run-time way to access 'hash' instead of build-time.
> > > > So bpf program would need two loads to read csum or hash field instead of one.
> > > > With BTF the layout of metadata is known to the program at build-time.
> > > > 
> > > > To reiterate the proposal:
> > > > - driver+firmware keep layout of the metadata in BTF format (either in the driver
> > > >   or driver can read it from firmware)
> > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and
> > > >   generate normal C header file based on BTF in the given NIC
> > > > - user does #include "md_desc.h" and bpf program can access md->csum or md->hash
> > > >   with direct single load out of metadata area in front of the packet
> > > > - llvm compiles bpf program and records how program is doing this md->csum accesses
> > > >   in BTF format as well (the compiler will be keeping such records
> > > >   for __sk_buff and all other structs too, but that's separate discussion)
> > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program
> > > >   accesses metadata (and other structs) matches BTF from the driver,
> > > >   so no surprises if driver+firmware got updated, but program is not recompiled
> > > > - every NIC can have their own layout of metadata and its own meaning of the fields,
> > > >   but would be good to standardize at least a few common fields like hash  
> > > 
> > > Can I expose HW descriptors this way, though, or is the proposal to
> > > copy this data into the packet buffer?  
> > 
> > That crossed my mind too. We can use BTF to describe HW descriptors too,
> > but I don't think it would buy us anything. AF_XDP approach is better.
> >
> > > > Once this is working we can do more cool things with BTF.
> > > > Like doing offset rewriting at program load time similar to what we plan
> > > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > > and the kernel will adjust offsetof(struct task_struct, pid) during program load
> > > > depending on BTF for the kernel.
> > > > The same trick we can do for networking metadata.
> > > > The program will contain load instruction md->hash that will get automatically
> > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > > > For now I'm proposing _not_ to go that far with offset rewriting and start
> > > > with simple steps described above.  
> > > 
> > > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > > impacting fast paths of the drivers yet again?  This rewrite should be
> > > easier than task->pid, because we have the synthetic user space struct
> > > xdp_md definition.  
> > 
> > I don't understand 'impact fast path yet again' concern.
> > If NIC has certain metadata today, just derscribe what it has in BTF
> > and supply the actual per-packet md to xdp program as-is.
> > No changes for fast path at all.
> > Future rewritting will be done by the bpf/xdp core with zero
> > changes to the driver. All driver has to do is to provide BTF.
> 
> I'm confused.  AFAIK most *existing* NICs have the metadata in the
> "descriptor", i.e. not in the packet buffer.  So if the NIC just
> describes what it has, and there is no data shuffling/copying
> (performance) then we have to expose the descriptor, no?

which piece of sw put that data into desciptor ?
I bet it's firmware. It could have stored it into pre-packet data, no?
I'd like to avoid _all_ copies.
Right now xdp program can only see a pointer to packet and pre-packet.
If we need another pointer to a piece of the packet descriptor,
that's also fine. Both pre-packet metadata and pieces of descriptor
can be described in BTF.
I'd like to push back on firmware folks that should be listening
to feedback from driver folks and kernel stack instead of saying
'here is hw spec that firmware provides'. Firmware is software.
It can change and should be open to change by the community
with proper maintainership.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-07  0:40                   ` Jakub Kicinski
@ 2018-07-07  1:00                     ` Alexei Starovoitov
  2018-07-07  1:20                       ` Jakub Kicinski
  0 siblings, 1 reply; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-07  1:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Waskiewicz Jr, Peter, Duyck, Alexander H, daniel, saeedm, brouer,
	borkmann, tariqt, john.fastabend, netdev, saeedm

On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote:
> 
> Could we just say that at the start we expose only existing SKB fields
> (csum, hash, mark) and the meaning of the is the same as in the SKB?

what would be the meaning of 'hash' ? Over which fields?
Does it support inner and outer packets? How about udp encap (vxlan and friends) ?
Same question of csum... tcp only? how about crc for sctp?
What is 'mark' ?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-07  1:00                     ` Alexei Starovoitov
@ 2018-07-07  1:20                       ` Jakub Kicinski
  2018-07-07  2:38                         ` Alexei Starovoitov
  0 siblings, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2018-07-07  1:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Waskiewicz Jr, Peter, Duyck, Alexander H, daniel, saeedm, brouer,
	borkmann, tariqt, john.fastabend, netdev, saeedm

On Fri, 6 Jul 2018 18:00:13 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote:
> > 
> > Could we just say that at the start we expose only existing SKB fields
> > (csum, hash, mark) and the meaning of the is the same as in the SKB?  
> 
> what would be the meaning of 'hash' ? Over which fields?
> Does it support inner and outer packets? How about udp encap (vxlan and friends) ?

We don't seem to need to answer that for the rest of the stack, no?  We
can expose the "hash type" field as well if that's *really* necessary.

> Same question of csum... tcp only? 

Shouldn't we just stick to CHECKSUM_COMPLETE?

> how about crc for sctp?

That's harder to answer.  Can cls_bpf access such info?

> What is 'mark' ?

Same thing it would be on an skb.  Most likely set with an offloaded TC
rule?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-06 16:30           ` Alexei Starovoitov
  2018-07-06 20:44             ` Waskiewicz Jr, Peter
  2018-07-06 21:33             ` Jakub Kicinski
@ 2018-07-07  1:27             ` David Miller
  2 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2018-07-07  1:27 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: jakub.kicinski, daniel, saeedm, saeedm, alexander.h.duyck,
	netdev, tariqt, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Fri, 6 Jul 2018 09:30:42 -0700

> Like doing offset rewriting at program load time similar to what we plan
> to do for tracing. Tracing programs will be doing 'task->pid' access
> and the kernel will adjust offsetof(struct task_struct, pid) during program load
> depending on BTF for the kernel.
> The same trick we can do for networking metadata.
> The program will contain load instruction md->hash that will get automatically
> adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> For now I'm proposing _not_ to go that far with offset rewriting and start
> with simple steps described above.

+1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-07  0:53                   ` Alexei Starovoitov
@ 2018-07-07  1:37                     ` David Miller
  2018-07-07  1:44                     ` Jakub Kicinski
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2018-07-07  1:37 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: jakub.kicinski, daniel, saeedm, saeedm, alexander.h.duyck,
	netdev, tariqt, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Fri, 6 Jul 2018 17:53:18 -0700

> I'd like to push back on firmware folks that should be listening
> to feedback from driver folks and kernel stack instead of saying
> 'here is hw spec that firmware provides'. Firmware is software.
> It can change and should be open to change by the community
> with proper maintainership.

+1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-07  0:53                   ` Alexei Starovoitov
  2018-07-07  1:37                     ` David Miller
@ 2018-07-07  1:44                     ` Jakub Kicinski
  2018-07-07  2:51                       ` Alexei Starovoitov
  1 sibling, 1 reply; 36+ messages in thread
From: Jakub Kicinski @ 2018-07-07  1:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Daniel Borkmann, Saeed Mahameed, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Fri, 6 Jul 2018 17:53:18 -0700, Alexei Starovoitov wrote:
> On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote:
> > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:  
> > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:  
> > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:    
> > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:    
> > > > > > 
> > > > > > I'm also not 100% on board with the argument that "future" FW can
> > > > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > > > format?  Or will it be reconfigurable at runtime?      
> > > > > 
> > > > > let's table configuration of metadata aside for a second.
> > > > > 
> > > > > Describing metedata layout in BTF allows NICs to disclose everything
> > > > > NIC has to users in a standard and generic way.
> > > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > > and hw powercycled to have new md layout (and corresponding BTF description)
> > > > > is a separate discussion.
> > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
> > > > > to reach 'hash' value in metadata.
> > > > > Essentially it's a run-time way to access 'hash' instead of build-time.
> > > > > So bpf program would need two loads to read csum or hash field instead of one.
> > > > > With BTF the layout of metadata is known to the program at build-time.
> > > > > 
> > > > > To reiterate the proposal:
> > > > > - driver+firmware keep layout of the metadata in BTF format (either in the driver
> > > > >   or driver can read it from firmware)
> > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and
> > > > >   generate normal C header file based on BTF in the given NIC
> > > > > - user does #include "md_desc.h" and bpf program can access md->csum or md->hash
> > > > >   with direct single load out of metadata area in front of the packet
> > > > > - llvm compiles bpf program and records how program is doing this md->csum accesses
> > > > >   in BTF format as well (the compiler will be keeping such records
> > > > >   for __sk_buff and all other structs too, but that's separate discussion)
> > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program
> > > > >   accesses metadata (and other structs) matches BTF from the driver,
> > > > >   so no surprises if driver+firmware got updated, but program is not recompiled
> > > > > - every NIC can have their own layout of metadata and its own meaning of the fields,
> > > > >   but would be good to standardize at least a few common fields like hash    
> > > > 
> > > > Can I expose HW descriptors this way, though, or is the proposal to
> > > > copy this data into the packet buffer?    
> > > 
> > > That crossed my mind too. We can use BTF to describe HW descriptors too,
> > > but I don't think it would buy us anything. AF_XDP approach is better.
> > >  
> > > > > Once this is working we can do more cool things with BTF.
> > > > > Like doing offset rewriting at program load time similar to what we plan
> > > > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > > > and the kernel will adjust offsetof(struct task_struct, pid) during program load
> > > > > depending on BTF for the kernel.
> > > > > The same trick we can do for networking metadata.
> > > > > The program will contain load instruction md->hash that will get automatically
> > > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > > > > For now I'm proposing _not_ to go that far with offset rewriting and start
> > > > > with simple steps described above.    
> > > > 
> > > > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > > > impacting fast paths of the drivers yet again?  This rewrite should be
> > > > easier than task->pid, because we have the synthetic user space struct
> > > > xdp_md definition.    
> > > 
> > > I don't understand 'impact fast path yet again' concern.
> > > If NIC has certain metadata today, just derscribe what it has in BTF
> > > and supply the actual per-packet md to xdp program as-is.
> > > No changes for fast path at all.
> > > Future rewritting will be done by the bpf/xdp core with zero
> > > changes to the driver. All driver has to do is to provide BTF.  
> > 
> > I'm confused.  AFAIK most *existing* NICs have the metadata in the
> > "descriptor", i.e. not in the packet buffer.  So if the NIC just
> > describes what it has, and there is no data shuffling/copying
> > (performance) then we have to expose the descriptor, no?  
> 
> which piece of sw put that data into desciptor ?
> I bet it's firmware. It could have stored it into pre-packet data, no?
> I'd like to avoid _all_ copies.
> Right now xdp program can only see a pointer to packet and pre-packet.
> If we need another pointer to a piece of the packet descriptor,
> that's also fine. Both pre-packet metadata and pieces of descriptor
> can be described in BTF.

Okay, if we expose another pointer then it's possible to avoid copies.

But please keep in mind that descriptors are very compact, there is
a lot of interdependencies between fields and the fields can shift
depending on the type of packet.  HW/FW guys always quote the 64B
packet performance as a reason why things can't be simple.  We can't
consume 50% of PCIe bandwidth to DMA the metadata alone.

> I'd like to push back on firmware folks that should be listening
> to feedback from driver folks and kernel stack instead of saying
> 'here is hw spec that firmware provides'. Firmware is software.
> It can change and should be open to change by the community
> with proper maintainership.

I think the way to influence FW/HW is to provide a strong and well
justified standard, and if we just expose raw HW data we are doing 
just the opposite.  We can claim BTF driver provides is not uABI, but
I will personally no longer feel comfortable with modifying descriptor
formats (and I've done that for the NFP in the past - grep for
chained_metadata_format).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-07  1:20                       ` Jakub Kicinski
@ 2018-07-07  2:38                         ` Alexei Starovoitov
  0 siblings, 0 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-07  2:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Waskiewicz Jr, Peter, Duyck, Alexander H, daniel, saeedm, brouer,
	borkmann, tariqt, john.fastabend, netdev, saeedm

On Fri, Jul 06, 2018 at 06:20:47PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 18:00:13 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 05:40:43PM -0700, Jakub Kicinski wrote:
> > > 
> > > Could we just say that at the start we expose only existing SKB fields
> > > (csum, hash, mark) and the meaning of the is the same as in the SKB?  
> > 
> > what would be the meaning of 'hash' ? Over which fields?
> > Does it support inner and outer packets? How about udp encap (vxlan and friends) ?
> 
> We don't seem to need to answer that for the rest of the stack, no?  We
> can expose the "hash type" field as well if that's *really* necessary.
> 
> > Same question of csum... tcp only? 
> 
> Shouldn't we just stick to CHECKSUM_COMPLETE?

I don't yet see how checksum_complete and 'hash just like stack'
helps to accelerate xdp programs like ddos and load balancer.
'hash like stack' may help to accelerate networking stack itself,
but that's very different discussion. There is no xdp in such case.
No programs and no metadata. If some NICs can do
'hash like stack' and demonstrate performance improvements
for regular tcp/ip processing with netperf in user space
that would definitely be interesting and standardizing such
hash from the drivers into upper layers would warrant its own patches,
but seems orthogonal to this xdp hints discussion.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
  2018-07-07  1:44                     ` Jakub Kicinski
@ 2018-07-07  2:51                       ` Alexei Starovoitov
  0 siblings, 0 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2018-07-07  2:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Daniel Borkmann, Saeed Mahameed, saeedm, alexander.h.duyck,
	netdev, Tariq Toukan, john.fastabend, brouer, borkmann,
	peter.waskiewicz.jr

On Fri, Jul 06, 2018 at 06:44:24PM -0700, Jakub Kicinski wrote:
> On Fri, 6 Jul 2018 17:53:18 -0700, Alexei Starovoitov wrote:
> > On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote:
> > > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote:  
> > > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote:  
> > > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote:    
> > > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:    
> > > > > > > 
> > > > > > > I'm also not 100% on board with the argument that "future" FW can
> > > > > > > reshuffle things whatever way it wants to.  Is the assumption that
> > > > > > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > > > > > format?  Or will it be reconfigurable at runtime?      
> > > > > > 
> > > > > > let's table configuration of metadata aside for a second.
> > > > > > 
> > > > > > Describing metedata layout in BTF allows NICs to disclose everything
> > > > > > NIC has to users in a standard and generic way.
> > > > > > Whether firmware is reconfigurable on the fly or has to reflashed
> > > > > > and hw powercycled to have new md layout (and corresponding BTF description)
> > > > > > is a separate discussion.
> > > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info'
> > > > > > to reach 'hash' value in metadata.
> > > > > > Essentially it's a run-time way to access 'hash' instead of build-time.
> > > > > > So bpf program would need two loads to read csum or hash field instead of one.
> > > > > > With BTF the layout of metadata is known to the program at build-time.
> > > > > > 
> > > > > > To reiterate the proposal:
> > > > > > - driver+firmware keep layout of the metadata in BTF format (either in the driver
> > > > > >   or driver can read it from firmware)
> > > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and
> > > > > >   generate normal C header file based on BTF in the given NIC
> > > > > > - user does #include "md_desc.h" and bpf program can access md->csum or md->hash
> > > > > >   with direct single load out of metadata area in front of the packet
> > > > > > - llvm compiles bpf program and records how program is doing this md->csum accesses
> > > > > >   in BTF format as well (the compiler will be keeping such records
> > > > > >   for __sk_buff and all other structs too, but that's separate discussion)
> > > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program
> > > > > >   accesses metadata (and other structs) matches BTF from the driver,
> > > > > >   so no surprises if driver+firmware got updated, but program is not recompiled
> > > > > > - every NIC can have their own layout of metadata and its own meaning of the fields,
> > > > > >   but would be good to standardize at least a few common fields like hash    
> > > > > 
> > > > > Can I expose HW descriptors this way, though, or is the proposal to
> > > > > copy this data into the packet buffer?    
> > > > 
> > > > That crossed my mind too. We can use BTF to describe HW descriptors too,
> > > > but I don't think it would buy us anything. AF_XDP approach is better.
> > > >  
> > > > > > Once this is working we can do more cool things with BTF.
> > > > > > Like doing offset rewriting at program load time similar to what we plan
> > > > > > to do for tracing. Tracing programs will be doing 'task->pid' access
> > > > > > and the kernel will adjust offsetof(struct task_struct, pid) during program load
> > > > > > depending on BTF for the kernel.
> > > > > > The same trick we can do for networking metadata.
> > > > > > The program will contain load instruction md->hash that will get automatically
> > > > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC.
> > > > > > For now I'm proposing _not_ to go that far with offset rewriting and start
> > > > > > with simple steps described above.    
> > > > > 
> > > > > Why? :(  Could we please go with the rewrite/driver helpers instead of
> > > > > impacting fast paths of the drivers yet again?  This rewrite should be
> > > > > easier than task->pid, because we have the synthetic user space struct
> > > > > xdp_md definition.    
> > > > 
> > > > I don't understand 'impact fast path yet again' concern.
> > > > If NIC has certain metadata today, just derscribe what it has in BTF
> > > > and supply the actual per-packet md to xdp program as-is.
> > > > No changes for fast path at all.
> > > > Future rewritting will be done by the bpf/xdp core with zero
> > > > changes to the driver. All driver has to do is to provide BTF.  
> > > 
> > > I'm confused.  AFAIK most *existing* NICs have the metadata in the
> > > "descriptor", i.e. not in the packet buffer.  So if the NIC just
> > > describes what it has, and there is no data shuffling/copying
> > > (performance) then we have to expose the descriptor, no?  
> > 
> > which piece of sw put that data into desciptor ?
> > I bet it's firmware. It could have stored it into pre-packet data, no?
> > I'd like to avoid _all_ copies.
> > Right now xdp program can only see a pointer to packet and pre-packet.
> > If we need another pointer to a piece of the packet descriptor,
> > that's also fine. Both pre-packet metadata and pieces of descriptor
> > can be described in BTF.
> 
> Okay, if we expose another pointer then it's possible to avoid copies.
> 
> But please keep in mind that descriptors are very compact, there is
> a lot of interdependencies between fields and the fields can shift
> depending on the type of packet.  HW/FW guys always quote the 64B
> packet performance as a reason why things can't be simple.  We can't
> consume 50% of PCIe bandwidth to DMA the metadata alone.

Certainly. With l4 load balancer we're after short packet performance too.

> > I'd like to push back on firmware folks that should be listening
> > to feedback from driver folks and kernel stack instead of saying
> > 'here is hw spec that firmware provides'. Firmware is software.
> > It can change and should be open to change by the community
> > with proper maintainership.
> 
> I think the way to influence FW/HW is to provide a strong and well
> justified standard, and if we just expose raw HW data we are doing 
> just the opposite.  We can claim BTF driver provides is not uABI, but
> I will personally no longer feel comfortable with modifying descriptor
> formats (and I've done that for the NFP in the past - grep for
> chained_metadata_format).

I think AF_XDP is doing what you're asking for. It defined a standard
descriptor format. Metadata for the packet is use case specific.
What is useful for l4 load balancer could be a waste of pcie bandwidth
for regular tcp/ip. I'm not proposing to have it always on.

What I'm proposing with BTF is instead of doing nfp_net_parse_meta()
for this chained format in the nfp driver expose it to xdp program.
Let program deal with it when it wants to. Would that work?

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-07-07  2:51 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-27  2:46 [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Saeed Mahameed
2018-06-27  2:46 ` [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests Saeed Mahameed
2018-06-27  2:46 ` [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure Saeed Mahameed
2018-06-27 14:15   ` Jesper Dangaard Brouer
2018-06-27 17:55     ` Saeed Mahameed
2018-07-02  8:01       ` Daniel Borkmann
2018-07-03 23:52         ` Saeed Mahameed
2018-07-03 23:01   ` Alexei Starovoitov
2018-07-04  0:57     ` Saeed Mahameed
2018-07-04  7:51       ` Daniel Borkmann
2018-07-05 17:18         ` Jakub Kicinski
2018-07-06 16:30           ` Alexei Starovoitov
2018-07-06 20:44             ` Waskiewicz Jr, Peter
2018-07-06 23:38               ` Alexei Starovoitov
2018-07-06 23:49                 ` Waskiewicz Jr, Peter
2018-07-07  0:40                   ` Jakub Kicinski
2018-07-07  1:00                     ` Alexei Starovoitov
2018-07-07  1:20                       ` Jakub Kicinski
2018-07-07  2:38                         ` Alexei Starovoitov
2018-07-07  0:45                   ` Alexei Starovoitov
2018-07-06 21:33             ` Jakub Kicinski
2018-07-06 23:42               ` Alexei Starovoitov
2018-07-07  0:08                 ` Jakub Kicinski
2018-07-07  0:53                   ` Alexei Starovoitov
2018-07-07  1:37                     ` David Miller
2018-07-07  1:44                     ` Jakub Kicinski
2018-07-07  2:51                       ` Alexei Starovoitov
2018-07-07  1:27             ` David Miller
2018-06-27  2:46 ` [RFC bpf-next 3/6] net/mlx5e: Store xdp flags and meta data info Saeed Mahameed
2018-06-27  2:46 ` [RFC bpf-next 4/6] net/mlx5e: Pass CQE to RX handlers Saeed Mahameed
2018-06-27  2:46 ` [RFC bpf-next 5/6] net/mlx5e: Add XDP RX meta data support Saeed Mahameed
2018-07-04  8:28   ` Daniel Borkmann
2018-06-27  2:46 ` [RFC bpf-next 6/6] samples/bpf: Add meta data hash example to xdp_redirect_cpu Saeed Mahameed
2018-06-27 10:59   ` Jesper Dangaard Brouer
2018-06-27 18:04     ` Saeed Mahameed
2018-06-27 16:42 ` [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP) Parikh, Neerav

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.