[RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp_buff
@ 2017-05-18 15:41 Jesper Dangaard Brouer
  2017-05-18 15:41 ` [RFC net-next PATCH 1/5] samples/bpf: xdp_tx_iptunnel make use of map_data[] Jesper Dangaard Brouer
                   ` (4 more replies)
  0 siblings, 5 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-18 15:41 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, Jesper Dangaard Brouer

I would like some comments on introducing a feature API between XDP
drives and XDP/BPF core.  The primary issue is when extending struct
xdp_buff, today, drivers not implementing this feature can access
uninitilized memory, using bpf-helper associated with the feature.

---

Jesper Dangaard Brouer (5):
      samples/bpf: xdp_tx_iptunnel make use of map_data[]
      mlx5: fix bug reading rss_hash_type from CQE
      net: introduce XDP driver features interface
      net: new XDP feature for reading HW rxhash from drivers
      mlx5: add XDP rxhash feature for driver mlx5


 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   98 ++++++++++++++-------
 include/linux/filter.h                            |   31 ++++++-
 include/linux/mlx5/device.h                       |   10 ++
 include/linux/netdev_features.h                   |   34 +++++++
 include/linux/netdevice.h                         |    1 
 include/uapi/linux/bpf.h                          |   56 ++++++++++++
 kernel/bpf/verifier.c                             |    3 +
 net/core/dev.c                                    |   48 ++++++++++
 net/core/ethtool.c                                |    2 
 net/core/filter.c                                 |   73 ++++++++++++++++
 samples/bpf/bpf_helpers.h                         |    2 
 samples/bpf/xdp_tx_iptunnel_common.h              |    2 
 samples/bpf/xdp_tx_iptunnel_kern.c                |    2 
 samples/bpf/xdp_tx_iptunnel_user.c                |   14 ++-
 tools/include/uapi/linux/bpf.h                    |   10 ++
 16 files changed, 345 insertions(+), 44 deletions(-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC net-next PATCH 1/5] samples/bpf: xdp_tx_iptunnel make use of map_data[]
  2017-05-18 15:41 [RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp_buff Jesper Dangaard Brouer
@ 2017-05-18 15:41 ` Jesper Dangaard Brouer
  2017-05-19 15:45   ` Daniel Borkmann
  2017-05-18 15:41 ` [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-18 15:41 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, Jesper Dangaard Brouer

There is no reason to use a compile time constant MAX_IPTNL_ENTRIES
shared between the _user.c and _kern.c, when map_data[].def.max_entries
can tell us dynamically what the max_entries were of the ELF map that
the bpf loaded created.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 samples/bpf/xdp_tx_iptunnel_common.h |    2 --
 samples/bpf/xdp_tx_iptunnel_kern.c   |    2 +-
 samples/bpf/xdp_tx_iptunnel_user.c   |   14 +++++++++-----
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/samples/bpf/xdp_tx_iptunnel_common.h b/samples/bpf/xdp_tx_iptunnel_common.h
index dd12cc35110f..b065699cacb5 100644
--- a/samples/bpf/xdp_tx_iptunnel_common.h
+++ b/samples/bpf/xdp_tx_iptunnel_common.h
@@ -9,8 +9,6 @@
 
 #include <linux/types.h>
 
-#define MAX_IPTNL_ENTRIES 256U
-
 struct vip {
 	union {
 		__u32 v6[4];
diff --git a/samples/bpf/xdp_tx_iptunnel_kern.c b/samples/bpf/xdp_tx_iptunnel_kern.c
index 0f4f6e8c8611..b19489eb3c22 100644
--- a/samples/bpf/xdp_tx_iptunnel_kern.c
+++ b/samples/bpf/xdp_tx_iptunnel_kern.c
@@ -30,7 +30,7 @@ struct bpf_map_def SEC("maps") vip2tnl = {
 	.type = BPF_MAP_TYPE_HASH,
 	.key_size = sizeof(struct vip),
 	.value_size = sizeof(struct iptnl_info),
-	.max_entries = MAX_IPTNL_ENTRIES,
+	.max_entries = 256,
 };
 
 static __always_inline void count_tx(u32 protocol)
diff --git a/samples/bpf/xdp_tx_iptunnel_user.c b/samples/bpf/xdp_tx_iptunnel_user.c
index 92b8bde9337c..0500a5cc75c4 100644
--- a/samples/bpf/xdp_tx_iptunnel_user.c
+++ b/samples/bpf/xdp_tx_iptunnel_user.c
@@ -123,11 +123,6 @@ static int parse_ports(const char *port_str, int *min_port, int *max_port)
 		return 1;
 	}
 
-	if (tmp_max_port - tmp_min_port + 1 > MAX_IPTNL_ENTRIES) {
-		fprintf(stderr, "Port range (%s) is larger than %u\n",
-			port_str, MAX_IPTNL_ENTRIES);
-		return 1;
-	}
 	*min_port = tmp_min_port;
 	*max_port = tmp_max_port;
 
@@ -142,6 +137,7 @@ int main(int argc, char **argv)
 	int min_port = 0, max_port = 0;
 	struct iptnl_info tnl = {};
 	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	unsigned int entries, max_entries;
 	struct vip vip = {};
 	char filename[256];
 	int opt;
@@ -238,6 +234,14 @@ int main(int argc, char **argv)
 		return 1;
 	}
 
+	entries = max_port - min_port + 1;
+	max_entries = map_data[1].def.max_entries;
+	if (entries > max_entries) {
+		fprintf(stderr, "Req port entries (%u) is larger than max %u\n",
+			entries, max_entries);
+		return 1;
+	}
+
 	signal(SIGINT, int_exit);
 
 	while (min_port <= max_port) {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE
  2017-05-18 15:41 [RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp_buff Jesper Dangaard Brouer
  2017-05-18 15:41 ` [RFC net-next PATCH 1/5] samples/bpf: xdp_tx_iptunnel make use of map_data[] Jesper Dangaard Brouer
@ 2017-05-18 15:41 ` Jesper Dangaard Brouer
  2017-05-19 15:47   ` Daniel Borkmann
  2017-05-19 23:38   ` David Miller
  2017-05-18 15:41 ` [RFC net-next PATCH 3/5] net: introduce XDP driver features interface Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-18 15:41 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, Jesper Dangaard Brouer

Masks for extracting part of the Completion Queue Entry (CQE)
field rss_hash_type was swapped, namely CQE_RSS_HTYPE_IP and
CQE_RSS_HTYPE_L4.

The bug resulted in setting skb->l4_hash, even-though the
rss_hash_type indicated that hash was NOT computed over the
L4 (UDP or TCP) part of the packet.

Added comments from the datasheet, to make it more clear what
these masks are selecting.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/mlx5/device.h |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index dd9a263ed368..a940ec6a046c 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -787,8 +787,14 @@ enum {
 };
 
 enum {
-	CQE_RSS_HTYPE_IP	= 0x3 << 6,
-	CQE_RSS_HTYPE_L4	= 0x3 << 2,
+	CQE_RSS_HTYPE_IP	= 0x3 << 2,
+	/* cqe->rss_hash_type[3:2] - IP destination selected for hash
+	 * (00 = none,  01 = IPv4, 10 = IPv6, 11 = Reserved)
+	 */
+	CQE_RSS_HTYPE_L4	= 0x3 << 6,
+	/* cqe->rss_hash_type[7:6] - L4 destination selected for hash
+	 * (00 = none, 01 = TCP. 10 = UDP, 11 = IPSEC.SPI
+	 */
 };
 
 enum {

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-18 15:41 [RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp_buff Jesper Dangaard Brouer
  2017-05-18 15:41 ` [RFC net-next PATCH 1/5] samples/bpf: xdp_tx_iptunnel make use of map_data[] Jesper Dangaard Brouer
  2017-05-18 15:41 ` [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE Jesper Dangaard Brouer
@ 2017-05-18 15:41 ` Jesper Dangaard Brouer
  2017-05-19 17:13   ` Daniel Borkmann
  2017-05-18 15:41 ` [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers Jesper Dangaard Brouer
  2017-05-18 15:41 ` [RFC net-next PATCH 5/5] mlx5: add XDP rxhash feature for driver mlx5 Jesper Dangaard Brouer
  4 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-18 15:41 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, Jesper Dangaard Brouer

There is a fundamental difference between normal eBPF programs
and (XDP) eBPF programs getting attached in a driver. For normal
eBPF programs it is easy to add a new bpf feature, like a bpf
helper, because is it strongly tied to the feature being
available in the current core kernel code.  When drivers invoke a
bpf_prog, then it is not sufficient to simply relying on whether
a bpf_helper exists or not.  When a driver haven't implemented a
given feature yet, then it is possible to expose uninitialized
parts of xdp_buff.  The driver pass in a pointer to xdp_buff,
usually "allocated" on the stack, which must not be exposed.

Only two user visible NETIF_F_XDP_* net_device feature flags are
exposed via ethtool (-k) seen as "xdp" and "xdp-partial".
The "xdp-partial" is detected when there is not feature equality
between kernel and driver, and a netdev_warn is given.

The idea is that XDP_DRV_* feature bits define a contract between
the driver and the kernel, giving a reliable way to know that XDP
features a driver promised to implement. Thus, knowing what bpf
side features are safe to allow.

There are 3 levels of features: "required", "devel" and "optional".

The motivation is pushing driver vendors forward to support all
the new XDP features.  Once a given feature bit is moved into
the "required" features, the kernel will reject loading XDP
program if feature isn't implemented by driver.  Features under
developement, require help from the bpf infrastrucure to detect
when a given helper or direct-access is used, using a bpf_prog
bit to mark a need for the feature, and pulling in this bit in
the xdp_features_check().  When all drivers have implemented
a "devel" feature, it can be moved to the "required" feature and
the bpf_prog bit can be refurbished. The "optional" features are
for things that are handled safely runtime, but drivers will
still get flagged as "xdp-partial" if not implementing those.
---
 include/linux/netdev_features.h |   32 ++++++++++++++++++++++++++++++++
 include/linux/netdevice.h       |    1 +
 net/core/dev.c                  |   34 ++++++++++++++++++++++++++++++++++
 net/core/ethtool.c              |    2 ++
 4 files changed, 69 insertions(+)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 1d4737cffc71..ff81ee231410 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -77,6 +77,8 @@ enum {
 	NETIF_F_HW_ESP_BIT,		/* Hardware ESP transformation offload */
 	NETIF_F_HW_ESP_TX_CSUM_BIT,	/* ESP with TX checksum offload */
 
+	NETIF_F_XDP_BASELINE_BIT,	/* Driver supports XDP */
+	NETIF_F_XDP_PARTIAL_BIT,	/* not supporting all XDP features */
 	/*
 	 * Add your fresh new feature above and remember to update
 	 * netdev_features_strings[] in net/core/ethtool.c and maybe
@@ -140,6 +142,8 @@ enum {
 #define NETIF_F_HW_TC		__NETIF_F(HW_TC)
 #define NETIF_F_HW_ESP		__NETIF_F(HW_ESP)
 #define NETIF_F_HW_ESP_TX_CSUM	__NETIF_F(HW_ESP_TX_CSUM)
+#define NETIF_F_XDP_BASELINE	__NETIF_F(XDP_BASELINE)
+#define NETIF_F_XDP_PARTIAL	__NETIF_F(XDP_PARTIAL)
 
 #define for_each_netdev_feature(mask_addr, bit)	\
 	for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
@@ -212,4 +216,32 @@ enum {
 				 NETIF_F_GSO_UDP_TUNNEL |		\
 				 NETIF_F_GSO_UDP_TUNNEL_CSUM)
 
+/* XDP driver flags */
+enum {
+	XDP_DRV_F_ENABLED_BIT,
+};
+
+#define __XDP_DRV_F_BIT(bit)	((netdev_features_t)1 << (bit))
+#define __XDP_DRV_F(name)	__XDP_DRV_F_BIT(XDP_DRV_F_##name##_BIT)
+#define XDP_DRV_F_ENABLED	__XDP_DRV_F(ENABLED)
+
+/* XDP driver MUST support these features, else kernel MUST reject
+ * bpf_prog to guarantee safe access to data structures
+ */
+#define XDP_DRV_FEATURES_REQUIRED	XDP_DRV_F_ENABLED
+
+/* Some XDP features are under development. Based on bpf_prog loading
+ * detect if kernel feature can be activated.
+ */
+#define XDP_DRV_FEATURES_DEVEL		0
+
+/* Some XDP features are optional, like action return code, as they
+ * are handled safely runtime.
+ */
+#define XDP_DRV_FEATURES_OPTIONAL	0
+
+#define XDP_DRV_FEATURES_MASK		(XDP_DRV_FEATURES_REQUIRED |	\
+					 XDP_DRV_FEATURES_DEVEL |	\
+					 XDP_DRV_FEATURES_OPTIONAL)
+
 #endif	/* _LINUX_NETDEV_FEATURES_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9c23bd2efb56..329ae156ff65 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1685,6 +1685,7 @@ struct net_device {
 	netdev_features_t	hw_enc_features;
 	netdev_features_t	mpls_features;
 	netdev_features_t	gso_partial_features;
+	netdev_features_t	xdp_features;
 
 	int			ifindex;
 	int			group;
diff --git a/net/core/dev.c b/net/core/dev.c
index 35a06cebb282..b4af5fbbd9da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6851,6 +6851,25 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
 }
 EXPORT_SYMBOL(dev_change_proto_down);
 
+bool xdp_features_check(struct net_device *dev, struct bpf_prog *xdp_prog,
+			struct netlink_ext_ack *extack, u32 flags)
+{
+	netdev_features_t req_features = XDP_DRV_FEATURES_REQUIRED;
+	netdev_features_t dev_xdp_features;
+
+	/* Generic XDP naturally support all features */
+	if (flags & XDP_FLAGS_SKB_MODE)
+		return true;
+
+	dev_xdp_features = dev->xdp_features & XDP_DRV_FEATURES_MASK;
+	if (req_features & ~dev_xdp_features) {
+		NL_SET_ERR_MSG(extack,
+			       "Required XDP feature not supported by device");
+		return false;
+	}
+	return true;
+}
+
 /**
  *	dev_change_xdp_fd - set or clear a bpf program for a device rx path
  *	@dev: device
@@ -6890,6 +6909,11 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_XDP);
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
+
+		if (!xdp_features_check(dev, prog, extack, flags)) {
+			bpf_prog_put(prog);
+			return -EOPNOTSUPP;
+		}
 	}
 
 	memset(&xdp, 0, sizeof(xdp));
@@ -7402,6 +7426,16 @@ int register_netdevice(struct net_device *dev)
 	dev->features |= NETIF_F_SOFT_FEATURES;
 	dev->wanted_features = dev->features & dev->hw_features;
 
+	/* Transfer XDP features and detect mismatch */
+	if (dev->netdev_ops->ndo_xdp) {
+		dev->xdp_features |= XDP_DRV_F_ENABLED;
+		dev->features     |= NETIF_F_XDP_BASELINE;
+		if (dev->xdp_features ^ XDP_DRV_FEATURES_MASK) {
+			netdev_warn(dev, "Partial XDP support in driver\n");
+			dev->features |= NETIF_F_XDP_PARTIAL;
+		}
+	}
+
 	if (!(dev->flags & IFF_LOOPBACK))
 		dev->hw_features |= NETIF_F_NOCACHE_COPY;
 
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03111a2d6653..d283cdc9ee25 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -106,6 +106,8 @@ static const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
 	[NETIF_F_HW_TC_BIT] =		 "hw-tc-offload",
 	[NETIF_F_HW_ESP_BIT] =		 "esp-hw-offload",
 	[NETIF_F_HW_ESP_TX_CSUM_BIT] =	 "esp-tx-csum-hw-offload",
+	[NETIF_F_XDP_BASELINE_BIT] =	 "xdp",
+	[NETIF_F_XDP_PARTIAL_BIT] =	 "xdp-partial", /* "xdp-challenged"? */
 };
 
 static const char

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-18 15:41 [RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp_buff Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2017-05-18 15:41 ` [RFC net-next PATCH 3/5] net: introduce XDP driver features interface Jesper Dangaard Brouer
@ 2017-05-18 15:41 ` Jesper Dangaard Brouer
  2017-05-19 11:47   ` Jesper Dangaard Brouer
                     ` (2 more replies)
  2017-05-18 15:41 ` [RFC net-next PATCH 5/5] mlx5: add XDP rxhash feature for driver mlx5 Jesper Dangaard Brouer
  4 siblings, 3 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-18 15:41 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, Jesper Dangaard Brouer

Introducing a new XDP feature and associated bpf helper bpf_xdp_rxhash.

The rxhash and type allow filtering on packets without touching
packet memory.  The performance difference on my system with a
100 Gbit/s mlx5 NIC is 12Mpps to 19Mpps.

TODO: desc RXHASH and associated type, and how XDP choose to map
and export these to bpf_prog's.

TODO: desc how this interacts with XDP driver features system.
---
 include/linux/filter.h          |   31 ++++++++++++++++-
 include/linux/netdev_features.h |    4 ++
 include/uapi/linux/bpf.h        |   56 +++++++++++++++++++++++++++++-
 kernel/bpf/verifier.c           |    3 ++
 net/core/dev.c                  |   16 ++++++++-
 net/core/filter.c               |   73 +++++++++++++++++++++++++++++++++++++++
 samples/bpf/bpf_helpers.h       |    2 +
 tools/include/uapi/linux/bpf.h  |   10 +++++
 8 files changed, 190 insertions(+), 5 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 9a7786db14fa..33a254ccd47d 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -413,7 +413,8 @@ struct bpf_prog {
 				locked:1,	/* Program image locked? */
 				gpl_compatible:1, /* Is filter GPL compatible? */
 				cb_access:1,	/* Is control block accessed? */
-				dst_needed:1;	/* Do we need dst entry? */
+				dst_needed:1,	/* Do we need dst entry? */
+				xdp_rxhash_needed:1;/* Req XDP RXHASH support */
 	kmemcheck_bitfield_end(meta);
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	u32			len;		/* Number of filter blocks */
@@ -444,12 +445,40 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
+/* Kernel internal xdp_buff->flags */
+#define XDP_CTX_F_RXHASH_TYPE_MASK	XDP_HASH_TYPE_MASK
+#define XDP_CTX_F_RXHASH_TYPE_BITS	XDP_HASH_TYPE_BITS
+#define XDP_CTX_F_RXHASH_SW		(1ULL <<  XDP_CTX_F_RXHASH_TYPE_BITS)
+#define XDP_CTX_F_RXHASH_HW		(1ULL << (XDP_CTX_F_RXHASH_TYPE_BITS+1))
+
 struct xdp_buff {
 	void *data;
 	void *data_end;
 	void *data_hard_start;
+	u64 flags;
+	u32 rxhash;
 };
 
+/* helper functions for driver setting rxhash */
+static inline void
+xdp_record_hash(struct xdp_buff *xdp, u32 hash, u32 type)
+{
+	xdp->flags |= XDP_CTX_F_RXHASH_HW;
+	xdp->flags |= type & XDP_CTX_F_RXHASH_TYPE_MASK;
+	xdp->rxhash = hash;
+}
+
+static inline void
+xdp_set_skb_hash(struct xdp_buff *xdp, struct sk_buff *skb)
+{
+	if (likely(xdp->flags & (XDP_CTX_F_RXHASH_HW|XDP_CTX_F_RXHASH_SW))) {
+		bool is_sw = !!(xdp->flags | XDP_CTX_F_RXHASH_SW);
+		bool is_l4 = !!(xdp->flags & XDP_HASH_TYPE_L4_MASK);
+
+		__skb_set_hash(skb, xdp->rxhash, is_sw, is_l4);
+	}
+}
+
 /* compute the linear packet data range [data, data_end) which
  * will be accessed by cls_bpf, act_bpf and lwt programs
  */
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index ff81ee231410..4b50e8c606c5 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -219,11 +219,13 @@ enum {
 /* XDP driver flags */
 enum {
 	XDP_DRV_F_ENABLED_BIT,
+	XDP_DRV_F_RXHASH_BIT,
 };
 
 #define __XDP_DRV_F_BIT(bit)	((netdev_features_t)1 << (bit))
 #define __XDP_DRV_F(name)	__XDP_DRV_F_BIT(XDP_DRV_F_##name##_BIT)
 #define XDP_DRV_F_ENABLED	__XDP_DRV_F(ENABLED)
+#define XDP_DRV_F_RXHASH	__XDP_DRV_F(RXHASH)
 
 /* XDP driver MUST support these features, else kernel MUST reject
  * bpf_prog to guarantee safe access to data structures
@@ -233,7 +235,7 @@ enum {
 /* Some XDP features are under development. Based on bpf_prog loading
  * detect if kernel feature can be activated.
  */
-#define XDP_DRV_FEATURES_DEVEL		0
+#define XDP_DRV_FEATURES_DEVEL		XDP_DRV_F_RXHASH
 
 /* Some XDP features are optional, like action return code, as they
  * are handled safely runtime.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 945a1f5f63c5..1d9d3a46217d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -482,6 +482,9 @@ union bpf_attr {
  *     Get the owner uid of the socket stored inside sk_buff.
  *     @skb: pointer to skb
  *     Return: uid of the socket owner on success or overflowuid if failed.
+ *
+ * u64 bpf_xdp_rxhash(xdp_md, new_hash, type, flags)
+ *	TODO: MISSING DESC
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -531,7 +534,8 @@ union bpf_attr {
 	FN(xdp_adjust_head),		\
 	FN(probe_read_str),		\
 	FN(get_socket_cookie),		\
-	FN(get_socket_uid),
+	FN(get_socket_uid),		\
+	FN(xdp_rxhash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -581,6 +585,10 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
 #define BPF_F_CTXLEN_MASK		(0xfffffULL << 32)
 
+/* BPF_FUNC_xdp_rxhash flags */
+#define BPF_F_RXHASH_SET		0ULL
+#define BPF_F_RXHASH_GET		(1ULL << 0)
+
 /* user accessible mirror of in-kernel sk_buff.
  * new fields can only be added to the end of this structure
  */
@@ -660,6 +668,52 @@ enum xdp_action {
 struct xdp_md {
 	__u32 data;
 	__u32 data_end;
+	__u32 rxhash;
+	/* (FIXME delete comment)
+	 * Discussion: If choosing to support direct read, then I
+	 * (believe) having a separate 'rxhash_type' is easier and
+	 * faster to implement. (Else I have to do BPF instruction
+	 * hacks to move the type into upper bits of 'rxhash', which I
+	 * couldn't figureout how to do ;-))
+	*/
+	__u32 rxhash_type;
 };
 
+/* XDP rxhash have an associated type, which is related to the RSS
+ * (Receive Side Scaling) standard, but NIC HW have different mapping
+ * and support. Thus, create mapping that is interesting for XDP.  XDP
+ * would primarly want insight into L3 and L4 protocol info.
+ *
+ * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
+ *
+ * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
+ * the 64-bit rxhash (internally type stored in xdp_buff->flags).
+ */
+#define XDP_HASH(x)		((x) & ((1ULL << 32)-1))
+#define XDP_HASH_TYPE(x)	((x) >> 32)
+
+#define XDP_HASH_TYPE_L3_SHIFT	0
+#define XDP_HASH_TYPE_L3_BITS	3
+#define XDP_HASH_TYPE_L3_MASK	((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
+#define XDP_HASH_TYPE_L3(x)	((x) & XDP_HASH_TYPE_L3_MASK)
+enum {
+	XDP_HASH_TYPE_L3_IPV4 = 1,
+	XDP_HASH_TYPE_L3_IPV6,
+};
+
+#define XDP_HASH_TYPE_L4_SHIFT	XDP_HASH_TYPE_L3_BITS
+#define XDP_HASH_TYPE_L4_BITS	5
+#define XDP_HASH_TYPE_L4_MASK						\
+	(((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
+#define XDP_HASH_TYPE_L4(x)	((x) & XDP_HASH_TYPE_L4_MASK)
+enum {
+	_XDP_HASH_TYPE_L4_TCP = 1,
+	_XDP_HASH_TYPE_L4_UDP,
+};
+#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
+#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)
+
+#define XDP_HASH_TYPE_BITS   (XDP_HASH_TYPE_L3_BITS + XDP_HASH_TYPE_L4_BITS)
+#define XDP_HASH_TYPE_MASK   (XDP_HASH_TYPE_L3_MASK | XDP_HASH_TYPE_L4_MASK)
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6f8b6ed690be..248bc113ad18 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3346,6 +3346,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			prog->dst_needed = 1;
 		if (insn->imm == BPF_FUNC_get_prandom_u32)
 			bpf_user_rnd_init_once();
+		if (insn->imm == BPF_FUNC_xdp_rxhash)
+			prog->xdp_rxhash_needed = 1;
 		if (insn->imm == BPF_FUNC_tail_call) {
 			/* If we tail call into other programs, we
 			 * cannot make any assumptions since they can
@@ -3353,6 +3355,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			 * the program array.
 			 */
 			prog->cb_access = 1;
+			prog->xdp_rxhash_needed = 1;
 
 			/* mark bpf_tail_call as different opcode to avoid
 			 * conditional branch in the interpeter for every normal
diff --git a/net/core/dev.c b/net/core/dev.c
index b4af5fbbd9da..28082067ac00 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4318,9 +4318,13 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	xdp.data_end = xdp.data + hlen;
 	xdp.data_hard_start = skb->data - skb_headroom(skb);
 	orig_data = xdp.data;
+	xdp.flags  = 0;
+	xdp.rxhash = skb->hash;
 
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
+	xdp_set_skb_hash(&xdp, skb);
+
 	off = xdp.data - orig_data;
 	if (off > 0)
 		__skb_pull(skb, off);
@@ -6851,10 +6855,20 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
 }
 EXPORT_SYMBOL(dev_change_proto_down);
 
+netdev_features_t bpf_get_xdp_features(struct bpf_prog *prog)
+{
+	netdev_features_t features = XDP_DRV_FEATURES_REQUIRED;
+
+	if (prog->xdp_rxhash_needed)
+		features |= XDP_DRV_F_RXHASH;
+
+	return features;
+}
+
 bool xdp_features_check(struct net_device *dev, struct bpf_prog *xdp_prog,
 			struct netlink_ext_ack *extack, u32 flags)
 {
-	netdev_features_t req_features = XDP_DRV_FEATURES_REQUIRED;
+	netdev_features_t req_features = bpf_get_xdp_features(xdp_prog);
 	netdev_features_t dev_xdp_features;
 
 	/* Generic XDP naturally support all features */
diff --git a/net/core/filter.c b/net/core/filter.c
index a253a6197e6b..df04ac73f581 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2272,6 +2272,54 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_xdp_rxhash, struct xdp_buff *, xdp, u32, new_hash, u32, type,
+	   unsigned int, flags)
+{
+	/* Read+write access to xdp_buff->rxhash is safe, because
+	 * fixup_bpf_calls() detect when helper is used, and drivers
+	 * not implemeting rxhash will not be allowed to load bpf_prog.
+	 */
+
+	/* Set hash and type */
+	if (flags == BPF_F_RXHASH_SET) {
+		xdp->rxhash = new_hash;
+		xdp->flags |= XDP_CTX_F_RXHASH_SW; /* Need for skb "is_sw" */
+		xdp->flags |= type & XDP_CTX_F_RXHASH_TYPE_MASK;
+	}
+
+	/* Get can specify "type" interested in */
+	if ((flags == BPF_F_RXHASH_GET) &&
+	    (type & XDP_CTX_F_RXHASH_TYPE_MASK)) {
+		u32 f_type = (xdp->flags & XDP_CTX_F_RXHASH_TYPE_MASK);
+		bool match = false;
+
+		/* Match on either L3 or L4 type rxhash */
+		if (!((type ^ f_type) & XDP_HASH_TYPE_L3_MASK))
+			match = true;
+		if (!((type ^ f_type) & XDP_HASH_TYPE_L4_MASK))
+			match = true;
+		if (match == false)
+			return 0;
+	}
+
+	/* Drivers only xdp_record_hash if NETIF_F_RXHASH enabled */
+	if (likely(xdp->flags & (XDP_CTX_F_RXHASH_HW|XDP_CTX_F_RXHASH_SW))) {
+		u64 rxhash_type = xdp->flags & XDP_CTX_F_RXHASH_TYPE_MASK;
+
+		return (u64)(xdp->rxhash | (rxhash_type << 32));
+	}
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_rxhash_proto = {
+	.func           = bpf_xdp_rxhash,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER, // Q: how do I say "u64" ?
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 bool bpf_helper_changes_pkt_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -2760,6 +2808,8 @@ xdp_func_proto(enum bpf_func_id func_id)
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_xdp_adjust_head:
 		return &bpf_xdp_adjust_head_proto;
+	case BPF_FUNC_xdp_rxhash:
+		return &bpf_xdp_rxhash_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -3308,6 +3358,29 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 				      si->dst_reg, si->src_reg,
 				      offsetof(struct xdp_buff, data_end));
 		break;
+	case offsetof(struct xdp_md, rxhash):
+		/* Direct read-access to rxhash is safe, as drivers
+		 * not implementing will not be allowed to load bpf_prog.
+		 *
+		 * Driver gotchas: Even if NETIF_F_RXHASH is disabled
+		 * drivers must init xdp_buff->rxhash, due to this
+		 * direct read.
+		 */
+		prog->xdp_rxhash_needed = 1;
+
+		BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_buff, rxhash) != 4);
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, rxhash));
+		break;
+	case offsetof(struct xdp_md, rxhash_type):
+		/* rxhash_type stored in lower 8-bits of xdp_buff->flags */
+		prog->xdp_rxhash_needed = 1;
+
+		BUILD_BUG_ON(XDP_HASH_TYPE_BITS != 8);
+		/* Load first 8 bits (BPF_B) of flags */
+		*insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_buff, flags));
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 9a9c95f2c9fb..634a976a02c6 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -59,6 +59,8 @@ static unsigned long long (*bpf_get_prandom_u32)(void) =
 	(void *) BPF_FUNC_get_prandom_u32;
 static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
 	(void *) BPF_FUNC_xdp_adjust_head;
+static unsigned long long (*bpf_xdp_rxhash)(void *ctx, __u32 new_hash, __u32 type, unsigned int flags) =
+	(void *) BPF_FUNC_xdp_rxhash;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e553529929f6..a38c544bf6f0 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -483,6 +483,9 @@ union bpf_attr {
  *     @skb: pointer to skb
  *     Return: uid of the socket owner on success or 0 if the socket pointer
  *     inside sk_buff is NULL
+ *
+ * u64 bpf_xdp_rxhash(xdp_md, new_hash, type, flags)
+ *	FIXME: Copy desc from include/uapi/linux/bpf.h
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -532,7 +535,8 @@ union bpf_attr {
 	FN(xdp_adjust_head),		\
 	FN(probe_read_str),		\
 	FN(get_socket_cookie),		\
-	FN(get_socket_uid),
+	FN(get_socket_uid),		\
+	FN(xdp_rxhash),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -661,6 +665,10 @@ enum xdp_action {
 struct xdp_md {
 	__u32 data;
 	__u32 data_end;
+	__u32 rxhash;
+	__u32 rxhash_type;
 };
 
+// FIXME: Sync with include/uapi/linux/bpf.h
+
 #endif /* _UAPI__LINUX_BPF_H__ */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC net-next PATCH 5/5] mlx5: add XDP rxhash feature for driver mlx5
  2017-05-18 15:41 [RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp_buff Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2017-05-18 15:41 ` [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers Jesper Dangaard Brouer
@ 2017-05-18 15:41 ` Jesper Dangaard Brouer
  4 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-18 15:41 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, Jesper Dangaard Brouer


---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   98 ++++++++++++++-------
 2 files changed, 70 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index e43411d232ee..3ae90dbdd3de 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3956,6 +3956,9 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_RX;
 	netdev->hw_features      |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
+	/* XDP_DRV_F_ENABLED is added in register_netdevice() */
+	netdev->xdp_features = XDP_DRV_F_RXHASH;
+
 	if (mlx5e_vxlan_allowed(mdev)) {
 		netdev->hw_features     |= NETIF_F_GSO_UDP_TUNNEL |
 					   NETIF_F_GSO_UDP_TUNNEL_CSUM |
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index ae66fad98244..eb9d859bf09d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -514,14 +514,28 @@ static void mlx5e_lro_update_hdr(struct sk_buff *skb, struct mlx5_cqe64 *cqe,
 	}
 }
 
-static inline void mlx5e_skb_set_hash(struct mlx5_cqe64 *cqe,
-				      struct sk_buff *skb)
+u8 mlx5_htype_l3_to_xdp[4] = {
+	0,			/* 00 - none */
+	XDP_HASH_TYPE_L3_IPV4,	/* 01 - IPv4 */
+	XDP_HASH_TYPE_L3_IPV6,	/* 10 - IPv6 */
+	0,			/* 11 - Reserved */
+};
+
+u8 mlx5_htype_l4_to_xdp[4] = {
+	0,			/* 00 - none */
+	XDP_HASH_TYPE_L4_TCP,	/* 01 - TCP  */
+	XDP_HASH_TYPE_L4_UDP,	/* 10 - UDP  */
+	0,			/* 11 - IPSEC.SPI */
+};
+
+static inline void mlx5e_xdp_set_hash(struct mlx5_cqe64 *cqe,
+				      struct xdp_buff *xdp)
 {
 	u8 cht = cqe->rss_hash_type;
-	int ht = (cht & CQE_RSS_HTYPE_L4) ? PKT_HASH_TYPE_L4 :
-		 (cht & CQE_RSS_HTYPE_IP) ? PKT_HASH_TYPE_L3 :
-					    PKT_HASH_TYPE_NONE;
-	skb_set_hash(skb, be32_to_cpu(cqe->rss_hash_result), ht);
+	u32 ht = (mlx5_htype_l4_to_xdp[((cht & CQE_RSS_HTYPE_L4) >> 6)] | \
+		  mlx5_htype_l3_to_xdp[((cht & CQE_RSS_HTYPE_IP) >> 2)]);
+
+	xdp_record_hash(xdp, be32_to_cpu(cqe->rss_hash_result), ht);
 }
 
 static inline bool is_first_ethertype_ip(struct sk_buff *skb)
@@ -570,7 +584,8 @@ static inline void mlx5e_handle_csum(struct net_device *netdev,
 static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 				      u32 cqe_bcnt,
 				      struct mlx5e_rq *rq,
-				      struct sk_buff *skb)
+				      struct sk_buff *skb,
+				      struct xdp_buff *xdp)
 {
 	struct net_device *netdev = rq->netdev;
 	struct mlx5e_tstamp *tstamp = rq->tstamp;
@@ -593,8 +608,7 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 
 	skb_record_rx_queue(skb, rq->ix);
 
-	if (likely(netdev->features & NETIF_F_RXHASH))
-		mlx5e_skb_set_hash(cqe, skb);
+	xdp_set_skb_hash(xdp, skb);
 
 	if (cqe_has_vlan(cqe))
 		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
@@ -609,11 +623,12 @@ static inline void mlx5e_build_rx_skb(struct mlx5_cqe64 *cqe,
 static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
 					 struct mlx5_cqe64 *cqe,
 					 u32 cqe_bcnt,
-					 struct sk_buff *skb)
+					 struct sk_buff *skb,
+					 struct xdp_buff *xdp)
 {
 	rq->stats.packets++;
 	rq->stats.bytes += cqe_bcnt;
-	mlx5e_build_rx_skb(cqe, cqe_bcnt, rq, skb);
+	mlx5e_build_rx_skb(cqe, cqe_bcnt, rq, skb, xdp);
 }
 
 static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
@@ -696,27 +711,27 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_rq *rq,
 /* returns true if packet was consumed by xdp */
 static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				   struct mlx5e_dma_info *di,
-				   void *va, u16 *rx_headroom, u32 *len)
+				   struct xdp_buff *xdp, void *va,
+				   u16 *rx_headroom, u32 *len)
 {
 	const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
-	struct xdp_buff xdp;
 	u32 act;
 
 	if (!prog)
 		return false;
 
-	xdp.data = va + *rx_headroom;
-	xdp.data_end = xdp.data + *len;
-	xdp.data_hard_start = va;
+	xdp->data = va + *rx_headroom;
+	xdp->data_end = xdp->data + *len;
+	xdp->data_hard_start = va;
 
-	act = bpf_prog_run_xdp(prog, &xdp);
+	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
 	case XDP_PASS:
-		*rx_headroom = xdp.data - xdp.data_hard_start;
-		*len = xdp.data_end - xdp.data;
+		*rx_headroom = xdp->data - xdp->data_hard_start;
+		*len = xdp->data_end - xdp->data;
 		return false;
 	case XDP_TX:
-		if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, &xdp)))
+		if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, xdp)))
 			trace_xdp_exception(rq->netdev, prog, act);
 		return true;
 	default:
@@ -731,7 +746,22 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 }
 
 static inline
+void mlx5_fill_xdp_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			    struct xdp_buff *xdp)
+{
+	struct net_device *netdev = rq->netdev;
+
+	xdp->flags = 0;
+
+	if (likely(netdev->features & NETIF_F_RXHASH))
+		mlx5e_xdp_set_hash(cqe, xdp);
+	else
+		xdp->rxhash=0; /* Due to bpf direct read */
+}
+
+static inline
 struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			     struct xdp_buff *xdp,
 			     u16 wqe_counter, u32 cqe_bcnt)
 {
 	struct mlx5e_dma_info *di;
@@ -756,9 +786,10 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 		mlx5e_page_release(rq, di, true);
 		return NULL;
 	}
+	mlx5_fill_xdp_rx_cqe(rq, cqe, xdp);
 
 	rcu_read_lock();
-	consumed = mlx5e_xdp_handle(rq, di, va, &rx_headroom, &cqe_bcnt);
+	consumed = mlx5e_xdp_handle(rq, di, xdp, va, &rx_headroom, &cqe_bcnt);
 	rcu_read_unlock();
 	if (consumed)
 		return NULL; /* page/packet was consumed by XDP */
@@ -784,6 +815,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_rx_wqe *wqe;
 	__be16 wqe_counter_be;
+	struct xdp_buff xdp;
 	struct sk_buff *skb;
 	u16 wqe_counter;
 	u32 cqe_bcnt;
@@ -793,11 +825,11 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
 	cqe_bcnt       = be32_to_cpu(cqe->byte_cnt);
 
-	skb = skb_from_cqe(rq, cqe, wqe_counter, cqe_bcnt);
+	skb = skb_from_cqe(rq, cqe, &xdp, wqe_counter, cqe_bcnt);
 	if (!skb)
 		goto wq_ll_pop;
 
-	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
+	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb, &xdp);
 	napi_gro_receive(rq->cq.napi, skb);
 
 wq_ll_pop:
@@ -811,6 +843,7 @@ void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	struct mlx5e_priv *priv = netdev_priv(netdev);
 	struct mlx5_eswitch_rep *rep = priv->ppriv;
 	struct mlx5e_rx_wqe *wqe;
+	struct xdp_buff xdp;
 	struct sk_buff *skb;
 	__be16 wqe_counter_be;
 	u16 wqe_counter;
@@ -821,11 +854,11 @@ void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
 	cqe_bcnt       = be32_to_cpu(cqe->byte_cnt);
 
-	skb = skb_from_cqe(rq, cqe, wqe_counter, cqe_bcnt);
+	skb = skb_from_cqe(rq, cqe, &xdp, wqe_counter, cqe_bcnt);
 	if (!skb)
 		goto wq_ll_pop;
 
-	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
+	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb, &xdp);
 
 	if (rep->vlan && skb_vlan_tag_present(skb))
 		skb_vlan_pop(skb);
@@ -882,6 +915,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	struct mlx5e_mpw_info *wi = &rq->mpwqe.info[wqe_id];
 	struct mlx5e_rx_wqe  *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
 	struct sk_buff *skb;
+	struct xdp_buff xdp;
 	u16 cqe_bcnt;
 
 	wi->consumed_strides += cstrides;
@@ -906,9 +940,10 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 
 	prefetch(skb->data);
 	cqe_bcnt = mpwrq_get_cqe_byte_cnt(cqe);
+	mlx5_fill_xdp_rx_cqe(rq, cqe, &xdp);
 
 	mlx5e_mpwqe_fill_rx_skb(rq, cqe, wi, cqe_bcnt, skb);
-	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
+	mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb, &xdp);
 	napi_gro_receive(rq->cq.napi, skb);
 
 mpwrq_cqe_out:
@@ -1043,7 +1078,8 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
 static inline void mlx5i_complete_rx_cqe(struct mlx5e_rq *rq,
 					 struct mlx5_cqe64 *cqe,
 					 u32 cqe_bcnt,
-					 struct sk_buff *skb)
+					 struct sk_buff *skb,
+					 struct xdp_buff *xdp)
 {
 	struct net_device *netdev = rq->netdev;
 	u8 *dgid;
@@ -1071,8 +1107,7 @@ static inline void mlx5i_complete_rx_cqe(struct mlx5e_rq *rq,
 
 	skb_record_rx_queue(skb, rq->ix);
 
-	if (likely(netdev->features & NETIF_F_RXHASH))
-		mlx5e_skb_set_hash(cqe, skb);
+	xdp_set_skb_hash(xdp, skb);
 
 	skb_reset_mac_header(skb);
 	skb_pull(skb, MLX5_IPOIB_ENCAP_LEN);
@@ -1088,6 +1123,7 @@ void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_rx_wqe *wqe;
 	__be16 wqe_counter_be;
+	struct xdp_buff xdp;
 	struct sk_buff *skb;
 	u16 wqe_counter;
 	u32 cqe_bcnt;
@@ -1097,11 +1133,11 @@ void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	wqe            = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
 	cqe_bcnt       = be32_to_cpu(cqe->byte_cnt);
 
-	skb = skb_from_cqe(rq, cqe, wqe_counter, cqe_bcnt);
+	skb = skb_from_cqe(rq, cqe, &xdp, wqe_counter, cqe_bcnt);
 	if (!skb)
 		goto wq_ll_pop;
 
-	mlx5i_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
+	mlx5i_complete_rx_cqe(rq, cqe, cqe_bcnt, skb, &xdp);
 	napi_gro_receive(rq->cq.napi, skb);
 
 wq_ll_pop:

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-18 15:41 ` [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers Jesper Dangaard Brouer
@ 2017-05-19 11:47   ` Jesper Dangaard Brouer
  2017-05-20  3:07   ` Alexei Starovoitov
  2017-05-20 16:16   ` Tom Herbert
  2 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-19 11:47 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, brouer, xdp-newbies

On Thu, 18 May 2017 17:41:48 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> Introducing a new XDP feature and associated bpf helper bpf_xdp_rxhash.
> 
> The rxhash and type allow filtering on packets without touching
> packet memory.  The performance difference on my system with a
> 100 Gbit/s mlx5 NIC is 12Mpps to 19Mpps.

The XDP/bpf program I use (called xdp_rxhash) for testing this feature
is available via my github repo here:

 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_rxhash_kern.c
 https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_rxhash_user.c

The cmdline output looks like:

$ sudo ./xdp_rxhash --dev mlx5p2 --sec 2 --notouch

xdp-action     pps        pps-human-readable mem      
XDP_ABORTED    19694682   19,694,682         2.000205  no_touch
XDP_DROP       0          0                  2.000205  no_touch
XDP_PASS       10         10                 2.000205  no_touch
XDP_TX         0          0                  2.000205  no_touch
rx_total       19694701   19,694,701         2.000205  no_touch

hash_type:L3   pps        pps-human-readable sample-period
Unknown        0          0                  2.000205
IPv4           19694725   19,694,725         2.000205
IPv6           0          0                  2.000205

hash_type:L4   pps        pps-human-readable sample-period
Unknown        10         10                 2.000205
TCP            0          0                  2.000205
UDP            19694697   19,694,697         2.000205

^CInterrupted: Removing XDP program on ifindex:5 device:mlx5p2


The 10 pps XDP_PASS is a ping command I rand at the same time. Notice
how these ping-ICMP packets were categorized as L4=Unknown and L3=IPv4.
The L4 categorization is usually UDP or TCP, but looking at driver-code
it seems some HW support detecting other L4 types, like ICMP, SCTP, etc.


$ sudo taskset -c 4 ping -i 0.1 -c 10000 198.18.100.1 -c 100 -q
PING 198.18.100.1 (198.18.100.1) 56(84) bytes of data.

--- 198.18.100.1 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 10294ms
rtt min/avg/max/mdev = 0.010/0.071/0.113/0.043 ms


p.s. xdp-newbies can via this commit find the links back to the netdev
kernel RFC patches/mails:
 https://github.com/netoptimizer/prototype-kernel/commit/9647e1b563970

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 1/5] samples/bpf: xdp_tx_iptunnel make use of map_data[]
  2017-05-18 15:41 ` [RFC net-next PATCH 1/5] samples/bpf: xdp_tx_iptunnel make use of map_data[] Jesper Dangaard Brouer
@ 2017-05-19 15:45   ` Daniel Borkmann
  0 siblings, 0 replies; 31+ messages in thread
From: Daniel Borkmann @ 2017-05-19 15:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev

On 05/18/2017 05:41 PM, Jesper Dangaard Brouer wrote:
> There is no reason to use a compile time constant MAX_IPTNL_ENTRIES
> shared between the _user.c and _kern.c, when map_data[].def.max_entries
> can tell us dynamically what the max_entries were of the ELF map that
> the bpf loaded created.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Previous code was perhaps a bit more robust in the sense that the
order of the map wouldn't matter due to MAX_IPTNL_ENTRIES being
shared. Now you rely on it being in slot 1 (map_data[1].def.max_entries)
from "maps" section in ELF.

>   samples/bpf/xdp_tx_iptunnel_common.h |    2 --
>   samples/bpf/xdp_tx_iptunnel_kern.c   |    2 +-
>   samples/bpf/xdp_tx_iptunnel_user.c   |   14 +++++++++-----
>   3 files changed, 10 insertions(+), 8 deletions(-)

Not sure it's worth it given this actually adds more code and makes
it more fragile at the same time. Only point I could see is to demo
usage of map_data[1].def.max_entries for sample code.

Perhaps at the very minimum add a warning comment to xdp_tx_iptunnel_kern.c
that should the code be further extended with additional maps, that
ordering of struct bpf_map_def entries really matters here to not break
the _user.c part.

Other than that, this should be sent as stand-alone "cleanup" ...

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE
  2017-05-18 15:41 ` [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE Jesper Dangaard Brouer
@ 2017-05-19 15:47   ` Daniel Borkmann
  2017-05-19 23:38   ` David Miller
  1 sibling, 0 replies; 31+ messages in thread
From: Daniel Borkmann @ 2017-05-19 15:47 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev, saeedm

On 05/18/2017 05:41 PM, Jesper Dangaard Brouer wrote:
> Masks for extracting part of the Completion Queue Entry (CQE)
> field rss_hash_type was swapped, namely CQE_RSS_HTYPE_IP and
> CQE_RSS_HTYPE_L4.
>
> The bug resulted in setting skb->l4_hash, even-though the
> rss_hash_type indicated that hash was NOT computed over the
> L4 (UDP or TCP) part of the packet.
>
> Added comments from the datasheet, to make it more clear what
> these masks are selecting.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

Stand-alone fix for -net tree?

>   include/linux/mlx5/device.h |   10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
> index dd9a263ed368..a940ec6a046c 100644
> --- a/include/linux/mlx5/device.h
> +++ b/include/linux/mlx5/device.h
> @@ -787,8 +787,14 @@ enum {
>   };
>
>   enum {
> -	CQE_RSS_HTYPE_IP	= 0x3 << 6,
> -	CQE_RSS_HTYPE_L4	= 0x3 << 2,
> +	CQE_RSS_HTYPE_IP	= 0x3 << 2,
> +	/* cqe->rss_hash_type[3:2] - IP destination selected for hash
> +	 * (00 = none,  01 = IPv4, 10 = IPv6, 11 = Reserved)
> +	 */
> +	CQE_RSS_HTYPE_L4	= 0x3 << 6,
> +	/* cqe->rss_hash_type[7:6] - L4 destination selected for hash
> +	 * (00 = none, 01 = TCP. 10 = UDP, 11 = IPSEC.SPI
> +	 */
>   };
>
>   enum {
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-18 15:41 ` [RFC net-next PATCH 3/5] net: introduce XDP driver features interface Jesper Dangaard Brouer
@ 2017-05-19 17:13   ` Daniel Borkmann
  2017-05-19 23:37     ` David Miller
  2017-05-20  7:53     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 31+ messages in thread
From: Daniel Borkmann @ 2017-05-19 17:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov
  Cc: John Fastabend, netdev

On 05/18/2017 05:41 PM, Jesper Dangaard Brouer wrote:
> There is a fundamental difference between normal eBPF programs
> and (XDP) eBPF programs getting attached in a driver. For normal
> eBPF programs it is easy to add a new bpf feature, like a bpf
> helper, because is it strongly tied to the feature being
> available in the current core kernel code.  When drivers invoke a
> bpf_prog, then it is not sufficient to simply relying on whether
> a bpf_helper exists or not.  When a driver haven't implemented a
> given feature yet, then it is possible to expose uninitialized
> parts of xdp_buff.  The driver pass in a pointer to xdp_buff,
> usually "allocated" on the stack, which must not be exposed.

When xdp_buff is being extended, then we should at least zero
initialize all in-tree users that don't support or populate this
field, thus that it's not uninitialized memory. Better would be
to have a way to reject the prog in the first place until it's
implemented (but further comments on feature bits below).

> Only two user visible NETIF_F_XDP_* net_device feature flags are
> exposed via ethtool (-k) seen as "xdp" and "xdp-partial".
> The "xdp-partial" is detected when there is not feature equality
> between kernel and driver, and a netdev_warn is given.

I think having something like a NETIF_F_XDP_BIT for ethtool to
indicate support as "xdp" is quite useful. Avoids having to grep
the kernel tree for ndo_xdp callback. ;) A "xdp-partial" would
still be unclear/confusing to the user whether his program loads
or doesn't which is the only thing a user (or some loading infra)
cares about eventually, so one still needs to go trying to load
the XDP code to see whether that fails for the native case.

> The idea is that XDP_DRV_* feature bits define a contract between
> the driver and the kernel, giving a reliable way to know that XDP
> features a driver promised to implement. Thus, knowing what bpf
> side features are safe to allow.
>
> There are 3 levels of features: "required", "devel" and "optional".
>
> The motivation is pushing driver vendors forward to support all
> the new XDP features.  Once a given feature bit is moved into
> the "required" features, the kernel will reject loading XDP
> program if feature isn't implemented by driver.  Features under
> developement, require help from the bpf infrastrucure to detect
> when a given helper or direct-access is used, using a bpf_prog
> bit to mark a need for the feature, and pulling in this bit in
> the xdp_features_check().  When all drivers have implemented
> a "devel" feature, it can be moved to the "required" feature and

The problem is that once you add bits markers to bpf_prog like we
used to do in the past, then as you do in patch 4/5 with the
xdp_rxhash_needed bit, they will need to be turned /on/ unconditionally
when a prog has tail calls.

Meaning, xdp_features_check() would then bail out when you have
XDP_DRV_F_RXHASH set. This has the effect that while XDP prog X
was running fine on kernel Y, it will suddenly get rejected on
later kernel (Y + 1), without using the feature (but tail calls).
And more complex networking progs are likely to use tail calls,
so that would break all of them unfortunately as they cannot load
anymore.

> the bpf_prog bit can be refurbished. The "optional" features are
> for things that are handled safely runtime, but drivers will
> still get flagged as "xdp-partial" if not implementing those.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-19 17:13   ` Daniel Borkmann
@ 2017-05-19 23:37     ` David Miller
  2017-05-20  7:53     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 31+ messages in thread
From: David Miller @ 2017-05-19 23:37 UTC (permalink / raw)
  To: daniel; +Cc: brouer, borkmann, alexei.starovoitov, john.r.fastabend, netdev

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Fri, 19 May 2017 19:13:29 +0200

> The problem is that once you add bits markers to bpf_prog like we
> used to do in the past, then as you do in patch 4/5 with the
> xdp_rxhash_needed bit, they will need to be turned /on/
> unconditionally when a prog has tail calls.

Yeah that's the problem with feature checks, once you have tail calls
involved we have to say "entire universe" of features is possible
because it is (intentionally) not possible to track all paths
reachable via tail calls, and in fact these paths can dynamically and
arbitrarily change after the program using tail calls have been loaded
and verified completely.

For example, let's assume we have eBPF program A that uses tail calls
via slots in bpf MAP "M".  At verification time, sure, we could see
the MAP "M" points to programs B and C, which don't use tail calls and
look at what features they use.

But after loading "A", anyone with access to bpf MAP "M" can change
the tail call slots to point to bpf programs other than "B" and "C".
And maybe those new programs use features outside of the set we tested
for when "A" was verified.

So it is impossible to test feature "sets" with eBPF like this.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE
  2017-05-18 15:41 ` [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE Jesper Dangaard Brouer
  2017-05-19 15:47   ` Daniel Borkmann
@ 2017-05-19 23:38   ` David Miller
  2017-05-22 18:27     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 31+ messages in thread
From: David Miller @ 2017-05-19 23:38 UTC (permalink / raw)
  To: brouer; +Cc: borkmann, alexei.starovoitov, john.r.fastabend, netdev

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Thu, 18 May 2017 17:41:38 +0200

> Masks for extracting part of the Completion Queue Entry (CQE)
> field rss_hash_type was swapped, namely CQE_RSS_HTYPE_IP and
> CQE_RSS_HTYPE_L4.
> 
> The bug resulted in setting skb->l4_hash, even-though the
> rss_hash_type indicated that hash was NOT computed over the
> L4 (UDP or TCP) part of the packet.
> 
> Added comments from the datasheet, to make it more clear what
> these masks are selecting.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

Please pass this along to the mlx5 developers as a standalone
bug fix for 'net', thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-18 15:41 ` [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers Jesper Dangaard Brouer
  2017-05-19 11:47   ` Jesper Dangaard Brouer
@ 2017-05-20  3:07   ` Alexei Starovoitov
  2017-05-20  3:21     ` Jakub Kicinski
  2017-05-21 15:55     ` Jesper Dangaard Brouer
  2017-05-20 16:16   ` Tom Herbert
  2 siblings, 2 replies; 31+ messages in thread
From: Alexei Starovoitov @ 2017-05-20  3:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Daniel Borkmann, John Fastabend, netdev

On Thu, May 18, 2017 at 05:41:48PM +0200, Jesper Dangaard Brouer wrote:
>  
> +/* XDP rxhash have an associated type, which is related to the RSS
> + * (Receive Side Scaling) standard, but NIC HW have different mapping
> + * and support. Thus, create mapping that is interesting for XDP.  XDP
> + * would primarly want insight into L3 and L4 protocol info.
> + *
> + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> + *
> + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> + */
> +#define XDP_HASH(x)		((x) & ((1ULL << 32)-1))
> +#define XDP_HASH_TYPE(x)	((x) >> 32)
> +
> +#define XDP_HASH_TYPE_L3_SHIFT	0
> +#define XDP_HASH_TYPE_L3_BITS	3
> +#define XDP_HASH_TYPE_L3_MASK	((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> +#define XDP_HASH_TYPE_L3(x)	((x) & XDP_HASH_TYPE_L3_MASK)
> +enum {
> +	XDP_HASH_TYPE_L3_IPV4 = 1,
> +	XDP_HASH_TYPE_L3_IPV6,
> +};
> +
> +#define XDP_HASH_TYPE_L4_SHIFT	XDP_HASH_TYPE_L3_BITS
> +#define XDP_HASH_TYPE_L4_BITS	5
> +#define XDP_HASH_TYPE_L4_MASK						\
> +	(((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> +#define XDP_HASH_TYPE_L4(x)	((x) & XDP_HASH_TYPE_L4_MASK)
> +enum {
> +	_XDP_HASH_TYPE_L4_TCP = 1,
> +	_XDP_HASH_TYPE_L4_UDP,
> +};
> +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
> +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)

imo this is dangerous territory.
As far as I can see this information doesn't exist in the current drivers at all
and you're enabling it in the patch 5 via fancy:
+       u32 ht = (mlx5_htype_l4_to_xdp[((cht & CQE_RSS_HTYPE_L4) >> 6)] | \
+                 mlx5_htype_l3_to_xdp[((cht & CQE_RSS_HTYPE_IP) >> 2)]);

It's pretty cool that you've discovered this hidden mlx5 feature
Did you find it in some hw spec ?
And it looks useful to me, but
1. i'm worried that we'd be relying on something that mellanox didn't
 implement in their drivers before. Was it tested and guarnteed to exist
 in the future revisions of firmware? Is it cx4 or cx4-lx or cx5 feature?
2. but the main concern that it is mellanox only feature. At least I cannot
see anything like this in broadcom and intel nics

In the very beginning we discussed that XDP programs should be as generic as
possible and HW independent while at the same time we want to expose HW
specific features to XDP programs.
So I'm totally fine to expose this fancy hw hash and ipv4 vs v6 and tcp vs udp
flags to xdp programs somehow, but I'm completely against making it into uapi.

How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
We can make sure that XDP program does read only access into it and
it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
in there, but this will not be uapi and it will be pretty obvious
to program authors that their programs are vendor specific.
'not uapi' here means that mellanox is free to change their HW descriptor
and its contents as they wish.
Also no copies and bit conversions will be necessary, so the cost will
be zero to programs that don't use it and we wouldn't need to change
verifier to discover access to this stuff.

Note that bpf programs already have access to all in-kernel data structures
on the tracing side, so this is nothing new and tracing program authors
got used to structures changing from kernel to kernel.
XDP program authors can do the same for vendor specific bits while we
keep core XDP uapi generic across all nics.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-20  3:07   ` Alexei Starovoitov
@ 2017-05-20  3:21     ` Jakub Kicinski
  2017-05-20  3:34       ` Alexei Starovoitov
  2017-05-21 15:55     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 31+ messages in thread
From: Jakub Kicinski @ 2017-05-20  3:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Daniel Borkmann, John Fastabend, netdev

On Fri, 19 May 2017 20:07:52 -0700, Alexei Starovoitov wrote:
> How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> We can make sure that XDP program does read only access into it and
> it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> in there, but this will not be uapi and it will be pretty obvious
> to program authors that their programs are vendor specific.
> 'not uapi' here means that mellanox is free to change their HW descriptor
> and its contents as they wish.

Hm..  Would that mean we have to teach the verifier about all possible
drivers and their metadata structures (i.e. sizes thereof).  And add an
UAPI enum of known drivers?

Other idea I floated in early days was to standardize the fields but
let the driver "JIT" the accesses to look at the right offset of the
right structure.  Admittedly that would be a lot more work.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-20  3:21     ` Jakub Kicinski
@ 2017-05-20  3:34       ` Alexei Starovoitov
  2017-05-20  4:13         ` Jakub Kicinski
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2017-05-20  3:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jesper Dangaard Brouer, Daniel Borkmann, John Fastabend, netdev

On Fri, May 19, 2017 at 08:21:47PM -0700, Jakub Kicinski wrote:
> On Fri, 19 May 2017 20:07:52 -0700, Alexei Starovoitov wrote:
> > How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> > We can make sure that XDP program does read only access into it and
> > it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> > in there, but this will not be uapi and it will be pretty obvious
> > to program authors that their programs are vendor specific.
> > 'not uapi' here means that mellanox is free to change their HW descriptor
> > and its contents as they wish.
> 
> Hm..  Would that mean we have to teach the verifier about all possible
> drivers and their metadata structures (i.e. sizes thereof).  And add an
> UAPI enum of known drivers?

why? no uapi other than a pointer to this hw rx descriptor.
Different sizeof(hw_rx_descriptor) is not a problem.
We deal with it already in tracing. All tracepoints have different
sizeof(*ctx), yet the safety is preserved.

> Other idea I floated in early days was to standardize the fields but
> let the driver "JIT" the accesses to look at the right offset of the
> right structure.  Admittedly that would be a lot more work.

'standardize the fields' sounds nice, but failed here already.
As far as I can see the meaning of packet 'hash' is quite different
across the drivers and 'hash' is just a beginning.
I hope we can standardize on 'csum' field and make it checksum_complete,
but so far out of 10+G nics only mlx5 and nfp do it in hw.
We need it at least for mlx4, but it can only fake it via expensive math.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-20  3:34       ` Alexei Starovoitov
@ 2017-05-20  4:13         ` Jakub Kicinski
  0 siblings, 0 replies; 31+ messages in thread
From: Jakub Kicinski @ 2017-05-20  4:13 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Jesper Dangaard Brouer, Daniel Borkmann, netdev

On Fri, 19 May 2017 20:34:00 -0700, Alexei Starovoitov wrote:
> On Fri, May 19, 2017 at 08:21:47PM -0700, Jakub Kicinski wrote:
> > On Fri, 19 May 2017 20:07:52 -0700, Alexei Starovoitov wrote:  
> > > How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> > > We can make sure that XDP program does read only access into it and
> > > it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> > > in there, but this will not be uapi and it will be pretty obvious
> > > to program authors that their programs are vendor specific.
> > > 'not uapi' here means that mellanox is free to change their HW descriptor
> > > and its contents as they wish.  
> > 
> > Hm..  Would that mean we have to teach the verifier about all possible
> > drivers and their metadata structures (i.e. sizes thereof).  And add an
> > UAPI enum of known drivers?  
> 
> why? no uapi other than a pointer to this hw rx descriptor.
> Different sizeof(hw_rx_descriptor) is not a problem.
> We deal with it already in tracing. All tracepoints have different
> sizeof(*ctx), yet the safety is preserved.

Ack, quick read of tracing code reveals this indeed should work.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-19 17:13   ` Daniel Borkmann
  2017-05-19 23:37     ` David Miller
@ 2017-05-20  7:53     ` Jesper Dangaard Brouer
  2017-05-21  0:58       ` Daniel Borkmann
  1 sibling, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-20  7:53 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Daniel Borkmann, Alexei Starovoitov, John Fastabend, netdev, brouer

On Fri, 19 May 2017 19:13:29 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 05/18/2017 05:41 PM, Jesper Dangaard Brouer wrote:
> > There is a fundamental difference between normal eBPF programs
> > and (XDP) eBPF programs getting attached in a driver. For normal
> > eBPF programs it is easy to add a new bpf feature, like a bpf
> > helper, because is it strongly tied to the feature being
> > available in the current core kernel code.  When drivers invoke a
> > bpf_prog, then it is not sufficient to simply relying on whether
> > a bpf_helper exists or not.  When a driver haven't implemented a
> > given feature yet, then it is possible to expose uninitialized
> > parts of xdp_buff.  The driver pass in a pointer to xdp_buff,
> > usually "allocated" on the stack, which must not be exposed.  
> 
> When xdp_buff is being extended, then we should at least zero
> initialize all in-tree users that don't support or populate this
> field, thus that it's not uninitialized memory. Better would be
> to have a way to reject the prog in the first place until it's
> implemented (but further comments on feature bits below).

Going down a path where we need to zero out the xdp_buff looks a lot
like the sk_buff zeroing, which is the top perf cost associated with
SKBs see[1].  XDP is is about not repeating the same issue we had with
the SKB...

[1] https://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html#analyzing-build-skb-and-memset


> > Only two user visible NETIF_F_XDP_* net_device feature flags are
> > exposed via ethtool (-k) seen as "xdp" and "xdp-partial".
> > The "xdp-partial" is detected when there is not feature equality
> > between kernel and driver, and a netdev_warn is given.  
> 
> I think having something like a NETIF_F_XDP_BIT for ethtool to
> indicate support as "xdp" is quite useful. Avoids having to grep
> the kernel tree for ndo_xdp callback. ;) A "xdp-partial" would
> still be unclear/confusing to the user whether his program loads
> or doesn't which is the only thing a user (or some loading infra)
> cares about eventually, so one still needs to go trying to load
> the XDP code to see whether that fails for the native case.

Good that we agree on usefulness of the NETIF_F_XDP_BIT.  The
"xdp-partial" or "xdp-challenged" is an early indication to the user
that they should complain to the vendor.  I tried to keep it simple
towards the user. Do you think every feature bit should be exposed to
userspace?


> > The idea is that XDP_DRV_* feature bits define a contract between
> > the driver and the kernel, giving a reliable way to know that XDP
> > features a driver promised to implement. Thus, knowing what bpf
> > side features are safe to allow.
> >
> > There are 3 levels of features: "required", "devel" and "optional".
> >
> > The motivation is pushing driver vendors forward to support all
> > the new XDP features.  Once a given feature bit is moved into
> > the "required" features, the kernel will reject loading XDP
> > program if feature isn't implemented by driver.  Features under
> > developement, require help from the bpf infrastrucure to detect
> > when a given helper or direct-access is used, using a bpf_prog
> > bit to mark a need for the feature, and pulling in this bit in
> > the xdp_features_check().  When all drivers have implemented
> > a "devel" feature, it can be moved to the "required" feature and  
> 
> The problem is that once you add bits markers to bpf_prog like we
> used to do in the past, then as you do in patch 4/5 with the
> xdp_rxhash_needed bit, they will need to be turned /on/ unconditionally
> when a prog has tail calls.

Yes, with tail calls, we have to enable all features.  But that is a
good thing, as it forces vendors to quickly implement all features.
And it is no different from moving a feature into the "required" bits,
once all drivers implement it.  It is only a limitation for tail calls,
and something we can fix later (for handling this for tail calls).

BPF have some nice features of evaluating the input program
"load-time", which is what I'm taking advantage of as an optimization
here (let use this nice bpf property).   It is only tail calls that
cannot evaluate this "load-time".  Thus, if you care about tail calls,
supporting intermediate features, we could later fix that by adding a
runtime feature check in the case of tail calls.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-18 15:41 ` [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers Jesper Dangaard Brouer
  2017-05-19 11:47   ` Jesper Dangaard Brouer
  2017-05-20  3:07   ` Alexei Starovoitov
@ 2017-05-20 16:16   ` Tom Herbert
  2017-05-21 16:04     ` Jesper Dangaard Brouer
  2 siblings, 1 reply; 31+ messages in thread
From: Tom Herbert @ 2017-05-20 16:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, John Fastabend,
	Linux Kernel Network Developers

On Thu, May 18, 2017 at 8:41 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Introducing a new XDP feature and associated bpf helper bpf_xdp_rxhash.
>
> The rxhash and type allow filtering on packets without touching
> packet memory.  The performance difference on my system with a
> 100 Gbit/s mlx5 NIC is 12Mpps to 19Mpps.
>
> TODO: desc RXHASH and associated type, and how XDP choose to map
> and export these to bpf_prog's.
>
> TODO: desc how this interacts with XDP driver features system.
> ---
>  include/linux/filter.h          |   31 ++++++++++++++++-
>  include/linux/netdev_features.h |    4 ++
>  include/uapi/linux/bpf.h        |   56 +++++++++++++++++++++++++++++-
>  kernel/bpf/verifier.c           |    3 ++
>  net/core/dev.c                  |   16 ++++++++-
>  net/core/filter.c               |   73 +++++++++++++++++++++++++++++++++++++++
>  samples/bpf/bpf_helpers.h       |    2 +
>  tools/include/uapi/linux/bpf.h  |   10 +++++
>  8 files changed, 190 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 9a7786db14fa..33a254ccd47d 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -413,7 +413,8 @@ struct bpf_prog {
>                                 locked:1,       /* Program image locked? */
>                                 gpl_compatible:1, /* Is filter GPL compatible? */
>                                 cb_access:1,    /* Is control block accessed? */
> -                               dst_needed:1;   /* Do we need dst entry? */
> +                               dst_needed:1,   /* Do we need dst entry? */
> +                               xdp_rxhash_needed:1;/* Req XDP RXHASH support */
>         kmemcheck_bitfield_end(meta);
>         enum bpf_prog_type      type;           /* Type of BPF program */
>         u32                     len;            /* Number of filter blocks */
> @@ -444,12 +445,40 @@ struct bpf_skb_data_end {
>         void *data_end;
>  };
>
> +/* Kernel internal xdp_buff->flags */
> +#define XDP_CTX_F_RXHASH_TYPE_MASK     XDP_HASH_TYPE_MASK
> +#define XDP_CTX_F_RXHASH_TYPE_BITS     XDP_HASH_TYPE_BITS
> +#define XDP_CTX_F_RXHASH_SW            (1ULL <<  XDP_CTX_F_RXHASH_TYPE_BITS)
> +#define XDP_CTX_F_RXHASH_HW            (1ULL << (XDP_CTX_F_RXHASH_TYPE_BITS+1))
> +
>  struct xdp_buff {
>         void *data;
>         void *data_end;
>         void *data_hard_start;
> +       u64 flags;
> +       u32 rxhash;
>  };
>
> +/* helper functions for driver setting rxhash */
> +static inline void
> +xdp_record_hash(struct xdp_buff *xdp, u32 hash, u32 type)
> +{
> +       xdp->flags |= XDP_CTX_F_RXHASH_HW;
> +       xdp->flags |= type & XDP_CTX_F_RXHASH_TYPE_MASK;
> +       xdp->rxhash = hash;
> +}
> +
> +static inline void
> +xdp_set_skb_hash(struct xdp_buff *xdp, struct sk_buff *skb)
> +{
> +       if (likely(xdp->flags & (XDP_CTX_F_RXHASH_HW|XDP_CTX_F_RXHASH_SW))) {
> +               bool is_sw = !!(xdp->flags | XDP_CTX_F_RXHASH_SW);
> +               bool is_l4 = !!(xdp->flags & XDP_HASH_TYPE_L4_MASK);
> +
> +               __skb_set_hash(skb, xdp->rxhash, is_sw, is_l4);
> +       }
> +}
> +
>  /* compute the linear packet data range [data, data_end) which
>   * will be accessed by cls_bpf, act_bpf and lwt programs
>   */
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index ff81ee231410..4b50e8c606c5 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -219,11 +219,13 @@ enum {
>  /* XDP driver flags */
>  enum {
>         XDP_DRV_F_ENABLED_BIT,
> +       XDP_DRV_F_RXHASH_BIT,
>  };
>
>  #define __XDP_DRV_F_BIT(bit)   ((netdev_features_t)1 << (bit))
>  #define __XDP_DRV_F(name)      __XDP_DRV_F_BIT(XDP_DRV_F_##name##_BIT)
>  #define XDP_DRV_F_ENABLED      __XDP_DRV_F(ENABLED)
> +#define XDP_DRV_F_RXHASH       __XDP_DRV_F(RXHASH)
>
>  /* XDP driver MUST support these features, else kernel MUST reject
>   * bpf_prog to guarantee safe access to data structures
> @@ -233,7 +235,7 @@ enum {
>  /* Some XDP features are under development. Based on bpf_prog loading
>   * detect if kernel feature can be activated.
>   */
> -#define XDP_DRV_FEATURES_DEVEL         0
> +#define XDP_DRV_FEATURES_DEVEL         XDP_DRV_F_RXHASH
>
>  /* Some XDP features are optional, like action return code, as they
>   * are handled safely runtime.
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 945a1f5f63c5..1d9d3a46217d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -482,6 +482,9 @@ union bpf_attr {
>   *     Get the owner uid of the socket stored inside sk_buff.
>   *     @skb: pointer to skb
>   *     Return: uid of the socket owner on success or overflowuid if failed.
> + *
> + * u64 bpf_xdp_rxhash(xdp_md, new_hash, type, flags)
> + *     TODO: MISSING DESC
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -531,7 +534,8 @@ union bpf_attr {
>         FN(xdp_adjust_head),            \
>         FN(probe_read_str),             \
>         FN(get_socket_cookie),          \
> -       FN(get_socket_uid),
> +       FN(get_socket_uid),             \
> +       FN(xdp_rxhash),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -581,6 +585,10 @@ enum bpf_func_id {
>  /* BPF_FUNC_perf_event_output for sk_buff input context. */
>  #define BPF_F_CTXLEN_MASK              (0xfffffULL << 32)
>
> +/* BPF_FUNC_xdp_rxhash flags */
> +#define BPF_F_RXHASH_SET               0ULL
> +#define BPF_F_RXHASH_GET               (1ULL << 0)
> +
>  /* user accessible mirror of in-kernel sk_buff.
>   * new fields can only be added to the end of this structure
>   */
> @@ -660,6 +668,52 @@ enum xdp_action {
>  struct xdp_md {
>         __u32 data;
>         __u32 data_end;
> +       __u32 rxhash;
> +       /* (FIXME delete comment)
> +        * Discussion: If choosing to support direct read, then I
> +        * (believe) having a separate 'rxhash_type' is easier and
> +        * faster to implement. (Else I have to do BPF instruction
> +        * hacks to move the type into upper bits of 'rxhash', which I
> +        * couldn't figureout how to do ;-))
> +       */
> +       __u32 rxhash_type;
>  };
>
> +/* XDP rxhash have an associated type, which is related to the RSS
> + * (Receive Side Scaling) standard, but NIC HW have different mapping
> + * and support. Thus, create mapping that is interesting for XDP.  XDP
> + * would primarly want insight into L3 and L4 protocol info.
> + *
> + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> + *
> + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> + */
> +#define XDP_HASH(x)            ((x) & ((1ULL << 32)-1))
> +#define XDP_HASH_TYPE(x)       ((x) >> 32)
> +
> +#define XDP_HASH_TYPE_L3_SHIFT 0
> +#define XDP_HASH_TYPE_L3_BITS  3
> +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> +#define XDP_HASH_TYPE_L3(x)    ((x) & XDP_HASH_TYPE_L3_MASK)
> +enum {
> +       XDP_HASH_TYPE_L3_IPV4 = 1,
> +       XDP_HASH_TYPE_L3_IPV6,
> +};
> +
> +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> +#define XDP_HASH_TYPE_L4_BITS  5
> +#define XDP_HASH_TYPE_L4_MASK                                          \
> +       (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> +#define XDP_HASH_TYPE_L4(x)    ((x) & XDP_HASH_TYPE_L4_MASK)
> +enum {
> +       _XDP_HASH_TYPE_L4_TCP = 1,
> +       _XDP_HASH_TYPE_L4_UDP,
> +};
> +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
> +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)
> +
Hi Jesper,

Why do we need these indicators for protocol specific hash? It seems
like L4 and L3 is useful differentiation and protocol agnostic (I'm
still holding out hope that SCTP will be deployed some day ;-) )

Tom

> +#define XDP_HASH_TYPE_BITS   (XDP_HASH_TYPE_L3_BITS + XDP_HASH_TYPE_L4_BITS)
> +#define XDP_HASH_TYPE_MASK   (XDP_HASH_TYPE_L3_MASK | XDP_HASH_TYPE_L4_MASK)
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6f8b6ed690be..248bc113ad18 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3346,6 +3346,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>                         prog->dst_needed = 1;
>                 if (insn->imm == BPF_FUNC_get_prandom_u32)
>                         bpf_user_rnd_init_once();
> +               if (insn->imm == BPF_FUNC_xdp_rxhash)
> +                       prog->xdp_rxhash_needed = 1;
>                 if (insn->imm == BPF_FUNC_tail_call) {
>                         /* If we tail call into other programs, we
>                          * cannot make any assumptions since they can
> @@ -3353,6 +3355,7 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
>                          * the program array.
>                          */
>                         prog->cb_access = 1;
> +                       prog->xdp_rxhash_needed = 1;
>
>                         /* mark bpf_tail_call as different opcode to avoid
>                          * conditional branch in the interpeter for every normal
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b4af5fbbd9da..28082067ac00 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4318,9 +4318,13 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
>         xdp.data_end = xdp.data + hlen;
>         xdp.data_hard_start = skb->data - skb_headroom(skb);
>         orig_data = xdp.data;
> +       xdp.flags  = 0;
> +       xdp.rxhash = skb->hash;
>
>         act = bpf_prog_run_xdp(xdp_prog, &xdp);
>
> +       xdp_set_skb_hash(&xdp, skb);
> +
>         off = xdp.data - orig_data;
>         if (off > 0)
>                 __skb_pull(skb, off);
> @@ -6851,10 +6855,20 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
>  }
>  EXPORT_SYMBOL(dev_change_proto_down);
>
> +netdev_features_t bpf_get_xdp_features(struct bpf_prog *prog)
> +{
> +       netdev_features_t features = XDP_DRV_FEATURES_REQUIRED;
> +
> +       if (prog->xdp_rxhash_needed)
> +               features |= XDP_DRV_F_RXHASH;
> +
> +       return features;
> +}
> +
>  bool xdp_features_check(struct net_device *dev, struct bpf_prog *xdp_prog,
>                         struct netlink_ext_ack *extack, u32 flags)
>  {
> -       netdev_features_t req_features = XDP_DRV_FEATURES_REQUIRED;
> +       netdev_features_t req_features = bpf_get_xdp_features(xdp_prog);
>         netdev_features_t dev_xdp_features;
>
>         /* Generic XDP naturally support all features */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index a253a6197e6b..df04ac73f581 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2272,6 +2272,54 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>         .arg2_type      = ARG_ANYTHING,
>  };
>
> +BPF_CALL_4(bpf_xdp_rxhash, struct xdp_buff *, xdp, u32, new_hash, u32, type,
> +          unsigned int, flags)
> +{
> +       /* Read+write access to xdp_buff->rxhash is safe, because
> +        * fixup_bpf_calls() detect when helper is used, and drivers
> +        * not implemeting rxhash will not be allowed to load bpf_prog.
> +        */
> +
> +       /* Set hash and type */
> +       if (flags == BPF_F_RXHASH_SET) {
> +               xdp->rxhash = new_hash;
> +               xdp->flags |= XDP_CTX_F_RXHASH_SW; /* Need for skb "is_sw" */
> +               xdp->flags |= type & XDP_CTX_F_RXHASH_TYPE_MASK;
> +       }
> +
> +       /* Get can specify "type" interested in */
> +       if ((flags == BPF_F_RXHASH_GET) &&
> +           (type & XDP_CTX_F_RXHASH_TYPE_MASK)) {
> +               u32 f_type = (xdp->flags & XDP_CTX_F_RXHASH_TYPE_MASK);
> +               bool match = false;
> +
> +               /* Match on either L3 or L4 type rxhash */
> +               if (!((type ^ f_type) & XDP_HASH_TYPE_L3_MASK))
> +                       match = true;
> +               if (!((type ^ f_type) & XDP_HASH_TYPE_L4_MASK))
> +                       match = true;
> +               if (match == false)
> +                       return 0;
> +       }
> +
> +       /* Drivers only xdp_record_hash if NETIF_F_RXHASH enabled */
> +       if (likely(xdp->flags & (XDP_CTX_F_RXHASH_HW|XDP_CTX_F_RXHASH_SW))) {
> +               u64 rxhash_type = xdp->flags & XDP_CTX_F_RXHASH_TYPE_MASK;
> +
> +               return (u64)(xdp->rxhash | (rxhash_type << 32));
> +       }
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_xdp_rxhash_proto = {
> +       .func           = bpf_xdp_rxhash,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER, // Q: how do I say "u64" ?
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_ANYTHING,
> +       .arg3_type      = ARG_ANYTHING,
> +};
> +
>  bool bpf_helper_changes_pkt_data(void *func)
>  {
>         if (func == bpf_skb_vlan_push ||
> @@ -2760,6 +2808,8 @@ xdp_func_proto(enum bpf_func_id func_id)
>                 return &bpf_get_smp_processor_id_proto;
>         case BPF_FUNC_xdp_adjust_head:
>                 return &bpf_xdp_adjust_head_proto;
> +       case BPF_FUNC_xdp_rxhash:
> +               return &bpf_xdp_rxhash_proto;
>         default:
>                 return bpf_base_func_proto(func_id);
>         }
> @@ -3308,6 +3358,29 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
>                                       si->dst_reg, si->src_reg,
>                                       offsetof(struct xdp_buff, data_end));
>                 break;
> +       case offsetof(struct xdp_md, rxhash):
> +               /* Direct read-access to rxhash is safe, as drivers
> +                * not implementing will not be allowed to load bpf_prog.
> +                *
> +                * Driver gotchas: Even if NETIF_F_RXHASH is disabled
> +                * drivers must init xdp_buff->rxhash, due to this
> +                * direct read.
> +                */
> +               prog->xdp_rxhash_needed = 1;
> +
> +               BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_buff, rxhash) != 4);
> +               *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->src_reg,
> +                                     offsetof(struct xdp_buff, rxhash));
> +               break;
> +       case offsetof(struct xdp_md, rxhash_type):
> +               /* rxhash_type stored in lower 8-bits of xdp_buff->flags */
> +               prog->xdp_rxhash_needed = 1;
> +
> +               BUILD_BUG_ON(XDP_HASH_TYPE_BITS != 8);
> +               /* Load first 8 bits (BPF_B) of flags */
> +               *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->src_reg,
> +                                     offsetof(struct xdp_buff, flags));
> +               break;
>         }
>
>         return insn - insn_buf;
> diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
> index 9a9c95f2c9fb..634a976a02c6 100644
> --- a/samples/bpf/bpf_helpers.h
> +++ b/samples/bpf/bpf_helpers.h
> @@ -59,6 +59,8 @@ static unsigned long long (*bpf_get_prandom_u32)(void) =
>         (void *) BPF_FUNC_get_prandom_u32;
>  static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
>         (void *) BPF_FUNC_xdp_adjust_head;
> +static unsigned long long (*bpf_xdp_rxhash)(void *ctx, __u32 new_hash, __u32 type, unsigned int flags) =
> +       (void *) BPF_FUNC_xdp_rxhash;
>
>  /* llvm builtin functions that eBPF C program may use to
>   * emit BPF_LD_ABS and BPF_LD_IND instructions
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index e553529929f6..a38c544bf6f0 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -483,6 +483,9 @@ union bpf_attr {
>   *     @skb: pointer to skb
>   *     Return: uid of the socket owner on success or 0 if the socket pointer
>   *     inside sk_buff is NULL
> + *
> + * u64 bpf_xdp_rxhash(xdp_md, new_hash, type, flags)
> + *     FIXME: Copy desc from include/uapi/linux/bpf.h
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -532,7 +535,8 @@ union bpf_attr {
>         FN(xdp_adjust_head),            \
>         FN(probe_read_str),             \
>         FN(get_socket_cookie),          \
> -       FN(get_socket_uid),
> +       FN(get_socket_uid),             \
> +       FN(xdp_rxhash),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -661,6 +665,10 @@ enum xdp_action {
>  struct xdp_md {
>         __u32 data;
>         __u32 data_end;
> +       __u32 rxhash;
> +       __u32 rxhash_type;
>  };
>
> +// FIXME: Sync with include/uapi/linux/bpf.h
> +
>  #endif /* _UAPI__LINUX_BPF_H__ */
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-20  7:53     ` Jesper Dangaard Brouer
@ 2017-05-21  0:58       ` Daniel Borkmann
  2017-05-22 14:49         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Borkmann @ 2017-05-21  0:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, John Fastabend, netdev

On 05/20/2017 09:53 AM, Jesper Dangaard Brouer wrote:
> On Fri, 19 May 2017 19:13:29 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>
>> On 05/18/2017 05:41 PM, Jesper Dangaard Brouer wrote:
>>> There is a fundamental difference between normal eBPF programs
>>> and (XDP) eBPF programs getting attached in a driver. For normal
>>> eBPF programs it is easy to add a new bpf feature, like a bpf
>>> helper, because is it strongly tied to the feature being
>>> available in the current core kernel code.  When drivers invoke a
>>> bpf_prog, then it is not sufficient to simply relying on whether
>>> a bpf_helper exists or not.  When a driver haven't implemented a
>>> given feature yet, then it is possible to expose uninitialized
>>> parts of xdp_buff.  The driver pass in a pointer to xdp_buff,
>>> usually "allocated" on the stack, which must not be exposed.
>>
>> When xdp_buff is being extended, then we should at least zero
>> initialize all in-tree users that don't support or populate this
>> field, thus that it's not uninitialized memory. Better would be
>> to have a way to reject the prog in the first place until it's
>> implemented (but further comments on feature bits below).
>
> Going down a path where we need to zero out the xdp_buff looks a lot
> like the sk_buff zeroing, which is the top perf cost associated with
> SKBs see[1].  XDP is is about not repeating the same issue we had with
> the SKB...

But if we agree on implementing a certain feature that requires to
add a new member, then basic requirement should be that it needs to
be implementable by all XDP enabled drivers, so that users eventually
don't need to worry which driver they run to access a XDP feature.
In that case, zeroing that member until it gets properly implemented
is just temporary (and could be an incentive perhaps).

> [1] https://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html#analyzing-build-skb-and-memset
>
>>> Only two user visible NETIF_F_XDP_* net_device feature flags are
>>> exposed via ethtool (-k) seen as "xdp" and "xdp-partial".
>>> The "xdp-partial" is detected when there is not feature equality
>>> between kernel and driver, and a netdev_warn is given.
>>
>> I think having something like a NETIF_F_XDP_BIT for ethtool to
>> indicate support as "xdp" is quite useful. Avoids having to grep
>> the kernel tree for ndo_xdp callback. ;) A "xdp-partial" would
>> still be unclear/confusing to the user whether his program loads
>> or doesn't which is the only thing a user (or some loading infra)
>> cares about eventually, so one still needs to go trying to load
>> the XDP code to see whether that fails for the native case.
>
> Good that we agree on usefulness of the NETIF_F_XDP_BIT.  The
> "xdp-partial" or "xdp-challenged" is an early indication to the user
> that they should complain to the vendor.  I tried to keep it simple
> towards the user. Do you think every feature bit should be exposed to
> userspace?

That would potentially require us to go down that path and expose
feature bits for everything, even something as silly as new flags
for a specific helper that requires some sort of additional support.
We probably rather want to keep such thing in the kernel for now
and potentially reject loads instead.

>>> The idea is that XDP_DRV_* feature bits define a contract between
>>> the driver and the kernel, giving a reliable way to know that XDP
>>> features a driver promised to implement. Thus, knowing what bpf
>>> side features are safe to allow.
>>>
>>> There are 3 levels of features: "required", "devel" and "optional".
>>>
>>> The motivation is pushing driver vendors forward to support all
>>> the new XDP features.  Once a given feature bit is moved into
>>> the "required" features, the kernel will reject loading XDP
>>> program if feature isn't implemented by driver.  Features under
>>> developement, require help from the bpf infrastrucure to detect
>>> when a given helper or direct-access is used, using a bpf_prog
>>> bit to mark a need for the feature, and pulling in this bit in
>>> the xdp_features_check().  When all drivers have implemented
>>> a "devel" feature, it can be moved to the "required" feature and
>>
>> The problem is that once you add bits markers to bpf_prog like we
>> used to do in the past, then as you do in patch 4/5 with the
>> xdp_rxhash_needed bit, they will need to be turned /on/ unconditionally
>> when a prog has tail calls.
>
> Yes, with tail calls, we have to enable all features.  But that is a
> good thing, as it forces vendors to quickly implement all features.
> And it is no different from moving a feature into the "required" bits,
> once all drivers implement it.  It is only a limitation for tail calls,
> and something we can fix later (for handling this for tail calls).

But the issue I pointed out with this approach is that XDP programs
using tail calls will suddenly break and get rejected from one
version to another.

That's basically breaking the contract we have today where it must
be guaranteed that older programs keep working on newer kernels,
exactly the same with uapi. Breaking user programs is a no-go,
not a good thing at all (it will hurt users and they move on with
something else). So it's not something we can 'fix' at some later
point in time; such feature detection would need to consider this
from day 1.

> BPF have some nice features of evaluating the input program
> "load-time", which is what I'm taking advantage of as an optimization
> here (let use this nice bpf property).   It is only tail calls that
> cannot evaluate this "load-time".  Thus, if you care about tail calls,
> supporting intermediate features, we could later fix that by adding a
> runtime feature check in the case of tail calls.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-20  3:07   ` Alexei Starovoitov
  2017-05-20  3:21     ` Jakub Kicinski
@ 2017-05-21 15:55     ` Jesper Dangaard Brouer
  2017-05-22  3:21       ` Alexei Starovoitov
  1 sibling, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-21 15:55 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Daniel Borkmann, netdev, brouer

On Fri, 19 May 2017 20:07:52 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, May 18, 2017 at 05:41:48PM +0200, Jesper Dangaard Brouer wrote:
> >  
> > +/* XDP rxhash have an associated type, which is related to the RSS
> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> > + * would primarly want insight into L3 and L4 protocol info.
> > + *
> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> > + *
> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> > + */
> > +#define XDP_HASH(x)		((x) & ((1ULL << 32)-1))
> > +#define XDP_HASH_TYPE(x)	((x) >> 32)
> > +
> > +#define XDP_HASH_TYPE_L3_SHIFT	0
> > +#define XDP_HASH_TYPE_L3_BITS	3
> > +#define XDP_HASH_TYPE_L3_MASK	((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> > +#define XDP_HASH_TYPE_L3(x)	((x) & XDP_HASH_TYPE_L3_MASK)
> > +enum {
> > +	XDP_HASH_TYPE_L3_IPV4 = 1,
> > +	XDP_HASH_TYPE_L3_IPV6,
> > +};
> > +
> > +#define XDP_HASH_TYPE_L4_SHIFT	XDP_HASH_TYPE_L3_BITS
> > +#define XDP_HASH_TYPE_L4_BITS	5
> > +#define XDP_HASH_TYPE_L4_MASK						\
> > +	(((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4(x)	((x) & XDP_HASH_TYPE_L4_MASK)
> > +enum {
> > +	_XDP_HASH_TYPE_L4_TCP = 1,
> > +	_XDP_HASH_TYPE_L4_UDP,
> > +};
> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)  
> 
> imo this is dangerous territory.
> As far as I can see this information doesn't exist in the current drivers at all
> and you're enabling it in the patch 5 via fancy:
> +       u32 ht = (mlx5_htype_l4_to_xdp[((cht & CQE_RSS_HTYPE_L4) >> 6)] | \
> +                 mlx5_htype_l3_to_xdp[((cht & CQE_RSS_HTYPE_IP) >> 2)]);
> 
> It's pretty cool that you've discovered this hidden mlx5 feature
> Did you find it in some hw spec ?

The Mellanox ConnectX-4/mlx5 spec is actually open, see:
[1] http://www.mellanox.com/page/products_dyn?product_family=204&mtag=connectx_4_en_card
and follow link to "Programming Manual (PRM)".


> And it looks useful to me, but

> 1. i'm worried that we'd be relying on something that mellanox didn't
>  implement in their drivers before. Was it tested and guarnteed to
>  exist in the future revisions of firmware? Is it cx4 or cx4-lx or cx5
>  feature?

It is not a hidden mlx5 or specific feature.  Due to the Microsoft RSS
standard/requirements[2] most NICs actually implement this.

[2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types


> 2. but the main concern that it is mellanox only feature. At least I cannot
> see anything like this in broadcom and intel nics

All the drivers I looked at have support for an RSS hash type.
Including Broadcom[3] and Intel. Just grep after NETIF_F_RXHASH, and
follow data-structs.  The Intel i40 NIC have the most elaborate rss type
system (it can e.g. tell if this was SCTP).

[3] http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/broadcom/bnx2x/bnx2x_hsi.h#L4198
 
> In the very beginning we discussed that XDP programs should be as
> generic as possible and HW independent while at the same time we want
> to expose HW specific features to XDP programs.
>
> So I'm totally fine to expose this fancy hw hash and ipv4 vs v6 and
> tcp vs udp flags to xdp programs somehow, but I'm completely against
> making it into uapi.
> 
> How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> We can make sure that XDP program does read only access into it and
> it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> in there, but this will not be uapi and it will be pretty obvious
> to program authors that their programs are vendor specific.

This sounds EXTREMELY dangerous to me... IHMO this will lead to vendor
lock-in.  As BPF program authors will become dependent on vendor
specific features, and their program are no longer portable to run on
other NICs.

How are you going to avoid vendor lock-in with this model?


> 'not uapi' here means that mellanox is free to change their HW descriptor
> and its contents as they wish.

Hmmm... IMHO directly exposing the HW descriptor to userspace, will
limit vendors ability to change its contents.


> Also no copies and bit conversions will be necessary, so the cost will
> be zero to programs that don't use it and we wouldn't need to change
> verifier to discover access to this stuff.

I'm not sure this would work out well, as we would need to keep the
CQE descriptor memory around longer.


The longer term plan with having RXHASH for XDP is to allow
implementing RPS (Receive Packet Steering) without touching packet
memory.  Which plays into my plans for XDP_REDIRECT to another CPU.
I guess, I'll go implement XDP_REDIRECT first, and then return back to
RXHASH when I need that feature (as I have the rxhash PoC code here).
RXHASH does have merits of its own, for e.g. flow-based XDP_TX
load-balancing (without touching memory).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-20 16:16   ` Tom Herbert
@ 2017-05-21 16:04     ` Jesper Dangaard Brouer
  2017-05-21 22:10       ` Tom Herbert
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-21 16:04 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Daniel Borkmann, Alexei Starovoitov,
	Linux Kernel Network Developers, brouer

On Sat, 20 May 2017 09:16:09 -0700
Tom Herbert <tom@herbertland.com> wrote:

> > +/* XDP rxhash have an associated type, which is related to the RSS
> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> > + * would primarly want insight into L3 and L4 protocol info.
> > + *
> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> > + *
> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> > + */
> > +#define XDP_HASH(x)            ((x) & ((1ULL << 32)-1))
> > +#define XDP_HASH_TYPE(x)       ((x) >> 32)
> > +
> > +#define XDP_HASH_TYPE_L3_SHIFT 0
> > +#define XDP_HASH_TYPE_L3_BITS  3
> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> > +#define XDP_HASH_TYPE_L3(x)    ((x) & XDP_HASH_TYPE_L3_MASK)
> > +enum {
> > +       XDP_HASH_TYPE_L3_IPV4 = 1,
> > +       XDP_HASH_TYPE_L3_IPV6,
> > +};
> > +
> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> > +#define XDP_HASH_TYPE_L4_BITS  5
> > +#define XDP_HASH_TYPE_L4_MASK                                          \
> > +       (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4(x)    ((x) & XDP_HASH_TYPE_L4_MASK)
> > +enum {
> > +       _XDP_HASH_TYPE_L4_TCP = 1,
> > +       _XDP_HASH_TYPE_L4_UDP,
> > +};
> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)
> > +  
> Hi Jesper,
> 
> Why do we need these indicators for protocol specific hash? It seems
> like L4 and L3 is useful differentiation and protocol agnostic (I'm
> still holding out hope that SCTP will be deployed some day ;-) )

I'm not sure I understood the question fully, but let me try to answer
anyway.  To me it seems obvious that you would want to know the
protocol/L4 type, as this helps avoid hash collisions between UDP and
TCP flows.  I can easily imagine someone constructing an UDP packet
that could hash collide with a given TCP flow.

And yes, i40 support matching SCTP, and we will create a
XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-21 16:04     ` Jesper Dangaard Brouer
@ 2017-05-21 22:10       ` Tom Herbert
  2017-05-22  6:39         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Tom Herbert @ 2017-05-21 22:10 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, Linux Kernel Network Developers

On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Sat, 20 May 2017 09:16:09 -0700
> Tom Herbert <tom@herbertland.com> wrote:
>
>> > +/* XDP rxhash have an associated type, which is related to the RSS
>> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
>> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
>> > + * would primarly want insight into L3 and L4 protocol info.
>> > + *
>> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
>> > + *
>> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
>> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
>> > + */
>> > +#define XDP_HASH(x)            ((x) & ((1ULL << 32)-1))
>> > +#define XDP_HASH_TYPE(x)       ((x) >> 32)
>> > +
>> > +#define XDP_HASH_TYPE_L3_SHIFT 0
>> > +#define XDP_HASH_TYPE_L3_BITS  3
>> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
>> > +#define XDP_HASH_TYPE_L3(x)    ((x) & XDP_HASH_TYPE_L3_MASK)
>> > +enum {
>> > +       XDP_HASH_TYPE_L3_IPV4 = 1,
>> > +       XDP_HASH_TYPE_L3_IPV6,
>> > +};
>> > +
>> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
>> > +#define XDP_HASH_TYPE_L4_BITS  5
>> > +#define XDP_HASH_TYPE_L4_MASK                                          \
>> > +       (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
>> > +#define XDP_HASH_TYPE_L4(x)    ((x) & XDP_HASH_TYPE_L4_MASK)
>> > +enum {
>> > +       _XDP_HASH_TYPE_L4_TCP = 1,
>> > +       _XDP_HASH_TYPE_L4_UDP,
>> > +};
>> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
>> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)
>> > +
>> Hi Jesper,
>>
>> Why do we need these indicators for protocol specific hash? It seems
>> like L4 and L3 is useful differentiation and protocol agnostic (I'm
>> still holding out hope that SCTP will be deployed some day ;-) )
>
> I'm not sure I understood the question fully, but let me try to answer
> anyway.  To me it seems obvious that you would want to know the
> protocol/L4 type, as this helps avoid hash collisions between UDP and
> TCP flows.  I can easily imagine someone constructing an UDP packet
> that could hash collide with a given TCP flow.
>
> And yes, i40 support matching SCTP, and we will create a
> XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
>
But where would this information be used? We don't save it in skbuff,
don't use it in RPS, RFS. RSS doesn't use it for packet steering so
the hash collision problem already exists at the device level. If
there is a collision problem between two protocols then maybe hash
should be over 5-tuple instead...

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-21 15:55     ` Jesper Dangaard Brouer
@ 2017-05-22  3:21       ` Alexei Starovoitov
  2017-05-22  4:12         ` John Fastabend
  0 siblings, 1 reply; 31+ messages in thread
From: Alexei Starovoitov @ 2017-05-22  3:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Daniel Borkmann, netdev

On Sun, May 21, 2017 at 05:55:50PM +0200, Jesper Dangaard Brouer wrote:
> > And it looks useful to me, but
> 
> > 1. i'm worried that we'd be relying on something that mellanox didn't
> >  implement in their drivers before. Was it tested and guarnteed to
> >  exist in the future revisions of firmware? Is it cx4 or cx4-lx or cx5
> >  feature?
> 
> It is not a hidden mlx5 or specific feature.  Due to the Microsoft RSS
> standard/requirements[2] most NICs actually implement this.
> 
> [2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types

...

> > 2. but the main concern that it is mellanox only feature. At least I cannot
> > see anything like this in broadcom and intel nics
> 
> All the drivers I looked at have support for an RSS hash type.
> Including Broadcom[3] and Intel. Just grep after NETIF_F_RXHASH, and
> follow data-structs.  The Intel i40 NIC have the most elaborate rss type
> system (it can e.g. tell if this was SCTP).
> 
> [3] http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/broadcom/bnx2x/bnx2x_hsi.h#L4198

yes and bnxt too.
msft spec requires RSS to be configured in these different ways, but
it doesn't mean that HW descriptor will have 'is_v4' and 'is_v6' bits set.
imo this is mlx specific behavior.
If you want to piggy back on msft spec and make linux rss to be configurable
the same way, I guess that's fine, but imo it's orthogonal to xdp.

> > How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
> > We can make sure that XDP program does read only access into it and
> > it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
> > in there, but this will not be uapi and it will be pretty obvious
> > to program authors that their programs are vendor specific.
> 
> This sounds EXTREMELY dangerous to me... IHMO this will lead to vendor
> lock-in.  As BPF program authors will become dependent on vendor
> specific features, and their program are no longer portable to run on
> other NICs.
> 
> How are you going to avoid vendor lock-in with this model?

It looked to me that that was the intent of your patch set, hence
counter proposal to make it much simpler.
I'm not going to use vendor specific features. The proposal
to expose hw rx descriptor as-is is for people who desperately want
that info without burdening core xdp with it.

> > 'not uapi' here means that mellanox is free to change their HW descriptor
> > and its contents as they wish.
> 
> Hmmm... IMHO directly exposing the HW descriptor to userspace, will
> limit vendors ability to change its contents.

kprobes can already look at hw rx descriptor.
if somebody really wants to look into it, they have a way to do it already:
- add kprobe to mlx5e_handle_rx_cqe(), look into cqe, store the outcome on a side
- use that info in the xdp program
All I proposed is to make it first class citizen and avoid kprobe.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-22  3:21       ` Alexei Starovoitov
@ 2017-05-22  4:12         ` John Fastabend
  0 siblings, 0 replies; 31+ messages in thread
From: John Fastabend @ 2017-05-22  4:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Jesper Dangaard Brouer; +Cc: Daniel Borkmann, netdev

On 05/21/2017 08:21 PM, Alexei Starovoitov wrote:
> On Sun, May 21, 2017 at 05:55:50PM +0200, Jesper Dangaard Brouer wrote:
>>> And it looks useful to me, but
>>
>>> 1. i'm worried that we'd be relying on something that mellanox didn't
>>>  implement in their drivers before. Was it tested and guarnteed to
>>>  exist in the future revisions of firmware? Is it cx4 or cx4-lx or cx5
>>>  feature?
>>
>> It is not a hidden mlx5 or specific feature.  Due to the Microsoft RSS
>> standard/requirements[2] most NICs actually implement this.
>>
>> [2] https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types
> 
> ...
> 
>>> 2. but the main concern that it is mellanox only feature. At least I cannot
>>> see anything like this in broadcom and intel nics
>>
>> All the drivers I looked at have support for an RSS hash type.
>> Including Broadcom[3] and Intel. Just grep after NETIF_F_RXHASH, and
>> follow data-structs.  The Intel i40 NIC have the most elaborate rss type
>> system (it can e.g. tell if this was SCTP).
>>
>> [3] http://elixir.free-electrons.com/linux/latest/source/drivers/net/ethernet/broadcom/bnx2x/bnx2x_hsi.h#L4198
> 
> yes and bnxt too.
> msft spec requires RSS to be configured in these different ways, but
> it doesn't mean that HW descriptor will have 'is_v4' and 'is_v6' bits set.
> imo this is mlx specific behavior.
> If you want to piggy back on msft spec and make linux rss to be configurable
> the same way, I guess that's fine, but imo it's orthogonal to xdp.
> 
>>> How about exposing 'struct mlx5_cqe64 *' to XDP programs as-is?
>>> We can make sure that XDP program does read only access into it and
>>> it will see cqe->rss_hash_result, cqe->rss_hash_type and everything else
>>> in there, but this will not be uapi and it will be pretty obvious
>>> to program authors that their programs are vendor specific.
>>
>> This sounds EXTREMELY dangerous to me... IHMO this will lead to vendor
>> lock-in.  As BPF program authors will become dependent on vendor
>> specific features, and their program are no longer portable to run on
>> other NICs.
>>
>> How are you going to avoid vendor lock-in with this model?
> 
> It looked to me that that was the intent of your patch set, hence
> counter proposal to make it much simpler.
> I'm not going to use vendor specific features. The proposal
> to expose hw rx descriptor as-is is for people who desperately want
> that info without burdening core xdp with it.
> 
>>> 'not uapi' here means that mellanox is free to change their HW descriptor
>>> and its contents as they wish.
>>
>> Hmmm... IMHO directly exposing the HW descriptor to userspace, will
>> limit vendors ability to change its contents.
> 
> kprobes can already look at hw rx descriptor.
> if somebody really wants to look into it, they have a way to do it already:
> - add kprobe to mlx5e_handle_rx_cqe(), look into cqe, store the outcome on a side
> - use that info in the xdp program
> All I proposed is to make it first class citizen and avoid kprobe.
> 

Another solution is to have hardware prepend meta-data to the front of the
packet and have the XDP program read it out. Of course the hardware and
XDP program need to be in sync at this point, but it works today assuming
a mechanism to program hardware exists.

The nice part of the above is you push all the complexity of feature negotiation
and hardware initialization out of XDP core completely.

This would be my preferred solution, except I'm not sure if some hardware
would have issue with this.

.John

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-21 22:10       ` Tom Herbert
@ 2017-05-22  6:39         ` Jesper Dangaard Brouer
  2017-05-22 20:42           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-22  6:39 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Daniel Borkmann, Alexei Starovoitov,
	Linux Kernel Network Developers, brouer

On Sun, 21 May 2017 15:10:29 -0700
Tom Herbert <tom@herbertland.com> wrote:

> On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > On Sat, 20 May 2017 09:16:09 -0700
> > Tom Herbert <tom@herbertland.com> wrote:
> >  
> >> > +/* XDP rxhash have an associated type, which is related to the RSS
> >> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> >> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> >> > + * would primarly want insight into L3 and L4 protocol info.
> >> > + *
> >> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> >> > + *
> >> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> >> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> >> > + */
> >> > +#define XDP_HASH(x)            ((x) & ((1ULL << 32)-1))
> >> > +#define XDP_HASH_TYPE(x)       ((x) >> 32)
> >> > +
> >> > +#define XDP_HASH_TYPE_L3_SHIFT 0
> >> > +#define XDP_HASH_TYPE_L3_BITS  3
> >> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> >> > +#define XDP_HASH_TYPE_L3(x)    ((x) & XDP_HASH_TYPE_L3_MASK)
> >> > +enum {
> >> > +       XDP_HASH_TYPE_L3_IPV4 = 1,
> >> > +       XDP_HASH_TYPE_L3_IPV6,
> >> > +};
> >> > +
> >> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> >> > +#define XDP_HASH_TYPE_L4_BITS  5
> >> > +#define XDP_HASH_TYPE_L4_MASK                                          \
> >> > +       (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> >> > +#define XDP_HASH_TYPE_L4(x)    ((x) & XDP_HASH_TYPE_L4_MASK)
> >> > +enum {
> >> > +       _XDP_HASH_TYPE_L4_TCP = 1,
> >> > +       _XDP_HASH_TYPE_L4_UDP,
> >> > +};
> >> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
> >> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)
> >> > +  
> >> Hi Jesper,
> >>
> >> Why do we need these indicators for protocol specific hash? It seems
> >> like L4 and L3 is useful differentiation and protocol agnostic (I'm
> >> still holding out hope that SCTP will be deployed some day ;-) )  
> >
> > I'm not sure I understood the question fully, but let me try to answer
> > anyway.  To me it seems obvious that you would want to know the
> > protocol/L4 type, as this helps avoid hash collisions between UDP and
> > TCP flows.  I can easily imagine someone constructing an UDP packet
> > that could hash collide with a given TCP flow.
> >
> > And yes, i40 support matching SCTP, and we will create a
> > XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
> >  
> But where would this information be used? We don't save it in skbuff,
> don't use it in RPS, RFS. RSS doesn't use it for packet steering so
> the hash collision problem already exists at the device level. If
> there is a collision problem between two protocols then maybe hash
> should be over 5-tuple instead...

One use-case (I heard at a customer) was that they had (web-)servers
that didn't serve any UDP traffic, thus they simply block/drop all
incoming UDP on the service NIC (as an ACL in the switch). (The servers
own DNS lookups and NTP goes through the management NIC to internal
DNS/NTP servers).

Another use-case: Inside an XDP/bpf program is can be used for
splitting protocol processing, into different tail calls, before even
touching packet-data.  I can imagine the bpf TCP handling code is
larger, thus an optimization is to have a separate tail call for the
UDP protocol handling.  One could also transfer/queue all TCP traffic
to other CPU(s) like RPS, just without touching packet memory.


This info is saved in the skb, but due to space constrains, it is
reduced to a single bit, namely skb->l4_hash, iif some
RSS-proto/XDP_HASH_TYPE_L4_* bit was set.  And the network stack do use
and react on this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-21  0:58       ` Daniel Borkmann
@ 2017-05-22 14:49         ` Jesper Dangaard Brouer
  2017-05-22 17:07           ` Daniel Borkmann
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-22 14:49 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev, brouer

On Sun, 21 May 2017 02:58:19 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 05/20/2017 09:53 AM, Jesper Dangaard Brouer wrote:
> > On Fri, 19 May 2017 19:13:29 +0200
> > Daniel Borkmann <daniel@iogearbox.net> wrote:
> >  
> >> On 05/18/2017 05:41 PM, Jesper Dangaard Brouer wrote:  
> >>> There is a fundamental difference between normal eBPF programs
> >>> and (XDP) eBPF programs getting attached in a driver. For normal
> >>> eBPF programs it is easy to add a new bpf feature, like a bpf
> >>> helper, because is it strongly tied to the feature being
> >>> available in the current core kernel code.  When drivers invoke a
> >>> bpf_prog, then it is not sufficient to simply relying on whether
> >>> a bpf_helper exists or not.  When a driver haven't implemented a
> >>> given feature yet, then it is possible to expose uninitialized
> >>> parts of xdp_buff.  The driver pass in a pointer to xdp_buff,
> >>> usually "allocated" on the stack, which must not be exposed.  
> >>
> >> When xdp_buff is being extended, then we should at least zero
> >> initialize all in-tree users that don't support or populate this
> >> field, thus that it's not uninitialized memory. Better would be
> >> to have a way to reject the prog in the first place until it's
> >> implemented (but further comments on feature bits below).  
> >
> > Going down a path where we need to zero out the xdp_buff looks a lot
> > like the sk_buff zeroing, which is the top perf cost associated with
> > SKBs see[1].  XDP is is about not repeating the same issue we had with
> > the SKB...  
> 
> But if we agree on implementing a certain feature that requires to
> add a new member, then basic requirement should be that it needs to
> be implementable by all XDP enabled drivers, so that users eventually
> don't need to worry which driver they run to access a XDP feature.
> In that case, zeroing that member until it gets properly implemented
> is just temporary (and could be an incentive perhaps).
> 
> > [1] https://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html#analyzing-build-skb-and-memset
> >  
> >>> Only two user visible NETIF_F_XDP_* net_device feature flags are
> >>> exposed via ethtool (-k) seen as "xdp" and "xdp-partial".
> >>> The "xdp-partial" is detected when there is not feature equality
> >>> between kernel and driver, and a netdev_warn is given.  
> >>
> >> I think having something like a NETIF_F_XDP_BIT for ethtool to
> >> indicate support as "xdp" is quite useful. Avoids having to grep
> >> the kernel tree for ndo_xdp callback. ;) A "xdp-partial" would
> >> still be unclear/confusing to the user whether his program loads
> >> or doesn't which is the only thing a user (or some loading infra)
> >> cares about eventually, so one still needs to go trying to load
> >> the XDP code to see whether that fails for the native case.  
> >
> > Good that we agree on usefulness of the NETIF_F_XDP_BIT.  The
> > "xdp-partial" or "xdp-challenged" is an early indication to the user
> > that they should complain to the vendor.  I tried to keep it simple
> > towards the user. Do you think every feature bit should be exposed to
> > userspace?  
> 
> That would potentially require us to go down that path and expose
> feature bits for everything, even something as silly as new flags
> for a specific helper that requires some sort of additional support.
> We probably rather want to keep such thing in the kernel for now
> and potentially reject loads instead.
> 
> >>> The idea is that XDP_DRV_* feature bits define a contract between
> >>> the driver and the kernel, giving a reliable way to know that XDP
> >>> features a driver promised to implement. Thus, knowing what bpf
> >>> side features are safe to allow.
> >>>
> >>> There are 3 levels of features: "required", "devel" and "optional".
> >>>
> >>> The motivation is pushing driver vendors forward to support all
> >>> the new XDP features.  Once a given feature bit is moved into
> >>> the "required" features, the kernel will reject loading XDP
> >>> program if feature isn't implemented by driver.  Features under
> >>> developement, require help from the bpf infrastrucure to detect
> >>> when a given helper or direct-access is used, using a bpf_prog
> >>> bit to mark a need for the feature, and pulling in this bit in
> >>> the xdp_features_check().  When all drivers have implemented
> >>> a "devel" feature, it can be moved to the "required" feature and  
> >>
> >> The problem is that once you add bits markers to bpf_prog like we
> >> used to do in the past, then as you do in patch 4/5 with the
> >> xdp_rxhash_needed bit, they will need to be turned /on/ unconditionally
> >> when a prog has tail calls.  
> >
> > Yes, with tail calls, we have to enable all features.  But that is a
> > good thing, as it forces vendors to quickly implement all features.
> > And it is no different from moving a feature into the "required" bits,
> > once all drivers implement it.  It is only a limitation for tail calls,
> > and something we can fix later (for handling this for tail calls).  
> 
> But the issue I pointed out with this approach is that XDP programs
> using tail calls will suddenly break and get rejected from one
> version to another.
> 
> That's basically breaking the contract we have today where it must
> be guaranteed that older programs keep working on newer kernels,
> exactly the same with uapi. Breaking user programs is a no-go,
> not a good thing at all (it will hurt users and they move on with
> something else). So it's not something we can 'fix' at some later
> point in time; such feature detection would need to consider this
> from day 1.

This is a very good point Daniel, that older program should be able to
run. But this will only happen when tail calls are in-use, and NIC
vendors don't keep their drivers up-to-date.  Creating a strong
incentive for drivers implementing all XDP features (which you have
argue for). Still I see your point.


> > BPF have some nice features of evaluating the input program
> > "load-time", which is what I'm taking advantage of as an optimization
> > here (let use this nice bpf property).   It is only tail calls that
> > cannot evaluate this "load-time".  Thus, if you care about tail calls,
> > supporting intermediate features, we could later fix that by adding a
> > runtime feature check in the case of tail calls.  


How do we move forward from here?

Let me try to restate the problem: There is a general disconnect
between bpf and what XDP drivers support.

The bpf model (AFAIK) is that your bpf program will get rejected when
loading your bpf_prog.  This happens if you are using a bpf_helper or
direct-ctx-access which is not supported/avail in the kernel. Right??
(This also happens for tail-call programs, right?)  (I do like this
model of failing the program when the feature is not avail).

I guess, new bpf features are discovered by looking at uapi/bpf.h and
then try to use the feature, and see if your program loads.

The disconnect is that when writing XDP-bpf programs, then you will
not get any feedback when you are calling a bpf-helper that the XDP
driver doesn't support.  The XDP bpf-program loads and can happily run
calling core-kernel XDP helpers.  Now, how will the XDP program author
know that his call to the helper is valid?

For rxhash, we could init the fields to zero (in drivers not
supporting it), but how will the XDP bpf_prog author know whether or
not a NIC driver supports XDP-rxhash?  (I guess, he could write a XDP
program, that sets the XDP software hash, and then add a match via TC
meta match or another cls_bpf program that read the skb->hash value,
to see if changing the rxhash works... it seem clumpsy)

For xdp_adjust_head, it was even-worse, as a boundry check against
data_end with e.g. zero is no boundry, and would allow unbounded
memory access from the xdp.data_end pointer.  We actually have
released kernels with this bug, which can be triggered with a bpf
tail-call.


I just did a quick check on linux-stable.git for v4.10.17 and it looks
like commit c2002f983767 ("bpf: fix checking xdp_adjust_head on tail
calls") is missing, meaning drivers nfp, mlx5, nfp and virtio_net is
likely subject to the adjust_head unbounded memory access bug via tail
calls. I really wish we had a reliable way of avoiding these kind of
bpf vs. XDP-driver mismatches.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

p.s. Remember, to verify if a driver supports adjust_head, grep after
"data_hard_start" in the driver code... which is also quite strange
for users to know.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-22 14:49         ` Jesper Dangaard Brouer
@ 2017-05-22 17:07           ` Daniel Borkmann
  2017-05-30  9:58             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 31+ messages in thread
From: Daniel Borkmann @ 2017-05-22 17:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev

On 05/22/2017 04:49 PM, Jesper Dangaard Brouer wrote:
> On Sun, 21 May 2017 02:58:19 +0200
> Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 05/20/2017 09:53 AM, Jesper Dangaard Brouer wrote:
>>> On Fri, 19 May 2017 19:13:29 +0200
>>> Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 05/18/2017 05:41 PM, Jesper Dangaard Brouer wrote:
>>>>> There is a fundamental difference between normal eBPF programs
>>>>> and (XDP) eBPF programs getting attached in a driver. For normal
>>>>> eBPF programs it is easy to add a new bpf feature, like a bpf
>>>>> helper, because is it strongly tied to the feature being
>>>>> available in the current core kernel code.  When drivers invoke a
>>>>> bpf_prog, then it is not sufficient to simply relying on whether
>>>>> a bpf_helper exists or not.  When a driver haven't implemented a
>>>>> given feature yet, then it is possible to expose uninitialized
>>>>> parts of xdp_buff.  The driver pass in a pointer to xdp_buff,
>>>>> usually "allocated" on the stack, which must not be exposed.
>>>>
>>>> When xdp_buff is being extended, then we should at least zero
>>>> initialize all in-tree users that don't support or populate this
>>>> field, thus that it's not uninitialized memory. Better would be
>>>> to have a way to reject the prog in the first place until it's
>>>> implemented (but further comments on feature bits below).
>>>
>>> Going down a path where we need to zero out the xdp_buff looks a lot
>>> like the sk_buff zeroing, which is the top perf cost associated with
>>> SKBs see[1].  XDP is is about not repeating the same issue we had with
>>> the SKB...
>>
>> But if we agree on implementing a certain feature that requires to
>> add a new member, then basic requirement should be that it needs to
>> be implementable by all XDP enabled drivers, so that users eventually
>> don't need to worry which driver they run to access a XDP feature.
>> In that case, zeroing that member until it gets properly implemented
>> is just temporary (and could be an incentive perhaps).
>>
>>> [1] https://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html#analyzing-build-skb-and-memset
>>>
>>>>> Only two user visible NETIF_F_XDP_* net_device feature flags are
>>>>> exposed via ethtool (-k) seen as "xdp" and "xdp-partial".
>>>>> The "xdp-partial" is detected when there is not feature equality
>>>>> between kernel and driver, and a netdev_warn is given.
>>>>
>>>> I think having something like a NETIF_F_XDP_BIT for ethtool to
>>>> indicate support as "xdp" is quite useful. Avoids having to grep
>>>> the kernel tree for ndo_xdp callback. ;) A "xdp-partial" would
>>>> still be unclear/confusing to the user whether his program loads
>>>> or doesn't which is the only thing a user (or some loading infra)
>>>> cares about eventually, so one still needs to go trying to load
>>>> the XDP code to see whether that fails for the native case.
>>>
>>> Good that we agree on usefulness of the NETIF_F_XDP_BIT.  The
>>> "xdp-partial" or "xdp-challenged" is an early indication to the user
>>> that they should complain to the vendor.  I tried to keep it simple
>>> towards the user. Do you think every feature bit should be exposed to
>>> userspace?
>>
>> That would potentially require us to go down that path and expose
>> feature bits for everything, even something as silly as new flags
>> for a specific helper that requires some sort of additional support.
>> We probably rather want to keep such thing in the kernel for now
>> and potentially reject loads instead.
>>
>>>>> The idea is that XDP_DRV_* feature bits define a contract between
>>>>> the driver and the kernel, giving a reliable way to know that XDP
>>>>> features a driver promised to implement. Thus, knowing what bpf
>>>>> side features are safe to allow.
>>>>>
>>>>> There are 3 levels of features: "required", "devel" and "optional".
>>>>>
>>>>> The motivation is pushing driver vendors forward to support all
>>>>> the new XDP features.  Once a given feature bit is moved into
>>>>> the "required" features, the kernel will reject loading XDP
>>>>> program if feature isn't implemented by driver.  Features under
>>>>> developement, require help from the bpf infrastrucure to detect
>>>>> when a given helper or direct-access is used, using a bpf_prog
>>>>> bit to mark a need for the feature, and pulling in this bit in
>>>>> the xdp_features_check().  When all drivers have implemented
>>>>> a "devel" feature, it can be moved to the "required" feature and
>>>>
>>>> The problem is that once you add bits markers to bpf_prog like we
>>>> used to do in the past, then as you do in patch 4/5 with the
>>>> xdp_rxhash_needed bit, they will need to be turned /on/ unconditionally
>>>> when a prog has tail calls.
>>>
>>> Yes, with tail calls, we have to enable all features.  But that is a
>>> good thing, as it forces vendors to quickly implement all features.
>>> And it is no different from moving a feature into the "required" bits,
>>> once all drivers implement it.  It is only a limitation for tail calls,
>>> and something we can fix later (for handling this for tail calls).
>>
>> But the issue I pointed out with this approach is that XDP programs
>> using tail calls will suddenly break and get rejected from one
>> version to another.
>>
>> That's basically breaking the contract we have today where it must
>> be guaranteed that older programs keep working on newer kernels,
>> exactly the same with uapi. Breaking user programs is a no-go,
>> not a good thing at all (it will hurt users and they move on with
>> something else). So it's not something we can 'fix' at some later
>> point in time; such feature detection would need to consider this
>> from day 1.
>
> This is a very good point Daniel, that older program should be able to
> run. But this will only happen when tail calls are in-use, and NIC
> vendors don't keep their drivers up-to-date.  Creating a strong
> incentive for drivers implementing all XDP features (which you have
> argue for). Still I see your point.

The issue is that such chaining model as presented in [1] might be
quite typical in deployments.

Another strong argument for using tail calls is that when drivers
still don't support atomic updates of BPF programs. Usually a config
reload is needed when transitioning from no XDP -> XDP and XDP -> no XDP
mode, respectively, but not from XDP -> XDP. Last time I checked however,
there are drivers like qede, that would always do a reconfig. So for
that case having an init prog that then just tail calls into a main
prog would help as well as a work-around until driver finally supports
this.

All in all, I'm trying to say that tail calls are not a niche function
and we shouldn't have such window that when a feature update doesn't
make it in time, things not using the feature break.

   [1] slide 11, http://netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf

>>> BPF have some nice features of evaluating the input program
>>> "load-time", which is what I'm taking advantage of as an optimization
>>> here (let use this nice bpf property).   It is only tail calls that
>>> cannot evaluate this "load-time".  Thus, if you care about tail calls,
>>> supporting intermediate features, we could later fix that by adding a
>>> runtime feature check in the case of tail calls.
>
> How do we move forward from here?

If we introduce such feature bits one day, one possibility I see
that more or less could work is to propagate this into tail call
maps in a similar way like we do today with bpf_prog_array_compatible().
Latter checks the prog type and whether a prog was jited, as both
really cannot be mixed among each other.

So, once you load the program, either we somehow need to tell the
verifier upfront about our requirements, or the verifier detects
them based on the programs that are loaded (not sure about this
one though), and besides prog, it needs to mark the tail call map
with the same mask, such that any programs added later on to this
tail call map are guaranteed to use the same set of features or
just a subset of them. This also means that later updates cannot
use features outside of this set anymore (even if the driver could
support it). Then, the 'root' program which is to be attached to
the device can check against the driver's capabilities eventually,
since any program reachable from the root program would be guaranteed
to not use features outside of this set.

Still big question-mark wrt exposing these feature bits, and how
the set would be determined eventually, e.g. the loader would somehow
need to transparently calc the superset of features based on all
progs residing in the object file, and then pass them into the kernel
on prog load, where verifier checks them against the prog and if the
prog makes use of the same set or a subset, then we mark it and related
maps with the superset.

Just a quick thought, probably we can come up with something better.
The ugly bit is that we would expose bits for each silly thing and
need to determine them in the loader, which gets messy very quickly,
we should try hard to avoid that in the first place.

I guess would we add something similar like xdp_adjust_head() in
the future we might just add a uapi hidden dev pointer along which
we then access from the kernel's struct xdp_buff representation
inside the helper and bail out with an error should the device not
support it. And once we have all devices supporting it, we'd remove
it again, so that we wouldn't need such huge complexity with feature
bits.

> Let me try to restate the problem: There is a general disconnect
> between bpf and what XDP drivers support.
>
> The bpf model (AFAIK) is that your bpf program will get rejected when
> loading your bpf_prog.  This happens if you are using a bpf_helper or
> direct-ctx-access which is not supported/avail in the kernel. Right??
> (This also happens for tail-call programs, right?)  (I do like this
> model of failing the program when the feature is not avail).
>
> I guess, new bpf features are discovered by looking at uapi/bpf.h and
> then try to use the feature, and see if your program loads.
>
> The disconnect is that when writing XDP-bpf programs, then you will
> not get any feedback when you are calling a bpf-helper that the XDP
> driver doesn't support.  The XDP bpf-program loads and can happily run
> calling core-kernel XDP helpers.  Now, how will the XDP program author
> know that his call to the helper is valid?
>
> For rxhash, we could init the fields to zero (in drivers not
> supporting it), but how will the XDP bpf_prog author know whether or
> not a NIC driver supports XDP-rxhash?  (I guess, he could write a XDP
> program, that sets the XDP software hash, and then add a match via TC
> meta match or another cls_bpf program that read the skb->hash value,
> to see if changing the rxhash works... it seem clumpsy)
>
> For xdp_adjust_head, it was even-worse, as a boundry check against
> data_end with e.g. zero is no boundry, and would allow unbounded
> memory access from the xdp.data_end pointer.  We actually have
> released kernels with this bug, which can be triggered with a bpf
> tail-call.
>
> I just did a quick check on linux-stable.git for v4.10.17 and it looks
> like commit c2002f983767 ("bpf: fix checking xdp_adjust_head on tail
> calls") is missing, meaning drivers nfp, mlx5, nfp and virtio_net is
> likely subject to the adjust_head unbounded memory access bug via tail
> calls. I really wish we had a reliable way of avoiding these kind of
> bpf vs. XDP-driver mismatches.

Yeah :/, I think we might need to pass these to 4.10 stable:

   c2002f983767 ("bpf: fix checking xdp_adjust_head on tail calls")
   6b1bb01bcc5b ("bpf: fix cb access in socket filter programs on tail calls")

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE
  2017-05-19 23:38   ` David Miller
@ 2017-05-22 18:27     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-22 18:27 UTC (permalink / raw)
  To: David Miller
  Cc: borkmann, alexei.starovoitov, john.r.fastabend, netdev, brouer

On Fri, 19 May 2017 19:38:18 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Thu, 18 May 2017 17:41:38 +0200
> 
> > Masks for extracting part of the Completion Queue Entry (CQE)
> > field rss_hash_type was swapped, namely CQE_RSS_HTYPE_IP and
> > CQE_RSS_HTYPE_L4.
> > 
> > The bug resulted in setting skb->l4_hash, even-though the
> > rss_hash_type indicated that hash was NOT computed over the
> > L4 (UDP or TCP) part of the packet.
> > 
> > Added comments from the datasheet, to make it more clear what
> > these masks are selecting.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>  
> 
> Please pass this along to the mlx5 developers as a standalone
> bug fix for 'net', thank you.

Done, send as separate patch.
 http://patchwork.ozlabs.org/patch/765537/

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-22  6:39         ` Jesper Dangaard Brouer
@ 2017-05-22 20:42           ` Jesper Dangaard Brouer
  2017-05-22 21:32             ` Tom Herbert
  0 siblings, 1 reply; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-22 20:42 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Daniel Borkmann, Alexei Starovoitov,
	Linux Kernel Network Developers, brouer

On Mon, 22 May 2017 08:39:35 +0200
Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> On Sun, 21 May 2017 15:10:29 -0700
> Tom Herbert <tom@herbertland.com> wrote:
> 
> > On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:  
> > > On Sat, 20 May 2017 09:16:09 -0700
> > > Tom Herbert <tom@herbertland.com> wrote:
> > >    
> > >> > +/* XDP rxhash have an associated type, which is related to the RSS
> > >> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
> > >> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
> > >> > + * would primarly want insight into L3 and L4 protocol info.
> > >> > + *
> > >> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
> > >> > + *
> > >> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
> > >> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
> > >> > + */
> > >> > +#define XDP_HASH(x)            ((x) & ((1ULL << 32)-1))
> > >> > +#define XDP_HASH_TYPE(x)       ((x) >> 32)
> > >> > +
> > >> > +#define XDP_HASH_TYPE_L3_SHIFT 0
> > >> > +#define XDP_HASH_TYPE_L3_BITS  3
> > >> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
> > >> > +#define XDP_HASH_TYPE_L3(x)    ((x) & XDP_HASH_TYPE_L3_MASK)
> > >> > +enum {
> > >> > +       XDP_HASH_TYPE_L3_IPV4 = 1,
> > >> > +       XDP_HASH_TYPE_L3_IPV6,
> > >> > +};
> > >> > +
> > >> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
> > >> > +#define XDP_HASH_TYPE_L4_BITS  5
> > >> > +#define XDP_HASH_TYPE_L4_MASK                                          \
> > >> > +       (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
> > >> > +#define XDP_HASH_TYPE_L4(x)    ((x) & XDP_HASH_TYPE_L4_MASK)
> > >> > +enum {
> > >> > +       _XDP_HASH_TYPE_L4_TCP = 1,
> > >> > +       _XDP_HASH_TYPE_L4_UDP,
> > >> > +};
> > >> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
> > >> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)
> > >> > +    
> > >> Hi Jesper,
> > >>
> > >> Why do we need these indicators for protocol specific hash? It seems
> > >> like L4 and L3 is useful differentiation and protocol agnostic (I'm
> > >> still holding out hope that SCTP will be deployed some day ;-) )    
> > >
> > > I'm not sure I understood the question fully, but let me try to answer
> > > anyway.  To me it seems obvious that you would want to know the
> > > protocol/L4 type, as this helps avoid hash collisions between UDP and
> > > TCP flows.  I can easily imagine someone constructing an UDP packet
> > > that could hash collide with a given TCP flow.
> > >
> > > And yes, i40 support matching SCTP, and we will create a
> > > XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
> > >    
> > But where would this information be used? We don't save it in skbuff,
> > don't use it in RPS, RFS. RSS doesn't use it for packet steering so
> > the hash collision problem already exists at the device level. If
> > there is a collision problem between two protocols then maybe hash
> > should be over 5-tuple instead...  
> 
> One use-case (I heard at a customer) was that they had (web-)servers
> that didn't serve any UDP traffic, thus they simply block/drop all
> incoming UDP on the service NIC (as an ACL in the switch). (The servers
> own DNS lookups and NTP goes through the management NIC to internal
> DNS/NTP servers).
> 
> Another use-case: Inside an XDP/bpf program is can be used for
> splitting protocol processing, into different tail calls, before even
> touching packet-data.  I can imagine the bpf TCP handling code is
> larger, thus an optimization is to have a separate tail call for the
> UDP protocol handling.  One could also transfer/queue all TCP traffic
> to other CPU(s) like RPS, just without touching packet memory.
> 
> 
> This info is saved in the skb, but due to space constrains, it is
> reduced to a single bit, namely skb->l4_hash, iif some
> RSS-proto/XDP_HASH_TYPE_L4_* bit was set.  And the network stack do use
> and react on this.
 
I also want to mention another real-customer use-case.  Some
deployments have a VXLAN based networks, but some NICs cannot
understand VXLAN do cannot do proper RSS rx-hashing, which resulted in
bad CPU scaling as all VXLAN packets gets delivered to the same CPU.

Thus, I would like to implement recalculation of the RXHASH in XDP, as
that would save me implementing yet another extension to the flow
dissector, that the kernel would have to carry forever, while this is
just a matter of NIC hashing getting improved.

With the extra L3 and L4 info, I'm assuming that XDP_HASH_TYPE_L3(x)
and XDP_HASH_TYPE_L4(x) will be zero for VXLAN as the NIC cannot
identify this.  Thus, I can at an early stage know which packets needs
to get a new rxhash.

I've seen a similar problem with Q-in-Q double tagged VLANs, failing
the RSS-hash distribution the same way...

I hope that explains what this can be use for(?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers
  2017-05-22 20:42           ` Jesper Dangaard Brouer
@ 2017-05-22 21:32             ` Tom Herbert
  0 siblings, 0 replies; 31+ messages in thread
From: Tom Herbert @ 2017-05-22 21:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, Linux Kernel Network Developers

On Mon, May 22, 2017 at 1:42 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Mon, 22 May 2017 08:39:35 +0200
> Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
>> On Sun, 21 May 2017 15:10:29 -0700
>> Tom Herbert <tom@herbertland.com> wrote:
>>
>> > On Sun, May 21, 2017 at 9:04 AM, Jesper Dangaard Brouer
>> > <brouer@redhat.com> wrote:
>> > > On Sat, 20 May 2017 09:16:09 -0700
>> > > Tom Herbert <tom@herbertland.com> wrote:
>> > >
>> > >> > +/* XDP rxhash have an associated type, which is related to the RSS
>> > >> > + * (Receive Side Scaling) standard, but NIC HW have different mapping
>> > >> > + * and support. Thus, create mapping that is interesting for XDP.  XDP
>> > >> > + * would primarly want insight into L3 and L4 protocol info.
>> > >> > + *
>> > >> > + * TODO: Likely need to get extended with "L3_IPV6_EX" due RSS standard
>> > >> > + *
>> > >> > + * The HASH_TYPE will be returned from bpf helper as the top 32-bit of
>> > >> > + * the 64-bit rxhash (internally type stored in xdp_buff->flags).
>> > >> > + */
>> > >> > +#define XDP_HASH(x)            ((x) & ((1ULL << 32)-1))
>> > >> > +#define XDP_HASH_TYPE(x)       ((x) >> 32)
>> > >> > +
>> > >> > +#define XDP_HASH_TYPE_L3_SHIFT 0
>> > >> > +#define XDP_HASH_TYPE_L3_BITS  3
>> > >> > +#define XDP_HASH_TYPE_L3_MASK  ((1ULL << XDP_HASH_TYPE_L3_BITS)-1)
>> > >> > +#define XDP_HASH_TYPE_L3(x)    ((x) & XDP_HASH_TYPE_L3_MASK)
>> > >> > +enum {
>> > >> > +       XDP_HASH_TYPE_L3_IPV4 = 1,
>> > >> > +       XDP_HASH_TYPE_L3_IPV6,
>> > >> > +};
>> > >> > +
>> > >> > +#define XDP_HASH_TYPE_L4_SHIFT XDP_HASH_TYPE_L3_BITS
>> > >> > +#define XDP_HASH_TYPE_L4_BITS  5
>> > >> > +#define XDP_HASH_TYPE_L4_MASK                                          \
>> > >> > +       (((1ULL << XDP_HASH_TYPE_L4_BITS)-1) << XDP_HASH_TYPE_L4_SHIFT)
>> > >> > +#define XDP_HASH_TYPE_L4(x)    ((x) & XDP_HASH_TYPE_L4_MASK)
>> > >> > +enum {
>> > >> > +       _XDP_HASH_TYPE_L4_TCP = 1,
>> > >> > +       _XDP_HASH_TYPE_L4_UDP,
>> > >> > +};
>> > >> > +#define XDP_HASH_TYPE_L4_TCP (_XDP_HASH_TYPE_L4_TCP << XDP_HASH_TYPE_L4_SHIFT)
>> > >> > +#define XDP_HASH_TYPE_L4_UDP (_XDP_HASH_TYPE_L4_UDP << XDP_HASH_TYPE_L4_SHIFT)
>> > >> > +
>> > >> Hi Jesper,
>> > >>
>> > >> Why do we need these indicators for protocol specific hash? It seems
>> > >> like L4 and L3 is useful differentiation and protocol agnostic (I'm
>> > >> still holding out hope that SCTP will be deployed some day ;-) )
>> > >
>> > > I'm not sure I understood the question fully, but let me try to answer
>> > > anyway.  To me it seems obvious that you would want to know the
>> > > protocol/L4 type, as this helps avoid hash collisions between UDP and
>> > > TCP flows.  I can easily imagine someone constructing an UDP packet
>> > > that could hash collide with a given TCP flow.
>> > >
>> > > And yes, i40 support matching SCTP, and we will create a
>> > > XDP_HASH_TYPE_L4_SCTP when adding XDP rxhash support for that driver.
>> > >
>> > But where would this information be used? We don't save it in skbuff,
>> > don't use it in RPS, RFS. RSS doesn't use it for packet steering so
>> > the hash collision problem already exists at the device level. If
>> > there is a collision problem between two protocols then maybe hash
>> > should be over 5-tuple instead...
>>
>> One use-case (I heard at a customer) was that they had (web-)servers
>> that didn't serve any UDP traffic, thus they simply block/drop all
>> incoming UDP on the service NIC (as an ACL in the switch). (The servers
>> own DNS lookups and NTP goes through the management NIC to internal
>> DNS/NTP servers).
>>
>> Another use-case: Inside an XDP/bpf program is can be used for
>> splitting protocol processing, into different tail calls, before even
>> touching packet-data.  I can imagine the bpf TCP handling code is
>> larger, thus an optimization is to have a separate tail call for the
>> UDP protocol handling.  One could also transfer/queue all TCP traffic
>> to other CPU(s) like RPS, just without touching packet memory.
>>
>>
>> This info is saved in the skb, but due to space constrains, it is
>> reduced to a single bit, namely skb->l4_hash, iif some
>> RSS-proto/XDP_HASH_TYPE_L4_* bit was set.  And the network stack do use
>> and react on this.
>
> I also want to mention another real-customer use-case.  Some
> deployments have a VXLAN based networks, but some NICs cannot
> understand VXLAN do cannot do proper RSS rx-hashing, which resulted in
> bad CPU scaling as all VXLAN packets gets delivered to the same CPU.
>
They need to turn on RSS for UDP I think.

> Thus, I would like to implement recalculation of the RXHASH in XDP, as
> that would save me implementing yet another extension to the flow
> dissector, that the kernel would have to carry forever, while this is
> just a matter of NIC hashing getting improved.
>
> With the extra L3 and L4 info, I'm assuming that XDP_HASH_TYPE_L3(x)
> and XDP_HASH_TYPE_L4(x) will be zero for VXLAN as the NIC cannot
> identify this.  Thus, I can at an early stage know which packets needs
> to get a new rxhash.
>
> I've seen a similar problem with Q-in-Q double tagged VLANs, failing
> the RSS-hash distribution the same way...
>
That's fine, but I'm still not seeing that differentiating the hash
created from different protocols is really helps. The most correct way
moving forward to generate a flow hash is use IPv6 flow label. This
eliminates either the device or software from needing to do DPI into
protocols, over extension headers, etc. In this case though we declare
a hash based on flow label as L4, but have no idea and really don't
care what protocol it's for (and in fact for something like IPsec we
can't even know).

The idea that the world is composed of just TCP and UDP is a delusion
espoused by NIC implementations, in SW we can and should do better
than that with protocol agnostic mechanisms.

Tom

> I hope that explains what this can be use for(?)
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC net-next PATCH 3/5] net: introduce XDP driver features interface
  2017-05-22 17:07           ` Daniel Borkmann
@ 2017-05-30  9:58             ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 31+ messages in thread
From: Jesper Dangaard Brouer @ 2017-05-30  9:58 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev, brouer


On Mon, 22 May 2017 19:07:44 +0200 Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 05/22/2017 04:49 PM, Jesper Dangaard Brouer wrote:
> >
> > How do we move forward from here?  
> 
> If we introduce such feature bits one day, one possibility I see
> that more or less could work is to propagate this into tail call
> maps in a similar way like we do today with bpf_prog_array_compatible().
> Latter checks the prog type and whether a prog was jited, as both
> really cannot be mixed among each other.

I went down this path, of extending bpf_prog_array_compatible() and
traversing the tail call maps when the bpf_prog gets attached in the
XDP driver.  It is first at the XDP driver attachment point-in-time, the
features can be checked and locked down.

See patch below (on top of this RFC patchset).
Tested with:
 https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf
 bpf_tail_calls01_{kern,user}.c

> So, once you load the program, either we somehow need to tell the
> verifier upfront about our requirements, or the verifier detects
> them based on the programs that are loaded (not sure about this
> one though), 

IMHO "optional" features are *detected* by bpf verifier and filter.c
rewriter.  And "required" features are defined/detect/verified at NIC
driver level.  The user (XDP bpf programmer) do NOT supply "features".

> and besides prog, it needs to mark the tail call map
> with the same mask, such that any programs added later on to this
> tail call map are guaranteed to use the same set of features or
> just a subset of them. This also means that later updates cannot
> use features outside of this set anymore (even if the driver could
> support it). Then, the 'root' program which is to be attached to
> the device can check against the driver's capabilities eventually,
> since any program reachable from the root program would be guaranteed
> to not use features outside of this set.

I've also solved this one by storing the max supported features by the
driver, after validating that the "used" features passed the driver
check.  Thus, later runtime progs can still use all features supported
by the driver, even-though they were not "used" at XDO init/attach time.
 
> Still big question-mark wrt exposing these feature bits, and how
> the set would be determined eventually, e.g. the loader would somehow
> need to transparently calc the superset of features based on all
> progs residing in the object file, and then pass them into the kernel
> on prog load, where verifier checks them against the prog and if the
> prog makes use of the same set or a subset, then we mark it and related
> maps with the superset.

I've managed to keep the bpf side of the feature bits completely
flexible and not-exposed.  It is only the NIC drivers XDP_DRV_F_*
defines that gets exposed, which is what we want to easily "see" and
enforce what drivers (must) support.  For the bpf side, the bits can be
removed and recycled once a XDP_DRV_F_XXX feature moved to the required
set XDP_DRV_FEATURES_REQUIRED.

I've also managed to keep feature validation to "setup-time", thus no
additional runtime overhead is added for tail calls.


[PATCH RFC] bpf: handle XDP features for bpf tail calls

From: Jesper Dangaard Brouer <brouer@redhat.com>

XDP is challenging the bpf infrastructure assumption, that a
bpf_helper imply that a features is available to the bpf program.
This is no-longer true for XDP NIC drivers as the feature behind the
bpf_helper need to be implemented on a per driver basis.

Disregarding handling bpf tail calls, it is fairly easily to implement
and detect feature mismatch, via leveraging the bpf verifier and
ins-rewriter, who knows what helpers and direct access a given
bpf_prog are activating, as demonstrated in previous patches.

This patch handle tail calls (BPF_MAP_TYPE_PROG_ARRAY) by traversing
the tail call prog array, and collecting the features.

When attaching the XDP bpf_prog to a given XDP driver, the prog arrays
are traversed, and used features are verified against what the XDP
driver supports.  On success the supported NIC XDP features are locked
down (for the traversed bpf prog_array maps).  Later when runtime
adding bpf_prog's to the map, then features are checked.

The supported feature set is locked down to the maximum supported
features by the driver, to allow runtime adding tail calls that need
more features, but within drivers capabilities.

When adding a tail call, that itself have tail calls, the same
traversal verification is performed.  On success, these prog_array's
are also feature locked based on inheriting the supported features of
the array being inserted into.

At XDP driver attachment time, the bpf_prog have already been made
read-only.  Thus, the needed info are stored in struct bpf_array.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/bpf.h    |   26 ++++++
 include/linux/filter.h |    5 +
 kernel/bpf/core.c      |  226 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/verifier.c  |    1 
 net/core/dev.c         |   18 ++--
 5 files changed, 262 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6bb38d76faf4..842409105925 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -167,6 +167,23 @@ struct bpf_verifier_ops {
 			union bpf_attr __user *uattr);
 };
 
+/* These bpf_feature bits are 100% internal to the bpf infrastructure
+ * They mirror some of the bpf_prog bits, related to features.  Over
+ * time these features bits will get removed when the subsystem using
+ * them, like XDP, will support the feature from all call points
+ * (e.g. XDP drivers).
+ */
+struct bpf_features {
+	union {
+		struct {
+			u32	cb_access:1,
+				xdp_rxhash_needed:1;
+		};
+		u32	flags;
+	};
+};
+typedef u32	bpf_features_t;
+
 struct bpf_prog_aux {
 	atomic_t refcnt;
 	u32 used_map_cnt;
@@ -193,6 +210,15 @@ struct bpf_array {
 	 */
 	enum bpf_prog_type owner_prog_type;
 	bool owner_jited;
+
+	/* Restrict features allowed */
+	bpf_features_t features_supported;
+	bool features_locked;
+
+	/* Members needed when traversing prog_array tail calls */
+	struct list_head traversed_node;
+	bool is_traversed;
+
 	union {
 		char value[0] __aligned(8);
 		void *ptrs[0] __aligned(8);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 33a254ccd47d..113914b7ac28 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -641,6 +641,11 @@ static inline int sk_filter(struct sock *sk, struct sk_buff *skb)
 	return sk_filter_trim_cap(sk, skb, 1);
 }
 
+netdev_features_t bpf_get_xdp_features(struct bpf_prog *xdp_prog);
+bool bpf_lock_xdp_features(struct bpf_prog *prog,
+			   netdev_features_t xdp_approved_f,
+			   netdev_features_t xdp_dev_support_f);
+
 struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err);
 void bpf_prog_free(struct bpf_prog *fp);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 6f81e0f5a0fa..d3dbce365993 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1224,21 +1224,241 @@ static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)
 }
 STACK_FRAME_NON_STANDARD(__bpf_prog_run); /* jump table */
 
+/* convert bpf_prog bits into bpf_features bits */
+static bpf_features_t __bpf_prog_to_features(const struct bpf_prog *prog)
+{
+	struct bpf_features f = { 0 };
+
+	if (prog->cb_access)
+		f.cb_access = 1;
+
+	if (prog->xdp_rxhash_needed)
+		f.xdp_rxhash_needed = 1;
+
+	return f.flags;
+}
+
+/* convert bpf_features bits into net_device xdp_features */
+static netdev_features_t __bpf_features_to_xdp_features(bpf_features_t f)
+{
+	netdev_features_t features = XDP_DRV_FEATURES_REQUIRED;
+	struct bpf_features bf;
+
+	bf.flags = f;
+	if (bf.xdp_rxhash_needed)
+		features |= XDP_DRV_F_RXHASH;
+
+	return features;
+}
+
+/* Extend bpf_features with extra features based on xdp_features input */
+bpf_features_t __bpf_features_extend_from_xdp_features(bpf_features_t bpf_f,
+						       netdev_features_t xdp_f)
+{
+	struct bpf_features bf;
+
+	bf.flags = bpf_f;
+	if (xdp_f & XDP_DRV_F_RXHASH)
+		bf.xdp_rxhash_needed = 1;
+
+	return bf.flags;
+}
+
+static DEFINE_MUTEX(prog_array_traversal_mutex);
+static LIST_HEAD(prog_array_traversal_q);
+
+static bpf_features_t
+__bpf_features_via_prog_array(const struct bpf_prog *top_prog,
+			      bpf_features_t features)
+{
+	struct bpf_prog_aux *aux = top_prog->aux;
+	int i;
+
+	/* First extract features from bpf_prog's in known prog_array's */
+	for (i = 0; i < aux->used_map_cnt; i++) {
+		struct bpf_map *map = aux->used_maps[i];
+		struct bpf_array *array;
+		int j;
+
+		/* Walk all prog_array's */
+		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
+			continue;
+		array = container_of(map, struct bpf_array, map);
+
+		/* Look at features in each active bpf_prog in prog_array */
+		for (j = 0; j < array->map.max_entries; j++) {
+			const struct bpf_prog *prog;
+
+			prog = array->ptrs[j];
+			if (!prog)
+				continue;
+
+			features |= __bpf_prog_to_features(prog);
+		}
+	}
+
+	/* Now recursive visit bpf_prog's for containing prog_array's */
+	for (i = 0; i < aux->used_map_cnt; i++) {
+		struct bpf_map *map = aux->used_maps[i];
+		struct bpf_array *array;
+		int j;
+
+		/* Walk all prog_array's again */
+		if (map->map_type != BPF_MAP_TYPE_PROG_ARRAY)
+			continue;
+		array = container_of(map, struct bpf_array, map);
+
+		/* Avoid traversal loops and record prog_array's */
+		if (array->is_traversed)
+			continue;
+		array->is_traversed = true;
+		list_add_tail(&array->traversed_node, &prog_array_traversal_q);
+
+		/* Recurse into bpf_prog in prog_array */
+		for (j = 0; j < array->map.max_entries; j++) {
+			const struct bpf_prog *p;
+
+			p = array->ptrs[j];
+			if (!p)
+				continue;
+
+			features |= __bpf_features_via_prog_array(p, features);
+		}
+	}
+
+	return features;
+}
+
+/* Find superset of features traversing tail call maps */
+static bpf_features_t bpf_features_via_prog_array(const struct bpf_prog *prog,
+						  bpf_features_t features)
+{
+	struct bpf_array *prog_array, *tmp;
+
+	features |= __bpf_features_via_prog_array(prog, features);
+	list_for_each_entry_safe(prog_array, tmp, &prog_array_traversal_q,
+				 traversed_node)
+	{
+		list_del(&prog_array->traversed_node);
+		prog_array->is_traversed = false;
+	}
+
+	return features;
+}
+
+netdev_features_t bpf_get_xdp_features(struct bpf_prog *prog)
+{
+	bpf_features_t bpf_features;
+
+	mutex_lock(&prog_array_traversal_mutex);
+	bpf_features = __bpf_prog_to_features(prog);
+	bpf_features = bpf_features_via_prog_array(prog, bpf_features);
+	mutex_unlock(&prog_array_traversal_mutex);
+
+	return __bpf_features_to_xdp_features(bpf_features);
+}
+
+/* Caller have checked xdp features are approved */
+bool bpf_lock_xdp_features(struct bpf_prog *prog,
+			   netdev_features_t xdp_approved_f,
+			   netdev_features_t xdp_dev_support_f)
+{
+	struct bpf_array *prog_array, *tmp;
+	netdev_features_t xdp_f_in_use;
+	bpf_features_t bpf_f_in_use;
+	bool lock_features = true;
+	bpf_features_t max;
+
+	mutex_lock(&prog_array_traversal_mutex);
+
+	/* Get and detect if bpf_features changed */
+	bpf_f_in_use  = __bpf_prog_to_features(prog);
+	bpf_f_in_use |= __bpf_features_via_prog_array(prog, bpf_f_in_use);
+	xdp_f_in_use  = __bpf_features_to_xdp_features(bpf_f_in_use);
+	if (xdp_f_in_use != xdp_approved_f)
+		lock_features = false;
+
+	/* XDP driver might support more features than in-use, allow
+	 * later added bpf_prog's to still use-these extra features
+	 */
+	max = __bpf_features_extend_from_xdp_features(bpf_f_in_use,
+						      xdp_dev_support_f);
+	list_for_each_entry_safe(prog_array, tmp, &prog_array_traversal_q,
+				 traversed_node)
+	{
+		list_del(&prog_array->traversed_node);
+		prog_array->is_traversed = false;
+		if (lock_features) {
+			/* Handle when already locked by another driver.
+			 * Find smallest common feature set (via simple AND)
+			 */
+			if (prog_array->features_locked)
+				prog_array->features_supported &= max;
+			else
+				prog_array->features_supported = max;
+			prog_array->features_locked = true;
+		}
+	}
+	mutex_unlock(&prog_array_traversal_mutex);
+	return lock_features;
+}
+
 bool bpf_prog_array_compatible(struct bpf_array *array,
 			       const struct bpf_prog *fp)
 {
+	bool compat = false;
+
+	mutex_lock(&prog_array_traversal_mutex);
+
 	if (!array->owner_prog_type) {
 		/* There's no owner yet where we could check for
 		 * compatibility.
 		 */
 		array->owner_prog_type = fp->type;
 		array->owner_jited = fp->jited;
+		array->features_locked = false;
+		array->is_traversed = false;
+		compat = true;
+		goto out;
+	}
+
+	/* Features can be locked, e.g. when XDP prog is attach to net_device */
+	if (array->features_locked)
+	{
+		bpf_features_t f = __bpf_prog_to_features(fp);
+		struct bpf_array *pa, *tmp;
+		bpf_features_t max;
 
-		return true;
+		f |= __bpf_features_via_prog_array(fp, f);
+
+		/* Detect any feature bit set, which is not supported */
+		if (f & ~(array->features_supported)) {
+			compat = false;
+			goto out;
+		}
+		/* If fp contained tail call's itself, they need to be
+		 * locked down, to this array->features_supported.
+		 */
+		max = array->features_supported;
+		list_for_each_entry_safe(pa, tmp, &prog_array_traversal_q,
+					 traversed_node)
+		{
+			list_del(&pa->traversed_node);
+			pa->is_traversed = false;
+			/* Handle when already locked by another driver */
+			if (pa->features_locked)
+				pa->features_supported &= max;
+			else
+				pa->features_supported = max;
+			pa->features_locked = true;
+		}
 	}
 
-	return array->owner_prog_type == fp->type &&
-	       array->owner_jited == fp->jited;
+	compat = (array->owner_prog_type == fp->type &&
+		  array->owner_jited == fp->jited);
+out:
+	mutex_unlock(&prog_array_traversal_mutex);
+	return compat;
 }
 
 static int bpf_check_tail_call(const struct bpf_prog *fp)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 248bc113ad18..df9d08a79ac6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3355,7 +3355,6 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
 			 * the program array.
 			 */
 			prog->cb_access = 1;
-			prog->xdp_rxhash_needed = 1;
 
 			/* mark bpf_tail_call as different opcode to avoid
 			 * conditional branch in the interpeter for every normal
diff --git a/net/core/dev.c b/net/core/dev.c
index 28082067ac00..b45e8114b84c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6855,16 +6855,6 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down)
 }
 EXPORT_SYMBOL(dev_change_proto_down);
 
-netdev_features_t bpf_get_xdp_features(struct bpf_prog *prog)
-{
-	netdev_features_t features = XDP_DRV_FEATURES_REQUIRED;
-
-	if (prog->xdp_rxhash_needed)
-		features |= XDP_DRV_F_RXHASH;
-
-	return features;
-}
-
 bool xdp_features_check(struct net_device *dev, struct bpf_prog *xdp_prog,
 			struct netlink_ext_ack *extack, u32 flags)
 {
@@ -6881,6 +6871,14 @@ bool xdp_features_check(struct net_device *dev, struct bpf_prog *xdp_prog,
 			       "Required XDP feature not supported by device");
 		return false;
 	}
+	/* Ask BPF infra to limit runtime added bpf_prog's (tail calls)
+	 * to features supported by XDP driver.
+	 */
+	if (!bpf_lock_xdp_features(xdp_prog, req_features, dev_xdp_features)) {
+		NL_SET_ERR_MSG(extack,
+			"Couldn't lock XDP features supported by device");
+		return false;
+	}
 	return true;
 }
 


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-05-30  9:58 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-18 15:41 [RFC net-next PATCH 0/5] XDP driver feature API and handling change to xdp_buff Jesper Dangaard Brouer
2017-05-18 15:41 ` [RFC net-next PATCH 1/5] samples/bpf: xdp_tx_iptunnel make use of map_data[] Jesper Dangaard Brouer
2017-05-19 15:45   ` Daniel Borkmann
2017-05-18 15:41 ` [RFC net-next PATCH 2/5] mlx5: fix bug reading rss_hash_type from CQE Jesper Dangaard Brouer
2017-05-19 15:47   ` Daniel Borkmann
2017-05-19 23:38   ` David Miller
2017-05-22 18:27     ` Jesper Dangaard Brouer
2017-05-18 15:41 ` [RFC net-next PATCH 3/5] net: introduce XDP driver features interface Jesper Dangaard Brouer
2017-05-19 17:13   ` Daniel Borkmann
2017-05-19 23:37     ` David Miller
2017-05-20  7:53     ` Jesper Dangaard Brouer
2017-05-21  0:58       ` Daniel Borkmann
2017-05-22 14:49         ` Jesper Dangaard Brouer
2017-05-22 17:07           ` Daniel Borkmann
2017-05-30  9:58             ` Jesper Dangaard Brouer
2017-05-18 15:41 ` [RFC net-next PATCH 4/5] net: new XDP feature for reading HW rxhash from drivers Jesper Dangaard Brouer
2017-05-19 11:47   ` Jesper Dangaard Brouer
2017-05-20  3:07   ` Alexei Starovoitov
2017-05-20  3:21     ` Jakub Kicinski
2017-05-20  3:34       ` Alexei Starovoitov
2017-05-20  4:13         ` Jakub Kicinski
2017-05-21 15:55     ` Jesper Dangaard Brouer
2017-05-22  3:21       ` Alexei Starovoitov
2017-05-22  4:12         ` John Fastabend
2017-05-20 16:16   ` Tom Herbert
2017-05-21 16:04     ` Jesper Dangaard Brouer
2017-05-21 22:10       ` Tom Herbert
2017-05-22  6:39         ` Jesper Dangaard Brouer
2017-05-22 20:42           ` Jesper Dangaard Brouer
2017-05-22 21:32             ` Tom Herbert
2017-05-18 15:41 ` [RFC net-next PATCH 5/5] mlx5: add XDP rxhash feature for driver mlx5 Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.