netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 00/12] Add support for XDP in egress path
@ 2020-01-23  1:41 David Ahern
  2020-01-23  1:41 ` [PATCH bpf-next 01/12] net: Add new XDP setup and query commands David Ahern
                   ` (11 more replies)
  0 siblings, 12 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:41 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

From: David Ahern <dahern@digitalocean.com>

This series adds support for XDP in the egress path by introducing
a new XDP attachment type, BPF_XDP_EGRESS, and adding an if_link API
for attaching the program to a netdevice and reporting the program.
The intent is to emulate the current RX path for XDP as much as
possible to maintain consistency in the 2 paths.

This set adds initial support for XDP in the egress path to the tun
driver, mostly for XDP_DROP (e.g., ACL on the ingress path to the
VM). XDP_REDIRECT will be handled later because we need to come up
with proper way to handle it in tx path. veth should be an easy
follow on for use with containers once this initial API change is in.

Some of the patches in this set were originally part of Jason
Wang's work "XDP offload with virtio-net" [1]. The work overlaps
work done by me, so we decided to consolidate the work into the
egress path first with the offload option expected to be a follow
on.

With respect to comments on RFC v2:

The kernel side implementation requires 2 UAPI attributes:
1. attach type - BPF_XDP_EGRESS, and
2. attach to a device and report such - IFLA_XDP_EGRESS.

The attach type uapi is needed as a way for the verifier to restrict
access to Rx entries in the xdp context for example and is consistent
with other programs and the concept of attach type. The if_link changes
should mirror the current XDP API for Rx (i.e., IFLA_XDP and its nested
attributes).

The expectation is that the current mode levels (skb, driver and hardware)
will also apply to the egress path. The current patch set focuses on the
driver mode with tun. Attempting to tag the EGRESS path as yet another
mode is inconsistent on a number of levels - from the current usage of
XDP_FLAGS to options passed to the verifier for restricting xdp_md
accesses. Using the API as listed above maintains consistency with all
existing code.

As for libbpf, we believe that an API that mirrors and is consistent
with the existing xdp functions for the ingress path is the right way
to go for the egress path. Meaning, the libbpf API is replicated with
_egress in the name, but the libbpf implementation shares common code
to the existing xdp to the degree possible.

v3:
- reworked the patches - splitting patch 1 from RFC v2 into 3, combining
  patch 2 from RFC v3 into the first 3, combining patches 6 and 7 from
  RFC v2 into 1 since both did a trivial rename and export. Reordered
  the patches such that kernel changes are first followed by libbpf and
  an enhancement to a sample.

- moved small xdp related helper functions from tun.c to tun.h to make
  tun_ptr_free usable from the tap code. This is needed to handle the
  case of tap builtin and tun built as a module.

- pkt_ptrs added to `struct tun_file` and passed to tun_consume_packets
  rather than declaring pkts as an array on the stack.

v2:
- New XDP attachment type: Jesper, Toke and Alexei discussed whether
  to introduce a new program type. Since this set adds a way to attach
  regular XDP program to the tx path, as per Alexei's suggestion, a
  new attachment type BPF_XDP_EGRESS is introduced.

- libbpf API changes:
  Alexei had suggested _opts() style of API extension. Considering it
  two new libbpf APIs are introduced which are equivalent to existing
  APIs. New ones can be extended easily. Please see individual patches
  for details. xdp1 sample program is modified to use new APIs.

- tun: Some patches from previous set are removed as they are
  irrelevant in this series. They will in introduced later.

[1]: https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net

David Ahern (4):
  net: Add new XDP setup and query commands
  net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  tun: set egress XDP program
  libbpf: Add egress XDP support

Jason Wang (1):
  net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core()

Prashant Bhole (7):
  net: Add BPF_XDP_EGRESS as a bpf_attach_type
  tuntap: check tun_msg_ctl type at necessary places
  tun: move shared functions to if_tun.h
  vhost_net: user tap recvmsg api to access ptr ring
  tuntap: remove usage of ptr ring in vhost_net
  tun: run XDP program in tx path
  samples/bpf: xdp1, add egress XDP support

 drivers/net/tap.c                  |  42 +++--
 drivers/net/tun.c                  | 257 ++++++++++++++++++++++-------
 drivers/vhost/net.c                |  77 ++++-----
 include/linux/if_tap.h             |   5 -
 include/linux/if_tun.h             |  57 ++++++-
 include/linux/netdevice.h          |   6 +-
 include/uapi/linux/bpf.h           |   1 +
 include/uapi/linux/if_link.h       |   1 +
 net/core/dev.c                     |  44 +++--
 net/core/filter.c                  |   8 +
 net/core/rtnetlink.c               | 114 ++++++++++++-
 samples/bpf/xdp1_user.c            |  30 +++-
 tools/include/uapi/linux/bpf.h     |   1 +
 tools/include/uapi/linux/if_link.h |   1 +
 tools/lib/bpf/libbpf.c             |   2 +
 tools/lib/bpf/libbpf.h             |   6 +
 tools/lib/bpf/libbpf.map           |   3 +
 tools/lib/bpf/netlink.c            |  52 +++++-
 18 files changed, 555 insertions(+), 152 deletions(-)

-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 01/12] net: Add new XDP setup and query commands
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
@ 2020-01-23  1:41 ` David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type David Ahern
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:41 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

From: David Ahern <dahern@digitalocean.com>

Add new netdev commands, XDP_SETUP_PROG_EGRESS and
XDP_QUERY_PROG_EGRESS.

Update dev_change_xdp_fd and dev_xdp_install to take an egress argument.
If egress bool is set, then use XDP_SETUP_PROG_EGRESS in dev_xdp_install
and XDP_QUERY_PROG_EGRESS in dev_change_xdp_fd.

Update dev_xdp_uninstall to query for XDP_QUERY_PROG_EGRESS and if a
program is installed call dev_xdp_install with the egress argument set
to true.

This enables existing infrastructure to be used for XDP programs in
the egress path.

Signed-off-by: David Ahern <dahern@digitalocean.com>
Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 include/linux/netdevice.h |  4 +++-
 net/core/dev.c            | 30 +++++++++++++++++++++---------
 net/core/rtnetlink.c      |  2 +-
 3 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5ec3537fbdb1..e9cc326086f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -865,8 +865,10 @@ enum bpf_netdev_command {
 	 */
 	XDP_SETUP_PROG,
 	XDP_SETUP_PROG_HW,
+	XDP_SETUP_PROG_EGRESS,
 	XDP_QUERY_PROG,
 	XDP_QUERY_PROG_HW,
+	XDP_QUERY_PROG_EGRESS,
 	/* BPF program for offload callbacks, invoked at program load time. */
 	BPF_OFFLOAD_MAP_ALLOC,
 	BPF_OFFLOAD_MAP_FREE,
@@ -3729,7 +3731,7 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
-		      int fd, u32 flags);
+		      int fd, u32 flags, bool egress);
 u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
 		    enum bpf_netdev_command cmd);
 int xdp_umem_query(struct net_device *dev, u16 queue_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index e7802a41ae7f..04cbcc930bc2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8538,7 +8538,7 @@ u32 __dev_xdp_query(struct net_device *dev, bpf_op_t bpf_op,
 
 static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 			   struct netlink_ext_ack *extack, u32 flags,
-			   struct bpf_prog *prog)
+			   struct bpf_prog *prog, bool egress)
 {
 	bool non_hw = !(flags & XDP_FLAGS_HW_MODE);
 	struct bpf_prog *prev_prog = NULL;
@@ -8556,7 +8556,7 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 	if (flags & XDP_FLAGS_HW_MODE)
 		xdp.command = XDP_SETUP_PROG_HW;
 	else
-		xdp.command = XDP_SETUP_PROG;
+		xdp.command = egress ? XDP_SETUP_PROG_EGRESS : XDP_SETUP_PROG;
 	xdp.extack = extack;
 	xdp.flags = flags;
 	xdp.prog = prog;
@@ -8577,7 +8577,8 @@ static void dev_xdp_uninstall(struct net_device *dev)
 	bpf_op_t ndo_bpf;
 
 	/* Remove generic XDP */
-	WARN_ON(dev_xdp_install(dev, generic_xdp_install, NULL, 0, NULL));
+	WARN_ON(dev_xdp_install(dev, generic_xdp_install, NULL, 0, NULL,
+				false));
 
 	/* Remove from the driver */
 	ndo_bpf = dev->netdev_ops->ndo_bpf;
@@ -8589,14 +8590,20 @@ static void dev_xdp_uninstall(struct net_device *dev)
 	WARN_ON(ndo_bpf(dev, &xdp));
 	if (xdp.prog_id)
 		WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags,
-					NULL));
+					NULL, false));
 
 	/* Remove HW offload */
 	memset(&xdp, 0, sizeof(xdp));
 	xdp.command = XDP_QUERY_PROG_HW;
 	if (!ndo_bpf(dev, &xdp) && xdp.prog_id)
 		WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags,
-					NULL));
+					NULL, false));
+	/* Remove egress program */
+	memset(&xdp, 0, sizeof(xdp));
+	xdp.command = XDP_QUERY_PROG_EGRESS;
+	if (!ndo_bpf(dev, &xdp) && xdp.prog_id)
+		WARN_ON(dev_xdp_install(dev, ndo_bpf, NULL, xdp.prog_flags,
+					NULL, true));
 }
 
 /**
@@ -8609,7 +8616,7 @@ static void dev_xdp_uninstall(struct net_device *dev)
  *	Set or clear a bpf program for a device
  */
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
-		      int fd, u32 flags)
+		      int fd, u32 flags, bool egress)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 	enum bpf_netdev_command query;
@@ -8621,7 +8628,11 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	ASSERT_RTNL();
 
 	offload = flags & XDP_FLAGS_HW_MODE;
-	query = offload ? XDP_QUERY_PROG_HW : XDP_QUERY_PROG;
+	if (egress)
+		query = XDP_QUERY_PROG_EGRESS;
+	else
+		query = offload ? XDP_QUERY_PROG_HW : XDP_QUERY_PROG;
+
 
 	bpf_op = bpf_chk = ops->ndo_bpf;
 	if (!bpf_op && (flags & (XDP_FLAGS_DRV_MODE | XDP_FLAGS_HW_MODE))) {
@@ -8636,7 +8647,8 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	if (fd >= 0) {
 		u32 prog_id;
 
-		if (!offload && __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG)) {
+		if (!offload && !egress &&
+		    __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG)) {
 			NL_SET_ERR_MSG(extack, "native and generic XDP can't be active at the same time");
 			return -EEXIST;
 		}
@@ -8668,7 +8680,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 			return 0;
 	}
 
-	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
+	err = dev_xdp_install(dev, bpf_op, extack, flags, prog, egress);
 	if (err < 0 && prog)
 		bpf_prog_put(prog);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 20bc406f3871..ed0c069ef187 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2801,7 +2801,7 @@ static int do_setlink(const struct sk_buff *skb,
 		if (xdp[IFLA_XDP_FD]) {
 			err = dev_change_xdp_fd(dev, extack,
 						nla_get_s32(xdp[IFLA_XDP_FD]),
-						xdp_flags);
+						xdp_flags, false);
 			if (err)
 				goto errout;
 			status |= DO_SETLINK_NOTIFY;
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
  2020-01-23  1:41 ` [PATCH bpf-next 01/12] net: Add new XDP setup and query commands David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23 11:34   ` Toke Høiland-Jørgensen
  2020-01-24  7:33   ` Martin Lau
  2020-01-23  1:42 ` [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path David Ahern
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

From: Prashant Bhole <prashantbhole.linux@gmail.com>

Add new bpf_attach_type, BPF_XDP_EGRESS, for BPF programs attached
at the XDP layer, but the egress path.

Since egress path does not have rx_queue_index and ingress_ifindex set,
update xdp_is_valid_access to block access to these entries in the xdp
context when a program is attached to egress path.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: David Ahern <dahern@digitalocean.com>
---
 include/uapi/linux/bpf.h       | 1 +
 net/core/filter.c              | 8 ++++++++
 tools/include/uapi/linux/bpf.h | 1 +
 3 files changed, 10 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 033d90a2282d..72f2a9a4621e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -209,6 +209,7 @@ enum bpf_attach_type {
 	BPF_TRACE_RAW_TP,
 	BPF_TRACE_FENTRY,
 	BPF_TRACE_FEXIT,
+	BPF_XDP_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 17de6747d9e3..a903f3a15d74 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6803,6 +6803,14 @@ static bool xdp_is_valid_access(int off, int size,
 		return false;
 	}
 
+	if (prog->expected_attach_type == BPF_XDP_EGRESS) {
+		switch (off) {
+		case offsetof(struct xdp_md, rx_queue_index):
+		case offsetof(struct xdp_md, ingress_ifindex):
+			return false;
+		}
+	}
+
 	switch (off) {
 	case offsetof(struct xdp_md, data):
 		info->reg_type = PTR_TO_PACKET;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 033d90a2282d..72f2a9a4621e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -209,6 +209,7 @@ enum bpf_attach_type {
 	BPF_TRACE_RAW_TP,
 	BPF_TRACE_FENTRY,
 	BPF_TRACE_FEXIT,
+	BPF_XDP_EGRESS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
  2020-01-23  1:41 ` [PATCH bpf-next 01/12] net: Add new XDP setup and query commands David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23 11:35   ` Toke Høiland-Jørgensen
  2020-01-23  1:42 ` [PATCH bpf-next 04/12] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() David Ahern
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

From: David Ahern <dahern@digitalocean.com>

Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
the egress counterpart to the existing rtnl_xdp_fill. The expectation
is that going forward egress path will acquire the various levels of
attach - generic, driver and hardware.

Signed-off-by: David Ahern <dahern@digitalocean.com>
Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 include/uapi/linux/if_link.h       |   1 +
 net/core/dev.c                     |   6 ++
 net/core/rtnetlink.c               | 112 ++++++++++++++++++++++++++++-
 tools/include/uapi/linux/if_link.h |   1 +
 4 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 1d69f637c5d6..c760aa54315c 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -170,6 +170,7 @@ enum {
 	IFLA_PROP_LIST,
 	IFLA_ALT_IFNAME, /* Alternative ifname */
 	IFLA_PERM_ADDRESS,
+	IFLA_XDP_EGRESS, /* nested attribute with 1 or more IFLA_XDP_ attrs */
 	__IFLA_MAX
 };
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 04cbcc930bc2..bf76dbee9d2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8664,6 +8664,12 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
 
+		if (egress && prog->expected_attach_type != BPF_XDP_EGRESS) {
+			NL_SET_ERR_MSG(extack, "XDP program in egress path must use BPF_XDP_EGRESS attach type");
+			bpf_prog_put(prog);
+			return -EINVAL;
+		}
+
 		if (!offload && bpf_prog_is_dev_bound(prog->aux)) {
 			NL_SET_ERR_MSG(extack, "using device-bound program without HW_MODE flag is not supported");
 			bpf_prog_put(prog);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ed0c069ef187..2179de9350b2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1030,7 +1030,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
 	       + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
-	       + rtnl_xdp_size() /* IFLA_XDP */
+	       + rtnl_xdp_size() * 2 /* IFLA_XDP and IFLA_XDP_EGRESS */
 	       + nla_total_size(4)  /* IFLA_EVENT */
 	       + nla_total_size(4)  /* IFLA_NEW_NETNSID */
 	       + nla_total_size(4)  /* IFLA_NEW_IFINDEX */
@@ -1395,6 +1395,36 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 	return 0;
 }
 
+static u32 rtnl_xdp_egress_prog_drv(struct net_device *dev)
+{
+	return __dev_xdp_query(dev, dev->netdev_ops->ndo_bpf,
+			       XDP_QUERY_PROG_EGRESS);
+}
+
+static int rtnl_xdp_egress_report(struct sk_buff *skb, struct net_device *dev,
+				  u32 *prog_id, u8 *mode, u8 tgt_mode, u32 attr,
+				  u32 (*get_prog_id)(struct net_device *dev))
+{
+	u32 curr_id;
+	int err;
+
+	curr_id = get_prog_id(dev);
+	if (!curr_id)
+		return 0;
+
+	*prog_id = curr_id;
+	err = nla_put_u32(skb, attr, curr_id);
+	if (err)
+		return err;
+
+	if (*mode != XDP_ATTACHED_NONE)
+		*mode = XDP_ATTACHED_MULTI;
+	else
+		*mode = tgt_mode;
+
+	return 0;
+}
+
 static u32 rtnl_xdp_prog_skb(struct net_device *dev)
 {
 	const struct bpf_prog *generic_xdp_prog;
@@ -1486,6 +1516,41 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
 	return err;
 }
 
+static int rtnl_xdp_egress_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	u8 mode = XDP_ATTACHED_NONE;
+	struct nlattr *xdp;
+	u32 prog_id = 0;
+	int err;
+
+	xdp = nla_nest_start_noflag(skb, IFLA_XDP_EGRESS);
+	if (!xdp)
+		return -EMSGSIZE;
+
+	err = rtnl_xdp_egress_report(skb, dev, &prog_id, &mode,
+				     XDP_ATTACHED_DRV, IFLA_XDP_DRV_PROG_ID,
+				     rtnl_xdp_egress_prog_drv);
+	if (err)
+		goto err_cancel;
+
+	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, mode);
+	if (err)
+		goto err_cancel;
+
+	if (prog_id && mode != XDP_ATTACHED_MULTI) {
+		err = nla_put_u32(skb, IFLA_XDP_PROG_ID, prog_id);
+		if (err)
+			goto err_cancel;
+	}
+
+	nla_nest_end(skb, xdp);
+	return 0;
+
+err_cancel:
+	nla_nest_cancel(skb, xdp);
+	return err;
+}
+
 static u32 rtnl_get_event(unsigned long event)
 {
 	u32 rtnl_event_type = IFLA_EVENT_NONE;
@@ -1743,6 +1808,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb,
 	if (rtnl_xdp_fill(skb, dev))
 		goto nla_put_failure;
 
+	if (rtnl_xdp_egress_fill(skb, dev))
+		goto nla_put_failure;
+
 	if (dev->rtnl_link_ops || rtnl_have_link_slave_info(dev)) {
 		if (rtnl_link_fill(skb, dev) < 0)
 			goto nla_put_failure;
@@ -1827,6 +1895,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
 	[IFLA_ALT_IFNAME]	= { .type = NLA_STRING,
 				    .len = ALTIFNAMSIZ - 1 },
 	[IFLA_PERM_ADDRESS]	= { .type = NLA_REJECT },
+	[IFLA_XDP_EGRESS]	= { .type = NLA_NESTED },
 };
 
 static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
@@ -2808,6 +2877,47 @@ static int do_setlink(const struct sk_buff *skb,
 		}
 	}
 
+	if (tb[IFLA_XDP_EGRESS]) {
+		struct nlattr *xdp[IFLA_XDP_MAX + 1];
+		u32 xdp_flags = 0;
+
+		err = nla_parse_nested_deprecated(xdp, IFLA_XDP_MAX,
+						  tb[IFLA_XDP_EGRESS],
+						  ifla_xdp_policy, NULL);
+		if (err < 0)
+			goto errout;
+
+		if (xdp[IFLA_XDP_ATTACHED] || xdp[IFLA_XDP_PROG_ID]) {
+			err = -EINVAL;
+			goto errout;
+		}
+
+		if (xdp[IFLA_XDP_FLAGS]) {
+			xdp_flags = nla_get_u32(xdp[IFLA_XDP_FLAGS]);
+			if (xdp_flags & XDP_FLAGS_HW_MODE) {
+				err = -EINVAL;
+				goto errout;
+			}
+			if (xdp_flags & ~XDP_FLAGS_MASK) {
+				err = -EINVAL;
+				goto errout;
+			}
+			if (hweight32(xdp_flags & XDP_FLAGS_MODES) > 1) {
+				err = -EINVAL;
+				goto errout;
+			}
+		}
+
+		if (xdp[IFLA_XDP_FD]) {
+			err = dev_change_xdp_fd(dev, extack,
+						nla_get_s32(xdp[IFLA_XDP_FD]),
+						xdp_flags, true);
+			if (err)
+				goto errout;
+			status |= DO_SETLINK_NOTIFY;
+		}
+	}
+
 errout:
 	if (status & DO_SETLINK_MODIFIED) {
 		if ((status & DO_SETLINK_NOTIFY) == DO_SETLINK_NOTIFY)
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 1d69f637c5d6..c760aa54315c 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -170,6 +170,7 @@ enum {
 	IFLA_PROP_LIST,
 	IFLA_ALT_IFNAME, /* Alternative ifname */
 	IFLA_PERM_ADDRESS,
+	IFLA_XDP_EGRESS, /* nested attribute with 1 or more IFLA_XDP_ attrs */
 	__IFLA_MAX
 };
 
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 04/12] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core()
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (2 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 05/12] tuntap: check tun_msg_ctl type at necessary places David Ahern
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern

From: Jason Wang <jasowang@redhat.com>

In skb generic path, we need a way to run XDP program on skb but
to have customized handling of XDP actions. netif_receive_generic_xdp
will be more helpful in such cases than do_xdp_generic.

This patch prepares netif_receive_generic_xdp() to be used as general
purpose function by renaming it and exporting as a general purpose
function. It will just run XDP program on skb but will not handle XDP
actions.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 include/linux/netdevice.h | 2 ++
 net/core/dev.c            | 8 ++++----
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e9cc326086f4..9c219796691c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3665,6 +3665,8 @@ static inline void dev_consume_skb_any(struct sk_buff *skb)
 
 void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog);
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb);
+u32 do_xdp_generic_core(struct sk_buff *skb, struct xdp_buff *xdp,
+			struct bpf_prog *xdp_prog);
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index bf76dbee9d2a..3ba207b4ed79 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4461,9 +4461,8 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 	return rxqueue;
 }
 
-static u32 netif_receive_generic_xdp(struct sk_buff *skb,
-				     struct xdp_buff *xdp,
-				     struct bpf_prog *xdp_prog)
+u32 do_xdp_generic_core(struct sk_buff *skb, struct xdp_buff *xdp,
+			struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	void *orig_data, *orig_data_end;
@@ -4574,6 +4573,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 
 	return act;
 }
+EXPORT_SYMBOL_GPL(do_xdp_generic_core);
 
 /* When doing generic XDP we have to bypass the qdisc layer and the
  * network taps in order to match in-driver-XDP behavior.
@@ -4610,7 +4610,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 		u32 act;
 		int err;
 
-		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
+		act = do_xdp_generic_core(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
 			case XDP_REDIRECT:
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 05/12] tuntap: check tun_msg_ctl type at necessary places
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (3 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 04/12] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 06/12] tun: move shared functions to if_tun.h David Ahern
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern

From: Prashant Bhole <prashantbhole.linux@gmail.com>

tun_msg_ctl is used by vhost_net to communicate with tuntap. We will
introduce another type in soon. As a preparation this patch adds
conditions to check tun_msg_ctl type at necessary places.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tap.c | 7 +++++--
 drivers/net/tun.c | 6 +++++-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index a6d63665ad03..a0a5dc18109a 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1203,6 +1203,7 @@ static int tap_sendmsg(struct socket *sock, struct msghdr *m,
 	struct tap_queue *q = container_of(sock, struct tap_queue, sock);
 	struct tun_msg_ctl *ctl = m->msg_control;
 	struct xdp_buff *xdp;
+	void *ptr = NULL;
 	int i;
 
 	if (ctl && (ctl->type == TUN_MSG_PTR)) {
@@ -1213,8 +1214,10 @@ static int tap_sendmsg(struct socket *sock, struct msghdr *m,
 		return 0;
 	}
 
-	return tap_get_user(q, ctl ? ctl->ptr : NULL, &m->msg_iter,
-			    m->msg_flags & MSG_DONTWAIT);
+	if (ctl && ctl->type == TUN_MSG_UBUF)
+		ptr = ctl->ptr;
+
+	return tap_get_user(q, ptr, &m->msg_iter, m->msg_flags & MSG_DONTWAIT);
 }
 
 static int tap_recvmsg(struct socket *sock, struct msghdr *m,
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3a5a6c655dda..c3155bc3fc7f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2529,6 +2529,7 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	struct tun_struct *tun = tun_get(tfile);
 	struct tun_msg_ctl *ctl = m->msg_control;
 	struct xdp_buff *xdp;
+	void *ptr = NULL;
 
 	if (!tun)
 		return -EBADFD;
@@ -2560,7 +2561,10 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		goto out;
 	}
 
-	ret = tun_get_user(tun, tfile, ctl ? ctl->ptr : NULL, &m->msg_iter,
+	if (ctl && ctl->type == TUN_MSG_UBUF)
+		ptr = ctl->ptr;
+
+	ret = tun_get_user(tun, tfile, ptr, &m->msg_iter,
 			   m->msg_flags & MSG_DONTWAIT,
 			   m->msg_flags & MSG_MORE);
 out:
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 06/12] tun: move shared functions to if_tun.h
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (4 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 05/12] tuntap: check tun_msg_ctl type at necessary places David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 07/12] vhost_net: user tap recvmsg api to access ptr ring David Ahern
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern

From: Prashant Bhole <prashantbhole.linux@gmail.com>

This patch moves tun_ptr_free and related functions to if_tun.h. We
will use this function in tap.c in further patches. This was need to
avoid a scenario when tap is built into kernel image, tun is module
but not loaded. Below functions are moved to if_tun.h
- tun_ptr_free
- tun_xdp_to_ptr
- tun_ptr_to_xdp
- tun_ptr_free

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c      | 33 ---------------------------------
 include/linux/if_tun.h | 34 ++++++++++++++++++++++++++++++----
 2 files changed, 30 insertions(+), 37 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c3155bc3fc7f..6f12c32df346 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -64,7 +64,6 @@
 #include <net/xdp.h>
 #include <linux/seq_file.h>
 #include <linux/uio.h>
-#include <linux/skb_array.h>
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
 #include <linux/mutex.h>
@@ -249,24 +248,6 @@ struct veth {
 	__be16 h_vlan_TCI;
 };
 
-bool tun_is_xdp_frame(void *ptr)
-{
-	return (unsigned long)ptr & TUN_XDP_FLAG;
-}
-EXPORT_SYMBOL(tun_is_xdp_frame);
-
-void *tun_xdp_to_ptr(void *ptr)
-{
-	return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
-}
-EXPORT_SYMBOL(tun_xdp_to_ptr);
-
-void *tun_ptr_to_xdp(void *ptr)
-{
-	return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
-}
-EXPORT_SYMBOL(tun_ptr_to_xdp);
-
 static int tun_napi_receive(struct napi_struct *napi, int budget)
 {
 	struct tun_file *tfile = container_of(napi, struct tun_file, napi);
@@ -649,20 +630,6 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile)
 	return tun;
 }
 
-void tun_ptr_free(void *ptr)
-{
-	if (!ptr)
-		return;
-	if (tun_is_xdp_frame(ptr)) {
-		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
-
-		xdp_return_frame(xdpf);
-	} else {
-		__skb_array_destroy_skb(ptr);
-	}
-}
-EXPORT_SYMBOL_GPL(tun_ptr_free);
-
 static void tun_queue_purge(struct tun_file *tfile)
 {
 	void *ptr;
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 5bda8cf457b6..49ca20063a35 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -6,6 +6,7 @@
 #ifndef __IF_TUN_H
 #define __IF_TUN_H
 
+#include <linux/skb_array.h>
 #include <uapi/linux/if_tun.h>
 #include <uapi/linux/virtio_net.h>
 
@@ -27,10 +28,35 @@ struct tun_xdp_hdr {
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
-bool tun_is_xdp_frame(void *ptr);
-void *tun_xdp_to_ptr(void *ptr);
-void *tun_ptr_to_xdp(void *ptr);
-void tun_ptr_free(void *ptr);
+
+static inline bool tun_is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & TUN_XDP_FLAG;
+}
+
+static inline void *tun_xdp_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr | TUN_XDP_FLAG);
+}
+
+static inline void *tun_ptr_to_xdp(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG);
+}
+
+static inline void tun_ptr_free(void *ptr)
+{
+	if (!ptr)
+		return;
+	if (tun_is_xdp_frame(ptr)) {
+		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
+
+		xdp_return_frame(xdpf);
+	} else {
+		__skb_array_destroy_skb(ptr);
+	}
+}
+
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 07/12] vhost_net: user tap recvmsg api to access ptr ring
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (5 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 06/12] tun: move shared functions to if_tun.h David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  8:26   ` Michael S. Tsirkin
  2020-01-23  1:42 ` [PATCH bpf-next 08/12] tuntap: remove usage of ptr ring in vhost_net David Ahern
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern

From: Prashant Bhole <prashantbhole.linux@gmail.com>

Currently vhost_net directly accesses ptr ring of tap driver to
fetch Rx packet pointers. In order to avoid it this patch modifies
tap driver's recvmsg api to do additional task of fetching Rx packet
pointers.

A special struct tun_msg_ctl is already being passed via msg_control
for tun Rx XDP batching. This patch extends tun_msg_ctl usage to
send sub commands to recvmsg api. Now tun_recvmsg will handle commands
to consume and unconsume packet pointers from ptr ring.

This will be useful in implementation of tx path XDP in tun driver,
where XDP program will process the packet before it is passed to
vhost_net.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tap.c      | 22 ++++++++++++++++++-
 drivers/net/tun.c      | 24 ++++++++++++++++++++-
 drivers/vhost/net.c    | 48 +++++++++++++++++++++++++++++++-----------
 include/linux/if_tun.h | 18 ++++++++++++++++
 4 files changed, 98 insertions(+), 14 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index a0a5dc18109a..a5ce44db11a3 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1224,8 +1224,28 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m,
 		       size_t total_len, int flags)
 {
 	struct tap_queue *q = container_of(sock, struct tap_queue, sock);
-	struct sk_buff *skb = m->msg_control;
+	struct tun_msg_ctl *ctl = m->msg_control;
+	struct sk_buff *skb = NULL;
 	int ret;
+
+	if (ctl) {
+		switch (ctl->type) {
+		case TUN_MSG_PKT:
+			skb = ctl->ptr;
+			break;
+		case TUN_MSG_CONSUME_PKTS:
+			return ptr_ring_consume_batched(&q->ring,
+							ctl->ptr,
+							ctl->num);
+		case TUN_MSG_UNCONSUME_PKTS:
+			ptr_ring_unconsume(&q->ring, ctl->ptr, ctl->num,
+					   tun_ptr_free);
+			return 0;
+		default:
+			return -EINVAL;
+		}
+	}
+
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) {
 		kfree_skb(skb);
 		return -EINVAL;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6f12c32df346..197bde748c09 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2544,7 +2544,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 {
 	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
 	struct tun_struct *tun = tun_get(tfile);
-	void *ptr = m->msg_control;
+	struct tun_msg_ctl *ctl = m->msg_control;
+	void *ptr = NULL;
 	int ret;
 
 	if (!tun) {
@@ -2552,6 +2553,27 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 		goto out_free;
 	}
 
+	if (ctl) {
+		switch (ctl->type) {
+		case TUN_MSG_PKT:
+			ptr = ctl->ptr;
+			break;
+		case TUN_MSG_CONSUME_PKTS:
+			ret = ptr_ring_consume_batched(&tfile->tx_ring,
+						       ctl->ptr,
+						       ctl->num);
+			goto out;
+		case TUN_MSG_UNCONSUME_PKTS:
+			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,
+					   ctl->num, tun_ptr_free);
+			ret = 0;
+			goto out;
+		default:
+			ret = -EINVAL;
+			goto out_put_tun;
+		}
+	}
+
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
 		ret = -EINVAL;
 		goto out_put_tun;
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e158159671fa..482548d00105 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -175,24 +175,44 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 
 static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
 {
+	struct vhost_virtqueue *vq = &nvq->vq;
+	struct socket *sock = vq->private_data;
 	struct vhost_net_buf *rxq = &nvq->rxq;
+	struct tun_msg_ctl ctl = {
+		.type = TUN_MSG_CONSUME_PKTS,
+		.ptr = (void *) rxq->queue,
+		.num = VHOST_NET_BATCH,
+	};
+	struct msghdr msg = {
+		.msg_control = &ctl,
+	};
 
 	rxq->head = 0;
-	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
-					      VHOST_NET_BATCH);
+	rxq->tail = sock->ops->recvmsg(sock, &msg, 0, 0);
+	if (WARN_ON_ONCE(rxq->tail < 0))
+		rxq->tail = 0;
+
 	return rxq->tail;
 }
 
 static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
 {
+	struct vhost_virtqueue *vq = &nvq->vq;
+	struct socket *sock = vq->private_data;
 	struct vhost_net_buf *rxq = &nvq->rxq;
+	struct tun_msg_ctl ctl = {
+		.type = TUN_MSG_UNCONSUME_PKTS,
+		.ptr = (void *) (rxq->queue + rxq->head),
+		.num = vhost_net_buf_get_size(rxq),
+	};
+	struct msghdr msg = {
+		.msg_control = &ctl,
+	};
 
-	if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
-		ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
-				   vhost_net_buf_get_size(rxq),
-				   tun_ptr_free);
-		rxq->head = rxq->tail = 0;
-	}
+	if (!vhost_net_buf_is_empty(rxq))
+		sock->ops->recvmsg(sock, &msg, 0, 0);
+
+	rxq->head = rxq->tail = 0;
 }
 
 static int vhost_net_buf_peek_len(void *ptr)
@@ -1109,6 +1129,7 @@ static void handle_rx(struct vhost_net *net)
 		.flags = 0,
 		.gso_type = VIRTIO_NET_HDR_GSO_NONE
 	};
+	struct tun_msg_ctl ctl;
 	size_t total_len = 0;
 	int err, mergeable;
 	s16 headcount;
@@ -1166,8 +1187,11 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		busyloop_intr = false;
-		if (nvq->rx_ring)
-			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
+		if (nvq->rx_ring) {
+			ctl.type = TUN_MSG_PKT;
+			ctl.ptr = vhost_net_buf_consume(&nvq->rxq);
+			msg.msg_control = &ctl;
+		}
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
 			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
@@ -1346,8 +1370,8 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 	mutex_lock(&vq->mutex);
 	sock = vq->private_data;
 	vhost_net_disable_vq(n, vq);
-	vq->private_data = NULL;
 	vhost_net_buf_unproduce(nvq);
+	vq->private_data = NULL;
 	nvq->rx_ring = NULL;
 	mutex_unlock(&vq->mutex);
 	return sock;
@@ -1538,8 +1562,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		}
 
 		vhost_net_disable_vq(n, vq);
-		vq->private_data = sock;
 		vhost_net_buf_unproduce(nvq);
+		vq->private_data = sock;
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 49ca20063a35..9184e3f177b8 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -12,8 +12,26 @@
 
 #define TUN_XDP_FLAG 0x1UL
 
+/*
+ * tun_msg_ctl types
+ */
+
 #define TUN_MSG_UBUF 1
 #define TUN_MSG_PTR  2
+/*
+ * Used for passing a packet pointer from vhost to tun
+ */
+#define TUN_MSG_PKT  3
+/*
+ * Used for passing an array of pointer from vhost to tun.
+ * tun consumes packets from ptr ring and stores in pointer array.
+ */
+#define TUN_MSG_CONSUME_PKTS    4
+/*
+ * Used for passing an array of pointer from vhost to tun.
+ * tun consumes get pointer from array and puts back into ptr ring.
+ */
+#define TUN_MSG_UNCONSUME_PKTS  5
 struct tun_msg_ctl {
 	unsigned short type;
 	unsigned short num;
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 08/12] tuntap: remove usage of ptr ring in vhost_net
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (6 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 07/12] vhost_net: user tap recvmsg api to access ptr ring David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 09/12] tun: set egress XDP program David Ahern
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern

From: Prashant Bhole <prashantbhole.linux@gmail.com>

Remove usage of ptr ring of tuntap in vhost_net and remove the
functions exported from tuntap drivers to get ptr ring.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tap.c      | 13 -------------
 drivers/net/tun.c      | 13 -------------
 drivers/vhost/net.c    | 31 ++++---------------------------
 include/linux/if_tap.h |  5 -----
 include/linux/if_tun.h |  5 -----
 5 files changed, 4 insertions(+), 63 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index a5ce44db11a3..fe816a99275d 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1288,19 +1288,6 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
-struct ptr_ring *tap_get_ptr_ring(struct file *file)
-{
-	struct tap_queue *q;
-
-	if (file->f_op != &tap_fops)
-		return ERR_PTR(-EINVAL);
-	q = file->private_data;
-	if (!q)
-		return ERR_PTR(-EBADFD);
-	return &q->ring;
-}
-EXPORT_SYMBOL_GPL(tap_get_ptr_ring);
-
 int tap_queue_resize(struct tap_dev *tap)
 {
 	struct net_device *dev = tap->dev;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 197bde748c09..d4a0e59fcf5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3717,19 +3717,6 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
-struct ptr_ring *tun_get_tx_ring(struct file *file)
-{
-	struct tun_file *tfile;
-
-	if (file->f_op != &tun_fops)
-		return ERR_PTR(-EINVAL);
-	tfile = file->private_data;
-	if (!tfile)
-		return ERR_PTR(-EBADFD);
-	return &tfile->tx_ring;
-}
-EXPORT_SYMBOL_GPL(tun_get_tx_ring);
-
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 482548d00105..30b5c68193c9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -122,7 +122,6 @@ struct vhost_net_virtqueue {
 	/* Reference counting for outstanding ubufs.
 	 * Protected by vq mutex. Writers must also take device mutex. */
 	struct vhost_net_ubuf_ref *ubufs;
-	struct ptr_ring *rx_ring;
 	struct vhost_net_buf rxq;
 	/* Batched XDP buffs */
 	struct xdp_buff *xdp;
@@ -997,8 +996,9 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	int len = 0;
 	unsigned long flags;
 
-	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+	len = vhost_net_buf_peek(rvq);
+	if (len)
+		return len;
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -1187,7 +1187,7 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		busyloop_intr = false;
-		if (nvq->rx_ring) {
+		if (!vhost_net_buf_is_empty(&nvq->rxq)) {
 			ctl.type = TUN_MSG_PKT;
 			ctl.ptr = vhost_net_buf_consume(&nvq->rxq);
 			msg.msg_control = &ctl;
@@ -1343,7 +1343,6 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].batched_xdp = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
-		n->vqs[i].rx_ring = NULL;
 		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
@@ -1372,7 +1371,6 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 	vhost_net_disable_vq(n, vq);
 	vhost_net_buf_unproduce(nvq);
 	vq->private_data = NULL;
-	nvq->rx_ring = NULL;
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -1468,25 +1466,6 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
-static struct ptr_ring *get_tap_ptr_ring(int fd)
-{
-	struct ptr_ring *ring;
-	struct file *file = fget(fd);
-
-	if (!file)
-		return NULL;
-	ring = tun_get_tx_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = tap_get_ptr_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = NULL;
-out:
-	fput(file);
-	return ring;
-}
-
 static struct socket *get_tap_socket(int fd)
 {
 	struct file *file = fget(fd);
@@ -1570,8 +1549,6 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = vhost_net_enable_vq(n, vq);
 		if (r)
 			goto err_used;
-		if (index == VHOST_NET_VQ_RX)
-			nvq->rx_ring = get_tap_ptr_ring(fd);
 
 		oldubufs = nvq->ubufs;
 		nvq->ubufs = ubufs;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 915a187cfabd..68fe366fb185 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -4,7 +4,6 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
-struct ptr_ring *tap_get_ptr_ring(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -14,10 +13,6 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
 #endif /* CONFIG_TAP */
 
 #include <net/sock.h>
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 9184e3f177b8..41d72bb0eaf1 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -45,7 +45,6 @@ struct tun_xdp_hdr {
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
-struct ptr_ring *tun_get_tx_ring(struct file *file);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -84,10 +83,6 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 09/12] tun: set egress XDP program
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (7 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 08/12] tuntap: remove usage of ptr ring in vhost_net David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 10/12] tun: run XDP program in tx path David Ahern
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

From: David Ahern <dahern@digitalocean.com>

This patch adds a way to set tx path XDP program in tun driver
by handling XDP_SETUP_PROG_EGRESS and XDP_QUERY_PROG_EGRESS in
ndo_bpf handler.

Signed-off-by: David Ahern <dahern@digitalocean.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d4a0e59fcf5c..b6bac773f2a0 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -238,6 +238,7 @@ struct tun_struct {
 	u32 rx_batched;
 	struct tun_pcpu_stats __percpu *pcpu_stats;
 	struct bpf_prog __rcu *xdp_prog;
+	struct bpf_prog __rcu *xdp_egress_prog;
 	struct tun_prog __rcu *steering_prog;
 	struct tun_prog __rcu *filter_prog;
 	struct ethtool_link_ksettings link_ksettings;
@@ -1156,15 +1157,21 @@ tun_net_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 }
 
 static int tun_xdp_set(struct net_device *dev, struct bpf_prog *prog,
-		       struct netlink_ext_ack *extack)
+		       bool egress, struct netlink_ext_ack *extack)
 {
 	struct tun_struct *tun = netdev_priv(dev);
 	struct tun_file *tfile;
 	struct bpf_prog *old_prog;
 	int i;
 
-	old_prog = rtnl_dereference(tun->xdp_prog);
-	rcu_assign_pointer(tun->xdp_prog, prog);
+	if (egress) {
+		old_prog = rtnl_dereference(tun->xdp_egress_prog);
+		rcu_assign_pointer(tun->xdp_egress_prog, prog);
+	} else {
+		old_prog = rtnl_dereference(tun->xdp_prog);
+		rcu_assign_pointer(tun->xdp_prog, prog);
+	}
+
 	if (old_prog)
 		bpf_prog_put(old_prog);
 
@@ -1185,12 +1192,16 @@ static int tun_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return 0;
 }
 
-static u32 tun_xdp_query(struct net_device *dev)
+static u32 tun_xdp_query(struct net_device *dev, bool egress)
 {
 	struct tun_struct *tun = netdev_priv(dev);
 	const struct bpf_prog *xdp_prog;
 
-	xdp_prog = rtnl_dereference(tun->xdp_prog);
+	if (egress)
+		xdp_prog = rtnl_dereference(tun->xdp_egress_prog);
+	else
+		xdp_prog = rtnl_dereference(tun->xdp_prog);
+
 	if (xdp_prog)
 		return xdp_prog->aux->id;
 
@@ -1201,13 +1212,20 @@ static int tun_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
-		return tun_xdp_set(dev, xdp->prog, xdp->extack);
+		return tun_xdp_set(dev, xdp->prog, false, xdp->extack);
+	case XDP_SETUP_PROG_EGRESS:
+		return tun_xdp_set(dev, xdp->prog, true, xdp->extack);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = tun_xdp_query(dev);
-		return 0;
+		xdp->prog_id = tun_xdp_query(dev, false);
+		break;
+	case XDP_QUERY_PROG_EGRESS:
+		xdp->prog_id = tun_xdp_query(dev, true);
+		break;
 	default:
 		return -EINVAL;
 	}
+
+	return 0;
 }
 
 static int tun_net_change_carrier(struct net_device *dev, bool new_carrier)
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 10/12] tun: run XDP program in tx path
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (8 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 09/12] tun: set egress XDP program David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  8:23   ` Michael S. Tsirkin
  2020-01-23  1:42 ` [PATCH bpf-next 11/12] libbpf: Add egress XDP support David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 12/12] samples/bpf: xdp1, add " David Ahern
  11 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern

From: Prashant Bhole <prashantbhole.linux@gmail.com>

Run the XDP program as soon as packet is removed from the ptr
ring. Since this is XDP in tx path, the traditional handling of
XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
just runs the program and leaves the action handling to us.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c | 153 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 150 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b6bac773f2a0..71bcd4ec2571 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -130,6 +130,7 @@ struct tap_filter {
 /* MAX_TAP_QUEUES 256 is chosen to allow rx/tx queues to be equal
  * to max number of VCPUs in guest. */
 #define MAX_TAP_QUEUES 256
+#define MAX_TAP_BATCH 64
 #define MAX_TAP_FLOWS  4096
 
 #define TUN_FLOW_EXPIRE (3 * HZ)
@@ -175,6 +176,7 @@ struct tun_file {
 	struct tun_struct *detached;
 	struct ptr_ring tx_ring;
 	struct xdp_rxq_info xdp_rxq;
+	void *pkt_ptrs[MAX_TAP_BATCH];
 };
 
 struct tun_page {
@@ -2140,6 +2142,107 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
+static struct sk_buff *tun_prepare_xdp_skb(struct sk_buff *skb)
+{
+	struct sk_buff *nskb;
+
+	if (skb_shared(skb) || skb_cloned(skb)) {
+		nskb = skb_copy(skb, GFP_ATOMIC);
+		consume_skb(skb);
+		return nskb;
+	}
+
+	return skb;
+}
+
+static u32 tun_do_xdp_tx_generic(struct tun_struct *tun,
+				 struct sk_buff *skb)
+{
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+	u32 act = XDP_PASS;
+
+	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
+	if (xdp_prog) {
+		skb = tun_prepare_xdp_skb(skb);
+		if (!skb) {
+			act = XDP_DROP;
+			kfree_skb(skb);
+			goto drop;
+		}
+
+		act = do_xdp_generic_core(skb, &xdp, xdp_prog);
+		switch (act) {
+		case XDP_TX:
+			/* Rx path generic XDP will be called in this path
+			 */
+			local_bh_disable();
+			netif_receive_skb(skb);
+			local_bh_enable();
+			break;
+		case XDP_PASS:
+			break;
+		case XDP_REDIRECT:
+			/* Since we are not handling this case yet, let's free
+			 * skb here. In case of XDP_DROP/XDP_ABORTED, the skb
+			 * was already freed in do_xdp_generic_core()
+			 */
+			kfree_skb(skb);
+			/* fall through */
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog, act);
+			/* fall through */
+		case XDP_DROP:
+			goto drop;
+		}
+	}
+
+	return act;
+drop:
+	this_cpu_inc(tun->pcpu_stats->tx_dropped);
+	return act;
+}
+
+static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile,
+			 struct xdp_frame *frame)
+{
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+	u32 act = XDP_PASS;
+
+	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
+	if (xdp_prog) {
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = xdp.data + frame->len;
+		xdp.data_meta = xdp.data - frame->metasize;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+		switch (act) {
+		case XDP_PASS:
+			break;
+		case XDP_TX:
+			/* fall through */
+		case XDP_REDIRECT:
+			/* fall through */
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog, act);
+			/* fall through */
+		case XDP_DROP:
+			xdp_return_frame_rx_napi(frame);
+			break;
+		}
+	}
+
+	return act;
+}
+
 static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
@@ -2557,6 +2660,52 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	return ret;
 }
 
+static int tun_consume_packets(struct tun_file *tfile, void **ptr_array, int n)
+{
+	void **pkts = tfile->pkt_ptrs;
+	struct xdp_frame *frame;
+	struct tun_struct *tun;
+	int i, num_ptrs;
+	int pkt_cnt = 0;
+	void *ptr;
+	u32 act;
+	int batchsz;
+
+	if (unlikely(!tfile))
+		return 0;
+
+	rcu_read_lock();
+	tun = rcu_dereference(tfile->tun);
+	if (unlikely(!tun)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	while (n) {
+		batchsz = (n > MAX_TAP_BATCH) ? MAX_TAP_BATCH : n;
+		n -= batchsz;
+		num_ptrs = ptr_ring_consume_batched(&tfile->tx_ring, pkts,
+						    batchsz);
+		if (!num_ptrs)
+			break;
+		for (i = 0; i < num_ptrs; i++) {
+			ptr = pkts[i];
+			if (tun_is_xdp_frame(ptr)) {
+				frame = tun_ptr_to_xdp(ptr);
+				act = tun_do_xdp_tx(tun, tfile, frame);
+			} else {
+				act = tun_do_xdp_tx_generic(tun, ptr);
+			}
+
+			if (act == XDP_PASS)
+				ptr_array[pkt_cnt++] = ptr;
+		}
+	}
+
+	rcu_read_unlock();
+	return pkt_cnt;
+}
+
 static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 		       int flags)
 {
@@ -2577,9 +2726,7 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 			ptr = ctl->ptr;
 			break;
 		case TUN_MSG_CONSUME_PKTS:
-			ret = ptr_ring_consume_batched(&tfile->tx_ring,
-						       ctl->ptr,
-						       ctl->num);
+			ret = tun_consume_packets(tfile, ctl->ptr, ctl->num);
 			goto out;
 		case TUN_MSG_UNCONSUME_PKTS:
 			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 11/12] libbpf: Add egress XDP support
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (9 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 10/12] tun: run XDP program in tx path David Ahern
@ 2020-01-23  1:42 ` David Ahern
  2020-01-23  1:42 ` [PATCH bpf-next 12/12] samples/bpf: xdp1, add " David Ahern
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

From: David Ahern <dahern@digitalocean.com>

Patch adds egress XDP support in libbpf.

First, new section name hint, xdp_egress, is added to set expected attach
type at program load. Programs can use xdp_egress as the prefix in
the SEC statement to load the program with the BPF_XDP_EGRESS
attach type set.

Second, new APIs are added that parallel the existing xdp ones which
can be changed:
        bpf_set_link_xdp_egress_fd - attach program at fd to device as
                                     xdp egress
        bpf_get_link_xdp_egress_id - get id for xdp egress program
        bpf_get_link_xdp_egress_info - get info for xdp egress program

Internally, the libbpf code is re-factored to be common for both
XDP use cases with a new egress argument to specify which netlink
attribute to use.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Co-developed-by: David Ahern <dahern@digitalocean.com>
Signed-off-by: David Ahern <dahern@digitalocean.com>
---
 tools/lib/bpf/libbpf.c   |  2 ++
 tools/lib/bpf/libbpf.h   |  6 +++++
 tools/lib/bpf/libbpf.map |  3 +++
 tools/lib/bpf/netlink.c  | 52 ++++++++++++++++++++++++++++++++++------
 4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index faab96a42141..0dd4cd535ef1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -6265,6 +6265,8 @@ static const struct bpf_sec_def section_defs[] = {
 		.expected_attach_type = BPF_TRACE_FEXIT,
 		.is_attach_btf = true,
 		.attach_fn = attach_trace),
+	BPF_EAPROG_SEC("xdp_egress",		BPF_PROG_TYPE_XDP,
+						BPF_XDP_EGRESS),
 	BPF_PROG_SEC("xdp",			BPF_PROG_TYPE_XDP),
 	BPF_PROG_SEC("perf_event",		BPF_PROG_TYPE_PERF_EVENT),
 	BPF_PROG_SEC("lwt_in",			BPF_PROG_TYPE_LWT_IN),
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 01639f9a1062..ed9a92e6e0ef 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -433,6 +433,12 @@ LIBBPF_API int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
 LIBBPF_API int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags);
 LIBBPF_API int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
 				     size_t info_size, __u32 flags);
+LIBBPF_API int bpf_set_link_xdp_egress_fd(int ifindex, int fd, __u32 flags);
+LIBBPF_API int bpf_get_link_xdp_egress_id(int ifindex, __u32 *prog_id,
+					  __u32 flags);
+LIBBPF_API int bpf_get_link_xdp_egress_info(int ifindex,
+					    struct xdp_link_info *info,
+					    size_t info_size, __u32 flags);
 
 struct perf_buffer;
 
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 64ec71ba41f1..8a2d7e3a5b22 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -232,4 +232,7 @@ LIBBPF_0.0.7 {
 		bpf_program__set_struct_ops;
 		btf__align_of;
 		libbpf_find_kernel_btf;
+		bpf_set_link_xdp_egress_fd;
+		bpf_get_link_xdp_egress_id;
+		bpf_get_link_xdp_egress_info;
 } LIBBPF_0.0.6;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 431bd25c6cdb..3c53c5dff122 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -28,6 +28,7 @@ typedef int (*__dump_nlmsg_t)(struct nlmsghdr *nlmsg, libbpf_dump_nlmsg_t,
 struct xdp_id_md {
 	int ifindex;
 	__u32 flags;
+	bool egress;
 	struct xdp_link_info info;
 };
 
@@ -132,7 +133,7 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
 	return ret;
 }
 
-int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+static int __bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags, bool egress)
 {
 	int sock, seq = 0, ret;
 	struct nlattr *nla, *nla_xdp;
@@ -159,7 +160,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 	/* started nested attribute for XDP */
 	nla = (struct nlattr *)(((char *)&req)
 				+ NLMSG_ALIGN(req.nh.nlmsg_len));
-	nla->nla_type = NLA_F_NESTED | IFLA_XDP;
+	nla->nla_type = NLA_F_NESTED | (egress ? IFLA_XDP_EGRESS : IFLA_XDP);
 	nla->nla_len = NLA_HDRLEN;
 
 	/* add XDP fd */
@@ -191,6 +192,16 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 	return ret;
 }
 
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+	return __bpf_set_link_xdp_fd(ifindex, fd, flags, false);
+}
+
+int bpf_set_link_xdp_egress_fd(int ifindex, int fd, __u32 flags)
+{
+	return __bpf_set_link_xdp_fd(ifindex, fd, flags, true);
+}
+
 static int __dump_link_nlmsg(struct nlmsghdr *nlh,
 			     libbpf_dump_nlmsg_t dump_link_nlmsg, void *cookie)
 {
@@ -211,15 +222,17 @@ static int get_xdp_info(void *cookie, void *msg, struct nlattr **tb)
 	struct nlattr *xdp_tb[IFLA_XDP_MAX + 1];
 	struct xdp_id_md *xdp_id = cookie;
 	struct ifinfomsg *ifinfo = msg;
+	unsigned int atype;
 	int ret;
 
 	if (xdp_id->ifindex && xdp_id->ifindex != ifinfo->ifi_index)
 		return 0;
 
-	if (!tb[IFLA_XDP])
+	atype = xdp_id->egress ? IFLA_XDP_EGRESS : IFLA_XDP;
+	if (!tb[atype])
 		return 0;
 
-	ret = libbpf_nla_parse_nested(xdp_tb, IFLA_XDP_MAX, tb[IFLA_XDP], NULL);
+	ret = libbpf_nla_parse_nested(xdp_tb, IFLA_XDP_MAX, tb[atype], NULL);
 	if (ret)
 		return ret;
 
@@ -251,10 +264,10 @@ static int get_xdp_info(void *cookie, void *msg, struct nlattr **tb)
 	return 0;
 }
 
-int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
-			  size_t info_size, __u32 flags)
+static int __bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
+				   size_t info_size, __u32 flags, bool egress)
 {
-	struct xdp_id_md xdp_id = {};
+	struct xdp_id_md xdp_id;
 	int sock, ret;
 	__u32 nl_pid;
 	__u32 mask;
@@ -274,6 +287,7 @@ int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
 
 	xdp_id.ifindex = ifindex;
 	xdp_id.flags = flags;
+	xdp_id.egress = egress;
 
 	ret = libbpf_nl_get_link(sock, nl_pid, get_xdp_info, &xdp_id);
 	if (!ret) {
@@ -287,6 +301,18 @@ int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
 	return ret;
 }
 
+int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
+			  size_t info_size, __u32 flags)
+{
+	return __bpf_get_link_xdp_info(ifindex, info, info_size, flags, false);
+}
+
+int bpf_get_link_xdp_egress_info(int ifindex, struct xdp_link_info *info,
+				 size_t info_size, __u32 flags)
+{
+	return __bpf_get_link_xdp_info(ifindex, info, info_size, flags, true);
+}
+
 static __u32 get_xdp_id(struct xdp_link_info *info, __u32 flags)
 {
 	if (info->attach_mode != XDP_ATTACHED_MULTI)
@@ -313,6 +339,18 @@ int bpf_get_link_xdp_id(int ifindex, __u32 *prog_id, __u32 flags)
 	return ret;
 }
 
+int bpf_get_link_xdp_egress_id(int ifindex, __u32 *prog_id, __u32 flags)
+{
+	struct xdp_link_info info;
+	int ret;
+
+	ret = bpf_get_link_xdp_egress_info(ifindex, &info, sizeof(info), flags);
+	if (!ret)
+		*prog_id = get_xdp_id(&info, flags);
+
+	return ret;
+}
+
 int libbpf_nl_get_link(int sock, unsigned int nl_pid,
 		       libbpf_dump_nlmsg_t dump_link_nlmsg, void *cookie)
 {
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH bpf-next 12/12] samples/bpf: xdp1, add egress XDP support
  2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
                   ` (10 preceding siblings ...)
  2020-01-23  1:42 ` [PATCH bpf-next 11/12] libbpf: Add egress XDP support David Ahern
@ 2020-01-23  1:42 ` David Ahern
  11 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-23  1:42 UTC (permalink / raw)
  To: netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	toke, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern

From: Prashant Bhole <prashantbhole.linux@gmail.com>

xdp1 and xdp2 now accept -E flag to set XDP program in the egress
path.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 samples/bpf/xdp1_user.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/samples/bpf/xdp1_user.c b/samples/bpf/xdp1_user.c
index c447ad9e3a1d..72c5bedbd030 100644
--- a/samples/bpf/xdp1_user.c
+++ b/samples/bpf/xdp1_user.c
@@ -21,17 +21,27 @@
 static int ifindex;
 static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
 static __u32 prog_id;
+static bool egress;
 
 static void int_exit(int sig)
 {
 	__u32 curr_prog_id = 0;
+	int err;
 
-	if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, xdp_flags)) {
+	if (egress)
+		err = bpf_get_link_xdp_egress_id(ifindex, &curr_prog_id,
+						 xdp_flags);
+	else
+		err = bpf_get_link_xdp_id(ifindex, &curr_prog_id, xdp_flags);
+	if (err) {
 		printf("bpf_get_link_xdp_id failed\n");
 		exit(1);
 	}
 	if (prog_id == curr_prog_id)
-		bpf_set_link_xdp_fd(ifindex, -1, xdp_flags);
+		if (egress)
+			bpf_set_link_xdp_egress_fd(ifindex, -1, xdp_flags);
+		else
+			bpf_set_link_xdp_fd(ifindex, -1, xdp_flags);
 	else if (!curr_prog_id)
 		printf("couldn't find a prog id on a given interface\n");
 	else
@@ -73,7 +83,8 @@ static void usage(const char *prog)
 		"OPTS:\n"
 		"    -S    use skb-mode\n"
 		"    -N    enforce native mode\n"
-		"    -F    force loading prog\n",
+		"    -F    force loading prog\n"
+		"    -E	   egress path program\n",
 		prog);
 }
 
@@ -85,7 +96,7 @@ int main(int argc, char **argv)
 	};
 	struct bpf_prog_info info = {};
 	__u32 info_len = sizeof(info);
-	const char *optstr = "FSN";
+	const char *optstr = "FSNE";
 	int prog_fd, map_fd, opt;
 	struct bpf_object *obj;
 	struct bpf_map *map;
@@ -103,6 +114,9 @@ int main(int argc, char **argv)
 		case 'F':
 			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
 			break;
+		case 'E':
+			egress = true;
+			break;
 		default:
 			usage(basename(argv[0]));
 			return 1;
@@ -130,6 +144,8 @@ int main(int argc, char **argv)
 
 	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
 	prog_load_attr.file = filename;
+	if (egress)
+		prog_load_attr.expected_attach_type = BPF_XDP_EGRESS;
 
 	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
 		return 1;
@@ -149,7 +165,11 @@ int main(int argc, char **argv)
 	signal(SIGINT, int_exit);
 	signal(SIGTERM, int_exit);
 
-	if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
+	if (egress)
+		err = bpf_set_link_xdp_egress_fd(ifindex, prog_fd, xdp_flags);
+	else
+		err = bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags);
+	if (err < 0) {
 		printf("link set xdp fd failed\n");
 		return 1;
 	}
-- 
2.21.1 (Apple Git-122.3)


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 10/12] tun: run XDP program in tx path
  2020-01-23  1:42 ` [PATCH bpf-next 10/12] tun: run XDP program in tx path David Ahern
@ 2020-01-23  8:23   ` Michael S. Tsirkin
  2020-01-24 13:36     ` Prashant Bhole
  2020-01-24 13:44     ` Prashant Bhole
  0 siblings, 2 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-01-23  8:23 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, prashantbhole.linux, jasowang, davem, jakub.kicinski,
	jbrouer, toke, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, dsahern

On Wed, Jan 22, 2020 at 06:42:08PM -0700, David Ahern wrote:
> From: Prashant Bhole <prashantbhole.linux@gmail.com>
> 
> Run the XDP program as soon as packet is removed from the ptr
> ring. Since this is XDP in tx path, the traditional handling of
> XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
> do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
> just runs the program and leaves the action handling to us.
> 
> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
> ---
>  drivers/net/tun.c | 153 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 150 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index b6bac773f2a0..71bcd4ec2571 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -130,6 +130,7 @@ struct tap_filter {
>  /* MAX_TAP_QUEUES 256 is chosen to allow rx/tx queues to be equal
>   * to max number of VCPUs in guest. */
>  #define MAX_TAP_QUEUES 256
> +#define MAX_TAP_BATCH 64
>  #define MAX_TAP_FLOWS  4096
>  
>  #define TUN_FLOW_EXPIRE (3 * HZ)
> @@ -175,6 +176,7 @@ struct tun_file {
>  	struct tun_struct *detached;
>  	struct ptr_ring tx_ring;
>  	struct xdp_rxq_info xdp_rxq;
> +	void *pkt_ptrs[MAX_TAP_BATCH];
>  };
>  
>  struct tun_page {
> @@ -2140,6 +2142,107 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  	return total;
>  }
>  
> +static struct sk_buff *tun_prepare_xdp_skb(struct sk_buff *skb)
> +{
> +	struct sk_buff *nskb;
> +
> +	if (skb_shared(skb) || skb_cloned(skb)) {
> +		nskb = skb_copy(skb, GFP_ATOMIC);
> +		consume_skb(skb);
> +		return nskb;
> +	}
> +
> +	return skb;
> +}
> +
> +static u32 tun_do_xdp_tx_generic(struct tun_struct *tun,
> +				 struct sk_buff *skb)
> +{
> +	struct bpf_prog *xdp_prog;
> +	struct xdp_buff xdp;
> +	u32 act = XDP_PASS;
> +
> +	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
> +	if (xdp_prog) {
> +		skb = tun_prepare_xdp_skb(skb);
> +		if (!skb) {
> +			act = XDP_DROP;
> +			kfree_skb(skb);
> +			goto drop;
> +		}
> +
> +		act = do_xdp_generic_core(skb, &xdp, xdp_prog);
> +		switch (act) {
> +		case XDP_TX:
> +			/* Rx path generic XDP will be called in this path
> +			 */
> +			local_bh_disable();
> +			netif_receive_skb(skb);
> +			local_bh_enable();
> +			break;
> +		case XDP_PASS:
> +			break;
> +		case XDP_REDIRECT:
> +			/* Since we are not handling this case yet, let's free
> +			 * skb here. In case of XDP_DROP/XDP_ABORTED, the skb
> +			 * was already freed in do_xdp_generic_core()
> +			 */
> +			kfree_skb(skb);
> +			/* fall through */
> +		default:
> +			bpf_warn_invalid_xdp_action(act);
> +			/* fall through */
> +		case XDP_ABORTED:
> +			trace_xdp_exception(tun->dev, xdp_prog, act);
> +			/* fall through */
> +		case XDP_DROP:
> +			goto drop;
> +		}
> +	}
> +
> +	return act;
> +drop:
> +	this_cpu_inc(tun->pcpu_stats->tx_dropped);
> +	return act;
> +}
> +
> +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile,
> +			 struct xdp_frame *frame)
> +{
> +	struct bpf_prog *xdp_prog;
> +	struct xdp_buff xdp;
> +	u32 act = XDP_PASS;
> +
> +	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
> +	if (xdp_prog) {
> +		xdp.data_hard_start = frame->data - frame->headroom;
> +		xdp.data = frame->data;
> +		xdp.data_end = xdp.data + frame->len;
> +		xdp.data_meta = xdp.data - frame->metasize;
> +
> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> +		switch (act) {
> +		case XDP_PASS:
> +			break;
> +		case XDP_TX:
> +			/* fall through */
> +		case XDP_REDIRECT:
> +			/* fall through */
> +		default:
> +			bpf_warn_invalid_xdp_action(act);
> +			/* fall through */
> +		case XDP_ABORTED:
> +			trace_xdp_exception(tun->dev, xdp_prog, act);
> +			/* fall through */
> +		case XDP_DROP:
> +			xdp_return_frame_rx_napi(frame);
> +			break;
> +		}
> +	}
> +
> +	return act;
> +}
> +
>  static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>  {
>  	DECLARE_WAITQUEUE(wait, current);
> @@ -2557,6 +2660,52 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>  	return ret;
>  }
>  
> +static int tun_consume_packets(struct tun_file *tfile, void **ptr_array, int n)
> +{
> +	void **pkts = tfile->pkt_ptrs;
> +	struct xdp_frame *frame;
> +	struct tun_struct *tun;
> +	int i, num_ptrs;
> +	int pkt_cnt = 0;
> +	void *ptr;
> +	u32 act;
> +	int batchsz;
> +
> +	if (unlikely(!tfile))
> +		return 0;
> +
> +	rcu_read_lock();
> +	tun = rcu_dereference(tfile->tun);
> +	if (unlikely(!tun)) {
> +		rcu_read_unlock();
> +		return 0;
> +	}
> +
> +	while (n) {
> +		batchsz = (n > MAX_TAP_BATCH) ? MAX_TAP_BATCH : n;
> +		n -= batchsz;
> +		num_ptrs = ptr_ring_consume_batched(&tfile->tx_ring, pkts,
> +						    batchsz);
> +		if (!num_ptrs)
> +			break;

Can't we avoid looping over the packets in the current case
where there are no xdp programs at all?


> +		for (i = 0; i < num_ptrs; i++) {
> +			ptr = pkts[i];
> +			if (tun_is_xdp_frame(ptr)) {
> +				frame = tun_ptr_to_xdp(ptr);
> +				act = tun_do_xdp_tx(tun, tfile, frame);
> +			} else {
> +				act = tun_do_xdp_tx_generic(tun, ptr);
> +			}
> +
> +			if (act == XDP_PASS)
> +				ptr_array[pkt_cnt++] = ptr;
> +		}
> +	}
> +
> +	rcu_read_unlock();
> +	return pkt_cnt;
> +}
> +
>  static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
>  		       int flags)
>  {
> @@ -2577,9 +2726,7 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
>  			ptr = ctl->ptr;
>  			break;
>  		case TUN_MSG_CONSUME_PKTS:
> -			ret = ptr_ring_consume_batched(&tfile->tx_ring,
> -						       ctl->ptr,
> -						       ctl->num);
> +			ret = tun_consume_packets(tfile, ctl->ptr, ctl->num);
>  			goto out;
>  		case TUN_MSG_UNCONSUME_PKTS:
>  			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,
> -- 
> 2.21.1 (Apple Git-122.3)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 07/12] vhost_net: user tap recvmsg api to access ptr ring
  2020-01-23  1:42 ` [PATCH bpf-next 07/12] vhost_net: user tap recvmsg api to access ptr ring David Ahern
@ 2020-01-23  8:26   ` Michael S. Tsirkin
  0 siblings, 0 replies; 58+ messages in thread
From: Michael S. Tsirkin @ 2020-01-23  8:26 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, prashantbhole.linux, jasowang, davem, jakub.kicinski,
	jbrouer, toke, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, dsahern

On Wed, Jan 22, 2020 at 06:42:05PM -0700, David Ahern wrote:
> From: Prashant Bhole <prashantbhole.linux@gmail.com>
> 
> Currently vhost_net directly accesses ptr ring of tap driver to
> fetch Rx packet pointers. In order to avoid it this patch modifies
> tap driver's recvmsg api to do additional task of fetching Rx packet
> pointers.
> 
> A special struct tun_msg_ctl is already being passed via msg_control
> for tun Rx XDP batching. This patch extends tun_msg_ctl usage to
> send sub commands to recvmsg api. Now tun_recvmsg will handle commands
> to consume and unconsume packet pointers from ptr ring.
> 
> This will be useful in implementation of tx path XDP in tun driver,
> where XDP program will process the packet before it is passed to
> vhost_net.
> 
> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
> ---
>  drivers/net/tap.c      | 22 ++++++++++++++++++-
>  drivers/net/tun.c      | 24 ++++++++++++++++++++-
>  drivers/vhost/net.c    | 48 +++++++++++++++++++++++++++++++-----------
>  include/linux/if_tun.h | 18 ++++++++++++++++
>  4 files changed, 98 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/tap.c b/drivers/net/tap.c
> index a0a5dc18109a..a5ce44db11a3 100644
> --- a/drivers/net/tap.c
> +++ b/drivers/net/tap.c
> @@ -1224,8 +1224,28 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m,
>  		       size_t total_len, int flags)
>  {
>  	struct tap_queue *q = container_of(sock, struct tap_queue, sock);
> -	struct sk_buff *skb = m->msg_control;
> +	struct tun_msg_ctl *ctl = m->msg_control;
> +	struct sk_buff *skb = NULL;
>  	int ret;
> +
> +	if (ctl) {
> +		switch (ctl->type) {
> +		case TUN_MSG_PKT:
> +			skb = ctl->ptr;
> +			break;
> +		case TUN_MSG_CONSUME_PKTS:
> +			return ptr_ring_consume_batched(&q->ring,
> +							ctl->ptr,
> +							ctl->num);
> +		case TUN_MSG_UNCONSUME_PKTS:
> +			ptr_ring_unconsume(&q->ring, ctl->ptr, ctl->num,
> +					   tun_ptr_free);
> +			return 0;
> +		default:
> +			return -EINVAL;
> +		}
> +	}
> +
>  	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) {
>  		kfree_skb(skb);
>  		return -EINVAL;
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 6f12c32df346..197bde748c09 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2544,7 +2544,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,

Hmm what about tap_recvmsg then?

>  {
>  	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
>  	struct tun_struct *tun = tun_get(tfile);
> -	void *ptr = m->msg_control;
> +	struct tun_msg_ctl *ctl = m->msg_control;
> +	void *ptr = NULL;
>  	int ret;
>  
>  	if (!tun) {
> @@ -2552,6 +2553,27 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
>  		goto out_free;
>  	}
>  

there's an extra tun_get/tun_put above.
Do we need them?
And if yes isn't  tun_ptr_free(ptr) on error a problem?



> +	if (ctl) {
> +		switch (ctl->type) {
> +		case TUN_MSG_PKT:
> +			ptr = ctl->ptr;
> +			break;
> +		case TUN_MSG_CONSUME_PKTS:
> +			ret = ptr_ring_consume_batched(&tfile->tx_ring,
> +						       ctl->ptr,
> +						       ctl->num);
> +			goto out;
> +		case TUN_MSG_UNCONSUME_PKTS:
> +			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,
> +					   ctl->num, tun_ptr_free);
> +			ret = 0;
> +			goto out;
> +		default:
> +			ret = -EINVAL;
> +			goto out_put_tun;
> +		}
> +	}
> +
>  	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
>  		ret = -EINVAL;
>  		goto out_put_tun;
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index e158159671fa..482548d00105 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -175,24 +175,44 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
>  
>  static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
>  {
> +	struct vhost_virtqueue *vq = &nvq->vq;
> +	struct socket *sock = vq->private_data;
>  	struct vhost_net_buf *rxq = &nvq->rxq;
> +	struct tun_msg_ctl ctl = {
> +		.type = TUN_MSG_CONSUME_PKTS,
> +		.ptr = (void *) rxq->queue,
> +		.num = VHOST_NET_BATCH,
> +	};
> +	struct msghdr msg = {
> +		.msg_control = &ctl,
> +	};
>  
>  	rxq->head = 0;
> -	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
> -					      VHOST_NET_BATCH);
> +	rxq->tail = sock->ops->recvmsg(sock, &msg, 0, 0);
> +	if (WARN_ON_ONCE(rxq->tail < 0))
> +		rxq->tail = 0;
> +
>  	return rxq->tail;
>  }

Hmm isn't there a way to avoid an indirect call here on data path?
ptr ring is a tun/tap thing anyway ...

>  
>  static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
>  {
> +	struct vhost_virtqueue *vq = &nvq->vq;
> +	struct socket *sock = vq->private_data;
>  	struct vhost_net_buf *rxq = &nvq->rxq;
> +	struct tun_msg_ctl ctl = {
> +		.type = TUN_MSG_UNCONSUME_PKTS,
> +		.ptr = (void *) (rxq->queue + rxq->head),

You don't really need to cast to void. An assignment will do.

> +		.num = vhost_net_buf_get_size(rxq),
> +	};
> +	struct msghdr msg = {
> +		.msg_control = &ctl,
> +	};
>  
> -	if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
> -		ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
> -				   vhost_net_buf_get_size(rxq),
> -				   tun_ptr_free);
> -		rxq->head = rxq->tail = 0;
> -	}
> +	if (!vhost_net_buf_is_empty(rxq))
> +		sock->ops->recvmsg(sock, &msg, 0, 0);
> +
> +	rxq->head = rxq->tail = 0;
>  }
>  
>  static int vhost_net_buf_peek_len(void *ptr)
> @@ -1109,6 +1129,7 @@ static void handle_rx(struct vhost_net *net)
>  		.flags = 0,
>  		.gso_type = VIRTIO_NET_HDR_GSO_NONE
>  	};
> +	struct tun_msg_ctl ctl;
>  	size_t total_len = 0;
>  	int err, mergeable;
>  	s16 headcount;
> @@ -1166,8 +1187,11 @@ static void handle_rx(struct vhost_net *net)
>  			goto out;
>  		}
>  		busyloop_intr = false;
> -		if (nvq->rx_ring)
> -			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
> +		if (nvq->rx_ring) {
> +			ctl.type = TUN_MSG_PKT;
> +			ctl.ptr = vhost_net_buf_consume(&nvq->rxq);
> +			msg.msg_control = &ctl;
> +		}
>  		/* On overrun, truncate and discard */
>  		if (unlikely(headcount > UIO_MAXIOV)) {
>  			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
> @@ -1346,8 +1370,8 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
>  	mutex_lock(&vq->mutex);
>  	sock = vq->private_data;
>  	vhost_net_disable_vq(n, vq);
> -	vq->private_data = NULL;
>  	vhost_net_buf_unproduce(nvq);
> +	vq->private_data = NULL;
>  	nvq->rx_ring = NULL;


So is rx_ring still in use?

>  	mutex_unlock(&vq->mutex);
>  	return sock;
> @@ -1538,8 +1562,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
>  		}
>  
>  		vhost_net_disable_vq(n, vq);
> -		vq->private_data = sock;
>  		vhost_net_buf_unproduce(nvq);
> +		vq->private_data = sock;
>  		r = vhost_vq_init_access(vq);
>  		if (r)
>  			goto err_used;
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index 49ca20063a35..9184e3f177b8 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -12,8 +12,26 @@
>  
>  #define TUN_XDP_FLAG 0x1UL
>  
> +/*
> + * tun_msg_ctl types
> + */
> +
>  #define TUN_MSG_UBUF 1
>  #define TUN_MSG_PTR  2
> +/*
> + * Used for passing a packet pointer from vhost to tun
> + */
> +#define TUN_MSG_PKT  3
> +/*
> + * Used for passing an array of pointer from vhost to tun.
> + * tun consumes packets from ptr ring and stores in pointer array.
> + */
> +#define TUN_MSG_CONSUME_PKTS    4
> +/*
> + * Used for passing an array of pointer from vhost to tun.
> + * tun consumes get pointer from array and puts back into ptr ring.
> + */
> +#define TUN_MSG_UNCONSUME_PKTS  5
>  struct tun_msg_ctl {
>  	unsigned short type;
>  	unsigned short num;
> -- 
> 2.21.1 (Apple Git-122.3)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type
  2020-01-23  1:42 ` [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type David Ahern
@ 2020-01-23 11:34   ` Toke Høiland-Jørgensen
  2020-01-23 21:32     ` David Ahern
  2020-01-24  7:33   ` Martin Lau
  1 sibling, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-23 11:34 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

David Ahern <dsahern@kernel.org> writes:

> From: Prashant Bhole <prashantbhole.linux@gmail.com>
>
> Add new bpf_attach_type, BPF_XDP_EGRESS, for BPF programs attached
> at the XDP layer, but the egress path.
>
> Since egress path does not have rx_queue_index and ingress_ifindex set,
> update xdp_is_valid_access to block access to these entries in the xdp
> context when a program is attached to egress path.

Isn't the whole point of this to be able to use unchanged XDP programs?
But now you're introducing a semantic difference. Since supposedly only
point-to-point links are going to be using this attach type, don't they
know enough about their peer device to be able to populate those fields
with meaningful values, instead of restricting access to them?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-23  1:42 ` [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path David Ahern
@ 2020-01-23 11:35   ` Toke Høiland-Jørgensen
  2020-01-23 21:33     ` David Ahern
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-23 11:35 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, David Ahern

David Ahern <dsahern@kernel.org> writes:

> From: David Ahern <dahern@digitalocean.com>
>
> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> is that going forward egress path will acquire the various levels of
> attach - generic, driver and hardware.

How would a 'hardware' attach work for this? As I said in my reply to
the previous patch, isn't this explicitly for emulating XDP on the other
end of a point-to-point link? How would that work with offloaded
programs?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type
  2020-01-23 11:34   ` Toke Høiland-Jørgensen
@ 2020-01-23 21:32     ` David Ahern
  2020-01-24  9:49       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-01-23 21:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David Ahern, netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On 1/23/20 4:34 AM, Toke Høiland-Jørgensen wrote:
> David Ahern <dsahern@kernel.org> writes:
> 
>> From: Prashant Bhole <prashantbhole.linux@gmail.com>
>>
>> Add new bpf_attach_type, BPF_XDP_EGRESS, for BPF programs attached
>> at the XDP layer, but the egress path.
>>
>> Since egress path does not have rx_queue_index and ingress_ifindex set,
>> update xdp_is_valid_access to block access to these entries in the xdp
>> context when a program is attached to egress path.
> 
> Isn't the whole point of this to be able to use unchanged XDP programs?

See patch 12. Only the userspace code was changed to load the same
program with the egress attach type set.

The verifier needs to check the egress program does not access Rx only
entries in xdp_md context. The attach type allows that check.

> But now you're introducing a semantic difference. Since supposedly only
> point-to-point links are going to be using this attach type, don't they
> know enough about their peer device to be able to populate those fields
> with meaningful values, instead of restricting access to them?
> 

You are conflating use cases. Don't assume point to point or peer devices.

This could be a REDIRECT from eth0 to eth1 and then an EGRESS program on
eth1 to do something. In the current test scenario it is REDIRECT from
eth0 to tapN and then on tapN run an egress program (Tx for a tap is
ingress to the VM).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-23 11:35   ` Toke Høiland-Jørgensen
@ 2020-01-23 21:33     ` David Ahern
  2020-01-24 15:21       ` Jakub Kicinski
  0 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-01-23 21:33 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David Ahern, netdev
  Cc: prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
> David Ahern <dsahern@kernel.org> writes:
> 
>> From: David Ahern <dahern@digitalocean.com>
>>
>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>> is that going forward egress path will acquire the various levels of
>> attach - generic, driver and hardware.
> 
> How would a 'hardware' attach work for this? As I said in my reply to
> the previous patch, isn't this explicitly for emulating XDP on the other
> end of a point-to-point link? How would that work with offloaded
> programs?
> 
> -Toke
> 

Nothing about this patch set is limited to point-to-point links.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type
  2020-01-23  1:42 ` [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type David Ahern
  2020-01-23 11:34   ` Toke Høiland-Jørgensen
@ 2020-01-24  7:33   ` Martin Lau
  1 sibling, 0 replies; 58+ messages in thread
From: Martin Lau @ 2020-01-24  7:33 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, prashantbhole.linux, jasowang, davem, jakub.kicinski,
	jbrouer, toke, mst, toshiaki.makita1, daniel, john.fastabend,
	ast, Song Liu, Yonghong Song, Andrii Nakryiko, dsahern,
	David Ahern

On Wed, Jan 22, 2020 at 06:42:00PM -0700, David Ahern wrote:
> From: Prashant Bhole <prashantbhole.linux@gmail.com>
> 
> Add new bpf_attach_type, BPF_XDP_EGRESS, for BPF programs attached
> at the XDP layer, but the egress path.
> 
> Since egress path does not have rx_queue_index and ingress_ifindex set,
> update xdp_is_valid_access to block access to these entries in the xdp
> context when a program is attached to egress path.
> 
> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
> Signed-off-by: David Ahern <dahern@digitalocean.com>
> ---
>  include/uapi/linux/bpf.h       | 1 +
>  net/core/filter.c              | 8 ++++++++
>  tools/include/uapi/linux/bpf.h | 1 +
>  3 files changed, 10 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 033d90a2282d..72f2a9a4621e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -209,6 +209,7 @@ enum bpf_attach_type {
>  	BPF_TRACE_RAW_TP,
>  	BPF_TRACE_FENTRY,
>  	BPF_TRACE_FEXIT,
> +	BPF_XDP_EGRESS,
>  	__MAX_BPF_ATTACH_TYPE
>  };
>  
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 17de6747d9e3..a903f3a15d74 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6803,6 +6803,14 @@ static bool xdp_is_valid_access(int off, int size,
>  		return false;
>  	}
>  
> +	if (prog->expected_attach_type == BPF_XDP_EGRESS) {
For BPF_PROG_TYPE_XDP, the expected_attach_type is currently not
enforced to be 0 in bpf_prog_load_check_attach().  Not sure if it
is ok to test it here and also return false in some of the
following switch cases.

> +		switch (off) {
> +		case offsetof(struct xdp_md, rx_queue_index):
> +		case offsetof(struct xdp_md, ingress_ifindex):
> +			return false;
> +		}
> +	}
> +
>  	switch (off) {
>  	case offsetof(struct xdp_md, data):
>  		info->reg_type = PTR_TO_PACKET;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type
  2020-01-23 21:32     ` David Ahern
@ 2020-01-24  9:49       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-24  9:49 UTC (permalink / raw)
  To: David Ahern, David Ahern, netdev
  Cc: prashantbhole.linux, jasowang, davem, jakub.kicinski, jbrouer,
	mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

David Ahern <dsahern@gmail.com> writes:

> On 1/23/20 4:34 AM, Toke Høiland-Jørgensen wrote:
>> David Ahern <dsahern@kernel.org> writes:
>> 
>>> From: Prashant Bhole <prashantbhole.linux@gmail.com>
>>>
>>> Add new bpf_attach_type, BPF_XDP_EGRESS, for BPF programs attached
>>> at the XDP layer, but the egress path.
>>>
>>> Since egress path does not have rx_queue_index and ingress_ifindex set,
>>> update xdp_is_valid_access to block access to these entries in the xdp
>>> context when a program is attached to egress path.
>> 
>> Isn't the whole point of this to be able to use unchanged XDP programs?
>
> See patch 12. Only the userspace code was changed to load the same
> program with the egress attach type set.
>
> The verifier needs to check the egress program does not access Rx only
> entries in xdp_md context. The attach type allows that check.
>
>> But now you're introducing a semantic difference. Since supposedly only
>> point-to-point links are going to be using this attach type, don't they
>> know enough about their peer device to be able to populate those fields
>> with meaningful values, instead of restricting access to them?
>> 
>
> You are conflating use cases. Don't assume point to point or peer devices.
>
> This could be a REDIRECT from eth0 to eth1 and then an EGRESS program on
> eth1 to do something. In the current test scenario it is REDIRECT from
> eth0 to tapN and then on tapN run an egress program (Tx for a tap is
> ingress to the VM).

But why would any hardware driver implement this program type, instead
of the proper TX hook? I thought the whole idea of this "third"
not-quite-TX program type was the virtualisation offload case. If you
just want to be able to run eBPF code in the TX path, why not implement
the proper TX hook straight away?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 10/12] tun: run XDP program in tx path
  2020-01-23  8:23   ` Michael S. Tsirkin
@ 2020-01-24 13:36     ` Prashant Bhole
  2020-01-24 13:44     ` Prashant Bhole
  1 sibling, 0 replies; 58+ messages in thread
From: Prashant Bhole @ 2020-01-24 13:36 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Ahern
  Cc: netdev, jasowang, davem, jakub.kicinski, jbrouer, toke,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, dsahern, Prashant Bhole



On 1/23/2020 5:23 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 22, 2020 at 06:42:08PM -0700, David Ahern wrote:
>> From: Prashant Bhole <prashantbhole.linux@gmail.com>
>>
>> Run the XDP program as soon as packet is removed from the ptr
>> ring. Since this is XDP in tx path, the traditional handling of
>> XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
>> do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
>> just runs the program and leaves the action handling to us.
>>
>> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
>> ---
>>   drivers/net/tun.c | 153 +++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 150 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index b6bac773f2a0..71bcd4ec2571 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -130,6 +130,7 @@ struct tap_filter {
>>   /* MAX_TAP_QUEUES 256 is chosen to allow rx/tx queues to be equal
>>    * to max number of VCPUs in guest. */
>>   #define MAX_TAP_QUEUES 256
>> +#define MAX_TAP_BATCH 64
>>   #define MAX_TAP_FLOWS  4096
>>   
>>   #define TUN_FLOW_EXPIRE (3 * HZ)
>> @@ -175,6 +176,7 @@ struct tun_file {
>>   	struct tun_struct *detached;
>>   	struct ptr_ring tx_ring;
>>   	struct xdp_rxq_info xdp_rxq;
>> +	void *pkt_ptrs[MAX_TAP_BATCH];
>>   };
>>   
>>   struct tun_page {
>> @@ -2140,6 +2142,107 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>   	return total;
>>   }
>>   
>> +static struct sk_buff *tun_prepare_xdp_skb(struct sk_buff *skb)
>> +{
>> +	struct sk_buff *nskb;
>> +
>> +	if (skb_shared(skb) || skb_cloned(skb)) {
>> +		nskb = skb_copy(skb, GFP_ATOMIC);
>> +		consume_skb(skb);
>> +		return nskb;
>> +	}
>> +
>> +	return skb;
>> +}
>> +
>> +static u32 tun_do_xdp_tx_generic(struct tun_struct *tun,
>> +				 struct sk_buff *skb)
>> +{
>> +	struct bpf_prog *xdp_prog;
>> +	struct xdp_buff xdp;
>> +	u32 act = XDP_PASS;
>> +
>> +	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
>> +	if (xdp_prog) {
>> +		skb = tun_prepare_xdp_skb(skb);
>> +		if (!skb) {
>> +			act = XDP_DROP;
>> +			kfree_skb(skb);
>> +			goto drop;
>> +		}
>> +
>> +		act = do_xdp_generic_core(skb, &xdp, xdp_prog);
>> +		switch (act) {
>> +		case XDP_TX:
>> +			/* Rx path generic XDP will be called in this path
>> +			 */
>> +			local_bh_disable();
>> +			netif_receive_skb(skb);
>> +			local_bh_enable();
>> +			break;
>> +		case XDP_PASS:
>> +			break;
>> +		case XDP_REDIRECT:
>> +			/* Since we are not handling this case yet, let's free
>> +			 * skb here. In case of XDP_DROP/XDP_ABORTED, the skb
>> +			 * was already freed in do_xdp_generic_core()
>> +			 */
>> +			kfree_skb(skb);
>> +			/* fall through */
>> +		default:
>> +			bpf_warn_invalid_xdp_action(act);
>> +			/* fall through */
>> +		case XDP_ABORTED:
>> +			trace_xdp_exception(tun->dev, xdp_prog, act);
>> +			/* fall through */
>> +		case XDP_DROP:
>> +			goto drop;
>> +		}
>> +	}
>> +
>> +	return act;
>> +drop:
>> +	this_cpu_inc(tun->pcpu_stats->tx_dropped);
>> +	return act;
>> +}
>> +
>> +static u32 tun_do_xdp_tx(struct tun_struct *tun, struct tun_file *tfile,
>> +			 struct xdp_frame *frame)
>> +{
>> +	struct bpf_prog *xdp_prog;
>> +	struct xdp_buff xdp;
>> +	u32 act = XDP_PASS;
>> +
>> +	xdp_prog = rcu_dereference(tun->xdp_egress_prog);
>> +	if (xdp_prog) {
>> +		xdp.data_hard_start = frame->data - frame->headroom;
>> +		xdp.data = frame->data;
>> +		xdp.data_end = xdp.data + frame->len;
>> +		xdp.data_meta = xdp.data - frame->metasize;
>> +
>> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +		switch (act) {
>> +		case XDP_PASS:
>> +			break;
>> +		case XDP_TX:
>> +			/* fall through */
>> +		case XDP_REDIRECT:
>> +			/* fall through */
>> +		default:
>> +			bpf_warn_invalid_xdp_action(act);
>> +			/* fall through */
>> +		case XDP_ABORTED:
>> +			trace_xdp_exception(tun->dev, xdp_prog, act);
>> +			/* fall through */
>> +		case XDP_DROP:
>> +			xdp_return_frame_rx_napi(frame);
>> +			break;
>> +		}
>> +	}
>> +
>> +	return act;
>> +}
>> +
>>   static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
>>   {
>>   	DECLARE_WAITQUEUE(wait, current);
>> @@ -2557,6 +2660,52 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>>   	return ret;
>>   }
>>   
>> +static int tun_consume_packets(struct tun_file *tfile, void **ptr_array, int n)
>> +{
>> +	void **pkts = tfile->pkt_ptrs;
>> +	struct xdp_frame *frame;
>> +	struct tun_struct *tun;
>> +	int i, num_ptrs;
>> +	int pkt_cnt = 0;
>> +	void *ptr;
>> +	u32 act;
>> +	int batchsz;
>> +
>> +	if (unlikely(!tfile))
>> +		return 0;
>> +
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (unlikely(!tun)) {
>> +		rcu_read_unlock();
>> +		return 0;
>> +	}
>> +
>> +	while (n) {
>> +		batchsz = (n > MAX_TAP_BATCH) ? MAX_TAP_BATCH : n;
>> +		n -= batchsz;
>> +		num_ptrs = ptr_ring_consume_batched(&tfile->tx_ring, pkts,
>> +						    batchsz);
>> +		if (!num_ptrs)
>> +			break;
> 
> Can't we avoid looping over the packets in the current case
> where there are no xdp programs at all?

That's doable. Thanks.

Prashant
> 
> 
>> +		for (i = 0; i < num_ptrs; i++) {
>> +			ptr = pkts[i];
>> +			if (tun_is_xdp_frame(ptr)) {
>> +				frame = tun_ptr_to_xdp(ptr);
>> +				act = tun_do_xdp_tx(tun, tfile, frame);
>> +			} else {
>> +				act = tun_do_xdp_tx_generic(tun, ptr);
>> +			}
>> +
>> +			if (act == XDP_PASS)
>> +				ptr_array[pkt_cnt++] = ptr;
>> +		}
>> +	}
>> +
>> +	rcu_read_unlock();
>> +	return pkt_cnt;
>> +}
>> +
>>   static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
>>   		       int flags)
>>   {
>> @@ -2577,9 +2726,7 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
>>   			ptr = ctl->ptr;
>>   			break;
>>   		case TUN_MSG_CONSUME_PKTS:
>> -			ret = ptr_ring_consume_batched(&tfile->tx_ring,
>> -						       ctl->ptr,
>> -						       ctl->num);
>> +			ret = tun_consume_packets(tfile, ctl->ptr, ctl->num);
>>   			goto out;
>>   		case TUN_MSG_UNCONSUME_PKTS:
>>   			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,
>> -- 
>> 2.21.1 (Apple Git-122.3)
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 10/12] tun: run XDP program in tx path
  2020-01-23  8:23   ` Michael S. Tsirkin
  2020-01-24 13:36     ` Prashant Bhole
@ 2020-01-24 13:44     ` Prashant Bhole
  1 sibling, 0 replies; 58+ messages in thread
From: Prashant Bhole @ 2020-01-24 13:44 UTC (permalink / raw)
  To: Michael S. Tsirkin, David Ahern
  Cc: netdev, prashantbhole.linux, jasowang, davem, jakub.kicinski,
	jbrouer, toke, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, dsahern



On 1/23/2020 5:23 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 22, 2020 at 06:42:08PM -0700, David Ahern wrote:
>> From: Prashant Bhole <prashantbhole.linux@gmail.com>
>>
>> Run the XDP program as soon as packet is removed from the ptr
>> ring. Since this is XDP in tx path, the traditional handling of
>> XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
>> do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
>> just runs the program and leaves the action handling to us.
>>
>> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
>> ---
>>   drivers/net/tun.c | 153 +++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 150 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c

[...]

>> +static int tun_consume_packets(struct tun_file *tfile, void **ptr_array, int n)
>> +{
>> +	void **pkts = tfile->pkt_ptrs;
>> +	struct xdp_frame *frame;
>> +	struct tun_struct *tun;
>> +	int i, num_ptrs;
>> +	int pkt_cnt = 0;
>> +	void *ptr;
>> +	u32 act;
>> +	int batchsz;
>> +
>> +	if (unlikely(!tfile))
>> +		return 0;
>> +
>> +	rcu_read_lock();
>> +	tun = rcu_dereference(tfile->tun);
>> +	if (unlikely(!tun)) {
>> +		rcu_read_unlock();
>> +		return 0;
>> +	}
>> +
>> +	while (n) {
>> +		batchsz = (n > MAX_TAP_BATCH) ? MAX_TAP_BATCH : n;
>> +		n -= batchsz;
>> +		num_ptrs = ptr_ring_consume_batched(&tfile->tx_ring, pkts,
>> +						    batchsz);
>> +		if (!num_ptrs)
>> +			break;
> 
> Can't we avoid looping over the packets in the current case
> where there are no xdp programs at all?

That's doable. Thanks.

Prashant

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-23 21:33     ` David Ahern
@ 2020-01-24 15:21       ` Jakub Kicinski
  2020-01-24 15:36         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2020-01-24 15:21 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
> > David Ahern <dsahern@kernel.org> writes:
> >> From: David Ahern <dahern@digitalocean.com>
> >>
> >> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> >> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> >> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> >> is that going forward egress path will acquire the various levels of
> >> attach - generic, driver and hardware.  
> > 
> > How would a 'hardware' attach work for this? As I said in my reply to
> > the previous patch, isn't this explicitly for emulating XDP on the other
> > end of a point-to-point link? How would that work with offloaded
> > programs?
> 
> Nothing about this patch set is limited to point-to-point links.

I struggle to understand of what the expected semantics of this new
hook are. Is this going to be run on all frames sent to the device
from the stack? All frames from the stack and from XDP_REDIRECT?

A little hard to figure out the semantics when we start from a funky
device like tun :S

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-24 15:21       ` Jakub Kicinski
@ 2020-01-24 15:36         ` Toke Høiland-Jørgensen
  2020-01-26  1:43           ` David Ahern
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-01-24 15:36 UTC (permalink / raw)
  To: Jakub Kicinski, David Ahern
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem,
	jbrouer, mst, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, David Ahern

Jakub Kicinski <kuba@kernel.org> writes:

> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
>> > David Ahern <dsahern@kernel.org> writes:
>> >> From: David Ahern <dahern@digitalocean.com>
>> >>
>> >> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>> >> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>> >> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>> >> is that going forward egress path will acquire the various levels of
>> >> attach - generic, driver and hardware.  
>> > 
>> > How would a 'hardware' attach work for this? As I said in my reply to
>> > the previous patch, isn't this explicitly for emulating XDP on the other
>> > end of a point-to-point link? How would that work with offloaded
>> > programs?
>> 
>> Nothing about this patch set is limited to point-to-point links.
>
> I struggle to understand of what the expected semantics of this new
> hook are. Is this going to be run on all frames sent to the device
> from the stack? All frames from the stack and from XDP_REDIRECT?
>
> A little hard to figure out the semantics when we start from a funky
> device like tun :S

Yes, that is also why I found this a bit weird. We have discussed plans
for an XDP TX hook before:
https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx

That TX hook would run for everything at TX, but it would be a separate
program type with its own metadata access. Whereas the idea with this
series (seemed to me) to be just to be able to "emulate" run a regular
RX-side XDP program on egress for devices where this makes sense.

If this series is not meant to implement that "emulation", but rather be
usable for all devices, I really think we should go straight for the
full TX hook as discussed earlier...

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-24 15:36         ` Toke Høiland-Jørgensen
@ 2020-01-26  1:43           ` David Ahern
  2020-01-26  4:54             ` Alexei Starovoitov
                               ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: David Ahern @ 2020-01-26  1:43 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jakub Kicinski
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem,
	jbrouer, mst, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, David Ahern

On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> 
>> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
>>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
>>>> David Ahern <dsahern@kernel.org> writes:
>>>>> From: David Ahern <dahern@digitalocean.com>
>>>>>
>>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>>>>> is that going forward egress path will acquire the various levels of
>>>>> attach - generic, driver and hardware.  
>>>>
>>>> How would a 'hardware' attach work for this? As I said in my reply to
>>>> the previous patch, isn't this explicitly for emulating XDP on the other
>>>> end of a point-to-point link? How would that work with offloaded
>>>> programs?
>>>
>>> Nothing about this patch set is limited to point-to-point links.
>>
>> I struggle to understand of what the expected semantics of this new
>> hook are. Is this going to be run on all frames sent to the device
>> from the stack? All frames from the stack and from XDP_REDIRECT?
>>
>> A little hard to figure out the semantics when we start from a funky
>> device like tun :S
> 
> Yes, that is also why I found this a bit weird. We have discussed plans
> for an XDP TX hook before:
> https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
> 
> That TX hook would run for everything at TX, but it would be a separate
> program type with its own metadata access. Whereas the idea with this
> series (seemed to me) to be just to be able to "emulate" run a regular
> RX-side XDP program on egress for devices where this makes sense.
> 
> If this series is not meant to implement that "emulation", but rather be
> usable for all devices, I really think we should go straight for the
> full TX hook as discussed earlier...
> 

The first patch set from Jason and Prashant started from the perspective
of offloading XDP programs for a guest. Independently, I was looking at
XDP in the TX path (now referred to as egress to avoid confusion with
the XDP_TX return type). Jason and Prashant were touching some of the
same code paths in the tun driver that I needed for XDP in the Tx path,
so we decided to consolidate and have XDP egress done first and then
offload of VMs as a followup. Offload in virtio_net can be done very
similar to how it is done in nfp -- the program is passed to the host as
a hardware level attach mode, and the driver verifies the program can be
offloaded (e.g., does not contain helpers that expose host specific data
like the fib lookup helper).

At this point, you need to stop thinking solely from the perspective of
tun or tap and VM offload; think about this from the ability to run an
XDP program on egress path at an appropriate place in the NIC driver
that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has
been discussed before as a need (e.g, Toke's reference above), and I am
trying to get this initial support done.

I very much wanted to avoid copy-paste-modify for the entire XDP API for
this. For the most part XDP means ebpf at the NIC driver / hardware
level (obviously with the exception of generic mode). The goal is
tempered with the need for the verifier to reject rx entries in the
xdp_md context. Hence the reason for use of an attach_type - existing
infrastructure to test and reject the accesses.

That said, Martin's comment throws a wrench in the goal: if the existing
code does not enforce expected_attach_type then that option can not be
used in which case I guess I have to go with a new program type
(BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
has different return codes, etc.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26  1:43           ` David Ahern
@ 2020-01-26  4:54             ` Alexei Starovoitov
  2020-02-02 17:59               ` David Ahern
  2020-01-26 12:49             ` Jesper Dangaard Brouer
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 58+ messages in thread
From: Alexei Starovoitov @ 2020-01-26  4:54 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, Jakub Kicinski, David Ahern,
	netdev, prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Sat, Jan 25, 2020 at 06:43:36PM -0700, David Ahern wrote:
> 
> That said, Martin's comment throws a wrench in the goal: if the existing
> code does not enforce expected_attach_type then that option can not be
> used in which case I guess I have to go with a new program type
> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
> has different return codes, etc.

This is acceptable risk. We did such thing in the past. The chances of
user space breakage are extremely low.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26  1:43           ` David Ahern
  2020-01-26  4:54             ` Alexei Starovoitov
@ 2020-01-26 12:49             ` Jesper Dangaard Brouer
  2020-01-26 16:38               ` David Ahern
  2020-01-26 22:17               ` Jakub Kicinski
  2020-01-26 22:11             ` Jakub Kicinski
  2020-02-01 15:59             ` Toke Høiland-Jørgensen
  3 siblings, 2 replies; 58+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-26 12:49 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, Jakub Kicinski, David Ahern,
	netdev, prashantbhole.linux, jasowang, davem, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Sat, 25 Jan 2020 18:43:36 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
> > Jakub Kicinski <kuba@kernel.org> writes:
> >   
> >> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:  
> >>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:  
> >>>> David Ahern <dsahern@kernel.org> writes:  
> >>>>> From: David Ahern <dahern@digitalocean.com>
> >>>>>
> >>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> >>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> >>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> >>>>> is that going forward egress path will acquire the various levels of
> >>>>> attach - generic, driver and hardware.    
> >>>>
> >>>> How would a 'hardware' attach work for this? As I said in my reply to
> >>>> the previous patch, isn't this explicitly for emulating XDP on the other
> >>>> end of a point-to-point link? How would that work with offloaded
> >>>> programs?  
> >>>
> >>> Nothing about this patch set is limited to point-to-point links.  
> >>
> >> I struggle to understand of what the expected semantics of this new
> >> hook are. Is this going to be run on all frames sent to the device
> >> from the stack? All frames from the stack and from XDP_REDIRECT?
> >>
> >> A little hard to figure out the semantics when we start from a funky
> >> device like tun :S  
> > 
> > Yes, that is also why I found this a bit weird. We have discussed plans
> > for an XDP TX hook before:
> > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
> > 
> > That TX hook would run for everything at TX, but it would be a separate
> > program type with its own metadata access. Whereas the idea with this
> > series (seemed to me) to be just to be able to "emulate" run a regular
> > RX-side XDP program on egress for devices where this makes sense.
> > 
> > If this series is not meant to implement that "emulation", but rather be
> > usable for all devices, I really think we should go straight for the
> > full TX hook as discussed earlier...
> >   
> 
> The first patch set from Jason and Prashant started from the perspective
> of offloading XDP programs for a guest. Independently, I was looking at
> XDP in the TX path (now referred to as egress to avoid confusion with
> the XDP_TX return type). Jason and Prashant were touching some of the
> same code paths in the tun driver that I needed for XDP in the Tx path,
> so we decided to consolidate and have XDP egress done first and then
> offload of VMs as a followup. Offload in virtio_net can be done very
> similar to how it is done in nfp -- the program is passed to the host as
> a hardware level attach mode, and the driver verifies the program can be
> offloaded (e.g., does not contain helpers that expose host specific data
> like the fib lookup helper).
> 
> At this point, you need to stop thinking solely from the perspective of
> tun or tap and VM offload; think about this from the ability to run an
> XDP program on egress path at an appropriate place in the NIC driver
> that covers both skbs and xdp_frames (e.g., on a REDIRECT).

Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.


> This has
> been discussed before as a need (e.g, Toke's reference above), and I am
> trying to get this initial support done.
> 
> I very much wanted to avoid copy-paste-modify for the entire XDP API for
> this. For the most part XDP means ebpf at the NIC driver / hardware
> level (obviously with the exception of generic mode). The goal is
> tempered with the need for the verifier to reject rx entries in the
> xdp_md context. Hence the reason for use of an attach_type - existing
> infrastructure to test and reject the accesses.
> 
> That said, Martin's comment throws a wrench in the goal: if the existing
> code does not enforce expected_attach_type then that option can not be
> used in which case I guess I have to go with a new program type
> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
> has different return codes, etc.

Taking about return codes.  Does XDP the return codes make sense for
this EGRESS hook? (if thinking about this being egress on the real NIC).

E.g. XDP_REDIRECT would have to be supported, which is interesting, but
also have implications (like looping packets).

E.g. what is the semantics/action of XDP_TX return code?

E.g. I'm considering adding a XDP_CONGESTED return code that can cause
backpressure towards qdisc layer.

Also think about that if this EGRESS hook uses standard prog type for
XDP (BPF_PROG_TYPE_XDP), then we need to convert xdp_frame to xdp_buff
(and also convert SKBs to xdp_buff).

Are we sure that reusing the same bpf prog type is the right choice?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26 12:49             ` Jesper Dangaard Brouer
@ 2020-01-26 16:38               ` David Ahern
  2020-01-26 22:17               ` Jakub Kicinski
  1 sibling, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-01-26 16:38 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toke Høiland-Jørgensen, Jakub Kicinski, David Ahern,
	netdev, prashantbhole.linux, jasowang, davem, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On 1/26/20 5:49 AM, Jesper Dangaard Brouer wrote:
>> This has
>> been discussed before as a need (e.g, Toke's reference above), and I am
>> trying to get this initial support done.
>>
>> I very much wanted to avoid copy-paste-modify for the entire XDP API for
>> this. For the most part XDP means ebpf at the NIC driver / hardware
>> level (obviously with the exception of generic mode). The goal is
>> tempered with the need for the verifier to reject rx entries in the
>> xdp_md context. Hence the reason for use of an attach_type - existing
>> infrastructure to test and reject the accesses.
>>
>> That said, Martin's comment throws a wrench in the goal: if the existing
>> code does not enforce expected_attach_type then that option can not be
>> used in which case I guess I have to go with a new program type
>> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
>> has different return codes, etc.
> 
> Taking about return codes.  Does XDP the return codes make sense for
> this EGRESS hook? (if thinking about this being egress on the real NIC).
> 
> E.g. XDP_REDIRECT would have to be supported, which is interesting, but
> also have implications (like looping packets).
> 
> E.g. what is the semantics/action of XDP_TX return code?

This has been discussed. XDP_TX in the EGRESS path could arguably be
equal to XDP_PASS.

> 
> E.g. I'm considering adding a XDP_CONGESTED return code that can cause
> backpressure towards qdisc layer.
> 
> Also think about that if this EGRESS hook uses standard prog type for
> XDP (BPF_PROG_TYPE_XDP), then we need to convert xdp_frame to xdp_buff
> (and also convert SKBs to xdp_buff).

Why? What about the patch set requires that change to be done to have
support for EGRESS path?

> 
> Are we sure that reusing the same bpf prog type is the right choice?
> 

Martin's comment about existing checking on the expected attach type is
the only reason I have seen so far to not have the same program type.

Looking at the helpers for use in XDP programs do you believe any of
those should not be allowed with EGRESS programs? Do you have any reason
to think that existing XDP capabilities should be prohibited or
different for EGRESS? As mentioned earlier the attach type can be used
to have the verifier handle small context differences (and restrict
helpers if needed).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26  1:43           ` David Ahern
  2020-01-26  4:54             ` Alexei Starovoitov
  2020-01-26 12:49             ` Jesper Dangaard Brouer
@ 2020-01-26 22:11             ` Jakub Kicinski
  2020-01-27  4:03               ` David Ahern
  2020-02-01 15:59             ` Toke Høiland-Jørgensen
  3 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2020-01-26 22:11 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Sat, 25 Jan 2020 18:43:36 -0700, David Ahern wrote:
> On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
> > Jakub Kicinski <kuba@kernel.org> writes:
> >> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:  
> >>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:  
> >>>> David Ahern <dsahern@kernel.org> writes:  
> >>>>> From: David Ahern <dahern@digitalocean.com>
> >>>>>
> >>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
> >>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
> >>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
> >>>>> is that going forward egress path will acquire the various levels of
> >>>>> attach - generic, driver and hardware.    
> >>>>
> >>>> How would a 'hardware' attach work for this? As I said in my reply to
> >>>> the previous patch, isn't this explicitly for emulating XDP on the other
> >>>> end of a point-to-point link? How would that work with offloaded
> >>>> programs?  
> >>>
> >>> Nothing about this patch set is limited to point-to-point links.  
> >>
> >> I struggle to understand of what the expected semantics of this new
> >> hook are. Is this going to be run on all frames sent to the device
> >> from the stack? All frames from the stack and from XDP_REDIRECT?
> >>
> >> A little hard to figure out the semantics when we start from a funky
> >> device like tun :S  
> > 
> > Yes, that is also why I found this a bit weird. We have discussed plans
> > for an XDP TX hook before:
> > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
> > 
> > That TX hook would run for everything at TX, but it would be a separate
> > program type with its own metadata access. Whereas the idea with this
> > series (seemed to me) to be just to be able to "emulate" run a regular
> > RX-side XDP program on egress for devices where this makes sense.
> > 
> > If this series is not meant to implement that "emulation", but rather be
> > usable for all devices, I really think we should go straight for the
> > full TX hook as discussed earlier...
> 
> The first patch set from Jason and Prashant started from the perspective
> of offloading XDP programs for a guest. Independently, I was looking at
> XDP in the TX path (now referred to as egress to avoid confusion with
> the XDP_TX return type). 

I looked through the commit message and the cover letter again, and you
never explain why you need the egress hook. Could you please clarify
your needs? If it's container-related maybe what Daniel talked about at
last netconf could be a better solution?

I can't quite square the concept of XDP which started as close to the
metal BPF hook for HW drivers, and this heavily SW-focused addition.

> Jason and Prashant were touching some of the
> same code paths in the tun driver that I needed for XDP in the Tx path,
> so we decided to consolidate and have XDP egress done first and then
> offload of VMs as a followup. Offload in virtio_net can be done very
> similar to how it is done in nfp -- the program is passed to the host as
> a hardware level attach mode, and the driver verifies the program can be
> offloaded (e.g., does not contain helpers that expose host specific data
> like the fib lookup helper).

<rant>

I'd ask to please never compare this work to the nfp offload. Netronome
was able to open up their NIC down to the instruction set level, with
the JIT in tree and rest of the FW open source:

https://github.com/Netronome/nic-firmware/

and that work is now used as precedent for something that risks turning
the kernel into a damn control plane for proprietary clouds?

I can see how they may seem similar in operational terms, but for
people who care about open source they couldn't be more different.

</rant>

> At this point, you need to stop thinking solely from the perspective of
> tun or tap and VM offload; think about this from the ability to run an
> XDP program on egress path at an appropriate place in the NIC driver
> that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has
> been discussed before as a need (e.g, Toke's reference above), and I am
> trying to get this initial support done.

TX hook related to queuing is a very different beast than just a RX
hook flipped. The queuing is a problem that indeed needs work, but just
adding a mirror RX hook does not solve that, and establishes semantics
which may be counter productive. That's why I was asking for clear
semantics.

> I very much wanted to avoid copy-paste-modify for the entire XDP API for
> this. For the most part XDP means ebpf at the NIC driver / hardware
> level (obviously with the exception of generic mode). The goal is
> tempered with the need for the verifier to reject rx entries in the
> xdp_md context. Hence the reason for use of an attach_type - existing
> infrastructure to test and reject the accesses.

For the offload host rx queue == dev PCI tx queue and vice versa.
So other than the name the rejection makes no sense. Just add a union
to xdp_md so both tx and rx names can be used.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26 12:49             ` Jesper Dangaard Brouer
  2020-01-26 16:38               ` David Ahern
@ 2020-01-26 22:17               ` Jakub Kicinski
  2020-01-28 14:13                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2020-01-26 22:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Ahern, Toke Høiland-Jørgensen, David Ahern,
	netdev, prashantbhole.linux, jasowang, davem, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:
> Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.

Any pointers on what for? Unless we see actual use cases there's
a justifiable concern of the entire thing just being an application of
"We can solve any problem by introducing an extra level of indirection."

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26 22:11             ` Jakub Kicinski
@ 2020-01-27  4:03               ` David Ahern
  2020-01-27 14:16                 ` Jakub Kicinski
  0 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-01-27  4:03 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On 1/26/20 3:11 PM, Jakub Kicinski wrote:
> 
> I looked through the commit message and the cover letter again, and you
> never explain why you need the egress hook. Could you please clarify
> your needs? 

XDP is about efficient network processing - ie., bypassing the Linux
stack when it does not make sense for the person deploying some
solution. XDP right now is Rx centric.

I want to run an ebpf program in the Tx path of the NIC regardless of
how the packet arrived at the device -- as an skb or an xdp_frame. There
are options for running programs on skb-based packets (e.g., tc). There
are *zero* options for manipulating/controlling/denying xdp_frames -
e.g., one REDIRECTED from an ingress device.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-27  4:03               ` David Ahern
@ 2020-01-27 14:16                 ` Jakub Kicinski
  2020-01-28  3:43                   ` David Ahern
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2020-01-27 14:16 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Sun, 26 Jan 2020 21:03:15 -0700, David Ahern wrote:
> On 1/26/20 3:11 PM, Jakub Kicinski wrote:
> > I looked through the commit message and the cover letter again, and you
> > never explain why you need the egress hook. Could you please clarify
> > your needs?   
> 
> XDP is about efficient network processing - ie., bypassing the Linux
> stack when it does not make sense for the person deploying some
> solution. XDP right now is Rx centric.

Network hardware is also "Rx centric" and somehow it works..

> I want to run an ebpf program in the Tx path of the NIC regardless of
> how the packet arrived at the device -- as an skb or an xdp_frame. There
> are options for running programs on skb-based packets (e.g., tc). There
> are *zero* options for manipulating/controlling/denying xdp_frames -
> e.g., one REDIRECTED from an ingress device.

Okay - so no precise use case.  You can run the same program at the 
end of whatever is doing the redirect (especially with Alexei's work 
on linking) and from cls_bpf 🤷‍♂️

I'm sure all driver authors can't wait to plough through their TX paths
:/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-27 14:16                 ` Jakub Kicinski
@ 2020-01-28  3:43                   ` David Ahern
  2020-01-28 13:57                     ` Jakub Kicinski
  0 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-01-28  3:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On 1/27/20 7:16 AM, Jakub Kicinski wrote:
>> I want to run an ebpf program in the Tx path of the NIC regardless of
>> how the packet arrived at the device -- as an skb or an xdp_frame. There
>> are options for running programs on skb-based packets (e.g., tc). There
>> are *zero* options for manipulating/controlling/denying xdp_frames -
>> e.g., one REDIRECTED from an ingress device.
> 
> Okay - so no precise use case.  You can run the same program at the 

For the sake of this discussion, consider a per-VM, per-tap device ebpf
program for firewall / ACL / traffic verification (packet manipulation
is also a possibility). Small, singly focused ebpf programs - attached
at startup, driven by maps, cleaned up when the tap device is destroyed.
(Remember: Tx for tap device is ingress to a VM.)

Small, singly focused programs only run for traffic to be delivered to
the VM. Setup is easy; cleanup automatic. How the traffic got there
could vary - from a bridge (L2 forwarding), the host stack (L3 routing),
or an XDP program on the host's NICs. It could have arrived at the host
with an encap header which is removed and the inner packet forwarded to
the VM.

> end of whatever is doing the redirect (especially with Alexei's work 

There are use cases where they may make sense, but this is not one.

> on linking) and from cls_bpf 🤷‍♂️
> 

cls_bpf is tc based == skb, no? I want to handle any packet, regardless
of how it arrived at the device's xmit function.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-28  3:43                   ` David Ahern
@ 2020-01-28 13:57                     ` Jakub Kicinski
  2020-02-01 16:24                       ` Toke Høiland-Jørgensen
  2020-02-02 17:43                       ` David Ahern
  0 siblings, 2 replies; 58+ messages in thread
From: Jakub Kicinski @ 2020-01-28 13:57 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Mon, 27 Jan 2020 20:43:09 -0700, David Ahern wrote:
> > end of whatever is doing the redirect (especially with Alexei's work   
> 
> There are use cases where they may make sense, but this is not one.
> 
> > on linking) and from cls_bpf 🤷‍♂️
> 
> cls_bpf is tc based == skb, no? I want to handle any packet, regardless
> of how it arrived at the device's xmit function.

Yes, that's why I said you need the same rules in XDP before REDIRECT
and cls_bpf. Sure it's more complex, but (1) it's faster to drop in
the ingress prog before going though the entire redirect code and
without parsing the packet twice and (2) no extra kernel code necessary.

Even the VM "offload" work doesn't need this. Translating an XDP prog
into a cls_bpf one should be trivial. Slap on some prologue to linearize
the skb, move ctx offsets around, slap on an epilogue to convert exit
codes, anything else?

I'm weary of partially implemented XDP features, EGRESS prog does us 
no good when most drivers didn't yet catch up with the REDIRECTs. And
we're adding this before we considered the queuing problem.

But if I'm alone in thinking this, and I'm not convincing anyone we can
move on :)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26 22:17               ` Jakub Kicinski
@ 2020-01-28 14:13                 ` Jesper Dangaard Brouer
  2020-01-30 14:45                   ` Jakub Kicinski
  0 siblings, 1 reply; 58+ messages in thread
From: Jesper Dangaard Brouer @ 2020-01-28 14:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, Toke Høiland-Jørgensen, David Ahern,
	netdev, prashantbhole.linux, jasowang, davem, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Sun, 26 Jan 2020 14:17:01 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:
> > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.  
> 
> Any pointers on what for? Unless we see actual use cases there's
> a justifiable concern of the entire thing just being an application of
> "We can solve any problem by introducing an extra level of indirection."

I have two use-cases:

(1) For XDP easier handling of interface specific setting on egress,
e.g. pushing a VLAN-id, instead of having to figure this out in RX hook.
(I think this is also David Ahern's use-case)


(2) I want this egress XDP hook to have the ability to signal
backpressure. Today we have BQL in most drivers (which is essential to
avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve.

For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(),
or a return code that can stop the queue towards the higher layers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-28 14:13                 ` Jesper Dangaard Brouer
@ 2020-01-30 14:45                   ` Jakub Kicinski
  2020-02-01 16:03                     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2020-01-30 14:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David Ahern, Toke Høiland-Jørgensen, David Ahern,
	netdev, prashantbhole.linux, jasowang, davem, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Tue, 28 Jan 2020 15:13:43 +0100, Jesper Dangaard Brouer wrote:
> On Sun, 26 Jan 2020 14:17:01 -0800
> Jakub Kicinski <kuba@kernel.org> wrote:
> 
> > On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:  
> > > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.    
> > 
> > Any pointers on what for? Unless we see actual use cases there's
> > a justifiable concern of the entire thing just being an application of
> > "We can solve any problem by introducing an extra level of indirection."  
> 
> I have two use-cases:
> 
> (1) For XDP easier handling of interface specific setting on egress,
> e.g. pushing a VLAN-id, instead of having to figure this out in RX hook.
> (I think this is also David Ahern's use-case)

Is it really useful to have a hook before multi-buffer frames are
possible and perhaps TSO? The local TCP performance is going to tank
with XDP enabled otherwise.

> (2) I want this egress XDP hook to have the ability to signal
> backpressure. Today we have BQL in most drivers (which is essential to
> avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve.
> 
> For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(),
> or a return code that can stop the queue towards the higher layers.

Agreed, although for that use case, I'm not sure if non-XDP frames 
have to pass trough the hook. Hard to tell as the current patches 
don't touch on this use case.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26  1:43           ` David Ahern
                               ` (2 preceding siblings ...)
  2020-01-26 22:11             ` Jakub Kicinski
@ 2020-02-01 15:59             ` Toke Høiland-Jørgensen
  2020-02-02 17:54               ` David Ahern
  3 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-01 15:59 UTC (permalink / raw)
  To: David Ahern, Jakub Kicinski
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem,
	jbrouer, mst, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, David Ahern

David Ahern <dsahern@gmail.com> writes:

> On 1/24/20 8:36 AM, Toke Høiland-Jørgensen wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> 
>>> On Thu, 23 Jan 2020 14:33:42 -0700, David Ahern wrote:
>>>> On 1/23/20 4:35 AM, Toke Høiland-Jørgensen wrote:
>>>>> David Ahern <dsahern@kernel.org> writes:
>>>>>> From: David Ahern <dahern@digitalocean.com>
>>>>>>
>>>>>> Add IFLA_XDP_EGRESS to if_link.h uapi to handle an XDP program attached
>>>>>> to the egress path of a device. Add rtnl_xdp_egress_fill and helpers as
>>>>>> the egress counterpart to the existing rtnl_xdp_fill. The expectation
>>>>>> is that going forward egress path will acquire the various levels of
>>>>>> attach - generic, driver and hardware.  
>>>>>
>>>>> How would a 'hardware' attach work for this? As I said in my reply to
>>>>> the previous patch, isn't this explicitly for emulating XDP on the other
>>>>> end of a point-to-point link? How would that work with offloaded
>>>>> programs?
>>>>
>>>> Nothing about this patch set is limited to point-to-point links.
>>>
>>> I struggle to understand of what the expected semantics of this new
>>> hook are. Is this going to be run on all frames sent to the device
>>> from the stack? All frames from the stack and from XDP_REDIRECT?
>>>
>>> A little hard to figure out the semantics when we start from a funky
>>> device like tun :S
>> 
>> Yes, that is also why I found this a bit weird. We have discussed plans
>> for an XDP TX hook before:
>> https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#xdp-hook-at-tx
>> 
>> That TX hook would run for everything at TX, but it would be a separate
>> program type with its own metadata access. Whereas the idea with this
>> series (seemed to me) to be just to be able to "emulate" run a regular
>> RX-side XDP program on egress for devices where this makes sense.
>> 
>> If this series is not meant to implement that "emulation", but rather be
>> usable for all devices, I really think we should go straight for the
>> full TX hook as discussed earlier...
>> 
>
> The first patch set from Jason and Prashant started from the perspective
> of offloading XDP programs for a guest. Independently, I was looking at
> XDP in the TX path (now referred to as egress to avoid confusion with
> the XDP_TX return type). Jason and Prashant were touching some of the
> same code paths in the tun driver that I needed for XDP in the Tx path,
> so we decided to consolidate and have XDP egress done first and then
> offload of VMs as a followup. Offload in virtio_net can be done very
> similar to how it is done in nfp -- the program is passed to the host as
> a hardware level attach mode, and the driver verifies the program can be
> offloaded (e.g., does not contain helpers that expose host specific data
> like the fib lookup helper).
>
> At this point, you need to stop thinking solely from the perspective of
> tun or tap and VM offload; think about this from the ability to run an
> XDP program on egress path at an appropriate place in the NIC driver
> that covers both skbs and xdp_frames (e.g., on a REDIRECT). This has
> been discussed before as a need (e.g, Toke's reference above), and I am
> trying to get this initial support done.

Right, so what we're discussing here *is* the "full" egress hook we've
discussed earlier. I thought this was still the other thing, which is
why I was confused.

> I very much wanted to avoid copy-paste-modify for the entire XDP API for
> this. For the most part XDP means ebpf at the NIC driver / hardware
> level (obviously with the exception of generic mode). The goal is
> tempered with the need for the verifier to reject rx entries in the
> xdp_md context. Hence the reason for use of an attach_type - existing
> infrastructure to test and reject the accesses.
>
> That said, Martin's comment throws a wrench in the goal: if the existing
> code does not enforce expected_attach_type then that option can not be
> used in which case I guess I have to go with a new program type
> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
> has different return codes, etc.

In any case an egress program will differ in:

- The context object (the RX-related fields will be invalid on egress,
  and we'll probably want to add new TX-related ones, such as HW
  TX-queue occupancy).
  
- The return code semantics (even if XDP_TX becomes equivalent to
  XDP_PASS, that is still a semantic difference from the RX side; and
  it's not necessarily clear whether we'll want to support REDIRECT on
  the egress side either, is it?)

So we'll have to disambiguate between the two different types of
programs. Which means that what we're discussing is really whether that
disambiguation should be encoded in the program type, or in the attach
type. IMO, making it a separate program type is a clearer and more
explicit UAPI. The verifier could still treat the two program types as
basically equivalent except for those cases where there has to be a
difference anyway. So it seems to me that all you are saving by using
attach_type instead of program type is the need for a new enum value and
a bunch of additions to switch statements? Or am I wildly
underestimating the effort to add a new program type?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-30 14:45                   ` Jakub Kicinski
@ 2020-02-01 16:03                     ` Toke Høiland-Jørgensen
  2020-02-02 17:48                       ` David Ahern
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-01 16:03 UTC (permalink / raw)
  To: Jakub Kicinski, Jesper Dangaard Brouer
  Cc: David Ahern, David Ahern, netdev, prashantbhole.linux, jasowang,
	davem, mst, toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 28 Jan 2020 15:13:43 +0100, Jesper Dangaard Brouer wrote:
>> On Sun, 26 Jan 2020 14:17:01 -0800
>> Jakub Kicinski <kuba@kernel.org> wrote:
>> 
>> > On Sun, 26 Jan 2020 13:49:33 +0100, Jesper Dangaard Brouer wrote:  
>> > > Yes, please. I want this NIC TX hook to see both SKBs and xdp_frames.    
>> > 
>> > Any pointers on what for? Unless we see actual use cases there's
>> > a justifiable concern of the entire thing just being an application of
>> > "We can solve any problem by introducing an extra level of indirection."  
>> 
>> I have two use-cases:
>> 
>> (1) For XDP easier handling of interface specific setting on egress,
>> e.g. pushing a VLAN-id, instead of having to figure this out in RX hook.
>> (I think this is also David Ahern's use-case)
>
> Is it really useful to have a hook before multi-buffer frames are
> possible and perhaps TSO? The local TCP performance is going to tank
> with XDP enabled otherwise.

For a software router (i.e., something that mostly forwards packets) it
can still be useful without multi-buffer. But yeah, that is definitely
something we need to solve, regardless of where this goes.

>> (2) I want this egress XDP hook to have the ability to signal
>> backpressure. Today we have BQL in most drivers (which is essential to
>> avoid bufferbloat). For XDP_REDIRECT we don't, which we must solve.
>> 
>> For use-case(2), we likely need a BPF-helper calling netif_tx_stop_queue(),
>> or a return code that can stop the queue towards the higher layers.
>
> Agreed, although for that use case, I'm not sure if non-XDP frames 
> have to pass trough the hook. Hard to tell as the current patches 
> don't touch on this use case.

I think it would be weird and surprising if it *doesn't* see packets
from the stack. On RX, XDP sees everything; the natural expectation
would be that this was also the case on TX, no?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-28 13:57                     ` Jakub Kicinski
@ 2020-02-01 16:24                       ` Toke Høiland-Jørgensen
  2020-02-01 17:08                         ` Jakub Kicinski
  2020-02-02 17:43                       ` David Ahern
  1 sibling, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-01 16:24 UTC (permalink / raw)
  To: Jakub Kicinski, David Ahern
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem,
	jbrouer, mst, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, David Ahern

[ skipping the first part to just comment on the below: ]

> I'm weary of partially implemented XDP features, EGRESS prog does us
> no good when most drivers didn't yet catch up with the REDIRECTs.

I kinda agree with this; but on the other hand, if we have to wait for
all drivers to catch up, that would mean we couldn't add *anything* new
that requires driver changes, which is not ideal either :/

> And we're adding this before we considered the queuing problem.
>
> But if I'm alone in thinking this, and I'm not convincing anyone we
> can move on :)

I do share your concern that this will end up being incompatible with
whatever solution we end up with for queueing. However, I don't
necessarily think it will: I view the XDP egress hook as something that
in any case will run *after* packets are dequeued from whichever
intermediate queueing it has been through (if any). I think such a hook
is missing in any case; for instance, it's currently impossible to
implement something like CoDel (which needs to know how long a packet
spent in the queue) in eBPF.

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-01 16:24                       ` Toke Høiland-Jørgensen
@ 2020-02-01 17:08                         ` Jakub Kicinski
  2020-02-01 20:05                           ` Toke Høiland-Jørgensen
  2020-02-02 17:45                           ` David Ahern
  0 siblings, 2 replies; 58+ messages in thread
From: Jakub Kicinski @ 2020-02-01 17:08 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David Ahern, David Ahern, netdev, prashantbhole.linux, jasowang,
	davem, jbrouer, mst, toshiaki.makita1, daniel, john.fastabend,
	ast, kafai, songliubraving, yhs, andriin, David Ahern

On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:
> > I'm weary of partially implemented XDP features, EGRESS prog does us
> > no good when most drivers didn't yet catch up with the REDIRECTs.  
> 
> I kinda agree with this; but on the other hand, if we have to wait for
> all drivers to catch up, that would mean we couldn't add *anything*
> new that requires driver changes, which is not ideal either :/

If EGRESS is only for XDP frames we could try to hide the handling in
the core (with slight changes to XDP_TX handling in the drivers),
making drivers smaller and XDP feature velocity higher.

I think loading the drivers with complexity is hurting us in so many
ways..

> > And we're adding this before we considered the queuing problem.
> >
> > But if I'm alone in thinking this, and I'm not convincing anyone we
> > can move on :)  
> 
> I do share your concern that this will end up being incompatible with
> whatever solution we end up with for queueing. However, I don't
> necessarily think it will: I view the XDP egress hook as something
> that in any case will run *after* packets are dequeued from whichever
> intermediate queueing it has been through (if any). I think such a
> hook is missing in any case; for instance, it's currently impossible
> to implement something like CoDel (which needs to know how long a
> packet spent in the queue) in eBPF.

Possibly 🤔 I don't have a good mental image of how the XDP queuing
would work.

Maybe once the queuing primitives are defined they can easily be
hooked into the Qdisc layer. With Martin's recent work all we need is 
a fifo that can store skb pointers, really...

It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
layer underneath.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-01 17:08                         ` Jakub Kicinski
@ 2020-02-01 20:05                           ` Toke Høiland-Jørgensen
  2020-02-02  4:15                             ` Jakub Kicinski
  2020-02-02 17:45                           ` David Ahern
  1 sibling, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-01 20:05 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, David Ahern, netdev, prashantbhole.linux, jasowang,
	davem, jbrouer, mst, toshiaki.makita1, daniel, john.fastabend,
	ast, kafai, songliubraving, yhs, andriin, David Ahern

Jakub Kicinski <kuba@kernel.org> writes:

> On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:
>> > I'm weary of partially implemented XDP features, EGRESS prog does us
>> > no good when most drivers didn't yet catch up with the REDIRECTs.  
>> 
>> I kinda agree with this; but on the other hand, if we have to wait for
>> all drivers to catch up, that would mean we couldn't add *anything*
>> new that requires driver changes, which is not ideal either :/
>
> If EGRESS is only for XDP frames we could try to hide the handling in
> the core (with slight changes to XDP_TX handling in the drivers),
> making drivers smaller and XDP feature velocity higher.

But if it's only for XDP frames that are REDIRECTed, then one might as
well perform whatever action the TX hook was doing before REDIRECTing
(as you yourself argued)... :)

> I think loading the drivers with complexity is hurting us in so many
> ways..

Yeah, but having the low-level details available to the XDP program
(such as HW queue occupancy for the egress hook) is one of the benefits
of XDP, isn't it?

Ultimately, I think Jesper's idea of having drivers operate exclusively
on XDP frames and have the skb handling entirely in the core is an
intriguing way to resolve this problem. Though this is obviously a
long-term thing, and one might reasonably doubt we'll ever get there for
existing drivers...

>> > And we're adding this before we considered the queuing problem.
>> >
>> > But if I'm alone in thinking this, and I'm not convincing anyone we
>> > can move on :)  
>> 
>> I do share your concern that this will end up being incompatible with
>> whatever solution we end up with for queueing. However, I don't
>> necessarily think it will: I view the XDP egress hook as something
>> that in any case will run *after* packets are dequeued from whichever
>> intermediate queueing it has been through (if any). I think such a
>> hook is missing in any case; for instance, it's currently impossible
>> to implement something like CoDel (which needs to know how long a
>> packet spent in the queue) in eBPF.
>
> Possibly 🤔 I don't have a good mental image of how the XDP queuing
> would work.
>
> Maybe once the queuing primitives are defined they can easily be
> hooked into the Qdisc layer. With Martin's recent work all we need is 
> a fifo that can store skb pointers, really...
>
> It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
> layer underneath.

Hmm, hooking into the existing qdisc layer is an interesting idea.
Ultimately, I fear it won't be feasible for performance reasons; but
it's certainly something to consider. Maybe at least as an option?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-01 20:05                           ` Toke Høiland-Jørgensen
@ 2020-02-02  4:15                             ` Jakub Kicinski
  2020-02-03 19:56                               ` Toke Høiland-Jørgensen
  2020-02-03 20:13                               ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 58+ messages in thread
From: Jakub Kicinski @ 2020-02-02  4:15 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David Ahern, David Ahern, netdev, prashantbhole.linux, jasowang,
	davem, jbrouer, mst, toshiaki.makita1, daniel, john.fastabend,
	ast, kafai, songliubraving, yhs, andriin, David Ahern

On Sat, 01 Feb 2020 21:05:28 +0100, Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:
> > On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:  
> >> > I'm weary of partially implemented XDP features, EGRESS prog does us
> >> > no good when most drivers didn't yet catch up with the REDIRECTs.    
> >> 
> >> I kinda agree with this; but on the other hand, if we have to wait for
> >> all drivers to catch up, that would mean we couldn't add *anything*
> >> new that requires driver changes, which is not ideal either :/  
> >
> > If EGRESS is only for XDP frames we could try to hide the handling in
> > the core (with slight changes to XDP_TX handling in the drivers),
> > making drivers smaller and XDP feature velocity higher.  
> 
> But if it's only for XDP frames that are REDIRECTed, then one might as
> well perform whatever action the TX hook was doing before REDIRECTing
> (as you yourself argued)... :)

Right, that's why I think the design needs to start from queuing which
can't be done today, and has to be done in context of the destination.
Solving queuing justifies the added complexity if you will :)

> > I think loading the drivers with complexity is hurting us in so many
> > ways..  
> 
> Yeah, but having the low-level details available to the XDP program
> (such as HW queue occupancy for the egress hook) is one of the benefits
> of XDP, isn't it?

I think I glossed over the hope for having access to HW queue occupancy
- what exactly are you after? 

I don't think one can get anything beyond a BQL type granularity.
Reading over PCIe is out of question, device write back on high
granularity would burn through way too much bus throughput.

> Ultimately, I think Jesper's idea of having drivers operate exclusively
> on XDP frames and have the skb handling entirely in the core is an
> intriguing way to resolve this problem. Though this is obviously a
> long-term thing, and one might reasonably doubt we'll ever get there for
> existing drivers...
> 
> >> > And we're adding this before we considered the queuing problem.
> >> >
> >> > But if I'm alone in thinking this, and I'm not convincing anyone we
> >> > can move on :)    
> >> 
> >> I do share your concern that this will end up being incompatible with
> >> whatever solution we end up with for queueing. However, I don't
> >> necessarily think it will: I view the XDP egress hook as something
> >> that in any case will run *after* packets are dequeued from whichever
> >> intermediate queueing it has been through (if any). I think such a
> >> hook is missing in any case; for instance, it's currently impossible
> >> to implement something like CoDel (which needs to know how long a
> >> packet spent in the queue) in eBPF.  
> >
> > Possibly 🤔 I don't have a good mental image of how the XDP queuing
> > would work.
> >
> > Maybe once the queuing primitives are defined they can easily be
> > hooked into the Qdisc layer. With Martin's recent work all we need is 
> > a fifo that can store skb pointers, really...
> >
> > It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
> > layer underneath.  
> 
> Hmm, hooking into the existing qdisc layer is an interesting idea.
> Ultimately, I fear it won't be feasible for performance reasons; but
> it's certainly something to consider. Maybe at least as an option?

For forwarding sure, but for locally generated traffic.. 🤷‍♂️

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-28 13:57                     ` Jakub Kicinski
  2020-02-01 16:24                       ` Toke Høiland-Jørgensen
@ 2020-02-02 17:43                       ` David Ahern
  2020-02-02 19:31                         ` Jakub Kicinski
  1 sibling, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-02-02 17:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin

On 1/28/20 6:57 AM, Jakub Kicinski wrote:
> On Mon, 27 Jan 2020 20:43:09 -0700, David Ahern wrote:
>>> end of whatever is doing the redirect (especially with Alexei's work   
>>
>> There are use cases where they may make sense, but this is not one.
>>
>>> on linking) and from cls_bpf 🤷‍♂️
>>
>> cls_bpf is tc based == skb, no? I want to handle any packet, regardless
>> of how it arrived at the device's xmit function.
> 
> Yes, that's why I said you need the same rules in XDP before REDIRECT
> and cls_bpf. Sure it's more complex, but (1) it's faster to drop in
> the ingress prog before going though the entire redirect code and
> without parsing the packet twice and (2) no extra kernel code necessary.

you are making a lot of assumptions and frankly it's the 'c' word
(complex) that I want to avoid. I do not believe in the OVS style of
packet processing - one gigantic program with a bunch of logic and data
driven flow lookups that affect every packet. I prefer simple, singly
focused programs logically concatenated to make decisions. Simpler
management, simpler lifecycle. The scope, scale and lifecycle management
of VMs/containers is just as important as minimizing the cycles spent
per packet. XDP in the Tx path is a missing primitive to make life simple.

As an example, the program on the ingress NICs to the host can be
nothing more than a demux'er - use the ethernet header and vlan
(preferably with vlan offload enabled) to do a <vlan,mac> lookup. No
packet parsing at all at this level. The lookup returns an index of
where to redirect the packet (e.g., the tap device of a VM). At the same
time, packets can hit the "slow path" - processing support from the full
stack when it is needed and still packets end up at the tap device. *IF*
there is an ingress ACL for that tap device managed by the host, it can
exist in 1 place - a program and map attached to the tap device limited
to that tap device's networking function (VMs can have multiple
connections to different networks with different needs at this point),
and the program gets run for both XDP fast path and skb slow path.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-01 17:08                         ` Jakub Kicinski
  2020-02-01 20:05                           ` Toke Høiland-Jørgensen
@ 2020-02-02 17:45                           ` David Ahern
  2020-02-02 19:12                             ` Jakub Kicinski
  1 sibling, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-02-02 17:45 UTC (permalink / raw)
  To: Jakub Kicinski, Toke Høiland-Jørgensen
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem,
	jbrouer, mst, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, David Ahern

On 2/1/20 10:08 AM, Jakub Kicinski wrote:
> If EGRESS is only for XDP frames we could try to hide the handling in
> the core (with slight changes to XDP_TX handling in the drivers),

It is not. I have said multiple times it is to work on ALL packets that
hit the xmit function, both skbs and xdp_frames.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-01 16:03                     ` Toke Høiland-Jørgensen
@ 2020-02-02 17:48                       ` David Ahern
  0 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-02-02 17:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On 2/1/20 9:03 AM, Toke Høiland-Jørgensen wrote:
> I think it would be weird and surprising if it *doesn't* see packets
> from the stack. On RX, XDP sees everything; the natural expectation
> would be that this was also the case on TX, no?
> 

Yes. XDP_EGRESS should be symmetrical with XDP on the Rx side.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-01 15:59             ` Toke Høiland-Jørgensen
@ 2020-02-02 17:54               ` David Ahern
  2020-02-03 20:09                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: David Ahern @ 2020-02-02 17:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jakub Kicinski
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem,
	jbrouer, mst, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, David Ahern

On 2/1/20 8:59 AM, Toke Høiland-Jørgensen wrote:
> 
> In any case an egress program will differ in:
> 
> - The context object (the RX-related fields will be invalid on egress,
>   and we'll probably want to add new TX-related ones, such as HW
>   TX-queue occupancy).

Jakub has suggested that rx_queue_index can be a union with
tx_queue_index; the former for the Rx path and the latter for the egress.

The rest of the fields in xdp_md are legit for either direction.

>   
> - The return code semantics (even if XDP_TX becomes equivalent to
>   XDP_PASS, that is still a semantic difference from the RX side; and
>   it's not necessarily clear whether we'll want to support REDIRECT on
>   the egress side either, is it?)

Why should REDIRECT not be allowed in the egress path? e.g., service
chaining or capturing suspicious packets (e.g., encap with a header and
redirect somewhere for analysis).

> 
> So we'll have to disambiguate between the two different types of
> programs. Which means that what we're discussing is really whether that
> disambiguation should be encoded in the program type, or in the attach
> type. IMO, making it a separate program type is a clearer and more
> explicit UAPI. The verifier could still treat the two program types as
> basically equivalent except for those cases where there has to be a
> difference anyway. So it seems to me that all you are saving by using
> attach_type instead of program type is the need for a new enum value and
> a bunch of additions to switch statements? Or am I wildly
> underestimating the effort to add a new program type?
> 

IMHO that is duplicating code and APIs for no real reason. XDP refers to
fast path processing, the only difference is where the program is
attached - Rx side or Tx side.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-01-26  4:54             ` Alexei Starovoitov
@ 2020-02-02 17:59               ` David Ahern
  0 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-02-02 17:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Toke Høiland-Jørgensen, Jakub Kicinski, David Ahern,
	netdev, prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On 1/25/20 9:54 PM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2020 at 06:43:36PM -0700, David Ahern wrote:
>>
>> That said, Martin's comment throws a wrench in the goal: if the existing
>> code does not enforce expected_attach_type then that option can not be
>> used in which case I guess I have to go with a new program type
>> (BPF_PROG_TYPE_XDP_EGRESS) which takes a new context (xdp_egress_md),
>> has different return codes, etc.
> 
> This is acceptable risk. We did such thing in the past. The chances of
> user space breakage are extremely low.
> 

Ultimately that is a decision for the maintainers. Code wise both
iproute2 and libbpf always initialize bpf_attr to 0 and given the many
uses of that union it seems odd that someone would initialize one field
at a time.

Unless someone comes back with a strong 'hell, no' I am planning to send
the next RFC version with the current API.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-02 17:45                           ` David Ahern
@ 2020-02-02 19:12                             ` Jakub Kicinski
  0 siblings, 0 replies; 58+ messages in thread
From: Jakub Kicinski @ 2020-02-02 19:12 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, jbrouer, mst,
	toshiaki.makita1, daniel, john.fastabend, ast, kafai,
	songliubraving, yhs, andriin, David Ahern

On Sun, 2 Feb 2020 10:45:27 -0700, David Ahern wrote:
> On 2/1/20 10:08 AM, Jakub Kicinski wrote:
> > If EGRESS is only for XDP frames we could try to hide the handling in
> > the core (with slight changes to XDP_TX handling in the drivers),  
> 
> It is not. I have said multiple times it is to work on ALL packets that
> hit the xmit function, both skbs and xdp_frames.

Okay, I should have said "was to be", I guess?

 If EGRESS _was_to_be_ only for XDP frames we could try to hide the
 handling in the core (with slight changes to XDP_TX handling in the
 drivers), making drivers smaller and XDP feature velocity higher.

I understand you'd like a hook for all packets, that is clear.
I'm just trying to highlight the cost and consequences.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-02 17:43                       ` David Ahern
@ 2020-02-02 19:31                         ` Jakub Kicinski
  2020-02-02 21:51                           ` David Ahern
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2020-02-02 19:31 UTC (permalink / raw)
  To: David Ahern
  Cc: Toke Høiland-Jørgensen, netdev, prashantbhole.linux,
	jasowang, davem, jbrouer, mst, toshiaki.makita1, daniel,
	john.fastabend, ast, kafai, songliubraving, yhs, andriin

On Sun, 2 Feb 2020 10:43:58 -0700, David Ahern wrote:
> you are making a lot of assumptions and frankly it's the 'c' word
> (complex) that I want to avoid. I do not believe in the OVS style of
> packet processing - one gigantic program with a bunch of logic and data
> driven flow lookups that affect every packet. I prefer simple, singly
> focused programs logically concatenated to make decisions. Simpler
> management, simpler lifecycle. The scope, scale and lifecycle management
> of VMs/containers is just as important as minimizing the cycles spent
> per packet. XDP in the Tx path is a missing primitive to make life simple.
> 
> As an example, the program on the ingress NICs to the host can be
> nothing more than a demux'er - use the ethernet header and vlan
> (preferably with vlan offload enabled) to do a <vlan,mac> lookup. No
> packet parsing at all at this level. The lookup returns an index of
> where to redirect the packet (e.g., the tap device of a VM). At the same
> time, packets can hit the "slow path" - processing support from the full
> stack when it is needed and still packets end up at the tap device. *IF*
> there is an ingress ACL for that tap device managed by the host, it can
> exist in 1 place - a program and map attached to the tap device limited
> to that tap device's networking function (VMs can have multiple
> connections to different networks with different needs at this point),
> and the program gets run for both XDP fast path and skb slow path.

I think our perspectives will be difficult to reconcile.

My reading is that you look at this from slow software device
perspective. Of which there are few (and already loaded with hacks).
And you want the control plane to be simple rather than performance.

I look at this from HW driver perspective of which there are many.
Saying that it's fine to load TX paths of all drivers with extra 
code, and that it's fine to effectively disable SG in the entire 
system just doesn't compute.

Is that a fair summary of the arguments?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-02 19:31                         ` Jakub Kicinski
@ 2020-02-02 21:51                           ` David Ahern
  0 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2020-02-02 21:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toke Høiland-Jørgensen, netdev, prashantbhole.linux,
	jasowang, davem, jbrouer, mst, toshiaki.makita1, daniel,
	john.fastabend, ast, kafai, songliubraving, yhs, andriin

On 2/2/20 12:31 PM, Jakub Kicinski wrote:
> 
> I think our perspectives will be difficult to reconcile.
> 
> My reading is that you look at this from slow software device
> perspective. Of which there are few (and already loaded with hacks).
> And you want the control plane to be simple rather than performance.
> 
> I look at this from HW driver perspective of which there are many.
> Saying that it's fine to load TX paths of all drivers with extra 
> code, and that it's fine to effectively disable SG in the entire 
> system just doesn't compute.
> 
> Is that a fair summary of the arguments?
> 

I do not believe so.

My argument is about allowing XDP and full stack to work
synergistically: Fast path for known unicast traffic, and slow path for
BUM traffic without duplicating config such as ACLs and forcing a
specific design (such as forcing all programs in the XDP Rx path which
does not work for all traffic patterns). For example, if I am willing to
trade a few cycles of over head (redirect processing) for a simpler
overall solution then I should be to do that. Fast path with XDP is
still faster than the full host stack for known unicast traffic.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-02  4:15                             ` Jakub Kicinski
@ 2020-02-03 19:56                               ` Toke Høiland-Jørgensen
  2020-02-03 20:13                               ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-03 19:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, David Ahern, netdev, prashantbhole.linux, jasowang,
	davem, brouer, mst, toshiaki.makita1, daniel, john.fastabend,
	ast, kafai, songliubraving, yhs, andriin, David Ahern

Jakub Kicinski <kuba@kernel.org> writes:

> On Sat, 01 Feb 2020 21:05:28 +0100, Toke Høiland-Jørgensen wrote:
>> Jakub Kicinski <kuba@kernel.org> writes:
>> > On Sat, 01 Feb 2020 17:24:39 +0100, Toke Høiland-Jørgensen wrote:  
>> >> > I'm weary of partially implemented XDP features, EGRESS prog does us
>> >> > no good when most drivers didn't yet catch up with the REDIRECTs.    
>> >> 
>> >> I kinda agree with this; but on the other hand, if we have to wait for
>> >> all drivers to catch up, that would mean we couldn't add *anything*
>> >> new that requires driver changes, which is not ideal either :/  
>> >
>> > If EGRESS is only for XDP frames we could try to hide the handling in
>> > the core (with slight changes to XDP_TX handling in the drivers),
>> > making drivers smaller and XDP feature velocity higher.  
>> 
>> But if it's only for XDP frames that are REDIRECTed, then one might as
>> well perform whatever action the TX hook was doing before REDIRECTing
>> (as you yourself argued)... :)
>
> Right, that's why I think the design needs to start from queuing which
> can't be done today, and has to be done in context of the destination.
> Solving queuing justifies the added complexity if you will :)

Right, that makes sense. Hmm, I wonder if a TX driver hook is enough?
I.e., a driver callback in the TX path could also just queue that packet
(returning XDP_QUEUED?), without the driver needing any more handling
than it will already have? I'm spitballing a little bit here, but it may
be quite straight-forward? :)

>> > I think loading the drivers with complexity is hurting us in so many
>> > ways..  
>> 
>> Yeah, but having the low-level details available to the XDP program
>> (such as HW queue occupancy for the egress hook) is one of the benefits
>> of XDP, isn't it?
>
> I think I glossed over the hope for having access to HW queue occupancy
> - what exactly are you after? 
>
> I don't think one can get anything beyond a BQL type granularity.
> Reading over PCIe is out of question, device write back on high
> granularity would burn through way too much bus throughput.
>
>> Ultimately, I think Jesper's idea of having drivers operate exclusively
>> on XDP frames and have the skb handling entirely in the core is an
>> intriguing way to resolve this problem. Though this is obviously a
>> long-term thing, and one might reasonably doubt we'll ever get there for
>> existing drivers...
>> 
>> >> > And we're adding this before we considered the queuing problem.
>> >> >
>> >> > But if I'm alone in thinking this, and I'm not convincing anyone we
>> >> > can move on :)    
>> >> 
>> >> I do share your concern that this will end up being incompatible with
>> >> whatever solution we end up with for queueing. However, I don't
>> >> necessarily think it will: I view the XDP egress hook as something
>> >> that in any case will run *after* packets are dequeued from whichever
>> >> intermediate queueing it has been through (if any). I think such a
>> >> hook is missing in any case; for instance, it's currently impossible
>> >> to implement something like CoDel (which needs to know how long a
>> >> packet spent in the queue) in eBPF.  
>> >
>> > Possibly 🤔 I don't have a good mental image of how the XDP queuing
>> > would work.
>> >
>> > Maybe once the queuing primitives are defined they can easily be
>> > hooked into the Qdisc layer. With Martin's recent work all we need is 
>> > a fifo that can store skb pointers, really...
>> >
>> > It'd be good if the BPF queuing could replace TC Qdiscs, rather than 
>> > layer underneath.  
>> 
>> Hmm, hooking into the existing qdisc layer is an interesting idea.
>> Ultimately, I fear it won't be feasible for performance reasons; but
>> it's certainly something to consider. Maybe at least as an option?
>
> For forwarding sure, but for locally generated traffic.. 🤷‍♂️

Right, well, for locally generated traffic we already have the qdisc?
This was kinda the reason why my original thought was to add the
queueing for XDP only at REDIRECT time. I.e., we already have the
xdp_dev_bulk_queue (which now even lives in struct net_device), so we
could "just" extend that and make it into a proper queueing structure,
and call it a day? :)

But from your comments it sounds like that when you're saying "BPF
queueing" you mean that the queueing itself should be programmable using
BPF? To me, the most urgent thing has been to figure out a way to do
*any* kind of queueing with XDP-forwarded frames, so haven't given much
thought to what a "fully programmable" queue would look like... Do you
have any thoughts?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-02 17:54               ` David Ahern
@ 2020-02-03 20:09                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-03 20:09 UTC (permalink / raw)
  To: David Ahern, Jakub Kicinski
  Cc: David Ahern, netdev, prashantbhole.linux, jasowang, davem,
	jbrouer, mst, toshiaki.makita1, daniel, john.fastabend, ast,
	kafai, songliubraving, yhs, andriin, David Ahern

David Ahern <dsahern@gmail.com> writes:

> On 2/1/20 8:59 AM, Toke Høiland-Jørgensen wrote:
>> 
>> In any case an egress program will differ in:
>> 
>> - The context object (the RX-related fields will be invalid on egress,
>>   and we'll probably want to add new TX-related ones, such as HW
>>   TX-queue occupancy).
>
> Jakub has suggested that rx_queue_index can be a union with
> tx_queue_index; the former for the Rx path and the latter for the egress.
>
> The rest of the fields in xdp_md are legit for either direction.

Right, okay, ifindex and queue index could make sense in both
directions. But it would still mean that we would limit ourselves to
only adding new fields that would work on both RX and TX. Maybe that is
OK, though.

>> - The return code semantics (even if XDP_TX becomes equivalent to
>>   XDP_PASS, that is still a semantic difference from the RX side; and
>>   it's not necessarily clear whether we'll want to support REDIRECT on
>>   the egress side either, is it?)
>
> Why should REDIRECT not be allowed in the egress path? e.g., service
> chaining or capturing suspicious packets (e.g., encap with a header and
> redirect somewhere for analysis).

Implementation and deployment complexity, mostly (loops!)? I do agree
that it would be nice to allow it, I'm just not entirely convinced that
it's feasible...

>> So we'll have to disambiguate between the two different types of
>> programs. Which means that what we're discussing is really whether that
>> disambiguation should be encoded in the program type, or in the attach
>> type. IMO, making it a separate program type is a clearer and more
>> explicit UAPI. The verifier could still treat the two program types as
>> basically equivalent except for those cases where there has to be a
>> difference anyway. So it seems to me that all you are saving by using
>> attach_type instead of program type is the need for a new enum value and
>> a bunch of additions to switch statements? Or am I wildly
>> underestimating the effort to add a new program type?
>> 
>
> IMHO that is duplicating code and APIs for no real reason. XDP refers to
> fast path processing, the only difference is where the program is
> attached - Rx side or Tx side.

Sure, if we can guarantee API and semantic equivalence. I.e., if any XDP
program really *can* be attached as both an RX and TX program, I have no
objections to re-using the program type. But if it turns out that there
*is* a difference, we're making implicit subtypes anyway, in which case
I think it's better to be explicit about the difference.

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-02  4:15                             ` Jakub Kicinski
  2020-02-03 19:56                               ` Toke Høiland-Jørgensen
@ 2020-02-03 20:13                               ` Toke Høiland-Jørgensen
  2020-02-03 22:15                                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-03 20:13 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Ahern, David Ahern, netdev, prashantbhole.linux, jasowang,
	davem, brouer, mst, toshiaki.makita1, daniel, john.fastabend,
	ast, kafai, songliubraving, yhs, andriin, David Ahern

Oops, I see I forgot to reply to this bit:

>> Yeah, but having the low-level details available to the XDP program
>> (such as HW queue occupancy for the egress hook) is one of the benefits
>> of XDP, isn't it?
>
> I think I glossed over the hope for having access to HW queue occupancy
> - what exactly are you after? 
>
> I don't think one can get anything beyond a BQL type granularity.
> Reading over PCIe is out of question, device write back on high
> granularity would burn through way too much bus throughput.

This was Jesper's idea originally, so maybe he can explain better; but
as I understood it, he basically wanted to expose the same information
that BQL has to eBPF. Making it possible for an eBPF program to either
(re-)implement BQL with its own custom policy, or react to HWQ pressure
in some other way, such as by load balancing to another interface.

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-03 20:13                               ` Toke Høiland-Jørgensen
@ 2020-02-03 22:15                                 ` Jesper Dangaard Brouer
  2020-02-04 11:00                                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Jesper Dangaard Brouer @ 2020-02-03 22:15 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jakub Kicinski, David Ahern, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, mst, toshiaki.makita1,
	daniel, john.fastabend, ast, kafai, songliubraving, yhs, andriin,
	David Ahern, brouer, Björn Töpel, Karlsson, Magnus

On Mon, 03 Feb 2020 21:13:24 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Oops, I see I forgot to reply to this bit:
> 
> >> Yeah, but having the low-level details available to the XDP program
> >> (such as HW queue occupancy for the egress hook) is one of the benefits
> >> of XDP, isn't it?  
> >
> > I think I glossed over the hope for having access to HW queue occupancy
> > - what exactly are you after? 
> >
> > I don't think one can get anything beyond a BQL type granularity.
> > Reading over PCIe is out of question, device write back on high
> > granularity would burn through way too much bus throughput.  
> 
> This was Jesper's idea originally, so maybe he can explain better; but
> as I understood it, he basically wanted to expose the same information
> that BQL has to eBPF. Making it possible for an eBPF program to either
> (re-)implement BQL with its own custom policy, or react to HWQ pressure
> in some other way, such as by load balancing to another interface.

Yes, and I also have plans that goes beyond BQL. But let me start with
explaining the BQL part, and answer Toke's question below.

On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:
> [...] Hmm, I wonder if a TX driver hook is enough?

Short answer is no, a TX driver hook is not enough.  The queue state
info the TX driver hook have access to, needs to be updated once the
hardware have "confirmed" the TX-DMA operation have completed.  For
BQL/DQL this update happens during TX-DMA completion/cleanup (code
see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
point out we cannot query the device directly due to performance
implications).  It doesn't need to be a new BPF hook, just something
that update the queue state info (we could piggy back on the
netdev_tx_completed_queue() call or give TX hook access to
dev_queue->dql).

Regarding "where is the queue": For me the XDP-TX queue is the NIC
hardware queue, that this BPF hook have some visibility into and can do
filtering on. (Imagine that my TX queue is bandwidth limited, then I
can shrink the packet size and still send a "congestion" packet to my
receiver).

The bigger picture is that I envision the XDP-TX/egress hook can
open-up for taking advantage of NIC hardware TX queue features.
This also ties into the queue abstraction work by Björn+Magnus.
Today NIC hardware can do a million TX-queues, and hardware can also do
rate limiting per queue.  Thus, I also envision that the XDP-TX/egress
hook can choose/change the TX queue the packet is queue/sent on (we can
likely just overload the XDP_REDIRECT and have a new bpf map type for
this).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-03 22:15                                 ` Jesper Dangaard Brouer
@ 2020-02-04 11:00                                   ` Toke Høiland-Jørgensen
  2020-02-04 17:09                                     ` Jakub Kicinski
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-04 11:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jakub Kicinski, David Ahern, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, mst, toshiaki.makita1,
	daniel, john.fastabend, ast, kafai, songliubraving, yhs, andriin,
	David Ahern, brouer, Björn Töpel, Karlsson, Magnus

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Mon, 03 Feb 2020 21:13:24 +0100
> Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
>> Oops, I see I forgot to reply to this bit:
>> 
>> >> Yeah, but having the low-level details available to the XDP program
>> >> (such as HW queue occupancy for the egress hook) is one of the benefits
>> >> of XDP, isn't it?  
>> >
>> > I think I glossed over the hope for having access to HW queue occupancy
>> > - what exactly are you after? 
>> >
>> > I don't think one can get anything beyond a BQL type granularity.
>> > Reading over PCIe is out of question, device write back on high
>> > granularity would burn through way too much bus throughput.  
>> 
>> This was Jesper's idea originally, so maybe he can explain better; but
>> as I understood it, he basically wanted to expose the same information
>> that BQL has to eBPF. Making it possible for an eBPF program to either
>> (re-)implement BQL with its own custom policy, or react to HWQ pressure
>> in some other way, such as by load balancing to another interface.
>
> Yes, and I also have plans that goes beyond BQL. But let me start with
> explaining the BQL part, and answer Toke's question below.
>
> On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:
>> [...] Hmm, I wonder if a TX driver hook is enough?
>
> Short answer is no, a TX driver hook is not enough.  The queue state
> info the TX driver hook have access to, needs to be updated once the
> hardware have "confirmed" the TX-DMA operation have completed.  For
> BQL/DQL this update happens during TX-DMA completion/cleanup (code
> see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
> point out we cannot query the device directly due to performance
> implications).  It doesn't need to be a new BPF hook, just something
> that update the queue state info (we could piggy back on the
> netdev_tx_completed_queue() call or give TX hook access to
> dev_queue->dql).

The question is whether this can't simply be done through bpf helpers?
bpf_get_txq_occupancy(ifindex, txqno)?

> Regarding "where is the queue": For me the XDP-TX queue is the NIC
> hardware queue, that this BPF hook have some visibility into and can do
> filtering on. (Imagine that my TX queue is bandwidth limited, then I
> can shrink the packet size and still send a "congestion" packet to my
> receiver).

I'm not sure the hardware queues will be enough, though. Unless I'm
misunderstanding something, hardware queues are (1) fairly short and (2)
FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd
still need a software queueing layer on top of the hardware queue.

If the hardware is EDT-aware this may change, I suppose, but I'm not
sure if we can design the XDP queueing primitives with this assumption? :)

> The bigger picture is that I envision the XDP-TX/egress hook can
> open-up for taking advantage of NIC hardware TX queue features. This
> also ties into the queue abstraction work by Björn+Magnus. Today NIC
> hardware can do a million TX-queues, and hardware can also do rate
> limiting per queue. Thus, I also envision that the XDP-TX/egress hook
> can choose/change the TX queue the packet is queue/sent on (we can
> likely just overload the XDP_REDIRECT and have a new bpf map type for
> this).

Yes, I think I mentioned in another email that putting all the queueing
smarts into the redirect map was also something I'd considered (well, I
do think we've discussed this in the past, so maybe not so surprising if
we're thinking along the same lines) :)

But the implication of this is also that an actual TX hook in the driver
need not necessarily incorporate a lot of new functionality, as it can
control the queueing through a combination of BPF helpers and map
updates?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-04 11:00                                   ` Toke Høiland-Jørgensen
@ 2020-02-04 17:09                                     ` Jakub Kicinski
  2020-02-05 15:30                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Jakub Kicinski @ 2020-02-04 17:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jesper Dangaard Brouer, David Ahern, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, mst, toshiaki.makita1,
	daniel, john.fastabend, ast, kafai, songliubraving, yhs, andriin,
	David Ahern, Björn Töpel, Karlsson, Magnus

On Tue, 04 Feb 2020 12:00:40 +0100, Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> > On Mon, 03 Feb 2020 21:13:24 +0100
> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >  
> >> Oops, I see I forgot to reply to this bit:
> >>   
> >> >> Yeah, but having the low-level details available to the XDP program
> >> >> (such as HW queue occupancy for the egress hook) is one of the benefits
> >> >> of XDP, isn't it?    
> >> >
> >> > I think I glossed over the hope for having access to HW queue occupancy
> >> > - what exactly are you after? 
> >> >
> >> > I don't think one can get anything beyond a BQL type granularity.
> >> > Reading over PCIe is out of question, device write back on high
> >> > granularity would burn through way too much bus throughput.    
> >> 
> >> This was Jesper's idea originally, so maybe he can explain better; but
> >> as I understood it, he basically wanted to expose the same information
> >> that BQL has to eBPF. Making it possible for an eBPF program to either
> >> (re-)implement BQL with its own custom policy, or react to HWQ pressure
> >> in some other way, such as by load balancing to another interface.  
> >
> > Yes, and I also have plans that goes beyond BQL. But let me start with
> > explaining the BQL part, and answer Toke's question below.
> >
> > On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:  
> >> [...] Hmm, I wonder if a TX driver hook is enough?  
> >
> > Short answer is no, a TX driver hook is not enough.  The queue state
> > info the TX driver hook have access to, needs to be updated once the
> > hardware have "confirmed" the TX-DMA operation have completed.  For
> > BQL/DQL this update happens during TX-DMA completion/cleanup (code
> > see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
> > point out we cannot query the device directly due to performance
> > implications).  It doesn't need to be a new BPF hook, just something
> > that update the queue state info (we could piggy back on the
> > netdev_tx_completed_queue() call or give TX hook access to
> > dev_queue->dql).  

Interesting, that model does make sense to me.

> The question is whether this can't simply be done through bpf helpers?
> bpf_get_txq_occupancy(ifindex, txqno)?

Helper vs dev_queue->dql field access seems like a technicality.
The usual flexibility of implementation vs performance and simplicity
consideration applies.. I guess?

> > Regarding "where is the queue": For me the XDP-TX queue is the NIC
> > hardware queue, that this BPF hook have some visibility into and can do
> > filtering on. (Imagine that my TX queue is bandwidth limited, then I
> > can shrink the packet size and still send a "congestion" packet to my
> > receiver).  
> 
> I'm not sure the hardware queues will be enough, though. Unless I'm
> misunderstanding something, hardware queues are (1) fairly short and (2)
> FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd
> still need a software queueing layer on top of the hardware queue.

Jesper makes a very interesting point tho. If all the implementation
wants is FIFO queues which are services in some simple manner (that is
something that can be offloaded) we should support that.

That means REDIRECT can target multiple TX queues, and we need an API
to control the queue allocation..

> If the hardware is EDT-aware this may change, I suppose, but I'm not
> sure if we can design the XDP queueing primitives with this assumption? :)

But I agree with you as well. I think both HW and SW feeding needs to
be supported. The HW implementations are always necessarily behind
ideas people implemented and tested in SW..

> > The bigger picture is that I envision the XDP-TX/egress hook can
> > open-up for taking advantage of NIC hardware TX queue features. This
> > also ties into the queue abstraction work by Björn+Magnus. Today NIC
> > hardware can do a million TX-queues, and hardware can also do rate
> > limiting per queue. Thus, I also envision that the XDP-TX/egress hook
> > can choose/change the TX queue the packet is queue/sent on (we can
> > likely just overload the XDP_REDIRECT and have a new bpf map type for
> > this).  

I wonder what that does to our HW offload model which is based on
TC Qdisc offload today :S Do we use TC API to control configuration 
of XDP queues? :S

> Yes, I think I mentioned in another email that putting all the queueing
> smarts into the redirect map was also something I'd considered (well, I
> do think we've discussed this in the past, so maybe not so surprising if
> we're thinking along the same lines) :)
> 
> But the implication of this is also that an actual TX hook in the driver
> need not necessarily incorporate a lot of new functionality, as it can
> control the queueing through a combination of BPF helpers and map
> updates?

True, it's the dequeuing that's on the TX side, so we could go as far
as putting all the enqueuing logic in the RX prog..

To answer your question from the other email Toke, my basic model was
kind of similar to TC Qdiscs. XDP redirect selects a device, then that
device has an enqueue and dequeue programs. Enqueue program can be run
in the XDP_REDIRECT context, dequeue is run every time NAPI cleaned up
some space on the TX descriptor ring. There is a "queue state" but the
FIFOs etc are sort of internal detail that the enqueue and dequeue
programs only share between each other. To be clear this is not a
suggestion of how things should be, it's what sprung to my mind without
thinking.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path
  2020-02-04 17:09                                     ` Jakub Kicinski
@ 2020-02-05 15:30                                       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2020-02-05 15:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jesper Dangaard Brouer, David Ahern, David Ahern, netdev,
	prashantbhole.linux, jasowang, davem, mst, toshiaki.makita1,
	daniel, john.fastabend, ast, kafai, songliubraving, yhs, andriin,
	David Ahern, Björn Töpel, Karlsson, Magnus

Jakub Kicinski <kuba@kernel.org> writes:

> On Tue, 04 Feb 2020 12:00:40 +0100, Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> > On Mon, 03 Feb 2020 21:13:24 +0100
>> > Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >  
>> >> Oops, I see I forgot to reply to this bit:
>> >>   
>> >> >> Yeah, but having the low-level details available to the XDP program
>> >> >> (such as HW queue occupancy for the egress hook) is one of the benefits
>> >> >> of XDP, isn't it?    
>> >> >
>> >> > I think I glossed over the hope for having access to HW queue occupancy
>> >> > - what exactly are you after? 
>> >> >
>> >> > I don't think one can get anything beyond a BQL type granularity.
>> >> > Reading over PCIe is out of question, device write back on high
>> >> > granularity would burn through way too much bus throughput.    
>> >> 
>> >> This was Jesper's idea originally, so maybe he can explain better; but
>> >> as I understood it, he basically wanted to expose the same information
>> >> that BQL has to eBPF. Making it possible for an eBPF program to either
>> >> (re-)implement BQL with its own custom policy, or react to HWQ pressure
>> >> in some other way, such as by load balancing to another interface.  
>> >
>> > Yes, and I also have plans that goes beyond BQL. But let me start with
>> > explaining the BQL part, and answer Toke's question below.
>> >
>> > On Mon, 03 Feb 2020 20:56:03 +0100 Toke wrote:  
>> >> [...] Hmm, I wonder if a TX driver hook is enough?  
>> >
>> > Short answer is no, a TX driver hook is not enough.  The queue state
>> > info the TX driver hook have access to, needs to be updated once the
>> > hardware have "confirmed" the TX-DMA operation have completed.  For
>> > BQL/DQL this update happens during TX-DMA completion/cleanup (code
>> > see call sites for netdev_tx_completed_queue()).  (As Jakub wisely
>> > point out we cannot query the device directly due to performance
>> > implications).  It doesn't need to be a new BPF hook, just something
>> > that update the queue state info (we could piggy back on the
>> > netdev_tx_completed_queue() call or give TX hook access to
>> > dev_queue->dql).  
>
> Interesting, that model does make sense to me.
>
>> The question is whether this can't simply be done through bpf helpers?
>> bpf_get_txq_occupancy(ifindex, txqno)?
>
> Helper vs dev_queue->dql field access seems like a technicality.
> The usual flexibility of implementation vs performance and simplicity
> consideration applies.. I guess?

Yeah, that was my point: We can do whatever makes the most sense for
that; it doesn't necessarily have to be something that's baked into the
TX hook point.

>> > Regarding "where is the queue": For me the XDP-TX queue is the NIC
>> > hardware queue, that this BPF hook have some visibility into and can do
>> > filtering on. (Imagine that my TX queue is bandwidth limited, then I
>> > can shrink the packet size and still send a "congestion" packet to my
>> > receiver).  
>> 
>> I'm not sure the hardware queues will be enough, though. Unless I'm
>> misunderstanding something, hardware queues are (1) fairly short and (2)
>> FIFO. So, say we wanted to implement fq_codel for XDP forwarding: we'd
>> still need a software queueing layer on top of the hardware queue.
>
> Jesper makes a very interesting point tho. If all the implementation
> wants is FIFO queues which are services in some simple manner (that is
> something that can be offloaded) we should support that.
>
> That means REDIRECT can target multiple TX queues, and we need an API
> to control the queue allocation..

Yes. I think that should be part of the "hwq API" that we presented at
LPC last year[0].

>> If the hardware is EDT-aware this may change, I suppose, but I'm not
>> sure if we can design the XDP queueing primitives with this assumption? :)
>
> But I agree with you as well. I think both HW and SW feeding needs to
> be supported. The HW implementations are always necessarily behind
> ideas people implemented and tested in SW..

Exactly. And even though we mostly think about 10-100G network
interfaces, I believe there is also potential for this to be useful at
the low end (think small embedded routers that are very CPU
constrained). In these sorts of environments software queueing is more
feasible.

>> > The bigger picture is that I envision the XDP-TX/egress hook can
>> > open-up for taking advantage of NIC hardware TX queue features. This
>> > also ties into the queue abstraction work by Björn+Magnus. Today NIC
>> > hardware can do a million TX-queues, and hardware can also do rate
>> > limiting per queue. Thus, I also envision that the XDP-TX/egress hook
>> > can choose/change the TX queue the packet is queue/sent on (we can
>> > likely just overload the XDP_REDIRECT and have a new bpf map type for
>> > this).  
>
> I wonder what that does to our HW offload model which is based on
> TC Qdisc offload today :S Do we use TC API to control configuration 
> of XDP queues? :S

My thought was that it could be baked into the same API[0]. I.e., add a
way to say not just "I want 3 hardware queues for this", but also "I
want 3 hardware queues, with fq_codel (or whatever) on top".

>> Yes, I think I mentioned in another email that putting all the queueing
>> smarts into the redirect map was also something I'd considered (well, I
>> do think we've discussed this in the past, so maybe not so surprising if
>> we're thinking along the same lines) :)
>> 
>> But the implication of this is also that an actual TX hook in the driver
>> need not necessarily incorporate a lot of new functionality, as it can
>> control the queueing through a combination of BPF helpers and map
>> updates?
>
> True, it's the dequeuing that's on the TX side, so we could go as far
> as putting all the enqueuing logic in the RX prog..
>
> To answer your question from the other email Toke, my basic model was
> kind of similar to TC Qdiscs. XDP redirect selects a device, then that
> device has an enqueue and dequeue programs. Enqueue program can be run
> in the XDP_REDIRECT context, dequeue is run every time NAPI cleaned up
> some space on the TX descriptor ring. There is a "queue state" but the
> FIFOs etc are sort of internal detail that the enqueue and dequeue
> programs only share between each other. To be clear this is not a
> suggestion of how things should be, it's what sprung to my mind without
> thinking.

Ah, that's interesting. I do agree that it may be better to bake this
into the existing hooks rather than have an additional set of
(en/de)queue hooks.

-Toke

[0] https://linuxplumbersconf.org/event/4/contributions/462/


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2020-02-05 15:31 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-23  1:41 [PATCH bpf-next 00/12] Add support for XDP in egress path David Ahern
2020-01-23  1:41 ` [PATCH bpf-next 01/12] net: Add new XDP setup and query commands David Ahern
2020-01-23  1:42 ` [PATCH bpf-next 02/12] net: Add BPF_XDP_EGRESS as a bpf_attach_type David Ahern
2020-01-23 11:34   ` Toke Høiland-Jørgensen
2020-01-23 21:32     ` David Ahern
2020-01-24  9:49       ` Toke Høiland-Jørgensen
2020-01-24  7:33   ` Martin Lau
2020-01-23  1:42 ` [PATCH bpf-next 03/12] net: Add IFLA_XDP_EGRESS for XDP programs in the egress path David Ahern
2020-01-23 11:35   ` Toke Høiland-Jørgensen
2020-01-23 21:33     ` David Ahern
2020-01-24 15:21       ` Jakub Kicinski
2020-01-24 15:36         ` Toke Høiland-Jørgensen
2020-01-26  1:43           ` David Ahern
2020-01-26  4:54             ` Alexei Starovoitov
2020-02-02 17:59               ` David Ahern
2020-01-26 12:49             ` Jesper Dangaard Brouer
2020-01-26 16:38               ` David Ahern
2020-01-26 22:17               ` Jakub Kicinski
2020-01-28 14:13                 ` Jesper Dangaard Brouer
2020-01-30 14:45                   ` Jakub Kicinski
2020-02-01 16:03                     ` Toke Høiland-Jørgensen
2020-02-02 17:48                       ` David Ahern
2020-01-26 22:11             ` Jakub Kicinski
2020-01-27  4:03               ` David Ahern
2020-01-27 14:16                 ` Jakub Kicinski
2020-01-28  3:43                   ` David Ahern
2020-01-28 13:57                     ` Jakub Kicinski
2020-02-01 16:24                       ` Toke Høiland-Jørgensen
2020-02-01 17:08                         ` Jakub Kicinski
2020-02-01 20:05                           ` Toke Høiland-Jørgensen
2020-02-02  4:15                             ` Jakub Kicinski
2020-02-03 19:56                               ` Toke Høiland-Jørgensen
2020-02-03 20:13                               ` Toke Høiland-Jørgensen
2020-02-03 22:15                                 ` Jesper Dangaard Brouer
2020-02-04 11:00                                   ` Toke Høiland-Jørgensen
2020-02-04 17:09                                     ` Jakub Kicinski
2020-02-05 15:30                                       ` Toke Høiland-Jørgensen
2020-02-02 17:45                           ` David Ahern
2020-02-02 19:12                             ` Jakub Kicinski
2020-02-02 17:43                       ` David Ahern
2020-02-02 19:31                         ` Jakub Kicinski
2020-02-02 21:51                           ` David Ahern
2020-02-01 15:59             ` Toke Høiland-Jørgensen
2020-02-02 17:54               ` David Ahern
2020-02-03 20:09                 ` Toke Høiland-Jørgensen
2020-01-23  1:42 ` [PATCH bpf-next 04/12] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() David Ahern
2020-01-23  1:42 ` [PATCH bpf-next 05/12] tuntap: check tun_msg_ctl type at necessary places David Ahern
2020-01-23  1:42 ` [PATCH bpf-next 06/12] tun: move shared functions to if_tun.h David Ahern
2020-01-23  1:42 ` [PATCH bpf-next 07/12] vhost_net: user tap recvmsg api to access ptr ring David Ahern
2020-01-23  8:26   ` Michael S. Tsirkin
2020-01-23  1:42 ` [PATCH bpf-next 08/12] tuntap: remove usage of ptr ring in vhost_net David Ahern
2020-01-23  1:42 ` [PATCH bpf-next 09/12] tun: set egress XDP program David Ahern
2020-01-23  1:42 ` [PATCH bpf-next 10/12] tun: run XDP program in tx path David Ahern
2020-01-23  8:23   ` Michael S. Tsirkin
2020-01-24 13:36     ` Prashant Bhole
2020-01-24 13:44     ` Prashant Bhole
2020-01-23  1:42 ` [PATCH bpf-next 11/12] libbpf: Add egress XDP support David Ahern
2020-01-23  1:42 ` [PATCH bpf-next 12/12] samples/bpf: xdp1, add " David Ahern

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).