KVM Archive on lore.kernel.org
 help / color / Atom feed
* [RFC net-next 00/18] virtio_net XDP offload
@ 2019-11-26 10:07 Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 01/18] bpf: introduce bpf_prog_offload_verifier_setup() Prashant Bhole
                   ` (18 more replies)
  0 siblings, 19 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

Note: This RFC has been sent to netdev as well as qemu-devel lists

This series introduces XDP offloading from virtio_net. It is based on
the following work by Jason Wang:
https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net

Current XDP performance in virtio-net is far from what we can achieve
on host. Several major factors cause the difference:
- Cost of virtualization
- Cost of virtio (populating virtqueue and context switching)
- Cost of vhost, it needs more optimization
- Cost of data copy
Because of above reasons there is a need of offloading XDP program to
host. This set is an attempt to implement XDP offload from the guest.


* High level design:

virtio_net exposes itself as offload capable device and works as a
transport of commands to load the program on the host. When offload is
requested, it sends the program to Qemu. Qemu then loads the program
and attaches to corresponding tap device. Similarly virtio_net sends
control commands to create and control maps. tap device runs the XDP
prog in its Tx path. The fast datapath remains on host whereas slow
path in which user program reads/updates map values remains in the
guest.

When offloading to actual hardware the program needs to be translated
and JITed for the target hardware. In case of offloading from guest
we pass almost raw program to the host. The verifier on the host
verifies the offloaded program.


* Implementation in Kernel


virtio_net
==========
Creates bpf offload device and registers as offload capable device.
It also implements bpf_map_dev_ops to handle the offloaded map. A new
command structure is defined to communicate with qemu.

Map offload:
- In offload sequence maps are always offloaded before the program. In
  map offloading stage, virtio_net sends control commands to qemu to
  create a map and return a map fd which is valid on host. This fd is
  stored in driver specific map structure. A list of such maps is
  maintained.

- Currently BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_HASH are supported.
  Offloading a per cpu array from guest to host doesn't make sense.

Program offload:
- In general the verifier in the guest replaces map fds in the user
  submitted programs with map pointers then bpf_prog_offload_ops
  callbacks are called.

- This set introduces new program offload callback 'setup()' which
  verifier calls before replacing map fds with map pointers. This way
  virtio_net can create a copy of the program with guest map fds. It
  was needed because virtio_net wants to derive driver specific map
  data from guest map fd. Then guest map fd will be replaced with
  host map fd in the copy of the program, hence the copy of the
  program which will be submitted to the host will have valid host map
  fds.

- Alternatively if we can move the prep() call in the verifier before
  map fd replacement happens, there is not need to introduce 'setup()'
  callback.

- As per current implementation of 'setup()' callback in virtio_net,
  it verifies full program for allowed helper functions and performs
  above mentioned map fd replacement.

- A list of allowed helper function is maintained and it is currently
  experimental, it will be updated later as per need. Using this
  list we can filter out most non-XDP type programs to some extent.
  Also we prevent the guest from collecting host specific information
  by not allowing some helper calls.

- XDP_PROG_SETUP_HW is called after successful program verification.
  In this call a control buffer is prepared, program instructions are
  appended to the buffer and it is sent to qemu.

tun
===
This set makes changes in tun to run XDP prog in Tx path. It will be
the offloaded program from the guest. This program can be set using
tun ioctl interface. There were multiple places where this program can
be executed.
- tun_net_xmit
- tun_xdp_xmit
- tun_recvmsg
tun_recvmsg was chosen because it runs in process context. The other
two run in bh context. Running in process context helps in setting up
service chaining using XDP redirect.

XDP_REDIRECT action of offloaded program isn't handled. It is because
target interface's ndo_xdp_xmit is called when we redirect a packet.
In offload case the target interface will be some tap interface. Any
packet redirected towards it will sent back to the guest, which is not
what we expect. Such redirect will need special handling in the kernel

XDP_TX action of offloaded program is handled. Packet is injected into
the Rx path in this case. Care is taken such that the tap's native Rx
path XDP will be executed in such case.


* Implementation in Qemu

Qemu is modified to handle handle control commands from the guest.
When program offload command is received, it loads the program in the
host OS and attaches program fd to tap device. All the program and map
operations are performed using libbpf APIs.


* Performance numbers

Single flow tests were performed. The diagram below shows the setup.
xdp1 and xdp2 sample programs were modified to use BPF_MAP_TYPE_ARRAY
instead of per cpu array and xdp1_user.c was modified to have hardware
offload parameter.

                     (Rx path XDP to drop      (Tx path XDP.
                      XDP_TX'ed pkts from       Program offloaded
                      tun Tx path XDP)          from virtio_net)
                          XDP_DROP ----------.  XDP_DROP/XDP_TX
                                              \   |
                                    (Case 2)   \  |   XDP_DROP/XDP_TX
 pktgen ---> 10G-NIC === 10G-NIC --- bridge --- tun --- virtio-net
|<------ netns ------>|    |                     ^   |<----guest---->|
                           v                     |
                           '---- XDP_REDIRECT----'
                                  (Case 1)

Case 1: Packets XDP_REDIRECT'ed towards tun.
                        Non-offload        Offload
  xdp1 (XDP_DROP)        2.46 Mpps        12.90 Mpps
  xdp2 (XDP_TX)          1.50 Mpps         7.26 Mpps

Case 2: Packets are not redirected. They pass through the bridge.
                        Non-offload        Offload
  xdp1 (XDP_DROP)        1.03 Mpps         1.01 Mpps
  xdp2 (XDP_TX)          1.10 Mpps         0.99 Mpps

  In case 2, the offload performance is low. In this case the
  producer function is tun_net_xmit. It puts single packet in ptr ring
  and spends most of the time in waking up vhost thread. On the other
  hand, each time when vhost thread wakes up, it calls tun_recvmsg.
  Since Tx path XDP runs in tun_recvmsg, vhost doesn't see any packet.
  It sleeps frequently and producer function most spends more time in
  waking it up. vhost polling improves these numbers but in that case
  non-offload performance also improves and remains higher than the
  offload case. Performance in this case can be improved later in a
  separate work.

Since this set makes changes in virtio_net, tun and vhost_net, it was
necessary to measure the performance difference after applying this
set. Performance numbers are in table below:

   Netperf Test         Before      After      Difference
  UDP_STREAM 18byte     89.43       90.74       +1.46%
  UDP_STREAM 1472byte    6882        7026       +2.09%
  TCP_STREAM             9403        9407       +0.04%
  UDP_RR                13520       13478       -0.31%
  TCP_RR                13120       12918       -1.53%


* Points for improvement (TODO)

- In current implementation, qemu passes host map fd to the guest,
  which means guest is poking host information. It can be avoided by
  moving the map fd replacement task from guest to qemu.

- Currently there is no way on the host side to show whether a tap
  interface has offloaded XDP program attached.

- When sending program and map related control commands from guest to
  host, it will be better if we pass metadata about the program, map.
  For example BTF data.

- In future virtio can have feature bit for offloading capability

- TUNGETFEATURES should have a flag to notify about offloading
  capability

- Submit virtio spec patch to describe XDP offloading feature

- When offloading is enabled, it should be a migration blocker.

- DoS: Offloaded map uses host's memory which is other than what has
  been allocated for the guest. Offloading many maps of large size can
  be one of the DoS strategy. Hence qemu should have parameter to
  limit how many maps guest can offload or how much memory offloaded
  maps use.


* Other dependencies

- Loading a bpf program requires CAP_SYS_ADMIN capability. We tested
  this set by running qemu as root OR adding CAP_SYS_ADMIN to the
  qemu binary. In other cases Qemu doesn't have this capability.
  Alexei's recent work CAP_BPF can be a solution to this problem.
  The CAP_BPF work is still being discussed in the mailing list.

Jason Wang (9):
  bpf: introduce bpf_prog_offload_verifier_setup()
  net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core()
  net: core: export do_xdp_generic_core()
  tun: set offloaded xdp program
  virtio-net: store xdp_prog in device
  virtio_net: add XDP prog offload infrastructure
  virtio_net: implement XDP prog offload functionality
  bpf: export function __bpf_map_get
  virtio_net: implment XDP map offload functionality

Prashant Bhole (9):
  tuntap: check tun_msg_ctl type at necessary places
  vhost_net: user tap recvmsg api to access ptr ring
  tuntap: remove usage of ptr ring in vhost_net
  tun: run offloaded XDP program in Tx path
  tun: add a way to inject Tx path packet into Rx path
  tun: handle XDP_TX action of offloaded program
  tun: run xdp prog when tun is read from file interface
  virtio_net: use XDP attachment helpers
  virtio_net: restrict bpf helper calls from offloaded program

 drivers/net/tap.c               |  42 ++-
 drivers/net/tun.c               | 257 +++++++++++++--
 drivers/net/virtio_net.c        | 552 +++++++++++++++++++++++++++++---
 drivers/vhost/net.c             |  77 ++---
 include/linux/bpf.h             |   1 +
 include/linux/bpf_verifier.h    |   1 +
 include/linux/if_tap.h          |   5 -
 include/linux/if_tun.h          |  23 +-
 include/linux/netdevice.h       |   2 +
 include/uapi/linux/if_tun.h     |   1 +
 include/uapi/linux/virtio_net.h |  50 +++
 kernel/bpf/offload.c            |  14 +
 kernel/bpf/syscall.c            |   1 +
 kernel/bpf/verifier.c           |   6 +
 net/core/dev.c                  |   8 +-
 15 files changed, 901 insertions(+), 139 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 01/18] bpf: introduce bpf_prog_offload_verifier_setup()
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 02/18] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() Prashant Bhole
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

Background:
This change was initiated from virtio_net XDP offload work. As per
the implementation plan, a copy of original program with map fds from
guest replaced with map fds from host needs to be offloaded to the
host. To implement this fd replacement, insn_hook() must provide an
insn with map fd intact. bpf_map and driver specific map data can be
derived from map_fd.

Since verifier calls all the offload callbacks after replacing map
fds, it was difficult to implement virtio_net XDP offload feature.
If virtio_net gets only one callback with original bpf program, it
will get a chance to perform the fd replacement in its own copy of the
program.

Solution:
Let's introduce a setup() callback in bpf_prog_offload_ops. It will be
non mandetory. The verifier will call it just before replacing the map
fds.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 include/linux/bpf.h          |  1 +
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/offload.c         | 14 ++++++++++++++
 kernel/bpf/verifier.c        |  6 ++++++
 4 files changed, 22 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 35903f148be5..1cdba120357c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -361,6 +361,7 @@ struct bpf_prog_offload_ops {
 			    struct bpf_insn *insn);
 	int (*remove_insns)(struct bpf_verifier_env *env, u32 off, u32 cnt);
 	/* program management callbacks */
+	int (*setup)(struct bpf_prog *prog);
 	int (*prepare)(struct bpf_prog *prog);
 	int (*translate)(struct bpf_prog *prog);
 	void (*destroy)(struct bpf_prog *prog);
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 26e40de9ef55..de7028e17c0d 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -418,6 +418,7 @@ static inline struct bpf_reg_state *cur_regs(struct bpf_verifier_env *env)
 	return cur_func(env)->regs;
 }
 
+int bpf_prog_offload_verifier_setup(struct bpf_prog *prog);
 int bpf_prog_offload_verifier_prep(struct bpf_prog *prog);
 int bpf_prog_offload_verify_insn(struct bpf_verifier_env *env,
 				 int insn_idx, int prev_insn_idx);
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 5b9da0954a27..04ca7a31d947 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -124,6 +124,20 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	return err;
 }
 
+int bpf_prog_offload_verifier_setup(struct bpf_prog *prog)
+{
+	struct bpf_prog_offload *offload;
+	int ret = 0;
+
+	down_read(&bpf_devs_lock);
+	offload = prog->aux->offload;
+	if (offload && offload->offdev->ops->setup)
+		ret = offload->offdev->ops->setup(prog);
+	up_read(&bpf_devs_lock);
+
+	return ret;
+}
+
 int bpf_prog_offload_verifier_prep(struct bpf_prog *prog)
 {
 	struct bpf_prog_offload *offload;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a0482e1c4a77..94b43542439e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9737,6 +9737,12 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr,
 
 	env->allow_ptr_leaks = is_priv;
 
+	if (bpf_prog_is_dev_bound(env->prog->aux)) {
+		ret = bpf_prog_offload_verifier_setup(env->prog);
+		if (ret)
+			goto skip_full_check;
+	}
+
 	if (is_priv)
 		env->test_state_freq = attr->prog_flags & BPF_F_TEST_STATE_FREQ;
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 02/18] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core()
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 01/18] bpf: introduce bpf_prog_offload_verifier_setup() Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 03/18] net: core: export do_xdp_generic_core() Prashant Bhole
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

In skb generic path, we need a way to run XDP program on skb but
to have customized handling of XDP actions. netif_receive_generic_xdp
will be more helpful in such cases than do_xdp_generic.

This patch prepares netif_receive_generic_xdp() to be used as general
purpose function by renaming it.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 net/core/dev.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c7fc902ccbdc..5ae647b9914f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4461,9 +4461,9 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 	return rxqueue;
 }
 
-static u32 netif_receive_generic_xdp(struct sk_buff *skb,
-				     struct xdp_buff *xdp,
-				     struct bpf_prog *xdp_prog)
+static u32 do_xdp_generic_core(struct sk_buff *skb,
+			       struct xdp_buff *xdp,
+			       struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	void *orig_data, *orig_data_end;
@@ -4610,7 +4610,7 @@ int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 		u32 act;
 		int err;
 
-		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
+		act = do_xdp_generic_core(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
 			case XDP_REDIRECT:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 03/18] net: core: export do_xdp_generic_core()
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 01/18] bpf: introduce bpf_prog_offload_verifier_setup() Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 02/18] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 04/18] tuntap: check tun_msg_ctl type at necessary places Prashant Bhole
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

Let's export do_xdp_generic as a general purpose function. It will
just run XDP program on skb but will not handle XDP actions.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 include/linux/netdevice.h | 2 ++
 net/core/dev.c            | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9e6fb8524d91..2b6317ac9795 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3648,6 +3648,8 @@ static inline void dev_consume_skb_any(struct sk_buff *skb)
 
 void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog);
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb);
+u32 do_xdp_generic_core(struct sk_buff *skb, struct xdp_buff *xdp,
+			struct bpf_prog *xdp_prog);
 int netif_rx(struct sk_buff *skb);
 int netif_rx_ni(struct sk_buff *skb);
 int netif_receive_skb(struct sk_buff *skb);
diff --git a/net/core/dev.c b/net/core/dev.c
index 5ae647b9914f..d97c3f35e047 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4461,9 +4461,8 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 	return rxqueue;
 }
 
-static u32 do_xdp_generic_core(struct sk_buff *skb,
-			       struct xdp_buff *xdp,
-			       struct bpf_prog *xdp_prog)
+u32 do_xdp_generic_core(struct sk_buff *skb, struct xdp_buff *xdp,
+			struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	void *orig_data, *orig_data_end;
@@ -4574,6 +4573,7 @@ static u32 do_xdp_generic_core(struct sk_buff *skb,
 
 	return act;
 }
+EXPORT_SYMBOL_GPL(do_xdp_generic_core);
 
 /* When doing generic XDP we have to bypass the qdisc layer and the
  * network taps in order to match in-driver-XDP behavior.
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 04/18] tuntap: check tun_msg_ctl type at necessary places
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (2 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 03/18] net: core: export do_xdp_generic_core() Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 05/18] vhost_net: user tap recvmsg api to access ptr ring Prashant Bhole
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

tun_msg_ctl is used by vhost_net to communicate with tuntap. We will
introduce another type in soon. As a preparation this patch adds
conditions to check tun_msg_ctl type at necessary places.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tap.c | 7 +++++--
 drivers/net/tun.c | 6 +++++-
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 3ae70c7e6860..4df7bf00af66 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1213,6 +1213,7 @@ static int tap_sendmsg(struct socket *sock, struct msghdr *m,
 	struct tap_queue *q = container_of(sock, struct tap_queue, sock);
 	struct tun_msg_ctl *ctl = m->msg_control;
 	struct xdp_buff *xdp;
+	void *ptr = NULL;
 	int i;
 
 	if (ctl && (ctl->type == TUN_MSG_PTR)) {
@@ -1223,8 +1224,10 @@ static int tap_sendmsg(struct socket *sock, struct msghdr *m,
 		return 0;
 	}
 
-	return tap_get_user(q, ctl ? ctl->ptr : NULL, &m->msg_iter,
-			    m->msg_flags & MSG_DONTWAIT);
+	if (ctl && ctl->type == TUN_MSG_UBUF)
+		ptr = ctl->ptr;
+
+	return tap_get_user(q, ptr, &m->msg_iter, m->msg_flags & MSG_DONTWAIT);
 }
 
 static int tap_recvmsg(struct socket *sock, struct msghdr *m,
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 683d371e6e82..1e436d9ec4e1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2529,6 +2529,7 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	struct tun_struct *tun = tun_get(tfile);
 	struct tun_msg_ctl *ctl = m->msg_control;
 	struct xdp_buff *xdp;
+	void *ptr = NULL;
 
 	if (!tun)
 		return -EBADFD;
@@ -2560,7 +2561,10 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 		goto out;
 	}
 
-	ret = tun_get_user(tun, tfile, ctl ? ctl->ptr : NULL, &m->msg_iter,
+	if (ctl && ctl->type == TUN_MSG_UBUF)
+		ptr = ctl->ptr;
+
+	ret = tun_get_user(tun, tfile, ptr, &m->msg_iter,
 			   m->msg_flags & MSG_DONTWAIT,
 			   m->msg_flags & MSG_MORE);
 out:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 05/18] vhost_net: user tap recvmsg api to access ptr ring
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (3 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 04/18] tuntap: check tun_msg_ctl type at necessary places Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 06/18] tuntap: remove usage of ptr ring in vhost_net Prashant Bhole
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

Currently vhost_net directly accesses ptr ring of tap driver to
fetch Rx packet pointers. In order to avoid it this patch modifies
tap driver's recvmsg api to do additional task of fetching Rx packet
pointers.

A special struct tun_msg_ctl is already being passed via msg_control
for tun Rx XDP batching. This patch extends tun_msg_ctl usage to
send sub commands to recvmsg api. Now tun_recvmsg will handle commands
to consume and unconsume packet pointers from ptr ring.

This will be useful in implementation of virtio-net XDP offload
feature, where packets will be XDP processed before they are passed
to vhost_net.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tap.c      | 22 ++++++++++++++++++-
 drivers/net/tun.c      | 24 ++++++++++++++++++++-
 drivers/vhost/net.c    | 48 +++++++++++++++++++++++++++++++-----------
 include/linux/if_tun.h | 18 ++++++++++++++++
 4 files changed, 98 insertions(+), 14 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 4df7bf00af66..8635cdfd7aa4 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1234,8 +1234,28 @@ static int tap_recvmsg(struct socket *sock, struct msghdr *m,
 		       size_t total_len, int flags)
 {
 	struct tap_queue *q = container_of(sock, struct tap_queue, sock);
-	struct sk_buff *skb = m->msg_control;
+	struct tun_msg_ctl *ctl = m->msg_control;
+	struct sk_buff *skb = NULL;
 	int ret;
+
+	if (ctl) {
+		switch (ctl->type) {
+		case TUN_MSG_PKT:
+			skb = ctl->ptr;
+			break;
+		case TUN_MSG_CONSUME_PKTS:
+			return ptr_ring_consume_batched(&q->ring,
+							ctl->ptr,
+							ctl->num);
+		case TUN_MSG_UNCONSUME_PKTS:
+			ptr_ring_unconsume(&q->ring, ctl->ptr, ctl->num,
+					   tun_ptr_free);
+			return 0;
+		default:
+			return -EINVAL;
+		}
+	}
+
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC)) {
 		kfree_skb(skb);
 		return -EINVAL;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1e436d9ec4e1..4f28f2387435 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2577,7 +2577,8 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 {
 	struct tun_file *tfile = container_of(sock, struct tun_file, socket);
 	struct tun_struct *tun = tun_get(tfile);
-	void *ptr = m->msg_control;
+	struct tun_msg_ctl *ctl = m->msg_control;
+	void *ptr = NULL;
 	int ret;
 
 	if (!tun) {
@@ -2585,6 +2586,27 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 		goto out_free;
 	}
 
+	if (ctl) {
+		switch (ctl->type) {
+		case TUN_MSG_PKT:
+			ptr = ctl->ptr;
+			break;
+		case TUN_MSG_CONSUME_PKTS:
+			ret = ptr_ring_consume_batched(&tfile->tx_ring,
+						       ctl->ptr,
+						       ctl->num);
+			goto out;
+		case TUN_MSG_UNCONSUME_PKTS:
+			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,
+					   ctl->num, tun_ptr_free);
+			ret = 0;
+			goto out;
+		default:
+			ret = -EINVAL;
+			goto out_put_tun;
+		}
+	}
+
 	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC|MSG_ERRQUEUE)) {
 		ret = -EINVAL;
 		goto out_put_tun;
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 1a2dd53caade..0f91b374a558 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -175,24 +175,44 @@ static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
 
 static int vhost_net_buf_produce(struct vhost_net_virtqueue *nvq)
 {
+	struct vhost_virtqueue *vq = &nvq->vq;
+	struct socket *sock = vq->private_data;
 	struct vhost_net_buf *rxq = &nvq->rxq;
+	struct tun_msg_ctl ctl = {
+		.type = TUN_MSG_CONSUME_PKTS,
+		.ptr = (void *) rxq->queue,
+		.num = VHOST_NET_BATCH,
+	};
+	struct msghdr msg = {
+		.msg_control = &ctl,
+	};
 
 	rxq->head = 0;
-	rxq->tail = ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
-					      VHOST_NET_BATCH);
+	rxq->tail = sock->ops->recvmsg(sock, &msg, 0, 0);
+	if (WARN_ON_ONCE(rxq->tail < 0))
+		rxq->tail = 0;
+
 	return rxq->tail;
 }
 
 static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
 {
+	struct vhost_virtqueue *vq = &nvq->vq;
+	struct socket *sock = vq->private_data;
 	struct vhost_net_buf *rxq = &nvq->rxq;
+	struct tun_msg_ctl ctl = {
+		.type = TUN_MSG_UNCONSUME_PKTS,
+		.ptr = (void *) (rxq->queue + rxq->head),
+		.num = vhost_net_buf_get_size(rxq),
+	};
+	struct msghdr msg = {
+		.msg_control = &ctl,
+	};
 
-	if (nvq->rx_ring && !vhost_net_buf_is_empty(rxq)) {
-		ptr_ring_unconsume(nvq->rx_ring, rxq->queue + rxq->head,
-				   vhost_net_buf_get_size(rxq),
-				   tun_ptr_free);
-		rxq->head = rxq->tail = 0;
-	}
+	if (!vhost_net_buf_is_empty(rxq))
+		sock->ops->recvmsg(sock, &msg, 0, 0);
+
+	rxq->head = rxq->tail = 0;
 }
 
 static int vhost_net_buf_peek_len(void *ptr)
@@ -1109,6 +1129,7 @@ static void handle_rx(struct vhost_net *net)
 		.flags = 0,
 		.gso_type = VIRTIO_NET_HDR_GSO_NONE
 	};
+	struct tun_msg_ctl ctl;
 	size_t total_len = 0;
 	int err, mergeable;
 	s16 headcount;
@@ -1166,8 +1187,11 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		busyloop_intr = false;
-		if (nvq->rx_ring)
-			msg.msg_control = vhost_net_buf_consume(&nvq->rxq);
+		if (nvq->rx_ring) {
+			ctl.type = TUN_MSG_PKT;
+			ctl.ptr = vhost_net_buf_consume(&nvq->rxq);
+			msg.msg_control = &ctl;
+		}
 		/* On overrun, truncate and discard */
 		if (unlikely(headcount > UIO_MAXIOV)) {
 			iov_iter_init(&msg.msg_iter, READ, vq->iov, 1, 1);
@@ -1346,8 +1370,8 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 	mutex_lock(&vq->mutex);
 	sock = vq->private_data;
 	vhost_net_disable_vq(n, vq);
-	vq->private_data = NULL;
 	vhost_net_buf_unproduce(nvq);
+	vq->private_data = NULL;
 	nvq->rx_ring = NULL;
 	mutex_unlock(&vq->mutex);
 	return sock;
@@ -1538,8 +1562,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		}
 
 		vhost_net_disable_vq(n, vq);
-		vq->private_data = sock;
 		vhost_net_buf_unproduce(nvq);
+		vq->private_data = sock;
 		r = vhost_vq_init_access(vq);
 		if (r)
 			goto err_used;
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 5bda8cf457b6..bb94843e3829 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -11,8 +11,26 @@
 
 #define TUN_XDP_FLAG 0x1UL
 
+/*
+ * tun_msg_ctl types
+ */
+
 #define TUN_MSG_UBUF 1
 #define TUN_MSG_PTR  2
+/*
+ * Used for passing a packet pointer from vhost to tun
+ */
+#define TUN_MSG_PKT  3
+/*
+ * Used for passing an array of pointer from vhost to tun.
+ * tun consumes packets from ptr ring and stores in pointer array.
+ */
+#define TUN_MSG_CONSUME_PKTS    4
+/*
+ * Used for passing an array of pointer from vhost to tun.
+ * tun consumes get pointer from array and puts back into ptr ring.
+ */
+#define TUN_MSG_UNCONSUME_PKTS  5
 struct tun_msg_ctl {
 	unsigned short type;
 	unsigned short num;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 06/18] tuntap: remove usage of ptr ring in vhost_net
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (4 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 05/18] vhost_net: user tap recvmsg api to access ptr ring Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 07/18] tun: set offloaded xdp program Prashant Bhole
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

Remove usage of ptr ring of tuntap in vhost_net and remove the
functions exported from tuntap drivers to get ptr ring.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tap.c      | 13 -------------
 drivers/net/tun.c      | 13 -------------
 drivers/vhost/net.c    | 31 ++++---------------------------
 include/linux/if_tap.h |  5 -----
 include/linux/if_tun.h |  5 -----
 5 files changed, 4 insertions(+), 63 deletions(-)

diff --git a/drivers/net/tap.c b/drivers/net/tap.c
index 8635cdfd7aa4..6426501b8d0e 100644
--- a/drivers/net/tap.c
+++ b/drivers/net/tap.c
@@ -1298,19 +1298,6 @@ struct socket *tap_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tap_get_socket);
 
-struct ptr_ring *tap_get_ptr_ring(struct file *file)
-{
-	struct tap_queue *q;
-
-	if (file->f_op != &tap_fops)
-		return ERR_PTR(-EINVAL);
-	q = file->private_data;
-	if (!q)
-		return ERR_PTR(-EBADFD);
-	return &q->ring;
-}
-EXPORT_SYMBOL_GPL(tap_get_ptr_ring);
-
 int tap_queue_resize(struct tap_dev *tap)
 {
 	struct net_device *dev = tap->dev;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4f28f2387435..d078b4659897 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -3750,19 +3750,6 @@ struct socket *tun_get_socket(struct file *file)
 }
 EXPORT_SYMBOL_GPL(tun_get_socket);
 
-struct ptr_ring *tun_get_tx_ring(struct file *file)
-{
-	struct tun_file *tfile;
-
-	if (file->f_op != &tun_fops)
-		return ERR_PTR(-EINVAL);
-	tfile = file->private_data;
-	if (!tfile)
-		return ERR_PTR(-EBADFD);
-	return &tfile->tx_ring;
-}
-EXPORT_SYMBOL_GPL(tun_get_tx_ring);
-
 module_init(tun_init);
 module_exit(tun_cleanup);
 MODULE_DESCRIPTION(DRV_DESCRIPTION);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 0f91b374a558..2e069d1ef946 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -122,7 +122,6 @@ struct vhost_net_virtqueue {
 	/* Reference counting for outstanding ubufs.
 	 * Protected by vq mutex. Writers must also take device mutex. */
 	struct vhost_net_ubuf_ref *ubufs;
-	struct ptr_ring *rx_ring;
 	struct vhost_net_buf rxq;
 	/* Batched XDP buffs */
 	struct xdp_buff *xdp;
@@ -997,8 +996,9 @@ static int peek_head_len(struct vhost_net_virtqueue *rvq, struct sock *sk)
 	int len = 0;
 	unsigned long flags;
 
-	if (rvq->rx_ring)
-		return vhost_net_buf_peek(rvq);
+	len = vhost_net_buf_peek(rvq);
+	if (len)
+		return len;
 
 	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	head = skb_peek(&sk->sk_receive_queue);
@@ -1187,7 +1187,7 @@ static void handle_rx(struct vhost_net *net)
 			goto out;
 		}
 		busyloop_intr = false;
-		if (nvq->rx_ring) {
+		if (!vhost_net_buf_is_empty(&nvq->rxq)) {
 			ctl.type = TUN_MSG_PKT;
 			ctl.ptr = vhost_net_buf_consume(&nvq->rxq);
 			msg.msg_control = &ctl;
@@ -1343,7 +1343,6 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 		n->vqs[i].batched_xdp = 0;
 		n->vqs[i].vhost_hlen = 0;
 		n->vqs[i].sock_hlen = 0;
-		n->vqs[i].rx_ring = NULL;
 		vhost_net_buf_init(&n->vqs[i].rxq);
 	}
 	vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
@@ -1372,7 +1371,6 @@ static struct socket *vhost_net_stop_vq(struct vhost_net *n,
 	vhost_net_disable_vq(n, vq);
 	vhost_net_buf_unproduce(nvq);
 	vq->private_data = NULL;
-	nvq->rx_ring = NULL;
 	mutex_unlock(&vq->mutex);
 	return sock;
 }
@@ -1468,25 +1466,6 @@ static struct socket *get_raw_socket(int fd)
 	return ERR_PTR(r);
 }
 
-static struct ptr_ring *get_tap_ptr_ring(int fd)
-{
-	struct ptr_ring *ring;
-	struct file *file = fget(fd);
-
-	if (!file)
-		return NULL;
-	ring = tun_get_tx_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = tap_get_ptr_ring(file);
-	if (!IS_ERR(ring))
-		goto out;
-	ring = NULL;
-out:
-	fput(file);
-	return ring;
-}
-
 static struct socket *get_tap_socket(int fd)
 {
 	struct file *file = fget(fd);
@@ -1570,8 +1549,6 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 		r = vhost_net_enable_vq(n, vq);
 		if (r)
 			goto err_used;
-		if (index == VHOST_NET_VQ_RX)
-			nvq->rx_ring = get_tap_ptr_ring(fd);
 
 		oldubufs = nvq->ubufs;
 		nvq->ubufs = ubufs;
diff --git a/include/linux/if_tap.h b/include/linux/if_tap.h
index 915a187cfabd..68fe366fb185 100644
--- a/include/linux/if_tap.h
+++ b/include/linux/if_tap.h
@@ -4,7 +4,6 @@
 
 #if IS_ENABLED(CONFIG_TAP)
 struct socket *tap_get_socket(struct file *);
-struct ptr_ring *tap_get_ptr_ring(struct file *file);
 #else
 #include <linux/err.h>
 #include <linux/errno.h>
@@ -14,10 +13,6 @@ static inline struct socket *tap_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline struct ptr_ring *tap_get_ptr_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
 #endif /* CONFIG_TAP */
 
 #include <net/sock.h>
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index bb94843e3829..f01a255e076d 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -44,7 +44,6 @@ struct tun_xdp_hdr {
 
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
-struct ptr_ring *tun_get_tx_ring(struct file *file);
 bool tun_is_xdp_frame(void *ptr);
 void *tun_xdp_to_ptr(void *ptr);
 void *tun_ptr_to_xdp(void *ptr);
@@ -58,10 +57,6 @@ static inline struct socket *tun_get_socket(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
-{
-	return ERR_PTR(-EINVAL);
-}
 static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 07/18] tun: set offloaded xdp program
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (5 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 06/18] tuntap: remove usage of ptr ring in vhost_net Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-12-01 16:35   ` David Ahern
  2019-12-01 16:45   ` David Ahern
  2019-11-26 10:07 ` [RFC net-next 08/18] tun: run offloaded XDP program in Tx path Prashant Bhole
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

This patch introduces an ioctl way to set an offloaded XDP program
to tun driver. This ioctl will be used by qemu to offload XDP program
from virtio_net in the guest.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c           | 19 ++++++++++++++-----
 include/uapi/linux/if_tun.h |  1 +
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d078b4659897..ecb49101b0b5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -241,6 +241,7 @@ struct tun_struct {
 	struct bpf_prog __rcu *xdp_prog;
 	struct tun_prog __rcu *steering_prog;
 	struct tun_prog __rcu *filter_prog;
+	struct tun_prog __rcu *offloaded_xdp_prog;
 	struct ethtool_link_ksettings link_ksettings;
 };
 
@@ -2256,7 +2257,7 @@ static void tun_prog_free(struct rcu_head *rcu)
 {
 	struct tun_prog *prog = container_of(rcu, struct tun_prog, rcu);
 
-	bpf_prog_destroy(prog->prog);
+	bpf_prog_put(prog->prog);
 	kfree(prog);
 }
 
@@ -2301,6 +2302,7 @@ static void tun_free_netdev(struct net_device *dev)
 	security_tun_dev_free_security(tun->security);
 	__tun_set_ebpf(tun, &tun->steering_prog, NULL);
 	__tun_set_ebpf(tun, &tun->filter_prog, NULL);
+	__tun_set_ebpf(tun, &tun->offloaded_xdp_prog, NULL);
 }
 
 static void tun_setup(struct net_device *dev)
@@ -3036,7 +3038,7 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
 }
 
 static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog **prog_p,
-			void __user *data)
+			void __user *data, int type)
 {
 	struct bpf_prog *prog;
 	int fd;
@@ -3047,7 +3049,7 @@ static int tun_set_ebpf(struct tun_struct *tun, struct tun_prog **prog_p,
 	if (fd == -1) {
 		prog = NULL;
 	} else {
-		prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
+		prog = bpf_prog_get_type(fd, type);
 		if (IS_ERR(prog))
 			return PTR_ERR(prog);
 	}
@@ -3345,11 +3347,18 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		break;
 
 	case TUNSETSTEERINGEBPF:
-		ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
+		ret = tun_set_ebpf(tun, &tun->steering_prog, argp,
+				   BPF_PROG_TYPE_SOCKET_FILTER);
 		break;
 
 	case TUNSETFILTEREBPF:
-		ret = tun_set_ebpf(tun, &tun->filter_prog, argp);
+		ret = tun_set_ebpf(tun, &tun->filter_prog, argp,
+				   BPF_PROG_TYPE_SOCKET_FILTER);
+		break;
+
+	case TUNSETOFFLOADEDXDP:
+		ret = tun_set_ebpf(tun, &tun->offloaded_xdp_prog, argp,
+				   BPF_PROG_TYPE_XDP);
 		break;
 
 	case TUNSETCARRIER:
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 454ae31b93c7..21dbd8db2401 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -61,6 +61,7 @@
 #define TUNSETFILTEREBPF _IOR('T', 225, int)
 #define TUNSETCARRIER _IOW('T', 226, int)
 #define TUNGETDEVNETNS _IO('T', 227)
+#define TUNSETOFFLOADEDXDP _IOW('T', 228, int)
 
 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 08/18] tun: run offloaded XDP program in Tx path
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (6 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 07/18] tun: set offloaded xdp program Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-12-01 16:39   ` David Ahern
  2019-11-26 10:07 ` [RFC net-next 09/18] tun: add a way to inject Tx path packet into Rx path Prashant Bhole
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

run offloaded XDP program as soon as packet is removed from the ptr
ring. Since this is XDP in Tx path, the traditional handling of
XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
just runs the program and leaves the action handling to us.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c | 149 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 146 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ecb49101b0b5..466ea69f00ee 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -131,6 +131,7 @@ struct tap_filter {
 /* MAX_TAP_QUEUES 256 is chosen to allow rx/tx queues to be equal
  * to max number of VCPUs in guest. */
 #define MAX_TAP_QUEUES 256
+#define MAX_TAP_BATCH 64
 #define MAX_TAP_FLOWS  4096
 
 #define TUN_FLOW_EXPIRE (3 * HZ)
@@ -2156,6 +2157,109 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 	return total;
 }
 
+static struct sk_buff *tun_prepare_xdp_skb(struct sk_buff *skb)
+{
+	struct sk_buff *nskb;
+
+	if (skb_shared(skb) || skb_cloned(skb)) {
+		nskb = skb_copy(skb, GFP_ATOMIC);
+		consume_skb(skb);
+		return nskb;
+	}
+
+	return skb;
+}
+
+static u32 tun_do_xdp_offload_generic(struct tun_struct *tun,
+				      struct sk_buff *skb)
+{
+	struct tun_prog *xdp_prog;
+	struct xdp_buff xdp;
+	u32 act = XDP_PASS;
+
+	xdp_prog = rcu_dereference(tun->offloaded_xdp_prog);
+	if (xdp_prog) {
+		skb = tun_prepare_xdp_skb(skb);
+		if (!skb) {
+			act = XDP_DROP;
+			kfree_skb(skb);
+			goto drop;
+		}
+
+		act = do_xdp_generic_core(skb, &xdp, xdp_prog->prog);
+		switch (act) {
+		case XDP_TX:
+			/*
+			 * Rx path generic XDP will be called in this path
+			 */
+			netif_receive_skb(skb);
+			break;
+		case XDP_PASS:
+			break;
+		case XDP_REDIRECT:
+			/*
+			 * Since we are not handling this case yet, let's free
+			 * skb here. In case of XDP_DROP/XDP_ABORTED, the skb
+			 * was already freed in do_xdp_generic_core()
+			 */
+			kfree_skb(skb);
+			/* fall through */
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog->prog, act);
+			/* fall through */
+		case XDP_DROP:
+			goto drop;
+		}
+	}
+
+	return act;
+drop:
+	this_cpu_inc(tun->pcpu_stats->tx_dropped);
+	return act;
+}
+
+static u32 tun_do_xdp_offload(struct tun_struct *tun, struct tun_file *tfile,
+			      struct xdp_frame *frame)
+{
+	struct tun_prog *xdp_prog;
+	struct tun_page tpage;
+	struct xdp_buff xdp;
+	u32 act = XDP_PASS;
+	int flush = 0;
+
+	xdp_prog = rcu_dereference(tun->offloaded_xdp_prog);
+	if (xdp_prog) {
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = xdp.data + frame->len;
+		xdp.data_meta = xdp.data - frame->metasize;
+
+		act = bpf_prog_run_xdp(xdp_prog->prog, &xdp);
+		switch (act) {
+		case XDP_PASS:
+			break;
+		case XDP_TX:
+			/* fall through */
+		case XDP_REDIRECT:
+			/* fall through */
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog->prog, act);
+			/* fall through */
+		case XDP_DROP:
+			xdp_return_frame_rx_napi(frame);
+			break;
+		}
+	}
+
+	return act;
+}
+
 static void *tun_ring_recv(struct tun_file *tfile, int noblock, int *err)
 {
 	DECLARE_WAITQUEUE(wait, current);
@@ -2574,6 +2678,47 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	return ret;
 }
 
+static int tun_consume_packets(struct tun_file *tfile, void **ptr_array, int n)
+{
+	struct tun_prog *xdp_prog;
+	struct xdp_frame *frame;
+	struct tun_struct *tun;
+	int i, num_ptrs;
+	int pkt_cnt = 0;
+	void *pkts[MAX_TAP_BATCH];
+	void *ptr;
+	u32 act;
+
+	if (unlikely(!tfile))
+		return 0;
+
+	if (n > MAX_TAP_BATCH)
+		n = MAX_TAP_BATCH;
+
+	rcu_read_lock();
+	tun = rcu_dereference(tfile->tun);
+	if (unlikely(!tun))
+		return 0;
+	xdp_prog = rcu_dereference(tun->offloaded_xdp_prog);
+
+	num_ptrs = ptr_ring_consume_batched(&tfile->tx_ring, pkts, n);
+	for (i = 0; i < num_ptrs; i++) {
+		ptr = pkts[i];
+		if (tun_is_xdp_frame(ptr)) {
+			frame = tun_ptr_to_xdp(ptr);
+			act = tun_do_xdp_offload(tun, tfile, frame);
+		} else {
+			act = tun_do_xdp_offload_generic(tun, ptr);
+		}
+
+		if (act == XDP_PASS)
+			ptr_array[pkt_cnt++] = ptr;
+	}
+
+	rcu_read_unlock();
+	return pkt_cnt;
+}
+
 static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 		       int flags)
 {
@@ -2594,9 +2739,7 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 			ptr = ctl->ptr;
 			break;
 		case TUN_MSG_CONSUME_PKTS:
-			ret = ptr_ring_consume_batched(&tfile->tx_ring,
-						       ctl->ptr,
-						       ctl->num);
+			ret = tun_consume_packets(tfile, ctl->ptr, ctl->num);
 			goto out;
 		case TUN_MSG_UNCONSUME_PKTS:
 			ptr_ring_unconsume(&tfile->tx_ring, ctl->ptr,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 09/18] tun: add a way to inject Tx path packet into Rx path
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (7 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 08/18] tun: run offloaded XDP program in Tx path Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 10/18] tun: handle XDP_TX action of offloaded program Prashant Bhole
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

In order to support XDP_TX from offloaded XDP program, we need a way
to inject Tx path packet into Rx path. Let's modify the Rx path
function tun_xdp_one() for this purpose.

This patch adds a parameter to pass information whether packet has
virtio_net header. When header isn't present, it is considered as
XDP_TX'ed packet from offloaded program.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c | 35 ++++++++++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 7 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 466ea69f00ee..8d6cdd3e5139 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2221,6 +2221,13 @@ static u32 tun_do_xdp_offload_generic(struct tun_struct *tun,
 	return act;
 }
 
+static int tun_xdp_one(struct tun_struct *tun,
+		       struct tun_file *tfile,
+		       struct xdp_buff *xdp, int *flush,
+		       struct tun_page *tpage, int has_hdr);
+
+static void tun_put_page(struct tun_page *tpage);
+
 static u32 tun_do_xdp_offload(struct tun_struct *tun, struct tun_file *tfile,
 			      struct xdp_frame *frame)
 {
@@ -2527,23 +2534,36 @@ static void tun_put_page(struct tun_page *tpage)
 static int tun_xdp_one(struct tun_struct *tun,
 		       struct tun_file *tfile,
 		       struct xdp_buff *xdp, int *flush,
-		       struct tun_page *tpage)
+		       struct tun_page *tpage, int has_hdr)
 {
 	unsigned int datasize = xdp->data_end - xdp->data;
-	struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-	struct virtio_net_hdr *gso = &hdr->gso;
+	struct tun_xdp_hdr *hdr;
+	struct virtio_net_hdr *gso;
 	struct tun_pcpu_stats *stats;
 	struct bpf_prog *xdp_prog;
 	struct sk_buff *skb = NULL;
+	unsigned int headroom;
 	u32 rxhash = 0, act;
-	int buflen = hdr->buflen;
+	int buflen;
 	int err = 0;
 	bool skb_xdp = false;
 	struct page *page;
 
+	if (has_hdr) {
+		hdr = xdp->data_hard_start;
+		gso = &hdr->gso;
+		buflen = hdr->buflen;
+	} else {
+		/* came here from tun tx path */
+		xdp->data_hard_start -= sizeof(struct xdp_frame);
+		headroom = xdp->data - xdp->data_hard_start;
+		buflen = datasize + headroom +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	}
+
 	xdp_prog = rcu_dereference(tun->xdp_prog);
 	if (xdp_prog) {
-		if (gso->gso_type) {
+		if (has_hdr && gso->gso_type) {
 			skb_xdp = true;
 			goto build;
 		}
@@ -2588,7 +2608,8 @@ static int tun_xdp_one(struct tun_struct *tun,
 	skb_reserve(skb, xdp->data - xdp->data_hard_start);
 	skb_put(skb, xdp->data_end - xdp->data);
 
-	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+	if (has_hdr &&
+	    virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
 		this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
 		kfree_skb(skb);
 		err = -EINVAL;
@@ -2652,7 +2673,7 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 
 		for (i = 0; i < n; i++) {
 			xdp = &((struct xdp_buff *)ctl->ptr)[i];
-			tun_xdp_one(tun, tfile, xdp, &flush, &tpage);
+			tun_xdp_one(tun, tfile, xdp, &flush, &tpage, true);
 		}
 
 		if (flush)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 10/18] tun: handle XDP_TX action of offloaded program
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (8 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 09/18] tun: add a way to inject Tx path packet into Rx path Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 11/18] tun: run xdp prog when tun is read from file interface Prashant Bhole
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

When offloaded program returns XDP_TX, we need to inject the packet in
Rx path of tun. This patch injects such packets in Rx path using
tun_xdp_one.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8d6cdd3e5139..084ca95358fe 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2249,7 +2249,13 @@ static u32 tun_do_xdp_offload(struct tun_struct *tun, struct tun_file *tfile,
 		case XDP_PASS:
 			break;
 		case XDP_TX:
-			/* fall through */
+			tpage.page = NULL;
+			tpage.count = 0;
+			tun_xdp_one(tun, tfile, &xdp, &flush, &tpage, false);
+			tun_put_page(&tpage);
+			if (flush)
+				xdp_do_flush_map();
+			break;
 		case XDP_REDIRECT:
 			/* fall through */
 		default:
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 11/18] tun: run xdp prog when tun is read from file interface
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (9 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 10/18] tun: handle XDP_TX action of offloaded program Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 12/18] virtio-net: store xdp_prog in device Prashant Bhole
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

It handles the case when qemu performs read on tun using file
operations.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/tun.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 084ca95358fe..639921c10e32 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2318,8 +2318,10 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			   struct iov_iter *to,
 			   int noblock, void *ptr)
 {
+	struct xdp_frame *frame;
 	ssize_t ret;
 	int err;
+	u32 act;
 
 	tun_debug(KERN_INFO, tun, "tun_do_read\n");
 
@@ -2333,6 +2335,15 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		ptr = tun_ring_recv(tfile, noblock, &err);
 		if (!ptr)
 			return err;
+
+		if (tun_is_xdp_frame(ptr)) {
+			frame = tun_ptr_to_xdp(ptr);
+			act = tun_do_xdp_offload(tun, tfile, frame);
+		} else {
+			act = tun_do_xdp_offload_generic(tun, ptr);
+		}
+		if (act != XDP_PASS)
+			return err;
 	}
 
 	if (tun_is_xdp_frame(ptr)) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 12/18] virtio-net: store xdp_prog in device
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (10 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 11/18] tun: run xdp prog when tun is read from file interface Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 13/18] virtio_net: use XDP attachment helpers Prashant Bhole
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

This is a preparation for adding XDP offload support in virtio_net
driver. By storing XDP program in virtionet_info will make it
consistent with the offloaded program which will introduce in next
patches.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/virtio_net.c | 62 ++++++++++++++++------------------------
 1 file changed, 25 insertions(+), 37 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4d7d5434cc5d..c8bbb1b90c1c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -137,8 +137,6 @@ struct receive_queue {
 
 	struct napi_struct napi;
 
-	struct bpf_prog __rcu *xdp_prog;
-
 	struct virtnet_rq_stats stats;
 
 	/* Chain pages by the private ptr. */
@@ -229,6 +227,8 @@ struct virtnet_info {
 
 	/* failover when STANDBY feature enabled */
 	struct failover *failover;
+
+	struct bpf_prog __rcu *xdp_prog;
 };
 
 struct padded_vnet_hdr {
@@ -486,7 +486,6 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 			    int n, struct xdp_frame **frames, u32 flags)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
-	struct receive_queue *rq = vi->rq;
 	struct bpf_prog *xdp_prog;
 	struct send_queue *sq;
 	unsigned int len;
@@ -501,7 +500,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
 	/* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this
 	 * indicate XDP resources have been successfully allocated.
 	 */
-	xdp_prog = rcu_dereference(rq->xdp_prog);
+	xdp_prog = rcu_dereference(vi->xdp_prog);
 	if (!xdp_prog)
 		return -ENXIO;
 
@@ -649,7 +648,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	stats->bytes += len;
 
 	rcu_read_lock();
-	xdp_prog = rcu_dereference(rq->xdp_prog);
+	xdp_prog = rcu_dereference(vi->xdp_prog);
 	if (xdp_prog) {
 		struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
 		struct xdp_frame *xdpf;
@@ -798,7 +797,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	stats->bytes += len - vi->hdr_len;
 
 	rcu_read_lock();
-	xdp_prog = rcu_dereference(rq->xdp_prog);
+	xdp_prog = rcu_dereference(vi->xdp_prog);
 	if (xdp_prog) {
 		struct xdp_frame *xdpf;
 		struct page *xdp_page;
@@ -2060,7 +2059,7 @@ static int virtnet_set_channels(struct net_device *dev,
 	 * also when XDP is loaded all RX queues have XDP programs so we only
 	 * need to check a single RX queue.
 	 */
-	if (vi->rq[0].xdp_prog)
+	if (vi->xdp_prog)
 		return -EINVAL;
 
 	get_online_cpus();
@@ -2441,13 +2440,10 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		return -ENOMEM;
 	}
 
-	old_prog = rtnl_dereference(vi->rq[0].xdp_prog);
+	old_prog = rtnl_dereference(vi->xdp_prog);
 	if (!prog && !old_prog)
 		return 0;
 
-	if (prog)
-		bpf_prog_add(prog, vi->max_queue_pairs - 1);
-
 	/* Make sure NAPI is not using any XDP TX queues for RX. */
 	if (netif_running(dev)) {
 		for (i = 0; i < vi->max_queue_pairs; i++) {
@@ -2457,11 +2453,8 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	}
 
 	if (!prog) {
-		for (i = 0; i < vi->max_queue_pairs; i++) {
-			rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
-			if (i == 0)
-				virtnet_restore_guest_offloads(vi);
-		}
+		rcu_assign_pointer(vi->xdp_prog, prog);
+		virtnet_restore_guest_offloads(vi);
 		synchronize_net();
 	}
 
@@ -2472,16 +2465,12 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	vi->xdp_queue_pairs = xdp_qp;
 
 	if (prog) {
-		for (i = 0; i < vi->max_queue_pairs; i++) {
-			rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
-			if (i == 0 && !old_prog)
-				virtnet_clear_guest_offloads(vi);
-		}
+		rcu_assign_pointer(vi->xdp_prog, prog);
+		if (!old_prog)
+			virtnet_clear_guest_offloads(vi);
 	}
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
-		if (old_prog)
-			bpf_prog_put(old_prog);
 		if (netif_running(dev)) {
 			virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 			virtnet_napi_tx_enable(vi, vi->sq[i].vq,
@@ -2489,13 +2478,15 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		}
 	}
 
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
 	return 0;
 
 err:
 	if (!prog) {
 		virtnet_clear_guest_offloads(vi);
-		for (i = 0; i < vi->max_queue_pairs; i++)
-			rcu_assign_pointer(vi->rq[i].xdp_prog, old_prog);
+		rcu_assign_pointer(vi->xdp_prog, old_prog);
 	}
 
 	if (netif_running(dev)) {
@@ -2514,13 +2505,11 @@ static u32 virtnet_xdp_query(struct net_device *dev)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	const struct bpf_prog *xdp_prog;
-	int i;
 
-	for (i = 0; i < vi->max_queue_pairs; i++) {
-		xdp_prog = rtnl_dereference(vi->rq[i].xdp_prog);
-		if (xdp_prog)
-			return xdp_prog->aux->id;
-	}
+	xdp_prog = rtnl_dereference(vi->xdp_prog);
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
 	return 0;
 }
 
@@ -2657,18 +2646,17 @@ static void virtnet_free_queues(struct virtnet_info *vi)
 
 static void _free_receive_bufs(struct virtnet_info *vi)
 {
-	struct bpf_prog *old_prog;
+	struct bpf_prog *old_prog = rtnl_dereference(vi->xdp_prog);
 	int i;
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
 		while (vi->rq[i].pages)
 			__free_pages(get_a_page(&vi->rq[i], GFP_KERNEL), 0);
-
-		old_prog = rtnl_dereference(vi->rq[i].xdp_prog);
-		RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
-		if (old_prog)
-			bpf_prog_put(old_prog);
 	}
+
+	RCU_INIT_POINTER(vi->xdp_prog, NULL);
+	if (old_prog)
+		bpf_prog_put(old_prog);
 }
 
 static void free_receive_bufs(struct virtnet_info *vi)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 13/18] virtio_net: use XDP attachment helpers
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (11 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 12/18] virtio-net: store xdp_prog in device Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 14/18] virtio_net: add XDP prog offload infrastructure Prashant Bhole
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

Next patches will introduce virtio_net XDP offloading. In that case
we need to manage offloaded and non-offload program. In order to make
it consistent this patch introduces use of XDP attachment helpers.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/virtio_net.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c8bbb1b90c1c..cee5c2b15c62 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -229,6 +229,8 @@ struct virtnet_info {
 	struct failover *failover;
 
 	struct bpf_prog __rcu *xdp_prog;
+
+	struct xdp_attachment_info xdp;
 };
 
 struct padded_vnet_hdr {
@@ -2398,15 +2400,19 @@ static int virtnet_restore_guest_offloads(struct virtnet_info *vi)
 	return virtnet_set_guest_offloads(vi, offloads);
 }
 
-static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
-			   struct netlink_ext_ack *extack)
+static int virtnet_xdp_set(struct net_device *dev, struct netdev_bpf *bpf)
 {
 	unsigned long int max_sz = PAGE_SIZE - sizeof(struct padded_vnet_hdr);
+	struct netlink_ext_ack *extack = bpf->extack;
 	struct virtnet_info *vi = netdev_priv(dev);
+	struct bpf_prog *prog = bpf->prog;
 	struct bpf_prog *old_prog;
 	u16 xdp_qp = 0, curr_qp;
 	int i, err;
 
+	if (!xdp_attachment_flags_ok(&vi->xdp, bpf))
+		return -EBUSY;
+
 	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
 	    && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
 	        virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
@@ -2478,8 +2484,7 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 		}
 	}
 
-	if (old_prog)
-		bpf_prog_put(old_prog);
+	xdp_attachment_setup(&vi->xdp, bpf);
 
 	return 0;
 
@@ -2501,26 +2506,13 @@ static int virtnet_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	return err;
 }
 
-static u32 virtnet_xdp_query(struct net_device *dev)
-{
-	struct virtnet_info *vi = netdev_priv(dev);
-	const struct bpf_prog *xdp_prog;
-
-	xdp_prog = rtnl_dereference(vi->xdp_prog);
-	if (xdp_prog)
-		return xdp_prog->aux->id;
-
-	return 0;
-}
-
 static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
-		return virtnet_xdp_set(dev, xdp->prog, xdp->extack);
+		return virtnet_xdp_set(dev, xdp);
 	case XDP_QUERY_PROG:
-		xdp->prog_id = virtnet_xdp_query(dev);
-		return 0;
+		return xdp_attachment_query(&vi->xdp, xdp);
 	default:
 		return -EINVAL;
 	}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 14/18] virtio_net: add XDP prog offload infrastructure
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (12 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 13/18] virtio_net: use XDP attachment helpers Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality Prashant Bhole
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

This patch prepares virtio_net of XDP offloading. It adds data
structures, blank callback implementations for bpf_prog_offload_ops.
It also implements ndo_init, ndo_uninit operations for setting up
offload related data structures.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/virtio_net.c | 103 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index cee5c2b15c62..a1088d0114f2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -229,8 +229,14 @@ struct virtnet_info {
 	struct failover *failover;
 
 	struct bpf_prog __rcu *xdp_prog;
+	struct bpf_prog __rcu *offload_xdp_prog;
 
 	struct xdp_attachment_info xdp;
+	struct xdp_attachment_info xdp_hw;
+
+	struct bpf_offload_dev *bpf_dev;
+
+	struct list_head bpf_bound_progs;
 };
 
 struct padded_vnet_hdr {
@@ -258,6 +264,14 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
 	return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
 }
 
+struct virtnet_bpf_bound_prog {
+	struct virtnet_info *vi;
+	struct bpf_prog *prog;
+	struct list_head list;
+	u32 len;
+	struct bpf_insn insnsi[0];
+};
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -2506,13 +2520,63 @@ static int virtnet_xdp_set(struct net_device *dev, struct netdev_bpf *bpf)
 	return err;
 }
 
+static int virtnet_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx,
+				   int prev_insn)
+{
+	return 0;
+}
+
+static void virtnet_bpf_destroy_prog(struct bpf_prog *prog)
+{
+}
+
+static int virtnet_xdp_set_offload(struct virtnet_info *vi,
+				   struct netdev_bpf *bpf)
+{
+	return -EBUSY;
+}
+
+static int virtnet_bpf_verifier_setup(struct bpf_prog *prog)
+{
+	return -ENOMEM;
+}
+
+static int virtnet_bpf_verifier_prep(struct bpf_prog *prog)
+{
+	return 0;
+}
+
+static int virtnet_bpf_translate(struct bpf_prog *prog)
+{
+	return 0;
+}
+
+static int virtnet_bpf_finalize(struct bpf_verifier_env *env)
+{
+	return 0;
+}
+
+static const struct bpf_prog_offload_ops virtnet_bpf_dev_ops = {
+	.setup		= virtnet_bpf_verifier_setup,
+	.prepare	= virtnet_bpf_verifier_prep,
+	.insn_hook	= virtnet_bpf_verify_insn,
+	.finalize	= virtnet_bpf_finalize,
+	.translate	= virtnet_bpf_translate,
+	.destroy	= virtnet_bpf_destroy_prog,
+};
+
 static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
+	struct virtnet_info *vi = netdev_priv(dev);
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return virtnet_xdp_set(dev, xdp);
 	case XDP_QUERY_PROG:
 		return xdp_attachment_query(&vi->xdp, xdp);
+	case XDP_SETUP_PROG_HW:
+		return virtnet_xdp_set_offload(vi, xdp);
+	case XDP_QUERY_PROG_HW:
+		return xdp_attachment_query(&vi->xdp_hw, xdp);
 	default:
 		return -EINVAL;
 	}
@@ -2559,7 +2623,46 @@ static int virtnet_set_features(struct net_device *dev,
 	return 0;
 }
 
+static int virtnet_bpf_init(struct virtnet_info *vi)
+{
+	int err;
+
+	vi->bpf_dev = bpf_offload_dev_create(&virtnet_bpf_dev_ops, NULL);
+	err = PTR_ERR_OR_ZERO(vi->bpf_dev);
+	if (err)
+		return err;
+
+	err = bpf_offload_dev_netdev_register(vi->bpf_dev, vi->dev);
+	if (err)
+		goto err_netdev_register;
+
+	INIT_LIST_HEAD(&vi->bpf_bound_progs);
+
+	return 0;
+
+err_netdev_register:
+	bpf_offload_dev_destroy(vi->bpf_dev);
+	return err;
+}
+
+static int virtnet_init(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+
+	return virtnet_bpf_init(vi);
+}
+
+static void virtnet_uninit(struct net_device *dev)
+{
+	struct virtnet_info *vi = netdev_priv(dev);
+
+	bpf_offload_dev_netdev_unregister(vi->bpf_dev, vi->dev);
+	bpf_offload_dev_destroy(vi->bpf_dev);
+}
+
 static const struct net_device_ops virtnet_netdev = {
+	.ndo_init            = virtnet_init,
+	.ndo_uninit          = virtnet_uninit,
 	.ndo_open            = virtnet_open,
 	.ndo_stop   	     = virtnet_close,
 	.ndo_start_xmit      = start_xmit,
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (13 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 14/18] virtio_net: add XDP prog offload infrastructure Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-27 20:42   ` Michael S. Tsirkin
  2019-11-26 10:07 ` [RFC net-next 16/18] bpf: export function __bpf_map_get Prashant Bhole
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

This patch implements bpf_prog_offload_ops callbacks and adds handling
for XDP_SETUP_PROG_HW. Handling of XDP_SETUP_PROG_HW involves setting
up struct virtio_net_ctrl_ebpf_prog and appending program instructions
to it. This control buffer is sent to Qemu with class
VIRTIO_NET_CTRL_EBPF and command VIRTIO_NET_BPF_CMD_SET_OFFLOAD.
The expected behavior from Qemu is that it should try to load the
program in host os and report the status.

It also adds restriction to have either driver or offloaded program
at a time. This restriction can be removed later after proper testing.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/virtio_net.c        | 114 +++++++++++++++++++++++++++++++-
 include/uapi/linux/virtio_net.h |  27 ++++++++
 2 files changed, 139 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index a1088d0114f2..dddfbb2a2075 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -169,6 +169,7 @@ struct control_buf {
 	u8 allmulti;
 	__virtio16 vid;
 	__virtio64 offloads;
+	struct virtio_net_ctrl_ebpf_prog prog_ctrl;
 };
 
 struct virtnet_info {
@@ -272,6 +273,8 @@ struct virtnet_bpf_bound_prog {
 	struct bpf_insn insnsi[0];
 };
 
+#define VIRTNET_EA(extack, msg)	NL_SET_ERR_MSG_MOD((extack), msg)
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -2427,6 +2430,11 @@ static int virtnet_xdp_set(struct net_device *dev, struct netdev_bpf *bpf)
 	if (!xdp_attachment_flags_ok(&vi->xdp, bpf))
 		return -EBUSY;
 
+	if (rtnl_dereference(vi->offload_xdp_prog)) {
+		VIRTNET_EA(bpf->extack, "program already attached in offload mode");
+		return -EINVAL;
+	}
+
 	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
 	    && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
 	        virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
@@ -2528,17 +2536,114 @@ static int virtnet_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx,
 
 static void virtnet_bpf_destroy_prog(struct bpf_prog *prog)
 {
+	struct virtnet_bpf_bound_prog *state;
+
+	state = prog->aux->offload->dev_priv;
+	list_del(&state->list);
+	kfree(state);
+}
+
+static int virtnet_xdp_offload_check(struct virtnet_info *vi,
+				     struct netdev_bpf *bpf)
+{
+	if (!bpf->prog)
+		return 0;
+
+	if (!bpf->prog->aux->offload) {
+		VIRTNET_EA(bpf->extack, "xdpoffload of non-bound program");
+		return -EINVAL;
+	}
+	if (bpf->prog->aux->offload->netdev != vi->dev) {
+		VIRTNET_EA(bpf->extack, "program bound to different dev");
+		return -EINVAL;
+	}
+
+	if (rtnl_dereference(vi->xdp_prog)) {
+		VIRTNET_EA(bpf->extack, "program already attached in driver mode");
+		return -EINVAL;
+	}
+
+	return 0;
 }
 
 static int virtnet_xdp_set_offload(struct virtnet_info *vi,
 				   struct netdev_bpf *bpf)
 {
-	return -EBUSY;
+	struct virtio_net_ctrl_ebpf_prog *ctrl;
+	struct virtnet_bpf_bound_prog *bound_prog = NULL;
+	struct virtio_device *vdev = vi->vdev;
+	struct bpf_prog *prog = bpf->prog;
+	void *ctrl_buf = NULL;
+	struct scatterlist sg;
+	int prog_len;
+	int err = 0;
+
+	if (!xdp_attachment_flags_ok(&vi->xdp_hw, bpf))
+		return -EBUSY;
+
+	if (prog) {
+		if (prog->type != BPF_PROG_TYPE_XDP)
+			return -EOPNOTSUPP;
+		bound_prog = prog->aux->offload->dev_priv;
+		prog_len = prog->len * sizeof(bound_prog->insnsi[0]);
+
+		ctrl_buf = kmalloc(GFP_KERNEL, sizeof(*ctrl) + prog_len);
+		if (!ctrl_buf)
+			return -ENOMEM;
+		ctrl = ctrl_buf;
+		ctrl->cmd = cpu_to_virtio32(vi->vdev,
+					    VIRTIO_NET_BPF_CMD_SET_OFFLOAD);
+		ctrl->len = cpu_to_virtio32(vi->vdev, prog_len);
+		ctrl->gpl_compatible = cpu_to_virtio16(vi->vdev,
+						       prog->gpl_compatible);
+		memcpy(ctrl->insns, bound_prog->insnsi,
+		       prog->len * sizeof(bound_prog->insnsi[0]));
+		sg_init_one(&sg, ctrl_buf, sizeof(*ctrl) + prog_len);
+	} else {
+		ctrl = &vi->ctrl->prog_ctrl;
+		ctrl->cmd  = cpu_to_virtio32(vi->vdev,
+					     VIRTIO_NET_BPF_CMD_UNSET_OFFLOAD);
+		sg_init_one(&sg, ctrl, sizeof(*ctrl));
+	}
+
+	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_EBPF,
+				  VIRTIO_NET_CTRL_EBPF_PROG,
+				  &sg)) {
+		dev_warn(&vdev->dev, "Failed to set bpf offload prog\n");
+		err = -EFAULT;
+		goto out;
+	}
+
+	rcu_assign_pointer(vi->offload_xdp_prog, prog);
+
+	xdp_attachment_setup(&vi->xdp_hw, bpf);
+
+out:
+	kfree(ctrl_buf);
+	return err;
 }
 
 static int virtnet_bpf_verifier_setup(struct bpf_prog *prog)
 {
-	return -ENOMEM;
+	struct virtnet_info *vi = netdev_priv(prog->aux->offload->netdev);
+	size_t insn_len = prog->len * sizeof(struct bpf_insn);
+	struct virtnet_bpf_bound_prog *state;
+
+	state = kzalloc(sizeof(*state) + insn_len, GFP_KERNEL);
+	if (!state)
+		return -ENOMEM;
+
+	memcpy(&state->insnsi[0], prog->insnsi, insn_len);
+
+	state->vi = vi;
+	state->prog = prog;
+	state->len = prog->len;
+
+	list_add_tail(&state->list, &vi->bpf_bound_progs);
+
+	prog->aux->offload->dev_priv = state;
+
+	return 0;
 }
 
 static int virtnet_bpf_verifier_prep(struct bpf_prog *prog)
@@ -2568,12 +2673,17 @@ static const struct bpf_prog_offload_ops virtnet_bpf_dev_ops = {
 static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
+	int err;
+
 	switch (xdp->command) {
 	case XDP_SETUP_PROG:
 		return virtnet_xdp_set(dev, xdp);
 	case XDP_QUERY_PROG:
 		return xdp_attachment_query(&vi->xdp, xdp);
 	case XDP_SETUP_PROG_HW:
+		err = virtnet_xdp_offload_check(vi, xdp);
+		if (err)
+			return err;
 		return virtnet_xdp_set_offload(vi, xdp);
 	case XDP_QUERY_PROG_HW:
 		return xdp_attachment_query(&vi->xdp_hw, xdp);
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index a3715a3224c1..0ea2f7910a5a 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -261,4 +261,31 @@ struct virtio_net_ctrl_mq {
 #define VIRTIO_NET_CTRL_GUEST_OFFLOADS   5
 #define VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET        0
 
+/*
+ * Control XDP offloads offloads
+ *
+ * When guest wants to offload XDP program to the device, it calls
+ * VIRTIO_NET_CTRL_EBPF_PROG along with VIRTIO_NET_BPF_CMD_SET_OFFLOAD
+ * subcommands. When offloading is successful, the device runs offloaded
+ * XDP program for each packet before sending it to the guest.
+ *
+ * VIRTIO_NET_BPF_CMD_UNSET_OFFLOAD removes the the offloaded program from
+ * the device, if exists.
+ */
+
+struct virtio_net_ctrl_ebpf_prog {
+	/* program length in bytes */
+	__virtio32 len;
+	__virtio16 cmd;
+	__virtio16 gpl_compatible;
+	__u8 insns[0];
+};
+
+#define VIRTIO_NET_CTRL_EBPF 6
+ #define VIRTIO_NET_CTRL_EBPF_PROG 1
+
+/* Commands for VIRTIO_NET_CTRL_EBPF_PROG */
+#define VIRTIO_NET_BPF_CMD_SET_OFFLOAD 1
+#define VIRTIO_NET_BPF_CMD_UNSET_OFFLOAD 2
+
 #endif /* _UAPI_LINUX_VIRTIO_NET_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 16/18] bpf: export function __bpf_map_get
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (14 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 17/18] virtio_net: implment XDP map offload functionality Prashant Bhole
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

__bpf_map_get is necessary to get verify whether an fd corresponds
to a bpf map, without adding a refcount on that map. After exporting
it can be used by a kernel module.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 kernel/bpf/syscall.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e3461ec59570..e524ab1e7c64 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -737,6 +737,7 @@ struct bpf_map *__bpf_map_get(struct fd f)
 
 	return f.file->private_data;
 }
+EXPORT_SYMBOL(__bpf_map_get);
 
 void bpf_map_inc(struct bpf_map *map)
 {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 17/18] virtio_net: implment XDP map offload functionality
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (15 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 16/18] bpf: export function __bpf_map_get Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 10:07 ` [RFC net-next 18/18] virtio_net: restrict bpf helper calls from offloaded program Prashant Bhole
  2019-11-26 20:35 ` [RFC net-next 00/18] virtio_net XDP offload Jakub Kicinski
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm, Prashant Bhole

From: Jason Wang <jasowang@redhat.com>

This patch implements:
* Handling of BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE:
  Allocate driver specific map data structure. Set up struct
  virtio_net_ctrl_ebpf_map and send the control buffer to Qemu with
  class VIRTIO_NET_CTRL_EBPF, cmd VIRTIO_NET_CTRL_EBPF_MAP. The cmd
  in the control buffer is set to VIRTIO_NET_BPF_CMD_CREATE_MAP. The
  expected behavior from Qemu is that it should perform the action
  as per command and return the status (and map data). In case of
  create map command, Qemu should set the map_fd in the control buffer

* bpf_map_dev_ops operations:
  Common map operations are implemented with use of above mentioned
  struct virtio_net_ctrl_ebpf_map. This control buffer has space for
  storing: key + key or key + value.

* map_fd replacement in a copy of the program:
  Since map are created before the verification of program begins,
  we have map fds from the host side for each offloaded map when
  program verification begins. map fds in the copy of the program are
  replaced with map fds from host side. This copy of program is used
  for offloading.

Signed-off-by: Jason Wang <jasowang@redhat.com>
Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/virtio_net.c        | 241 ++++++++++++++++++++++++++++++++
 include/uapi/linux/virtio_net.h |  23 +++
 2 files changed, 264 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dddfbb2a2075..91a94b787c64 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -238,6 +238,7 @@ struct virtnet_info {
 	struct bpf_offload_dev *bpf_dev;
 
 	struct list_head bpf_bound_progs;
+	struct list_head map_list;
 };
 
 struct padded_vnet_hdr {
@@ -275,6 +276,13 @@ struct virtnet_bpf_bound_prog {
 
 #define VIRTNET_EA(extack, msg)	NL_SET_ERR_MSG_MOD((extack), msg)
 
+struct virtnet_bpf_map {
+	struct bpf_offloaded_map *offmap;
+	struct virtnet_info *vi;
+	struct virtio_net_ctrl_ebpf_map *ctrl;
+	struct list_head list;
+};
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -2528,6 +2536,19 @@ static int virtnet_xdp_set(struct net_device *dev, struct netdev_bpf *bpf)
 	return err;
 }
 
+static struct virtnet_bpf_map *virtnet_get_bpf_map(struct virtnet_info *vi,
+						   struct bpf_map *map)
+{
+	struct virtnet_bpf_map *virtnet_map;
+
+	list_for_each_entry(virtnet_map, &vi->map_list, list) {
+		if (&virtnet_map->offmap->map == map)
+			return virtnet_map;
+	}
+
+	return NULL;
+}
+
 static int virtnet_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx,
 				   int prev_insn)
 {
@@ -2623,11 +2644,194 @@ static int virtnet_xdp_set_offload(struct virtnet_info *vi,
 	return err;
 }
 
+static int virtnet_bpf_ctrl_map(struct bpf_offloaded_map *offmap,
+				int cmd, u8 *key, u8 *value, u64 flags,
+				u8 *out_key, u8 *out_value)
+{
+	struct virtio_net_ctrl_ebpf_map *ctrl;
+	struct virtnet_bpf_map *virtnet_map;
+	struct bpf_map *map = &offmap->map;
+	unsigned char *keyptr, *valptr;
+	struct virtnet_info *vi;
+	struct scatterlist sg;
+
+	virtnet_map = offmap->dev_priv;
+	vi = virtnet_map->vi;
+	ctrl = virtnet_map->ctrl;
+
+	keyptr = ctrl->buf;
+	valptr = ctrl->buf + ctrl->key_size;
+
+	if (key)
+		memcpy(keyptr, key, map->key_size);
+	if (value)
+		memcpy(valptr, value, map->value_size);
+
+	ctrl->cmd = cpu_to_virtio32(vi->vdev, cmd);
+	ctrl->flags = cpu_to_virtio64(vi->vdev, flags);
+
+	sg_init_one(&sg, ctrl, sizeof(*ctrl) + ctrl->buf_len);
+	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_EBPF,
+				  VIRTIO_NET_CTRL_EBPF_MAP,
+				  &sg))
+		return -EFAULT;
+
+	if (out_key)
+		memcpy(out_key, valptr, map->key_size);
+	if (out_value)
+		memcpy(out_value, valptr, map->value_size);
+	return 0;
+}
+
+static int virtnet_bpf_map_update_entry(struct bpf_offloaded_map *offmap,
+					void *key, void *value, u64 flags)
+{
+	return virtnet_bpf_ctrl_map(offmap,
+				    VIRTIO_NET_BPF_CMD_UPDATE_ELEM,
+				    key, value, flags, NULL, NULL);
+}
+
+static int virtnet_bpf_map_delete_elem(struct bpf_offloaded_map *offmap,
+				       void *key)
+{
+	return virtnet_bpf_ctrl_map(offmap,
+				    VIRTIO_NET_BPF_CMD_DELETE_ELEM,
+				    key, NULL, 0, NULL, NULL);
+}
+
+static int virtnet_bpf_map_lookup_entry(struct bpf_offloaded_map *offmap,
+					void *key, void *value)
+{
+	return virtnet_bpf_ctrl_map(offmap,
+				    VIRTIO_NET_BPF_CMD_LOOKUP_ELEM,
+				    key, NULL, 0, NULL, value);
+}
+
+static int virtnet_bpf_map_get_first_key(struct bpf_offloaded_map *offmap,
+					 void *next_key)
+{
+	return virtnet_bpf_ctrl_map(offmap,
+				    VIRTIO_NET_BPF_CMD_GET_FIRST,
+				    NULL, NULL, 0, next_key, NULL);
+}
+
+static int virtnet_bpf_map_get_next_key(struct bpf_offloaded_map *offmap,
+					void *key, void *next_key)
+{
+	if (!key)
+		return virtnet_bpf_map_get_first_key(offmap, next_key);
+
+	return virtnet_bpf_ctrl_map(offmap,
+				    VIRTIO_NET_BPF_CMD_GET_NEXT,
+				    key, NULL, 0, next_key, NULL);
+}
+
+static const struct bpf_map_dev_ops virtnet_bpf_map_ops = {
+	.map_get_next_key	= virtnet_bpf_map_get_next_key,
+	.map_lookup_elem	= virtnet_bpf_map_lookup_entry,
+	.map_update_elem	= virtnet_bpf_map_update_entry,
+	.map_delete_elem	= virtnet_bpf_map_delete_elem,
+};
+
+static int virtnet_bpf_map_alloc(struct virtnet_info *vi,
+				 struct bpf_offloaded_map *offmap)
+{
+	struct virtnet_bpf_map *virtnet_map = NULL;
+	struct virtio_net_ctrl_ebpf_map *ctrl = NULL;
+	struct bpf_map *map = &offmap->map;
+	struct scatterlist sg;
+	int buf_len;
+
+	if (map->map_type != BPF_MAP_TYPE_ARRAY &&
+	    map->map_type != BPF_MAP_TYPE_HASH)
+		goto err;
+
+	virtnet_map = kzalloc(sizeof(*virtnet_map), GFP_KERNEL);
+	if (!virtnet_map)
+		goto err;
+
+	/* allocate buffer size to fit
+	 * - sizeof (struct virio_net_ctrl_map_buf)
+	 * - key_size
+	 * - max(key_size + value_size, key_size + key_size)
+	 */
+	buf_len = map->key_size;
+	buf_len += (map->key_size > map->value_size) ?
+		  map->key_size : map->value_size;
+	ctrl = kzalloc(sizeof(*ctrl) + buf_len, GFP_KERNEL);
+	if (!ctrl)
+		goto err;
+
+	ctrl->buf_len = cpu_to_virtio32(vi->vdev, buf_len);
+	ctrl->key_size = cpu_to_virtio32(vi->vdev, map->key_size);
+	ctrl->value_size = cpu_to_virtio32(vi->vdev, map->value_size);
+	ctrl->max_entries = cpu_to_virtio32(vi->vdev, map->max_entries);
+	ctrl->map_type = cpu_to_virtio32(vi->vdev, map->map_type);
+	ctrl->map_flags = 0;
+	ctrl->cmd = cpu_to_virtio32(vi->vdev, VIRTIO_NET_BPF_CMD_CREATE_MAP);
+
+	sg_init_one(&sg, ctrl, sizeof(*ctrl) + ctrl->buf_len);
+
+	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_EBPF,
+				  VIRTIO_NET_CTRL_EBPF_MAP,
+				  &sg)) {
+		dev_warn(&vi->vdev->dev, "Failed to create ebpf map\n");
+		goto err;
+	}
+
+	offmap->dev_ops = &virtnet_bpf_map_ops;
+	offmap->dev_priv = virtnet_map;
+
+	virtnet_map->offmap = offmap;
+	virtnet_map->vi = vi;
+	virtnet_map->ctrl = ctrl;
+
+	list_add_tail(&virtnet_map->list, &vi->map_list);
+
+	return 0;
+err:
+	kfree(virtnet_map);
+	kfree(ctrl);
+	return -EFAULT;
+}
+
+static int virtnet_bpf_map_free(struct virtnet_info *vi,
+				struct bpf_offloaded_map *offmap)
+{
+	struct bpf_map *map = &offmap->map;
+	struct virtnet_bpf_map *virtnet_map = virtnet_get_bpf_map(vi, map);
+	struct virtio_net_ctrl_ebpf_map *ctrl = virtnet_map->ctrl;
+	struct scatterlist sg;
+
+	if (!virtnet_map)
+		return -EINVAL;
+
+	ctrl->cmd = cpu_to_virtio32(vi->vdev, VIRTIO_NET_BPF_CMD_FREE_MAP);
+
+	sg_init_one(&sg, ctrl, sizeof(*ctrl) + ctrl->buf_len);
+
+	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_EBPF,
+				  VIRTIO_NET_CTRL_EBPF_MAP,
+				  &sg)) {
+		dev_warn(&vi->vdev->dev, "Failed to free ebpf map\n");
+		return -EFAULT;
+	}
+
+	list_del(&virtnet_map->list);
+	kfree(virtnet_map->ctrl);
+	kfree(virtnet_map);
+	return 0;
+}
+
 static int virtnet_bpf_verifier_setup(struct bpf_prog *prog)
 {
 	struct virtnet_info *vi = netdev_priv(prog->aux->offload->netdev);
 	size_t insn_len = prog->len * sizeof(struct bpf_insn);
 	struct virtnet_bpf_bound_prog *state;
+	struct virtnet_bpf_map *virtnet_map;
+	struct bpf_map *map;
+	struct fd mapfd;
+	int i, err = 0;
 
 	state = kzalloc(sizeof(*state) + insn_len, GFP_KERNEL);
 	if (!state)
@@ -2639,11 +2843,43 @@ static int virtnet_bpf_verifier_setup(struct bpf_prog *prog)
 	state->prog = prog;
 	state->len = prog->len;
 
+	for (i = 0; i < state->len; i++) {
+		struct bpf_insn *insn = &state->insnsi[i];
+
+		if (insn->code != (BPF_LD | BPF_IMM | BPF_DW))
+			continue;
+
+		mapfd = fdget(insn->imm);
+		map = __bpf_map_get(mapfd);
+		if (IS_ERR(map)) {
+			pr_warn("fd %d is not pointing to valid bpf_map\n",
+				insn->imm);
+			err = -EINVAL;
+			goto err_replace;
+		}
+
+		virtnet_map = virtnet_get_bpf_map(vi, map);
+		if (!virtnet_map) {
+			pr_warn("could not get a offloaded map fd %d\n",
+				insn->imm);
+			err = -EINVAL;
+			goto err_replace;
+		}
+
+		/* Note: no need of virtio32_to_cpu */
+		insn->imm = virtnet_map->ctrl->map_fd;
+		fdput(mapfd);
+	}
+
 	list_add_tail(&state->list, &vi->bpf_bound_progs);
 
 	prog->aux->offload->dev_priv = state;
 
 	return 0;
+
+err_replace:
+	kfree(state);
+	return err;
 }
 
 static int virtnet_bpf_verifier_prep(struct bpf_prog *prog)
@@ -2676,6 +2912,10 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	int err;
 
 	switch (xdp->command) {
+	case BPF_OFFLOAD_MAP_ALLOC:
+		return virtnet_bpf_map_alloc(vi, xdp->offmap);
+	case BPF_OFFLOAD_MAP_FREE:
+		return virtnet_bpf_map_free(vi, xdp->offmap);
 	case XDP_SETUP_PROG:
 		return virtnet_xdp_set(dev, xdp);
 	case XDP_QUERY_PROG:
@@ -2747,6 +2987,7 @@ static int virtnet_bpf_init(struct virtnet_info *vi)
 		goto err_netdev_register;
 
 	INIT_LIST_HEAD(&vi->bpf_bound_progs);
+	INIT_LIST_HEAD(&vi->map_list);
 
 	return 0;
 
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 0ea2f7910a5a..1d330a0883ac 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -281,11 +281,34 @@ struct virtio_net_ctrl_ebpf_prog {
 	__u8 insns[0];
 };
 
+struct virtio_net_ctrl_ebpf_map {
+	__virtio32 buf_len;
+	__virtio32 cmd;
+	__virtio32 map_type;
+	__virtio32 key_size;
+	__virtio32 value_size;
+	__virtio32 max_entries;
+	__virtio32 map_flags;
+	__virtio32 map_fd;
+	__virtio64 flags;
+	__u8 buf[0];
+};
+
 #define VIRTIO_NET_CTRL_EBPF 6
  #define VIRTIO_NET_CTRL_EBPF_PROG 1
+ #define VIRTIO_NET_CTRL_EBPF_MAP 2
 
 /* Commands for VIRTIO_NET_CTRL_EBPF_PROG */
 #define VIRTIO_NET_BPF_CMD_SET_OFFLOAD 1
 #define VIRTIO_NET_BPF_CMD_UNSET_OFFLOAD 2
 
+/* Commands for VIRTIO_NET_CTRL_EBPF_MAP */
+#define VIRTIO_NET_BPF_CMD_CREATE_MAP 1
+#define VIRTIO_NET_BPF_CMD_FREE_MAP 2
+#define VIRTIO_NET_BPF_CMD_UPDATE_ELEM 3
+#define VIRTIO_NET_BPF_CMD_LOOKUP_ELEM 4
+#define VIRTIO_NET_BPF_CMD_DELETE_ELEM 5
+#define VIRTIO_NET_BPF_CMD_GET_FIRST 6
+#define VIRTIO_NET_BPF_CMD_GET_NEXT 7
+
 #endif /* _UAPI_LINUX_VIRTIO_NET_H */
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC net-next 18/18] virtio_net: restrict bpf helper calls from offloaded program
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (16 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 17/18] virtio_net: implment XDP map offload functionality Prashant Bhole
@ 2019-11-26 10:07 ` Prashant Bhole
  2019-11-26 20:35 ` [RFC net-next 00/18] virtio_net XDP offload Jakub Kicinski
  18 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-26 10:07 UTC (permalink / raw)
  To: David S . Miller, Michael S . Tsirkin
  Cc: Prashant Bhole, Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

Since we are offloading this program to the host, some of the helper
calls will not make sense. For example get_numa_node_id. Some helpers
can not be used because we don't handle them yet.

So let's allow a small set of helper calls for now.

Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
---
 drivers/net/virtio_net.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 91a94b787c64..ab5be6b95bbd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2549,6 +2549,25 @@ static struct virtnet_bpf_map *virtnet_get_bpf_map(struct virtnet_info *vi,
 	return NULL;
 }
 
+static int virtnet_bpf_check_helper_call(struct bpf_insn *insn)
+{
+	switch (insn->imm) {
+	case BPF_FUNC_map_lookup_elem:
+	case BPF_FUNC_map_update_elem:
+	case BPF_FUNC_map_delete_elem:
+	case BPF_FUNC_ktime_get_ns:
+	case BPF_FUNC_get_prandom_u32:
+	case BPF_FUNC_csum_update:
+	case BPF_FUNC_xdp_adjust_head:
+	case BPF_FUNC_xdp_adjust_meta:
+	case BPF_FUNC_xdp_adjust_tail:
+	case BPF_FUNC_strtol:
+	case BPF_FUNC_strtoul:
+		return 0;
+	}
+	return -EOPNOTSUPP;
+}
+
 static int virtnet_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx,
 				   int prev_insn)
 {
@@ -2830,6 +2849,7 @@ static int virtnet_bpf_verifier_setup(struct bpf_prog *prog)
 	struct virtnet_bpf_bound_prog *state;
 	struct virtnet_bpf_map *virtnet_map;
 	struct bpf_map *map;
+	u8 opcode, class;
 	struct fd mapfd;
 	int i, err = 0;
 
@@ -2846,6 +2866,16 @@ static int virtnet_bpf_verifier_setup(struct bpf_prog *prog)
 	for (i = 0; i < state->len; i++) {
 		struct bpf_insn *insn = &state->insnsi[i];
 
+		opcode = BPF_OP(insn->code);
+		class = BPF_CLASS(insn->code);
+
+		if ((class == BPF_JMP || class == BPF_JMP32) &&
+		    opcode == BPF_CALL && insn->src_reg != BPF_PSEUDO_CALL) {
+			if (virtnet_bpf_check_helper_call(insn))
+				return -EOPNOTSUPP;
+			continue;
+		}
+
 		if (insn->code != (BPF_LD | BPF_IMM | BPF_DW))
 			continue;
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
                   ` (17 preceding siblings ...)
  2019-11-26 10:07 ` [RFC net-next 18/18] virtio_net: restrict bpf helper calls from offloaded program Prashant Bhole
@ 2019-11-26 20:35 ` Jakub Kicinski
  2019-11-27  2:59   ` Jason Wang
                     ` (2 more replies)
  18 siblings, 3 replies; 43+ messages in thread
From: Jakub Kicinski @ 2019-11-26 20:35 UTC (permalink / raw)
  To: Prashant Bhole
  Cc: David S . Miller, Michael S . Tsirkin, Jason Wang,
	Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer,
	John Fastabend, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, netdev, qemu-devel, kvm

On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
> Note: This RFC has been sent to netdev as well as qemu-devel lists
> 
> This series introduces XDP offloading from virtio_net. It is based on
> the following work by Jason Wang:
> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> 
> Current XDP performance in virtio-net is far from what we can achieve
> on host. Several major factors cause the difference:
> - Cost of virtualization
> - Cost of virtio (populating virtqueue and context switching)
> - Cost of vhost, it needs more optimization
> - Cost of data copy
> Because of above reasons there is a need of offloading XDP program to
> host. This set is an attempt to implement XDP offload from the guest.

This turns the guest kernel into a uAPI proxy.

BPF uAPI calls related to the "offloaded" BPF objects are forwarded 
to the hypervisor, they pop up in QEMU which makes the requested call
to the hypervisor kernel. Today it's the Linux kernel tomorrow it may 
be someone's proprietary "SmartNIC" implementation.

Why can't those calls be forwarded at the higher layer? Why do they
have to go through the guest kernel?

If kernel performs no significant work (or "adds value", pardon the
expression), and problem can easily be solved otherwise we shouldn't 
do the work of maintaining the mechanism.

The approach of kernel generating actual machine code which is then
loaded into a sandbox on the hypervisor/SmartNIC is another story.

I'd appreciate if others could chime in.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-26 20:35 ` [RFC net-next 00/18] virtio_net XDP offload Jakub Kicinski
@ 2019-11-27  2:59   ` Jason Wang
  2019-11-27 19:49     ` Jakub Kicinski
  2019-11-27 20:32   ` Michael S. Tsirkin
  2019-11-28  3:32   ` Alexei Starovoitov
  2 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2019-11-27  2:59 UTC (permalink / raw)
  To: Jakub Kicinski, Prashant Bhole
  Cc: Song Liu, Jesper Dangaard Brouer, Daniel Borkmann,
	Michael S . Tsirkin, qemu-devel, netdev, John Fastabend,
	Alexei Starovoitov, Martin KaFai Lau, kvm, Yonghong Song,
	Andrii Nakryiko, David S . Miller

Hi Jakub:

On 2019/11/27 上午4:35, Jakub Kicinski wrote:
> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
>> Note: This RFC has been sent to netdev as well as qemu-devel lists
>>
>> This series introduces XDP offloading from virtio_net. It is based on
>> the following work by Jason Wang:
>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
>>
>> Current XDP performance in virtio-net is far from what we can achieve
>> on host. Several major factors cause the difference:
>> - Cost of virtualization
>> - Cost of virtio (populating virtqueue and context switching)
>> - Cost of vhost, it needs more optimization
>> - Cost of data copy
>> Because of above reasons there is a need of offloading XDP program to
>> host. This set is an attempt to implement XDP offload from the guest.
> This turns the guest kernel into a uAPI proxy.
>
> BPF uAPI calls related to the "offloaded" BPF objects are forwarded
> to the hypervisor, they pop up in QEMU which makes the requested call
> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
> be someone's proprietary "SmartNIC" implementation.
>
> Why can't those calls be forwarded at the higher layer? Why do they
> have to go through the guest kernel?


I think doing forwarding at higher layer have the following issues:

- Need a dedicated library (probably libbpf) but application may choose 
to do eBPF syscall directly
- Depends on guest agent to work
- Can't work for virtio-net hardware, since it still requires a hardware 
interface for carrying  offloading information
- Implement at the level of kernel may help for future extension like 
BPF object pinning and eBPF helper etc.

Basically, this series is trying to have an implementation of 
transporting eBPF through virtio, so it's not necessarily a guest to 
host but driver and device. For device, it could be either a virtual one 
(as done in qemu) or a real hardware.


>
> If kernel performs no significant work (or "adds value", pardon the
> expression), and problem can easily be solved otherwise we shouldn't
> do the work of maintaining the mechanism.


My understanding is that it should not be much difference compared to 
other offloading technology.


>
> The approach of kernel generating actual machine code which is then
> loaded into a sandbox on the hypervisor/SmartNIC is another story.


We've considered such way, but actual machine code is not as portable as 
eBPF bytecode consider we may want:

- Support migration
- Further offload the program to smart NIC (e.g through macvtap 
passthrough mode etc).

Thanks


> I'd appreciate if others could chime in.
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-27  2:59   ` Jason Wang
@ 2019-11-27 19:49     ` Jakub Kicinski
  2019-11-28  3:41       ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2019-11-27 19:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: Prashant Bhole, Song Liu, Jesper Dangaard Brouer,
	Daniel Borkmann, Michael S . Tsirkin, qemu-devel, netdev,
	John Fastabend, Alexei Starovoitov, Martin KaFai Lau, kvm,
	Yonghong Song, Andrii Nakryiko, David S . Miller

On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote:
> On 2019/11/27 上午4:35, Jakub Kicinski wrote:
> > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:  
> >> Note: This RFC has been sent to netdev as well as qemu-devel lists
> >>
> >> This series introduces XDP offloading from virtio_net. It is based on
> >> the following work by Jason Wang:
> >> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> >>
> >> Current XDP performance in virtio-net is far from what we can achieve
> >> on host. Several major factors cause the difference:
> >> - Cost of virtualization
> >> - Cost of virtio (populating virtqueue and context switching)
> >> - Cost of vhost, it needs more optimization
> >> - Cost of data copy
> >> Because of above reasons there is a need of offloading XDP program to
> >> host. This set is an attempt to implement XDP offload from the guest.  
> > This turns the guest kernel into a uAPI proxy.
> >
> > BPF uAPI calls related to the "offloaded" BPF objects are forwarded
> > to the hypervisor, they pop up in QEMU which makes the requested call
> > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
> > be someone's proprietary "SmartNIC" implementation.
> >
> > Why can't those calls be forwarded at the higher layer? Why do they
> > have to go through the guest kernel?  
> 
> 
> I think doing forwarding at higher layer have the following issues:
> 
> - Need a dedicated library (probably libbpf) but application may choose 
>   to do eBPF syscall directly
> - Depends on guest agent to work

This can be said about any user space functionality.

> - Can't work for virtio-net hardware, since it still requires a hardware 
> interface for carrying  offloading information

The HW virtio-net presumably still has a PF and hopefully reprs for
VFs, so why can't it attach the program there?

> - Implement at the level of kernel may help for future extension like 
>   BPF object pinning and eBPF helper etc.

No idea what you mean by this.

> Basically, this series is trying to have an implementation of 
> transporting eBPF through virtio, so it's not necessarily a guest to 
> host but driver and device. For device, it could be either a virtual one 
> (as done in qemu) or a real hardware.

SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as 
is the x86 hypervisor side. This set turns the kernel into a uAPI
forwarder.

3 years ago my answer to this proposal would have been very different.
Today after all the CPU bugs it seems like the SmartNICs (which are 
just another CPU running proprietary code) may just take off..

> > If kernel performs no significant work (or "adds value", pardon the
> > expression), and problem can easily be solved otherwise we shouldn't
> > do the work of maintaining the mechanism.  
> 
> My understanding is that it should not be much difference compared to 
> other offloading technology.

I presume you mean TC offloads? In virtualization there is inherently a
hypervisor which will receive the request, be it an IO hub/SmartNIC or
the traditional hypervisor on the same CPU.

The ACL/routing offloads differ significantly, because it's either the 
driver that does all the HW register poking directly or the complexity
of programming a rule into a HW table is quite low.

Same is true for the NFP BPF offload, BTW, the driver does all the
heavy lifting and compiles the final machine code image.

You can't say verifying and JITing BPF code into machine code entirely
in the hypervisor is similarly simple.

So no, there is a huge difference.

> > The approach of kernel generating actual machine code which is then
> > loaded into a sandbox on the hypervisor/SmartNIC is another story.  
> 
> We've considered such way, but actual machine code is not as portable as 
> eBPF bytecode consider we may want:
> 
> - Support migration
> - Further offload the program to smart NIC (e.g through macvtap 
>   passthrough mode etc).

You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not
guarantee migration either, if the environment is expected to be
running different version of HW and SW. But yes, JITing in the guest
kernel when you don't know what to JIT for may be hard, I was just
saying that I don't mean to discourage people from implementing
sandboxes which run JITed code on SmartNICs. My criticism is (as
always?) against turning the kernel into a one-to-one uAPI forwarder
into unknown platform code.

For cloud use cases I believe the higher layer should solve this.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-26 20:35 ` [RFC net-next 00/18] virtio_net XDP offload Jakub Kicinski
  2019-11-27  2:59   ` Jason Wang
@ 2019-11-27 20:32   ` Michael S. Tsirkin
  2019-11-27 23:40     ` Jakub Kicinski
  2019-11-28  3:32   ` Alexei Starovoitov
  2 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2019-11-27 20:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Prashant Bhole, David S . Miller, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
> > Note: This RFC has been sent to netdev as well as qemu-devel lists
> > 
> > This series introduces XDP offloading from virtio_net. It is based on
> > the following work by Jason Wang:
> > https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> > 
> > Current XDP performance in virtio-net is far from what we can achieve
> > on host. Several major factors cause the difference:
> > - Cost of virtualization
> > - Cost of virtio (populating virtqueue and context switching)
> > - Cost of vhost, it needs more optimization
> > - Cost of data copy
> > Because of above reasons there is a need of offloading XDP program to
> > host. This set is an attempt to implement XDP offload from the guest.
> 
> This turns the guest kernel into a uAPI proxy.
> 
> BPF uAPI calls related to the "offloaded" BPF objects are forwarded 
> to the hypervisor, they pop up in QEMU which makes the requested call
> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may 
> be someone's proprietary "SmartNIC" implementation.
> 
> Why can't those calls be forwarded at the higher layer? Why do they
> have to go through the guest kernel?

Well everyone is writing these programs and attaching them to NICs.

For better or worse that's how userspace is written.

Yes, in the simple case where everything is passed through, it could
instead be passed through some other channel just as well, but then
userspace would need significant changes just to make it work with
virtio.



> If kernel performs no significant work (or "adds value", pardon the
> expression), and problem can easily be solved otherwise we shouldn't 
> do the work of maintaining the mechanism.
> 
> The approach of kernel generating actual machine code which is then
> loaded into a sandbox on the hypervisor/SmartNIC is another story.

But that's transparent to guest userspace. Making userspace care whether
it's a SmartNIC or a software device breaks part of virtualization's
appeal, which is that it looks like a hardware box to the guest.

> I'd appreciate if others could chime in.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality
  2019-11-26 10:07 ` [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality Prashant Bhole
@ 2019-11-27 20:42   ` Michael S. Tsirkin
  2019-11-28  2:53     ` Prashant Bhole
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2019-11-27 20:42 UTC (permalink / raw)
  To: Prashant Bhole
  Cc: David S . Miller, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, netdev, qemu-devel, kvm

On Tue, Nov 26, 2019 at 07:07:41PM +0900, Prashant Bhole wrote:
> From: Jason Wang <jasowang@redhat.com>
> 
> This patch implements bpf_prog_offload_ops callbacks and adds handling
> for XDP_SETUP_PROG_HW. Handling of XDP_SETUP_PROG_HW involves setting
> up struct virtio_net_ctrl_ebpf_prog and appending program instructions
> to it. This control buffer is sent to Qemu with class
> VIRTIO_NET_CTRL_EBPF and command VIRTIO_NET_BPF_CMD_SET_OFFLOAD.
> The expected behavior from Qemu is that it should try to load the
> program in host os and report the status.

That's not great e.g. for migration: different hosts might have
a different idea about what's allowed.
Device capabilities should be preferably exported through
feature bits or config space such that it's easy to
intercept and limit these as needed.

Also, how are we going to handle e.g. endian-ness here?

> 
> It also adds restriction to have either driver or offloaded program
> at a time.

I'm not sure I understand what does the above say.
virtnet_xdp_offload_check?
Please add code comments so we remember what to do and when.

> This restriction can be removed later after proper testing.

What kind of testing is missing here?

> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>

Any UAPI changes need to be copied to virtio-dev@lists.oasis-open.org
(subscriber only) list please.

> ---
>  drivers/net/virtio_net.c        | 114 +++++++++++++++++++++++++++++++-
>  include/uapi/linux/virtio_net.h |  27 ++++++++
>  2 files changed, 139 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index a1088d0114f2..dddfbb2a2075 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -169,6 +169,7 @@ struct control_buf {
>  	u8 allmulti;
>  	__virtio16 vid;
>  	__virtio64 offloads;
> +	struct virtio_net_ctrl_ebpf_prog prog_ctrl;
>  };
>  
>  struct virtnet_info {
> @@ -272,6 +273,8 @@ struct virtnet_bpf_bound_prog {
>  	struct bpf_insn insnsi[0];
>  };
>  
> +#define VIRTNET_EA(extack, msg)	NL_SET_ERR_MSG_MOD((extack), msg)
> +
>  /* Converting between virtqueue no. and kernel tx/rx queue no.
>   * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
>   */
> @@ -2427,6 +2430,11 @@ static int virtnet_xdp_set(struct net_device *dev, struct netdev_bpf *bpf)
>  	if (!xdp_attachment_flags_ok(&vi->xdp, bpf))
>  		return -EBUSY;
>  
> +	if (rtnl_dereference(vi->offload_xdp_prog)) {
> +		VIRTNET_EA(bpf->extack, "program already attached in offload mode");
> +		return -EINVAL;
> +	}
> +
>  	if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)
>  	    && (virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO4) ||
>  	        virtio_has_feature(vi->vdev, VIRTIO_NET_F_GUEST_TSO6) ||
> @@ -2528,17 +2536,114 @@ static int virtnet_bpf_verify_insn(struct bpf_verifier_env *env, int insn_idx,
>  
>  static void virtnet_bpf_destroy_prog(struct bpf_prog *prog)
>  {
> +	struct virtnet_bpf_bound_prog *state;
> +
> +	state = prog->aux->offload->dev_priv;
> +	list_del(&state->list);
> +	kfree(state);
> +}
> +
> +static int virtnet_xdp_offload_check(struct virtnet_info *vi,
> +				     struct netdev_bpf *bpf)
> +{
> +	if (!bpf->prog)
> +		return 0;
> +
> +	if (!bpf->prog->aux->offload) {
> +		VIRTNET_EA(bpf->extack, "xdpoffload of non-bound program");
> +		return -EINVAL;
> +	}
> +	if (bpf->prog->aux->offload->netdev != vi->dev) {
> +		VIRTNET_EA(bpf->extack, "program bound to different dev");
> +		return -EINVAL;
> +	}
> +
> +	if (rtnl_dereference(vi->xdp_prog)) {
> +		VIRTNET_EA(bpf->extack, "program already attached in driver mode");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
>  }
>  
>  static int virtnet_xdp_set_offload(struct virtnet_info *vi,
>  				   struct netdev_bpf *bpf)
>  {
> -	return -EBUSY;
> +	struct virtio_net_ctrl_ebpf_prog *ctrl;
> +	struct virtnet_bpf_bound_prog *bound_prog = NULL;
> +	struct virtio_device *vdev = vi->vdev;
> +	struct bpf_prog *prog = bpf->prog;
> +	void *ctrl_buf = NULL;
> +	struct scatterlist sg;
> +	int prog_len;
> +	int err = 0;
> +
> +	if (!xdp_attachment_flags_ok(&vi->xdp_hw, bpf))
> +		return -EBUSY;
> +
> +	if (prog) {
> +		if (prog->type != BPF_PROG_TYPE_XDP)
> +			return -EOPNOTSUPP;
> +		bound_prog = prog->aux->offload->dev_priv;
> +		prog_len = prog->len * sizeof(bound_prog->insnsi[0]);
> +
> +		ctrl_buf = kmalloc(GFP_KERNEL, sizeof(*ctrl) + prog_len);
> +		if (!ctrl_buf)
> +			return -ENOMEM;
> +		ctrl = ctrl_buf;
> +		ctrl->cmd = cpu_to_virtio32(vi->vdev,
> +					    VIRTIO_NET_BPF_CMD_SET_OFFLOAD);
> +		ctrl->len = cpu_to_virtio32(vi->vdev, prog_len);
> +		ctrl->gpl_compatible = cpu_to_virtio16(vi->vdev,
> +						       prog->gpl_compatible);
> +		memcpy(ctrl->insns, bound_prog->insnsi,
> +		       prog->len * sizeof(bound_prog->insnsi[0]));
> +		sg_init_one(&sg, ctrl_buf, sizeof(*ctrl) + prog_len);
> +	} else {
> +		ctrl = &vi->ctrl->prog_ctrl;
> +		ctrl->cmd  = cpu_to_virtio32(vi->vdev,
> +					     VIRTIO_NET_BPF_CMD_UNSET_OFFLOAD);
> +		sg_init_one(&sg, ctrl, sizeof(*ctrl));
> +	}
> +
> +	if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_EBPF,
> +				  VIRTIO_NET_CTRL_EBPF_PROG,
> +				  &sg)) {
> +		dev_warn(&vdev->dev, "Failed to set bpf offload prog\n");
> +		err = -EFAULT;
> +		goto out;
> +	}
> +
> +	rcu_assign_pointer(vi->offload_xdp_prog, prog);
> +
> +	xdp_attachment_setup(&vi->xdp_hw, bpf);
> +
> +out:
> +	kfree(ctrl_buf);
> +	return err;
>  }
>  
>  static int virtnet_bpf_verifier_setup(struct bpf_prog *prog)
>  {
> -	return -ENOMEM;
> +	struct virtnet_info *vi = netdev_priv(prog->aux->offload->netdev);
> +	size_t insn_len = prog->len * sizeof(struct bpf_insn);
> +	struct virtnet_bpf_bound_prog *state;
> +
> +	state = kzalloc(sizeof(*state) + insn_len, GFP_KERNEL);
> +	if (!state)
> +		return -ENOMEM;
> +
> +	memcpy(&state->insnsi[0], prog->insnsi, insn_len);
> +
> +	state->vi = vi;
> +	state->prog = prog;
> +	state->len = prog->len;
> +
> +	list_add_tail(&state->list, &vi->bpf_bound_progs);
> +
> +	prog->aux->offload->dev_priv = state;
> +
> +	return 0;
>  }
>  
>  static int virtnet_bpf_verifier_prep(struct bpf_prog *prog)
> @@ -2568,12 +2673,17 @@ static const struct bpf_prog_offload_ops virtnet_bpf_dev_ops = {
>  static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  {
>  	struct virtnet_info *vi = netdev_priv(dev);
> +	int err;
> +
>  	switch (xdp->command) {
>  	case XDP_SETUP_PROG:
>  		return virtnet_xdp_set(dev, xdp);
>  	case XDP_QUERY_PROG:
>  		return xdp_attachment_query(&vi->xdp, xdp);
>  	case XDP_SETUP_PROG_HW:
> +		err = virtnet_xdp_offload_check(vi, xdp);
> +		if (err)
> +			return err;
>  		return virtnet_xdp_set_offload(vi, xdp);
>  	case XDP_QUERY_PROG_HW:
>  		return xdp_attachment_query(&vi->xdp_hw, xdp);
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index a3715a3224c1..0ea2f7910a5a 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -261,4 +261,31 @@ struct virtio_net_ctrl_mq {
>  #define VIRTIO_NET_CTRL_GUEST_OFFLOADS   5
>  #define VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET        0
>  
> +/*
> + * Control XDP offloads offloads
> + *
> + * When guest wants to offload XDP program to the device, it calls
> + * VIRTIO_NET_CTRL_EBPF_PROG along with VIRTIO_NET_BPF_CMD_SET_OFFLOAD
> + * subcommands. When offloading is successful, the device runs offloaded
> + * XDP program for each packet before sending it to the guest.
> + *
> + * VIRTIO_NET_BPF_CMD_UNSET_OFFLOAD removes the the offloaded program from
> + * the device, if exists.
> + */
> +
> +struct virtio_net_ctrl_ebpf_prog {
> +	/* program length in bytes */
> +	__virtio32 len;
> +	__virtio16 cmd;
> +	__virtio16 gpl_compatible;
> +	__u8 insns[0];
> +};
> +
> +#define VIRTIO_NET_CTRL_EBPF 6
> + #define VIRTIO_NET_CTRL_EBPF_PROG 1
> +
> +/* Commands for VIRTIO_NET_CTRL_EBPF_PROG */
> +#define VIRTIO_NET_BPF_CMD_SET_OFFLOAD 1
> +#define VIRTIO_NET_BPF_CMD_UNSET_OFFLOAD 2
> +
>  #endif /* _UAPI_LINUX_VIRTIO_NET_H */
> -- 
> 2.20.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-27 20:32   ` Michael S. Tsirkin
@ 2019-11-27 23:40     ` Jakub Kicinski
  2019-12-02 15:29       ` Michael S. Tsirkin
  0 siblings, 1 reply; 43+ messages in thread
From: Jakub Kicinski @ 2019-11-27 23:40 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Prashant Bhole, David S . Miller, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

On Wed, 27 Nov 2019 15:32:17 -0500, Michael S. Tsirkin wrote:
> On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
> > On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:  
> > > Note: This RFC has been sent to netdev as well as qemu-devel lists
> > > 
> > > This series introduces XDP offloading from virtio_net. It is based on
> > > the following work by Jason Wang:
> > > https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
> > > 
> > > Current XDP performance in virtio-net is far from what we can achieve
> > > on host. Several major factors cause the difference:
> > > - Cost of virtualization
> > > - Cost of virtio (populating virtqueue and context switching)
> > > - Cost of vhost, it needs more optimization
> > > - Cost of data copy
> > > Because of above reasons there is a need of offloading XDP program to
> > > host. This set is an attempt to implement XDP offload from the guest.  
> > 
> > This turns the guest kernel into a uAPI proxy.
> > 
> > BPF uAPI calls related to the "offloaded" BPF objects are forwarded 
> > to the hypervisor, they pop up in QEMU which makes the requested call
> > to the hypervisor kernel. Today it's the Linux kernel tomorrow it may 
> > be someone's proprietary "SmartNIC" implementation.
> > 
> > Why can't those calls be forwarded at the higher layer? Why do they
> > have to go through the guest kernel?  
> 
> Well everyone is writing these programs and attaching them to NICs.

Who's everyone?

> For better or worse that's how userspace is written.

HW offload requires modifying the user space, too. The offload is not
transparent. Do you know that?

> Yes, in the simple case where everything is passed through, it could
> instead be passed through some other channel just as well, but then
> userspace would need significant changes just to make it work with
> virtio.

There is a recently spawned effort to create an "XDP daemon" or
otherwise a control application which would among other things link
separate XDP apps to share a NIC attachment point.

Making use of cloud APIs would make a perfect addition to that.

Obviously if one asks a kernel guy to solve a problem one'll get kernel
code as an answer. And writing higher layer code requires companies to
actually organize their teams and have "full stack" strategies.

We've seen this story already with net_failover wart. At least that
time we weren't risking building a proxy to someone's proprietary FW.

> > If kernel performs no significant work (or "adds value", pardon the
> > expression), and problem can easily be solved otherwise we shouldn't 
> > do the work of maintaining the mechanism.
> > 
> > The approach of kernel generating actual machine code which is then
> > loaded into a sandbox on the hypervisor/SmartNIC is another story.  
> 
> But that's transparent to guest userspace. Making userspace care whether
> it's a SmartNIC or a software device breaks part of virtualization's
> appeal, which is that it looks like a hardware box to the guest.

It's not hardware unless you JITed machine code for it, it's just
someone else's software.

I'm not arguing with the appeal. I'm arguing the risk/benefit ratio
doesn't justify opening this can of worms.

> > I'd appreciate if others could chime in.  

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality
  2019-11-27 20:42   ` Michael S. Tsirkin
@ 2019-11-28  2:53     ` Prashant Bhole
  0 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-11-28  2:53 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David S . Miller, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Martin KaFai Lau, Song Liu, Yonghong Song,
	Andrii Nakryiko, netdev, qemu-devel, kvm



On 11/28/19 5:42 AM, Michael S. Tsirkin wrote:
> On Tue, Nov 26, 2019 at 07:07:41PM +0900, Prashant Bhole wrote:
>> From: Jason Wang <jasowang@redhat.com>
>>
>> This patch implements bpf_prog_offload_ops callbacks and adds handling
>> for XDP_SETUP_PROG_HW. Handling of XDP_SETUP_PROG_HW involves setting
>> up struct virtio_net_ctrl_ebpf_prog and appending program instructions
>> to it. This control buffer is sent to Qemu with class
>> VIRTIO_NET_CTRL_EBPF and command VIRTIO_NET_BPF_CMD_SET_OFFLOAD.
>> The expected behavior from Qemu is that it should try to load the
>> program in host os and report the status.
> 
> That's not great e.g. for migration: different hosts might have
> a different idea about what's allowed.
> Device capabilities should be preferably exported through
> feature bits or config space such that it's easy to
> intercept and limit these as needed.

These things are mentioned in the TODO section of cover letter.
Having offload feature enabled should be a migration blocker.
A feature bit in virtio for offloading capability needs to be added.

> 
> Also, how are we going to handle e.g. endian-ness here?

For now I feel we should block offloading in case of cross endian
virtualization. Further to support cross endian-ness, the requests for 
offloading a map or program should include metadata such as BTF info.
Qemu needs to handle the conversion.

> 
>>
>> It also adds restriction to have either driver or offloaded program
>> at a time.
> 
> I'm not sure I understand what does the above say.
> virtnet_xdp_offload_check?
> Please add code comments so we remember what to do and when.
> 
>> This restriction can be removed later after proper testing.
> 
> What kind of testing is missing here?

This restriction mentioned above is about having multiple programs
attached to the same device. It can be possible to have a program
attached in driver mode and other in offload mode, but in current code
only one mode at a time is supported because I wasn't aware whether
bpf tooling supports the case. I will add a comment or remove the
restriction in the next revision.

> 
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> Co-developed-by: Prashant Bhole <prashantbhole.linux@gmail.com>
>> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
> 
> Any UAPI changes need to be copied to virtio-dev@lists.oasis-open.org
> (subscriber only) list please.

Sure.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-26 20:35 ` [RFC net-next 00/18] virtio_net XDP offload Jakub Kicinski
  2019-11-27  2:59   ` Jason Wang
  2019-11-27 20:32   ` Michael S. Tsirkin
@ 2019-11-28  3:32   ` Alexei Starovoitov
  2019-11-28  4:18     ` Jason Wang
  2 siblings, 1 reply; 43+ messages in thread
From: Alexei Starovoitov @ 2019-11-28  3:32 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Prashant Bhole, David S . Miller, Michael S . Tsirkin,
	Jason Wang, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm

On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
> 
> I'd appreciate if others could chime in.

The performance improvements are quite appealing.
In general offloading from higher layers into lower layers is necessary long term.

But the approach taken by patches 15 and 17 is a dead end. I don't see how it
can ever catch up with the pace of bpf development. As presented this approach
works for the most basic programs and simple maps. No line info, no BTF, no
debuggability. There are no tail_calls either. I don't think I've seen a single
production XDP program that doesn't use tail calls. Static and dynamic linking
is coming. Wraping one bpf feature at a time with virtio api is never going to
be complete. How FDs are going to be passed back? OBJ_GET_INFO_BY_FD ?
OBJ_PIN/GET ? Where bpffs is going to live ? Any realistic XDP application will
be using a lot more than single self contained XDP prog with hash and array
maps. It feels that the whole sys_bpf needs to be forwarded as a whole from
guest into host. In case of true hw offload the host is managing HW. So it
doesn't forward syscalls into the driver. The offload from guest into host is
different. BPF can be seen as a resource that host provides and guest kernel
plus qemu would be forwarding requests between guest user space and host
kernel. Like sys_bpf(BPF_MAP_CREATE) can passthrough into the host directly.
The FD that hosts sees would need a corresponding mirror FD in the guest. There
are still questions about bpffs paths, but the main issue of
one-feature-at-a-time will be addressed in such approach. There could be other
solutions, of course.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-27 19:49     ` Jakub Kicinski
@ 2019-11-28  3:41       ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2019-11-28  3:41 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Song Liu, Jesper Dangaard Brouer, Daniel Borkmann,
	Michael S . Tsirkin, netdev, John Fastabend, qemu-devel,
	Alexei Starovoitov, David S . Miller, Prashant Bhole, kvm,
	Yonghong Song, Andrii Nakryiko, Martin KaFai Lau


On 2019/11/28 上午3:49, Jakub Kicinski wrote:
> On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote:
>> On 2019/11/27 上午4:35, Jakub Kicinski wrote:
>>> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
>>>> Note: This RFC has been sent to netdev as well as qemu-devel lists
>>>>
>>>> This series introduces XDP offloading from virtio_net. It is based on
>>>> the following work by Jason Wang:
>>>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
>>>>
>>>> Current XDP performance in virtio-net is far from what we can achieve
>>>> on host. Several major factors cause the difference:
>>>> - Cost of virtualization
>>>> - Cost of virtio (populating virtqueue and context switching)
>>>> - Cost of vhost, it needs more optimization
>>>> - Cost of data copy
>>>> Because of above reasons there is a need of offloading XDP program to
>>>> host. This set is an attempt to implement XDP offload from the guest.
>>> This turns the guest kernel into a uAPI proxy.
>>>
>>> BPF uAPI calls related to the "offloaded" BPF objects are forwarded
>>> to the hypervisor, they pop up in QEMU which makes the requested call
>>> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
>>> be someone's proprietary "SmartNIC" implementation.
>>>
>>> Why can't those calls be forwarded at the higher layer? Why do they
>>> have to go through the guest kernel?
>>
>> I think doing forwarding at higher layer have the following issues:
>>
>> - Need a dedicated library (probably libbpf) but application may choose
>>    to do eBPF syscall directly
>> - Depends on guest agent to work
> This can be said about any user space functionality.


Yes but the feature may have too much unnecessary dependencies: 
dedicated library, guest agent, host agent etc. This can only work for 
some specific setups and will lead vendor specific implementations.


>
>> - Can't work for virtio-net hardware, since it still requires a hardware
>> interface for carrying  offloading information
> The HW virtio-net presumably still has a PF and hopefully reprs for
> VFs, so why can't it attach the program there?


Then you still need a interface for carrying such information? It will 
work like assuming we had a virtio-net VF with reprs:

libbpf(guest) -> guest agent -> host agent -> libbpf(host) -> BPF 
syscall -> VF reprs/PF drvier -> VF/PF reprs -> virtio-net VF

Still need a vendor specific way for passing eBPF commands from driver 
to reprs/PF, and possibility, it could still be a virtio interface there.

In this proposal it will work out of box as simple as:

libbpf(guest) -> guest kernel -> virtio-net driver -> virtio-net VF

If the request comes from host (e.g flow offloading, configuration etc), 
VF reprs make perfect fit. But if the request comes from guest, having 
much longer journey looks quite like a burden (dependencies, bugs etc) .

What's more important, we can not assume the how virtio-net HW is 
structured, it could even not a SRIOV or PCI card.


>
>> - Implement at the level of kernel may help for future extension like
>>    BPF object pinning and eBPF helper etc.
> No idea what you mean by this.


My understanding is, we should narrow the gap between non-offloaded eBPF 
program and offloaded eBPF program. Making maps or progs to be visible 
to kernel may help to persist a unified API e.g object pinning through 
sysfs, tracepoint, debug etc.


>
>> Basically, this series is trying to have an implementation of
>> transporting eBPF through virtio, so it's not necessarily a guest to
>> host but driver and device. For device, it could be either a virtual one
>> (as done in qemu) or a real hardware.
> SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as
> is the x86 hypervisor side. This set turns the kernel into a uAPI
> forwarder.


Not necessarily, as what has been done by NFP, driver filter out the 
features that is not supported, and the bpf object is still visible in 
the kernel (and see above comment).


>
> 3 years ago my answer to this proposal would have been very different.
> Today after all the CPU bugs it seems like the SmartNICs (which are
> just another CPU running proprietary code) may just take off..
>

That's interesting but vendor may choose to use FPGA other than SoC in 
this case. Anyhow discussion like this is somehow out of the scope of 
the series.


>>> If kernel performs no significant work (or "adds value", pardon the
>>> expression), and problem can easily be solved otherwise we shouldn't
>>> do the work of maintaining the mechanism.
>> My understanding is that it should not be much difference compared to
>> other offloading technology.
> I presume you mean TC offloads? In virtualization there is inherently a
> hypervisor which will receive the request, be it an IO hub/SmartNIC or
> the traditional hypervisor on the same CPU.
>
> The ACL/routing offloads differ significantly, because it's either the
> driver that does all the HW register poking directly or the complexity
> of programming a rule into a HW table is quite low.
>
> Same is true for the NFP BPF offload, BTW, the driver does all the
> heavy lifting and compiles the final machine code image.


Yes and this series benefit from the infrastructure invented from NFP. 
But I'm not sure this is a good point since, technically the machine 
code could be generated by smart NIC as well.


>
> You can't say verifying and JITing BPF code into machine code entirely
> in the hypervisor is similarly simple.


Yes and that's why we choose to do in on the device (host) to simplify 
things.


>
> So no, there is a huge difference.
>

>>> The approach of kernel generating actual machine code which is then
>>> loaded into a sandbox on the hypervisor/SmartNIC is another story.
>> We've considered such way, but actual machine code is not as portable as
>> eBPF bytecode consider we may want:
>>
>> - Support migration
>> - Further offload the program to smart NIC (e.g through macvtap
>>    passthrough mode etc).
> You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not
> guarantee migration either,


Yes, but it's more portable than machine code.


> if the environment is expected to be
> running different version of HW and SW.


Right, we plan to have feature negotiation.


> But yes, JITing in the guest
> kernel when you don't know what to JIT for may be hard,


Yes.


> I was just
> saying that I don't mean to discourage people from implementing
> sandboxes which run JITed code on SmartNICs. My criticism is (as
> always?) against turning the kernel into a one-to-one uAPI forwarder
> into unknown platform code.


We have FUSE and I think it's not only the forwarder, and we may do much 
more work on top in the future. For unknown platform code, I'm not sure 
why we need care about that. There's no way for us to prevent such 
implementation and if we try to formalize it through a specification 
(virtio spec and probably eBPF spec), it may help actually.


>
> For cloud use cases I believe the higher layer should solve this.
>

Technically possible, but have lots of drawbacks.

Thanks


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-28  3:32   ` Alexei Starovoitov
@ 2019-11-28  4:18     ` Jason Wang
  2019-12-01 16:54       ` David Ahern
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2019-11-28  4:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Jakub Kicinski
  Cc: Song Liu, Jesper Dangaard Brouer, Daniel Borkmann,
	Michael S . Tsirkin, qemu-devel, netdev, John Fastabend,
	Alexei Starovoitov, Martin KaFai Lau, Prashant Bhole, kvm,
	Yonghong Song, Andrii Nakryiko, David S . Miller


On 2019/11/28 上午11:32, Alexei Starovoitov wrote:
> On Tue, Nov 26, 2019 at 12:35:14PM -0800, Jakub Kicinski wrote:
>> I'd appreciate if others could chime in.
> The performance improvements are quite appealing.
> In general offloading from higher layers into lower layers is necessary long term.
>
> But the approach taken by patches 15 and 17 is a dead end. I don't see how it
> can ever catch up with the pace of bpf development.


This applies for any hardware offloading features, isn't it?


>   As presented this approach
> works for the most basic programs and simple maps. No line info, no BTF, no
> debuggability. There are no tail_calls either.


If I understand correctly, neither of above were implemented in NFP. We 
can collaborate to find solution for all of those.


>   I don't think I've seen a single
> production XDP program that doesn't use tail calls.


It looks to me we can manage to add this support.


> Static and dynamic linking
> is coming. Wraping one bpf feature at a time with virtio api is never going to
> be complete.


It's a common problem for hardware that want to implement eBPF 
offloading, not a virtio specific one.


> How FDs are going to be passed back? OBJ_GET_INFO_BY_FD ?
> OBJ_PIN/GET ? Where bpffs is going to live ?


If we want pinning work in the case of virt, it should live in both host 
and guest probably.


>   Any realistic XDP application will
> be using a lot more than single self contained XDP prog with hash and array
> maps.


It's possible if we want to use XDP offloading to accelerate VNF which 
often has simple logic.


> It feels that the whole sys_bpf needs to be forwarded as a whole from
> guest into host. In case of true hw offload the host is managing HW. So it
> doesn't forward syscalls into the driver. The offload from guest into host is
> different. BPF can be seen as a resource that host provides and guest kernel
> plus qemu would be forwarding requests between guest user space and host
> kernel. Like sys_bpf(BPF_MAP_CREATE) can passthrough into the host directly.
> The FD that hosts sees would need a corresponding mirror FD in the guest. There
> are still questions about bpffs paths, but the main issue of
> one-feature-at-a-time will be addressed in such approach.


We try to follow what NFP did by starting from a fraction of the whole 
eBPF features. It would be very hard to have all eBPF features 
implemented from the start.  It would be helpful to clarify what's the 
minimal set of features that you want to have from the start.


> There could be other
> solutions, of course.
>
>

Suggestions are welcomed.

Thanks


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 07/18] tun: set offloaded xdp program
  2019-11-26 10:07 ` [RFC net-next 07/18] tun: set offloaded xdp program Prashant Bhole
@ 2019-12-01 16:35   ` David Ahern
  2019-12-02  2:44     ` Jason Wang
  2019-12-01 16:45   ` David Ahern
  1 sibling, 1 reply; 43+ messages in thread
From: David Ahern @ 2019-12-01 16:35 UTC (permalink / raw)
  To: Prashant Bhole, David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm

On 11/26/19 4:07 AM, Prashant Bhole wrote:
> From: Jason Wang <jasowang@redhat.com>
> 
> This patch introduces an ioctl way to set an offloaded XDP program
> to tun driver. This ioctl will be used by qemu to offload XDP program
> from virtio_net in the guest.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
> ---
>  drivers/net/tun.c           | 19 ++++++++++++++-----
>  include/uapi/linux/if_tun.h |  1 +
>  2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index d078b4659897..ecb49101b0b5 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -241,6 +241,7 @@ struct tun_struct {
>  	struct bpf_prog __rcu *xdp_prog;
>  	struct tun_prog __rcu *steering_prog;
>  	struct tun_prog __rcu *filter_prog;
> +	struct tun_prog __rcu *offloaded_xdp_prog;

I have been looking into running XDP pograms in the TX path of a tap
device [1] where the program is installed and managed by a process in
the host. The code paths are the same as what you are doing with XDP
offload, so how about calling this xdp_prog_tx?

[1]
https://github.com/dsahern/linux/commit/f2303d05187c8a604cdb70b288338e9b1d1b0db6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 08/18] tun: run offloaded XDP program in Tx path
  2019-11-26 10:07 ` [RFC net-next 08/18] tun: run offloaded XDP program in Tx path Prashant Bhole
@ 2019-12-01 16:39   ` David Ahern
  2019-12-01 20:56     ` David Miller
  2019-12-02  2:45     ` Jason Wang
  0 siblings, 2 replies; 43+ messages in thread
From: David Ahern @ 2019-12-01 16:39 UTC (permalink / raw)
  To: Prashant Bhole, David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm

On 11/26/19 4:07 AM, Prashant Bhole wrote:
> run offloaded XDP program as soon as packet is removed from the ptr
> ring. Since this is XDP in Tx path, the traditional handling of
> XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
> do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
> just runs the program and leaves the action handling to us.

What happens if an offloaded program returns XDP_REDIRECT?

Below you just drop the packet which is going to be a bad user
experience. A better user experience is to detect XDP return codes a
program uses, catch those that are not supported for this use case and
fail the install of the program.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 07/18] tun: set offloaded xdp program
  2019-11-26 10:07 ` [RFC net-next 07/18] tun: set offloaded xdp program Prashant Bhole
  2019-12-01 16:35   ` David Ahern
@ 2019-12-01 16:45   ` David Ahern
  2019-12-02  2:47     ` Jason Wang
  1 sibling, 1 reply; 43+ messages in thread
From: David Ahern @ 2019-12-01 16:45 UTC (permalink / raw)
  To: Prashant Bhole, David S . Miller, Michael S . Tsirkin
  Cc: Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm

On 11/26/19 4:07 AM, Prashant Bhole wrote:
> From: Jason Wang <jasowang@redhat.com>
> 
> This patch introduces an ioctl way to set an offloaded XDP program
> to tun driver. This ioctl will be used by qemu to offload XDP program
> from virtio_net in the guest.
> 

Seems like you need to set / reset the SOCK_XDP flag on tfile->sk since
this is an XDP program.

Also, why not add this program using netlink instead of ioctl? e.g., as
part of a generic XDP in the egress path like I am looking into for the
host side.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-28  4:18     ` Jason Wang
@ 2019-12-01 16:54       ` David Ahern
  2019-12-02  2:48         ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: David Ahern @ 2019-12-01 16:54 UTC (permalink / raw)
  To: Jason Wang, Alexei Starovoitov, Jakub Kicinski
  Cc: Song Liu, Jesper Dangaard Brouer, Daniel Borkmann,
	Michael S . Tsirkin, qemu-devel, netdev, John Fastabend,
	Alexei Starovoitov, Martin KaFai Lau, Prashant Bhole, kvm,
	Yonghong Song, Andrii Nakryiko, David S . Miller

On 11/27/19 10:18 PM, Jason Wang wrote:
> We try to follow what NFP did by starting from a fraction of the whole
> eBPF features. It would be very hard to have all eBPF features
> implemented from the start.  It would be helpful to clarify what's the
> minimal set of features that you want to have from the start.

Offloading guest programs needs to prevent a guest XDP program from
running bpf helpers that access host kernel data. e.g., bpf_fib_lookup

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 08/18] tun: run offloaded XDP program in Tx path
  2019-12-01 16:39   ` David Ahern
@ 2019-12-01 20:56     ` David Miller
  2019-12-01 21:40       ` Michael S. Tsirkin
  2019-12-02  2:45     ` Jason Wang
  1 sibling, 1 reply; 43+ messages in thread
From: David Miller @ 2019-12-01 20:56 UTC (permalink / raw)
  To: dsahern
  Cc: prashantbhole.linux, mst, jasowang, ast, daniel, jakub.kicinski,
	hawk, john.fastabend, kafai, songliubraving, yhs, andriin,
	netdev, qemu-devel, kvm

From: David Ahern <dsahern@gmail.com>
Date: Sun, 1 Dec 2019 09:39:54 -0700

> Below you just drop the packet which is going to be a bad user
> experience. A better user experience is to detect XDP return codes a
> program uses, catch those that are not supported for this use case and
> fail the install of the program.

This is not universally possible.

Return codes can be calculated dynamically, come from maps potentially
shared with other bpf programs, etc.

So unfortunately this suggestion is not tenable.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 08/18] tun: run offloaded XDP program in Tx path
  2019-12-01 20:56     ` David Miller
@ 2019-12-01 21:40       ` Michael S. Tsirkin
  2019-12-01 21:54         ` David Miller
  0 siblings, 1 reply; 43+ messages in thread
From: Michael S. Tsirkin @ 2019-12-01 21:40 UTC (permalink / raw)
  To: David Miller
  Cc: dsahern, prashantbhole.linux, jasowang, ast, daniel,
	jakub.kicinski, hawk, john.fastabend, kafai, songliubraving, yhs,
	andriin, netdev, qemu-devel, kvm

On Sun, Dec 01, 2019 at 12:56:21PM -0800, David Miller wrote:
> From: David Ahern <dsahern@gmail.com>
> Date: Sun, 1 Dec 2019 09:39:54 -0700
> 
> > Below you just drop the packet which is going to be a bad user
> > experience. A better user experience is to detect XDP return codes a
> > program uses, catch those that are not supported for this use case and
> > fail the install of the program.
> 
> This is not universally possible.
> 
> Return codes can be calculated dynamically, come from maps potentially
> shared with other bpf programs, etc.
> 
> So unfortunately this suggestion is not tenable.

Right. But it is helpful to expose the supported functionality
to guest in some way, if nothing else then so that
guests can be moved between different hosts.

Also, we need a way to report this kind of event to guest
so it's possible to figure out what went wrong.

-- 
MST


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 08/18] tun: run offloaded XDP program in Tx path
  2019-12-01 21:40       ` Michael S. Tsirkin
@ 2019-12-01 21:54         ` David Miller
  2019-12-02  2:56           ` Jason Wang
  0 siblings, 1 reply; 43+ messages in thread
From: David Miller @ 2019-12-01 21:54 UTC (permalink / raw)
  To: mst
  Cc: dsahern, prashantbhole.linux, jasowang, ast, daniel,
	jakub.kicinski, hawk, john.fastabend, kafai, songliubraving, yhs,
	andriin, netdev, qemu-devel, kvm

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Sun, 1 Dec 2019 16:40:22 -0500

> Right. But it is helpful to expose the supported functionality
> to guest in some way, if nothing else then so that
> guests can be moved between different hosts.
> 
> Also, we need a way to report this kind of event to guest
> so it's possible to figure out what went wrong.

On the contrary, this is why it is of utmost importance that all
XDP implementations support the full suite of XDP facilities from
the very beginning.

This is why we keep giving people a hard time when they add support
only for some of the XDP return values and semantics.  Users will get
killed by this, and it makes XDP a poor technology to use because
behavior is not consistent across device types.

That's not acceptable and I'll push back on anything that continues
this trend.

If you can't HW offload it, kick it to software.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 07/18] tun: set offloaded xdp program
  2019-12-01 16:35   ` David Ahern
@ 2019-12-02  2:44     ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2019-12-02  2:44 UTC (permalink / raw)
  To: David Ahern, Prashant Bhole, David S . Miller, Michael S . Tsirkin
  Cc: Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm


On 2019/12/2 上午12:35, David Ahern wrote:
> On 11/26/19 4:07 AM, Prashant Bhole wrote:
>> From: Jason Wang <jasowang@redhat.com>
>>
>> This patch introduces an ioctl way to set an offloaded XDP program
>> to tun driver. This ioctl will be used by qemu to offload XDP program
>> from virtio_net in the guest.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> Signed-off-by: Prashant Bhole <prashantbhole.linux@gmail.com>
>> ---
>>   drivers/net/tun.c           | 19 ++++++++++++++-----
>>   include/uapi/linux/if_tun.h |  1 +
>>   2 files changed, 15 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index d078b4659897..ecb49101b0b5 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -241,6 +241,7 @@ struct tun_struct {
>>   	struct bpf_prog __rcu *xdp_prog;
>>   	struct tun_prog __rcu *steering_prog;
>>   	struct tun_prog __rcu *filter_prog;
>> +	struct tun_prog __rcu *offloaded_xdp_prog;
> I have been looking into running XDP pograms in the TX path of a tap
> device [1] where the program is installed and managed by a process in
> the host. The code paths are the same as what you are doing with XDP
> offload, so how about calling this xdp_prog_tx?
>
> [1]
> https://github.com/dsahern/linux/commit/f2303d05187c8a604cdb70b288338e9b1d1b0db6
>

I think it's fine, btw, except for the netlink part there should be no 
much difference.

Thanks


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 08/18] tun: run offloaded XDP program in Tx path
  2019-12-01 16:39   ` David Ahern
  2019-12-01 20:56     ` David Miller
@ 2019-12-02  2:45     ` Jason Wang
  1 sibling, 0 replies; 43+ messages in thread
From: Jason Wang @ 2019-12-02  2:45 UTC (permalink / raw)
  To: David Ahern, Prashant Bhole, David S . Miller, Michael S . Tsirkin
  Cc: Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm


On 2019/12/2 上午12:39, David Ahern wrote:
> On 11/26/19 4:07 AM, Prashant Bhole wrote:
>> run offloaded XDP program as soon as packet is removed from the ptr
>> ring. Since this is XDP in Tx path, the traditional handling of
>> XDP actions XDP_TX/REDIRECT isn't valid. For this reason we call
>> do_xdp_generic_core instead of do_xdp_generic. do_xdp_generic_core
>> just runs the program and leaves the action handling to us.
> What happens if an offloaded program returns XDP_REDIRECT?
>
> Below you just drop the packet which is going to be a bad user
> experience. A better user experience is to detect XDP return codes a
> program uses, catch those that are not supported for this use case and
> fail the install of the program.


Yes, this could be done in the guest.

Thanks


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 07/18] tun: set offloaded xdp program
  2019-12-01 16:45   ` David Ahern
@ 2019-12-02  2:47     ` Jason Wang
  2019-12-09  0:24       ` Prashant Bhole
  0 siblings, 1 reply; 43+ messages in thread
From: Jason Wang @ 2019-12-02  2:47 UTC (permalink / raw)
  To: David Ahern, Prashant Bhole, David S . Miller, Michael S . Tsirkin
  Cc: Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Martin KaFai Lau,
	Song Liu, Yonghong Song, Andrii Nakryiko, netdev, qemu-devel,
	kvm


On 2019/12/2 上午12:45, David Ahern wrote:
> On 11/26/19 4:07 AM, Prashant Bhole wrote:
>> From: Jason Wang <jasowang@redhat.com>
>>
>> This patch introduces an ioctl way to set an offloaded XDP program
>> to tun driver. This ioctl will be used by qemu to offload XDP program
>> from virtio_net in the guest.
>>
> Seems like you need to set / reset the SOCK_XDP flag on tfile->sk since
> this is an XDP program.
>
> Also, why not add this program using netlink instead of ioctl? e.g., as
> part of a generic XDP in the egress path like I am looking into for the
> host side.


Maybe both, otherwise, qemu may need netlink as a dependency.

Thanks


>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-12-01 16:54       ` David Ahern
@ 2019-12-02  2:48         ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2019-12-02  2:48 UTC (permalink / raw)
  To: David Ahern, Alexei Starovoitov, Jakub Kicinski
  Cc: Song Liu, Jesper Dangaard Brouer, Daniel Borkmann,
	Michael S . Tsirkin, qemu-devel, netdev, John Fastabend,
	Alexei Starovoitov, Martin KaFai Lau, Prashant Bhole, kvm,
	Yonghong Song, Andrii Nakryiko, David S . Miller


On 2019/12/2 上午12:54, David Ahern wrote:
> On 11/27/19 10:18 PM, Jason Wang wrote:
>> We try to follow what NFP did by starting from a fraction of the whole
>> eBPF features. It would be very hard to have all eBPF features
>> implemented from the start.  It would be helpful to clarify what's the
>> minimal set of features that you want to have from the start.
> Offloading guest programs needs to prevent a guest XDP program from
> running bpf helpers that access host kernel data. e.g., bpf_fib_lookup


Right, so we probably need a new type of eBPF program on the host and 
filter out the unsupported helpers there.

Thanks


>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 08/18] tun: run offloaded XDP program in Tx path
  2019-12-01 21:54         ` David Miller
@ 2019-12-02  2:56           ` Jason Wang
  0 siblings, 0 replies; 43+ messages in thread
From: Jason Wang @ 2019-12-02  2:56 UTC (permalink / raw)
  To: David Miller, mst
  Cc: dsahern, prashantbhole.linux, ast, daniel, jakub.kicinski, hawk,
	john.fastabend, kafai, songliubraving, yhs, andriin, netdev,
	qemu-devel, kvm


On 2019/12/2 上午5:54, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Sun, 1 Dec 2019 16:40:22 -0500
>
>> Right. But it is helpful to expose the supported functionality
>> to guest in some way, if nothing else then so that
>> guests can be moved between different hosts.
>>
>> Also, we need a way to report this kind of event to guest
>> so it's possible to figure out what went wrong.
> On the contrary, this is why it is of utmost importance that all
> XDP implementations support the full suite of XDP facilities from
> the very beginning.
>
> This is why we keep giving people a hard time when they add support
> only for some of the XDP return values and semantics.  Users will get
> killed by this, and it makes XDP a poor technology to use because
> behavior is not consistent across device types.
>
> That's not acceptable and I'll push back on anything that continues
> this trend.
>
> If you can't HW offload it, kick it to software.


We can try to work out a solution for XDP_REDIRECT.

Thanks


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 00/18] virtio_net XDP offload
  2019-11-27 23:40     ` Jakub Kicinski
@ 2019-12-02 15:29       ` Michael S. Tsirkin
  0 siblings, 0 replies; 43+ messages in thread
From: Michael S. Tsirkin @ 2019-12-02 15:29 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Prashant Bhole, David S . Miller, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm

On Wed, Nov 27, 2019 at 03:40:14PM -0800, Jakub Kicinski wrote:
> > For better or worse that's how userspace is written.
> 
> HW offload requires modifying the user space, too. The offload is not
> transparent. Do you know that?

It's true, offload of program itself isn't transparent. Adding a 3rd
interface (software/hardware/host) isn't welcome though, IMHO.

-- 
MST


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC net-next 07/18] tun: set offloaded xdp program
  2019-12-02  2:47     ` Jason Wang
@ 2019-12-09  0:24       ` Prashant Bhole
  0 siblings, 0 replies; 43+ messages in thread
From: Prashant Bhole @ 2019-12-09  0:24 UTC (permalink / raw)
  To: Jason Wang, David Ahern, David S . Miller, Michael S . Tsirkin,
	Alexei Starovoitov, Jakub Kicinski
  Cc: Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Martin KaFai Lau, Song Liu, Yonghong Song, Andrii Nakryiko,
	netdev, qemu-devel, kvm



On 12/2/19 11:47 AM, Jason Wang wrote:
> 
> On 2019/12/2 上午12:45, David Ahern wrote:
>> On 11/26/19 4:07 AM, Prashant Bhole wrote:
>>> From: Jason Wang <jasowang@redhat.com>
>>>
>>> This patch introduces an ioctl way to set an offloaded XDP program
>>> to tun driver. This ioctl will be used by qemu to offload XDP program
>>> from virtio_net in the guest.
>>>
>> Seems like you need to set / reset the SOCK_XDP flag on tfile->sk since
>> this is an XDP program.
>>
>> Also, why not add this program using netlink instead of ioctl? e.g., as
>> part of a generic XDP in the egress path like I am looking into for the
>> host side.
> 
> 
> Maybe both, otherwise, qemu may need netlink as a dependency.
> 
> Thanks
> 

Thank you all for reviewing. We will continue to improve this set.

If we split this work, Tx path XDP is one of the necessary part
which can be developed first. As suggested by David Ahern it will be
a netlink way but we will still need ioctl way for tap. I will try
to come up with Tx path XDP set next time.

Thanks.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, back to index

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 01/18] bpf: introduce bpf_prog_offload_verifier_setup() Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 02/18] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 03/18] net: core: export do_xdp_generic_core() Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 04/18] tuntap: check tun_msg_ctl type at necessary places Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 05/18] vhost_net: user tap recvmsg api to access ptr ring Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 06/18] tuntap: remove usage of ptr ring in vhost_net Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 07/18] tun: set offloaded xdp program Prashant Bhole
2019-12-01 16:35   ` David Ahern
2019-12-02  2:44     ` Jason Wang
2019-12-01 16:45   ` David Ahern
2019-12-02  2:47     ` Jason Wang
2019-12-09  0:24       ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 08/18] tun: run offloaded XDP program in Tx path Prashant Bhole
2019-12-01 16:39   ` David Ahern
2019-12-01 20:56     ` David Miller
2019-12-01 21:40       ` Michael S. Tsirkin
2019-12-01 21:54         ` David Miller
2019-12-02  2:56           ` Jason Wang
2019-12-02  2:45     ` Jason Wang
2019-11-26 10:07 ` [RFC net-next 09/18] tun: add a way to inject Tx path packet into Rx path Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 10/18] tun: handle XDP_TX action of offloaded program Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 11/18] tun: run xdp prog when tun is read from file interface Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 12/18] virtio-net: store xdp_prog in device Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 13/18] virtio_net: use XDP attachment helpers Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 14/18] virtio_net: add XDP prog offload infrastructure Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality Prashant Bhole
2019-11-27 20:42   ` Michael S. Tsirkin
2019-11-28  2:53     ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 16/18] bpf: export function __bpf_map_get Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 17/18] virtio_net: implment XDP map offload functionality Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 18/18] virtio_net: restrict bpf helper calls from offloaded program Prashant Bhole
2019-11-26 20:35 ` [RFC net-next 00/18] virtio_net XDP offload Jakub Kicinski
2019-11-27  2:59   ` Jason Wang
2019-11-27 19:49     ` Jakub Kicinski
2019-11-28  3:41       ` Jason Wang
2019-11-27 20:32   ` Michael S. Tsirkin
2019-11-27 23:40     ` Jakub Kicinski
2019-12-02 15:29       ` Michael S. Tsirkin
2019-11-28  3:32   ` Alexei Starovoitov
2019-11-28  4:18     ` Jason Wang
2019-12-01 16:54       ` David Ahern
2019-12-02  2:48         ` Jason Wang

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git