All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/14] Introducing AF_PACKET V4 support
@ 2017-10-31 12:41 Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API Björn Töpel
                   ` (15 more replies)
  0 siblings, 16 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
optimized for high performance packet processing and zero-copy
semantics. Throughput improvements can be up to 40x compared to V2 and
V3 for the micro benchmarks included. Would be great to get your
feedback on it.

The main difference between V4 and V2/V3 is that TX and RX descriptors
are separated from packet buffers. An RX or TX descriptor points to a
data buffer in a packet buffer area. RX and TX can share the same
packet buffer so that a packet does not have to be copied between RX
and TX. Moreover, if a packet needs to be kept for a while due to a
possible retransmit, then the descriptor that points to that packet
buffer can be changed to point to another buffer and reused right
away. This again avoids copying data.

The RX and TX descriptor rings are registered with the setsockopts
PACKET_RX_RING and PACKET_TX_RING, as usual. The packet buffer area is
allocated by user space and registered with the kernel using the new
PACKET_MEMREG setsockopt. All these three areas are shared between
user space and kernel space.

When V4 executes like this, we say that it executes in
"copy-mode". Each packet is sent to the Linux stack and a copy of it
is sent to user space, so V4 behaves in the same way as V2 and V3. All
syscalls operating on file descriptors should just work as if it was
V2 or V3. However, when the new PACKET_ZEROCOPY setsockopt is called,
V4 starts to operate in true zero-copy mode. In this mode, the
networking HW (or SW driver if it is a virtual driver like veth)
DMAs/puts packets straight into the packet buffer that is shared
between user space and kernel space. The RX and TX descriptor queues
of the networking HW are NOT shared to user space. Only the kernel can
read and write these and it is the kernel drivers responsibility to
translate these HW specific descriptors to the HW agnostic ones in the
V4 virtual descriptor rings that user space sees. This way, a
malicious user space program cannot mess with the networking HW.

The PACKET_ZEROCOPY setsockopt acts on a queue pair (channel in
ethtool speak), so one needs to steer the traffic to the zero-copy
enabled queue pair. Which queue to use, is up to the user.

For an untrusted application, HW packet steering to a specific queue
pair (the one associated with the application) is a requirement, as
the application would otherwise be able to see other user space
processes' packets. If the HW cannot support the required packet
steering, packets need to be DMA:ed into non user-space visible kernel
buffers and from there copied out to user space. This RFC only
addresses NIC HW with packet steering capabilities.

PACKET_ZEROCOPY comes with "XDP batteries included", so XDP programs
will be executed for zero-copy enabled queues. We're also suggesting
adding a new XDP action, XDP_PASS_TO_KERNEL, to pass copies to the
kernel stack instead of the V4 user space queue in zero-copy mode.

There's a tpbench benchmarking/test application included. Say that
you'd like your UDP traffic from port 4242 to end up in queue 16, that
we'll enable zero-copy on. Here, we use ethtool for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the benchmark in zero-copy mode can then be done using:

      tpbench -i p3p2 --rxdrop --zerocopy 17

Note that the --zerocopy command-line argument is one-based, and not
zero-based.

We've run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for Tx/Rx and one for the user space application. The
memory is DDR4 @ 1067 MT/s and the size of each DIMM is 8192MB and
with 8 of those DIMMs in the system we have 64 GB of total memory. The
compiler used is gcc version 5.4.0 20160609. The NIC is an Intel I40E
40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
byte packets, generated by commercial packet generator HW that is
generating packets at full 40 Gbit/s line rate.

Benchmark   V2     V3     V4     V4+ZC
rxdrop      0.67   0.73   0.74   33.7
txpush      0.98   0.98   0.91   19.6
l2fwd       0.66   0.71   0.67   15.5

The results are generated using the "bench_all.sh" script.

We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
Korea, and our paper with complete benchmarks will be released shortly
on the NetDev 2.2 site.

We based this patch set on net-next commit e1ea2f9856b7 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").

Please focus your review on:

* The V4 user space interface
* PACKET_ZEROCOPY and its semantics
* Packet array interface
* XDP semantics when excuting in zero-copy mode (user space passed
  buffers)
* XDP_PASS_TO_KERNEL semantics

To do:

* Investigate the user-space ring structure’s performance problems
* Continue the XDP integration into packet arrays
* Optimize performance
* SKB <-> V4 conversions in tp4a_populate & tp4a_flush
* Packet buffer is unnecessarily pinned for virtual devices
* Support shared packet buffers
* Unify V4 and SKB receive path in I40E driver
* Support for packets spanning multiple frames
* Disassociate the packet array implementation from the V4 queue
  structure

We would really like to thank the reviewers of the limited
distribution RFC for all their comments that have helped improve the
interfaces and the code significantly: Alexei Starovoitov, Alexander
Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
at Intel that has been helping out reviewing code, writing tests, and
sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
Zhang, your participation has really helped.

Thanks: Björn and Magnus

[1] https://www.netdevconf.org/2.2/

Björn Töpel (7):
  packet: introduce AF_PACKET V4 userspace API
  packet: implement PACKET_MEMREG setsockopt
  packet: enable AF_PACKET V4 rings
  packet: wire up zerocopy for AF_PACKET V4
  i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
  i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
  samples/tpacket4: added tpbench

Magnus Karlsson (7):
  packet: enable Rx for AF_PACKET V4
  packet: enable Tx support for AF_PACKET V4
  netdevice: add AF_PACKET V4 zerocopy ops
  veth: added support for PACKET_ZEROCOPY
  samples/tpacket4: added veth support
  i40e: added XDP support for TP4 enabled queue pairs
  xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use

 drivers/net/ethernet/intel/i40e/i40e.h         |    3 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |    9 +
 drivers/net/ethernet/intel/i40e/i40e_main.c    |  837 ++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  582 ++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   38 +
 drivers/net/veth.c                             |  174 +++
 include/linux/netdevice.h                      |   16 +
 include/linux/tpacket4.h                       | 1502 ++++++++++++++++++++++++
 include/uapi/linux/bpf.h                       |    1 +
 include/uapi/linux/if_packet.h                 |   65 +-
 net/packet/af_packet.c                         | 1252 +++++++++++++++++---
 net/packet/internal.h                          |    9 +
 samples/tpacket4/Makefile                      |   12 +
 samples/tpacket4/bench_all.sh                  |   28 +
 samples/tpacket4/tpbench.c                     | 1390 ++++++++++++++++++++++
 15 files changed, 5674 insertions(+), 244 deletions(-)
 create mode 100644 include/linux/tpacket4.h
 create mode 100644 samples/tpacket4/Makefile
 create mode 100755 samples/tpacket4/bench_all.sh
 create mode 100644 samples/tpacket4/tpbench.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-11-02  1:45   ` Willem de Bruijn
  2017-11-15 22:34   ` chet l
  2017-10-31 12:41 ` [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt Björn Töpel
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This patch adds the necessary AF_PACKET V4 structures for usage from
userspace. AF_PACKET V4 is a new interface optimized for high
performance packet processing.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/uapi/linux/if_packet.h | 65 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 4df96a7dd4fa..8eabcd1b370a 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -56,6 +56,8 @@ struct sockaddr_ll {
 #define PACKET_QDISC_BYPASS		20
 #define PACKET_ROLLOVER_STATS		21
 #define PACKET_FANOUT_DATA		22
+#define PACKET_MEMREG			23
+#define PACKET_ZEROCOPY			24
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -243,13 +245,35 @@ struct tpacket_block_desc {
 	union tpacket_bd_header_u hdr;
 };
 
+#define TP4_DESC_KERNEL	0x0080 /* The descriptor is owned by the kernel */
+#define TP4_PKT_CONT	1 /* The packet continues in the next descriptor */
+
+struct tpacket4_desc {
+	__u32 idx;
+	__u32 len;
+	__u16 offset;
+	__u8  error; /* an errno */
+	__u8  flags;
+	__u8  padding[4];
+};
+
+struct tpacket4_queue {
+	struct tpacket4_desc *ring;
+
+	unsigned int avail_idx;
+	unsigned int last_used_idx;
+	unsigned int num_free;
+	unsigned int ring_mask;
+};
+
 #define TPACKET2_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket2_hdr)) + sizeof(struct sockaddr_ll))
 #define TPACKET3_HDRLEN		(TPACKET_ALIGN(sizeof(struct tpacket3_hdr)) + sizeof(struct sockaddr_ll))
 
 enum tpacket_versions {
 	TPACKET_V1,
 	TPACKET_V2,
-	TPACKET_V3
+	TPACKET_V3,
+	TPACKET_V4
 };
 
 /*
@@ -282,9 +306,26 @@ struct tpacket_req3 {
 	unsigned int	tp_feature_req_word;
 };
 
+/* V4 frame structure
+ *
+ * The v4 frame is contained within a frame defined by
+ * PACKET_MEMREG/struct tpacket_memreg_req. Each frame is frame_size
+ * bytes, and laid out as following:
+ *
+ * - Start.
+ * - Gap, at least data_headroom (from struct tpacket_memreg_req),
+ *   chosen so that packet data (Start+data) is at least 64B aligned.
+ */
+
+struct tpacket_req4 {
+	int		mr_fd;	 /* File descriptor for registered buffers */
+	unsigned int	desc_nr; /* Number of entries in descriptor ring */
+};
+
 union tpacket_req_u {
 	struct tpacket_req	req;
 	struct tpacket_req3	req3;
+	struct tpacket_req4	req4;
 };
 
 struct packet_mreq {
@@ -294,6 +335,28 @@ struct packet_mreq {
 	unsigned char	mr_address[8];
 };
 
+/*
+ * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
+ * to register user memory which should be used to store the packet
+ * data.
+ *
+ * There are some constraints for the memory being registered:
+ * - The memory area has to be memory page size aligned.
+ * - The frame size has to be a power of 2.
+ * - The frame size cannot be smaller than 2048B.
+ * - The frame size cannot be larger than the memory page size.
+ *
+ * Corollary: The number of frames that can be stored is
+ * len / frame_size.
+ *
+ */
+struct tpacket_memreg_req {
+	unsigned long	addr;		/* Start of packet data area */
+	unsigned long	len;		/* Length of packet data area */
+	unsigned int	frame_size;	/* Frame size */
+	unsigned int	data_headroom;	/* Frame head room */
+};
+
 #define PACKET_MR_MULTICAST	0
 #define PACKET_MR_PROMISC	1
 #define PACKET_MR_ALLMULTI	2
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-11-03  3:00   ` Willem de Bruijn
  2017-10-31 12:41 ` [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings Björn Töpel
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Here, the PACKET_MEMREG setsockopt is implemented for the AF_PACKET
protocol family. PACKET_MEMREG allows the user to register memory
regions that can be used by AF_PACKET V4 as packet data buffers.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/tpacket4.h | 101 +++++++++++++++++++++++++++++
 net/packet/af_packet.c   | 163 +++++++++++++++++++++++++++++++++++++++++++++++
 net/packet/internal.h    |   4 ++
 3 files changed, 268 insertions(+)
 create mode 100644 include/linux/tpacket4.h

diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
new file mode 100644
index 000000000000..fcf4c333c78d
--- /dev/null
+++ b/include/linux/tpacket4.h
@@ -0,0 +1,101 @@
+/*
+ *  tpacket v4
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_TPACKET4_H
+#define _LINUX_TPACKET4_H
+
+#define TP4_UMEM_MIN_FRAME_SIZE 2048
+#define TP4_KERNEL_HEADROOM 256 /* Headrom for XDP */
+
+struct tp4_umem {
+	struct pid *pid;
+	struct page **pgs;
+	unsigned int npgs;
+	size_t size;
+	unsigned long address;
+	unsigned int frame_size;
+	unsigned int frame_size_log2;
+	unsigned int nframes;
+	unsigned int nfpplog2; /* num frames per page in log2 */
+	unsigned int data_headroom;
+};
+
+/*************** V4 QUEUE OPERATIONS *******************************/
+
+/**
+ * tp4q_umem_new - Creates a new umem (packet buffer)
+ *
+ * @addr: The address to the umem
+ * @size: The size of the umem
+ * @frame_size: The size of each frame, between 2K and PAGE_SIZE
+ * @data_headroom: The desired data headroom before start of the packet
+ *
+ * Returns a pointer to the new umem or NULL for failure
+ **/
+static inline struct tp4_umem *tp4q_umem_new(unsigned long addr, size_t size,
+					     unsigned int frame_size,
+					     unsigned int data_headroom)
+{
+	struct tp4_umem *umem;
+	unsigned int nframes;
+
+	if (frame_size < TP4_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
+		/* Strictly speaking we could support this, if:
+		 * - huge pages, or*
+		 * - using an IOMMU, or
+		 * - making sure the memory area is consecutive
+		 * but for now, we simply say "computer says no".
+		 */
+		return ERR_PTR(-EINVAL);
+	}
+
+	if (!is_power_of_2(frame_size))
+		return ERR_PTR(-EINVAL);
+
+	if (!PAGE_ALIGNED(addr)) {
+		/* Memory area has to be page size aligned. For
+		 * simplicity, this might change.
+		 */
+		return ERR_PTR(-EINVAL);
+	}
+
+	if ((addr + size) < addr)
+		return ERR_PTR(-EINVAL);
+
+	nframes = size / frame_size;
+	if (nframes == 0)
+		return ERR_PTR(-EINVAL);
+
+	data_headroom =	ALIGN(data_headroom, 64);
+
+	if (frame_size - data_headroom - TP4_KERNEL_HEADROOM < 0)
+		return ERR_PTR(-EINVAL);
+
+	umem = kzalloc(sizeof(*umem), GFP_KERNEL);
+	if (!umem)
+		return ERR_PTR(-ENOMEM);
+
+	umem->pid = get_task_pid(current, PIDTYPE_PID);
+	umem->size = size;
+	umem->address = addr;
+	umem->frame_size = frame_size;
+	umem->frame_size_log2 = ilog2(frame_size);
+	umem->nframes = nframes;
+	umem->nfpplog2 = ilog2(PAGE_SIZE / frame_size);
+	umem->data_headroom = data_headroom;
+
+	return umem;
+}
+
+#endif /* _LINUX_TPACKET4_H */
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 9603f6ff17a4..b39be424ec0e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -89,11 +89,15 @@
 #include <linux/errqueue.h>
 #include <linux/net_tstamp.h>
 #include <linux/percpu.h>
+#include <linux/log2.h>
 #ifdef CONFIG_INET
 #include <net/inet_common.h>
 #endif
 #include <linux/bpf.h>
 #include <net/compat.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/task.h>
+#include <linux/sched/signal.h>
 
 #include "internal.h"
 
@@ -2975,6 +2979,132 @@ static int packet_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 		return packet_snd(sock, msg, len);
 }
 
+static void
+packet_umem_unpin_pages(struct tp4_umem *umem)
+{
+	unsigned int i;
+
+	for (i = 0; i < umem->npgs; i++) {
+		struct page *page = umem->pgs[i];
+
+		set_page_dirty_lock(page);
+		put_page(page);
+	}
+	kfree(umem->pgs);
+	umem->pgs = NULL;
+}
+
+static void
+packet_umem_free(struct tp4_umem *umem)
+{
+	struct mm_struct *mm;
+	struct task_struct *task;
+	unsigned long diff;
+
+	packet_umem_unpin_pages(umem);
+
+	task = get_pid_task(umem->pid, PIDTYPE_PID);
+	put_pid(umem->pid);
+	if (!task)
+		goto out;
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	if (!mm)
+		goto out;
+
+	diff = umem->size >> PAGE_SHIFT;
+
+	down_write(&mm->mmap_sem);
+	mm->pinned_vm -= diff;
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+out:
+	kfree(umem);
+}
+
+static struct tp4_umem *
+packet_umem_new(unsigned long addr, size_t size, unsigned int frame_size,
+		unsigned int data_headroom)
+{
+	unsigned long lock_limit, locked, npages;
+	unsigned int gup_flags = FOLL_WRITE;
+	int need_release = 0, j = 0, i, ret;
+	struct page **page_list;
+	struct tp4_umem *umem;
+
+	if (!can_do_mlock())
+		return ERR_PTR(-EPERM);
+
+	umem = tp4q_umem_new(addr, size, frame_size, data_headroom);
+	if (IS_ERR(umem))
+		return umem;
+
+	page_list = (struct page **)__get_free_page(GFP_KERNEL);
+	if (!page_list) {
+		put_pid(umem->pid);
+		kfree(umem);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	npages = PAGE_ALIGN(umem->nframes * umem->frame_size) >> PAGE_SHIFT;
+
+	down_write(&current->mm->mmap_sem);
+
+	locked = npages + current->mm->pinned_vm;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (npages == 0 || npages > UINT_MAX) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	umem->pgs = kcalloc(npages, sizeof(*umem->pgs), GFP_KERNEL);
+	if (!umem->pgs) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	need_release = 1;
+	while (npages) {
+		ret = get_user_pages(addr,
+				     min_t(unsigned long, npages,
+					   PAGE_SIZE / sizeof(struct page *)),
+				     gup_flags, page_list, NULL);
+
+		if (ret < 0)
+			goto out;
+
+		umem->npgs += ret;
+		addr += ret * PAGE_SIZE;
+		npages -= ret;
+
+		for (i = 0; i < ret; i++)
+			umem->pgs[j++] = page_list[i];
+	}
+
+	ret = 0;
+
+out:
+	if (ret < 0) {
+		if (need_release)
+			packet_umem_unpin_pages(umem);
+		put_pid(umem->pid);
+		kfree(umem);
+	} else {
+		current->mm->pinned_vm = locked;
+	}
+
+	up_write(&current->mm->mmap_sem);
+	free_page((unsigned long)page_list);
+
+	return ret < 0 ? ERR_PTR(ret) : umem;
+}
+
 /*
  *	Close a PACKET socket. This is fairly simple. We immediately go
  *	to 'closed' state and remove our protocol entry in the device list.
@@ -3024,6 +3154,11 @@ static int packet_release(struct socket *sock)
 		packet_set_ring(sk, &req_u, 1, 1);
 	}
 
+	if (po->umem) {
+		packet_umem_free(po->umem);
+		po->umem = NULL;
+	}
+
 	f = fanout_release(sk);
 
 	synchronize_net();
@@ -3828,6 +3963,31 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+	case PACKET_MEMREG:
+	{
+		struct tpacket_memreg_req req;
+		struct tp4_umem *umem;
+
+		if (optlen < sizeof(req))
+			return -EINVAL;
+		if (copy_from_user(&req, optval, sizeof(req)))
+			return -EFAULT;
+
+		umem = packet_umem_new(req.addr, req.len, req.frame_size,
+				       req.data_headroom);
+		if (IS_ERR(umem))
+			return PTR_ERR(umem);
+
+		lock_sock(sk);
+		if (po->umem) {
+			release_sock(sk);
+			packet_umem_free(umem);
+			return -EBUSY;
+		}
+		po->umem = umem;
+		release_sock(sk);
+		return 0;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -4245,6 +4405,9 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		case TPACKET_V3:
 			po->tp_hdrlen = TPACKET3_HDRLEN;
 			break;
+		default:
+			err = -EINVAL;
+			goto out;
 		}
 
 		err = -EINVAL;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 94d1d405a116..9c07cfe1b8a3 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -2,6 +2,7 @@
 #define __PACKET_INTERNAL_H__
 
 #include <linux/refcount.h>
+#include <linux/tpacket4.h>
 
 struct packet_mclist {
 	struct packet_mclist	*next;
@@ -109,6 +110,9 @@ struct packet_sock {
 	union  tpacket_stats_u	stats;
 	struct packet_ring_buffer	rx_ring;
 	struct packet_ring_buffer	tx_ring;
+
+	struct tp4_umem			*umem;
+
 	int			copy_thresh;
 	spinlock_t		bind_lock;
 	struct mutex		pg_vec_lock;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-11-03  4:16   ` Willem de Bruijn
  2017-10-31 12:41 ` [RFC PATCH 04/14] packet: enable Rx for AF_PACKET V4 Björn Töpel
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Allow creation of AF_PACKET V4 rings. Tx and Rx are still disabled.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/tpacket4.h | 391 +++++++++++++++++++++++++++++++++++++++++++++++
 net/packet/af_packet.c   | 262 +++++++++++++++++++++++++++++--
 net/packet/internal.h    |   4 +
 3 files changed, 641 insertions(+), 16 deletions(-)

diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index fcf4c333c78d..44ba38034133 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -18,6 +18,12 @@
 #define TP4_UMEM_MIN_FRAME_SIZE 2048
 #define TP4_KERNEL_HEADROOM 256 /* Headrom for XDP */
 
+enum tp4_validation {
+	TP4_VALIDATION_NONE,	/* No validation is performed */
+	TP4_VALIDATION_IDX,	/* Only address to packet buffer is validated */
+	TP4_VALIDATION_DESC	/* Full descriptor is validated */
+};
+
 struct tp4_umem {
 	struct pid *pid;
 	struct page **pgs;
@@ -31,9 +37,95 @@ struct tp4_umem {
 	unsigned int data_headroom;
 };
 
+struct tp4_dma_info {
+	dma_addr_t dma;
+	struct page *page;
+};
+
+struct tp4_queue {
+	struct tpacket4_desc *ring;
+
+	unsigned int used_idx;
+	unsigned int last_avail_idx;
+	unsigned int ring_mask;
+	unsigned int num_free;
+
+	struct tp4_umem *umem;
+	struct tp4_dma_info *dma_info;
+	enum dma_data_direction direction;
+};
+
+/**
+ * struct tp4_packet_array - An array of packets/frames
+ *
+ * @tp4q: the tp4q associated with this packet array. Flushes and
+ *	  populates will operate on this.
+ * @dev: pointer to the netdevice the queue should be associated with
+ * @direction: the direction of the DMA channel that is set up.
+ * @validation: type of validation performed on populate
+ * @start: the first packet that has not been processed
+ * @curr: the packet that is currently being processed
+ * @end: the last packet in the array
+ * @mask: convenience variable for internal operations on the array
+ * @items: the actual descriptors to frames/packets that are in the array
+ **/
+struct tp4_packet_array {
+	struct tp4_queue *tp4q;
+	struct device *dev;
+	enum dma_data_direction direction;
+	enum tp4_validation validation;
+	u32 start;
+	u32 curr;
+	u32 end;
+	u32 mask;
+	struct tpacket4_desc items[0];
+};
+
+/**
+ * struct tp4_frame_set - A view of a packet array consisting of
+ *                        one or more frames
+ *
+ * @pkt_arr: the packet array this frame set is located in
+ * @start: the first frame that has not been processed
+ * @curr: the frame that is currently being processed
+ * @end: the last frame in the frame set
+ *
+ * This frame set can either be one or more frames or a single packet
+ * consisting of one or more frames. tp4f_ functions with packet in the
+ * name return a frame set representing a packet, while the other
+ * tp4f_ functions return one or more frames not taking into account if
+ * they consitute a packet or not.
+ **/
+struct tp4_frame_set {
+	struct tp4_packet_array *pkt_arr;
+	u32 start;
+	u32 curr;
+	u32 end;
+};
+
 /*************** V4 QUEUE OPERATIONS *******************************/
 
 /**
+ * tp4q_init - Initializas a tp4 queue
+ *
+ * @q: Pointer to the tp4 queue structure to be initialized
+ * @nentries: Number of descriptor entries in the queue
+ * @umem: Pointer to the umem / packet buffer associated with this queue
+ * @buffer: Pointer to the memory region where the descriptors will reside
+ **/
+static inline void tp4q_init(struct tp4_queue *q, unsigned int nentries,
+			     struct tp4_umem *umem,
+			     struct tpacket4_desc *buffer)
+{
+	q->ring = buffer;
+	q->used_idx = 0;
+	q->last_avail_idx = 0;
+	q->ring_mask = nentries - 1;
+	q->num_free = 0;
+	q->umem = umem;
+}
+
+/**
  * tp4q_umem_new - Creates a new umem (packet buffer)
  *
  * @addr: The address to the umem
@@ -98,4 +190,303 @@ static inline struct tp4_umem *tp4q_umem_new(unsigned long addr, size_t size,
 	return umem;
 }
 
+/**
+ * tp4q_enqueue_from_array - Enqueue entries from packet array to tp4 queue
+ *
+ * @a: Pointer to the packet array to enqueue from
+ * @dcnt: Max number of entries to enqueue
+ *
+ * Returns 0 for success or an errno at failure
+ **/
+static inline int tp4q_enqueue_from_array(struct tp4_packet_array *a,
+					  u32 dcnt)
+{
+	struct tp4_queue *q = a->tp4q;
+	unsigned int used_idx = q->used_idx;
+	struct tpacket4_desc *d = a->items;
+	int i;
+
+	if (q->num_free < dcnt)
+		return -ENOSPC;
+
+	q->num_free -= dcnt;
+
+	for (i = 0; i < dcnt; i++) {
+		unsigned int idx = (used_idx++) & q->ring_mask;
+		unsigned int didx = (a->start + i) & a->mask;
+
+		q->ring[idx].idx = d[didx].idx;
+		q->ring[idx].len = d[didx].len;
+		q->ring[idx].offset = d[didx].offset;
+		q->ring[idx].error = d[didx].error;
+	}
+
+	/* Order flags and data */
+	smp_wmb();
+
+	for (i = dcnt - 1; i >= 0; i--) {
+		unsigned int idx = (q->used_idx + i) & q->ring_mask;
+		unsigned int didx = (a->start + i) & a->mask;
+
+		q->ring[idx].flags = d[didx].flags & ~TP4_DESC_KERNEL;
+	}
+	q->used_idx += dcnt;
+
+	return 0;
+}
+
+/**
+ * tp4q_disable - Disable a tp4 queue
+ *
+ * @dev: Pointer to the netdevice the queue is connected to
+ * @q: Pointer to the tp4 queue to disable
+ **/
+static inline void tp4q_disable(struct device *dev,
+				struct tp4_queue *q)
+{
+	int i;
+
+	if (q->dma_info) {
+		/* Unmap DMA */
+		for (i = 0; i < q->umem->npgs; i++)
+			dma_unmap_page(dev, q->dma_info[i].dma, PAGE_SIZE,
+				       q->direction);
+
+		kfree(q->dma_info);
+		q->dma_info = NULL;
+	}
+}
+
+/**
+ * tp4q_enable - Enable a tp4 queue
+ *
+ * @dev: Pointer to the netdevice the queue should be associated with
+ * @q: Pointer to the tp4 queue to enable
+ * @direction: The direction of the DMA channel that is set up.
+ *
+ * Returns 0 for success or a negative errno for failure
+ **/
+static inline int tp4q_enable(struct device *dev,
+			      struct tp4_queue *q,
+			      enum dma_data_direction direction)
+{
+	int i, j;
+
+	/* DMA map all the buffers in bufs up front, and sync prior
+	 * kicking userspace. Is this sane? Strictly user land owns
+	 * the buffer until they show up on the avail queue. However,
+	 * mapping should be ok.
+	 */
+	if (direction != DMA_NONE) {
+		q->dma_info = kcalloc(q->umem->npgs, sizeof(*q->dma_info),
+				      GFP_KERNEL);
+		if (!q->dma_info)
+			return -ENOMEM;
+
+		for (i = 0; i < q->umem->npgs; i++) {
+			dma_addr_t dma;
+
+			dma = dma_map_page(dev, q->umem->pgs[i], 0,
+					   PAGE_SIZE, direction);
+			if (dma_mapping_error(dev, dma)) {
+				for (j = 0; j < i; j++)
+					dma_unmap_page(dev,
+						       q->dma_info[j].dma,
+						       PAGE_SIZE, direction);
+				kfree(q->dma_info);
+				q->dma_info = NULL;
+				return -EBUSY;
+			}
+
+			q->dma_info[i].page = q->umem->pgs[i];
+			q->dma_info[i].dma = dma;
+		}
+	} else {
+		q->dma_info = NULL;
+	}
+
+	q->direction = direction;
+	return 0;
+}
+
+/*************** FRAME OPERATIONS *******************************/
+/* A frame is always just one frame of size frame_size.
+ * A frame set is one or more frames.
+ **/
+
+/**
+ * tp4f_next_frame - Go to next frame in frame set
+ * @p: pointer to frame set
+ *
+ * Returns true if there is another frame in the frame set.
+ * Advances curr pointer.
+ **/
+static inline bool tp4f_next_frame(struct tp4_frame_set *p)
+{
+	if (p->curr + 1 == p->end)
+		return false;
+
+	p->curr++;
+	return true;
+}
+
+/**
+ * tp4f_set_frame - Sets the properties of a frame
+ * @p: pointer to frame
+ * @len: the length in bytes of the data in the frame
+ * @offset: offset to start of data in frame
+ * @is_eop: Set if this is the last frame of the packet
+ **/
+static inline void tp4f_set_frame(struct tp4_frame_set *p, u32 len, u16 offset,
+				  bool is_eop)
+{
+	struct tpacket4_desc *d =
+		&p->pkt_arr->items[p->curr & p->pkt_arr->mask];
+
+	d->len = len;
+	d->offset = offset;
+	if (!is_eop)
+		d->flags |= TP4_PKT_CONT;
+}
+
+/**************** PACKET_ARRAY FUNCTIONS ********************************/
+
+static inline struct tp4_packet_array *__tp4a_new(
+	struct tp4_queue *tp4q,
+	struct device *dev,
+	enum dma_data_direction direction,
+	enum tp4_validation validation,
+	size_t elems)
+{
+	struct tp4_packet_array *arr;
+	int err;
+
+	if (!is_power_of_2(elems))
+		return NULL;
+
+	arr = kzalloc(sizeof(*arr) + elems * sizeof(struct tpacket4_desc),
+		      GFP_KERNEL);
+	if (!arr)
+		return NULL;
+
+	err = tp4q_enable(dev, tp4q, direction);
+	if (err) {
+		kfree(arr);
+		return NULL;
+	}
+
+	arr->tp4q = tp4q;
+	arr->dev = dev;
+	arr->direction = direction;
+	arr->validation = validation;
+	arr->mask = elems - 1;
+	return arr;
+}
+
+/**
+ * tp4a_rx_new - Create new packet array for ingress
+ * @rx_opaque: opaque from tp4_netdev_params
+ * @elems: number of elements in the packet array
+ * @dev: device or NULL
+ *
+ * Returns a reference to the new packet array or NULL for failure
+ **/
+static inline struct tp4_packet_array *tp4a_rx_new(void *rx_opaque,
+						   size_t elems,
+						   struct device *dev)
+{
+	enum dma_data_direction direction = dev ? DMA_FROM_DEVICE : DMA_NONE;
+
+	return __tp4a_new(rx_opaque, dev, direction, TP4_VALIDATION_IDX,
+			  elems);
+}
+
+/**
+ * tp4a_tx_new - Create new packet array for egress
+ * @tx_opaque: opaque from tp4_netdev_params
+ * @elems: number of elements in the packet array
+ * @dev: device or NULL
+ *
+ * Returns a reference to the new packet array or NULL for failure
+ **/
+static inline struct tp4_packet_array *tp4a_tx_new(void *tx_opaque,
+						   size_t elems,
+						   struct device *dev)
+{
+	enum dma_data_direction direction = dev ? DMA_TO_DEVICE : DMA_NONE;
+
+	return __tp4a_new(tx_opaque, dev, direction, TP4_VALIDATION_DESC,
+			  elems);
+}
+
+/**
+ * tp4a_get_flushable_frame_set - Create a frame set of the flushable region
+ * @a: pointer to packet array
+ * @p: frame set
+ *
+ * Returns true for success and false for failure
+ **/
+static inline bool tp4a_get_flushable_frame_set(struct tp4_packet_array *a,
+						struct tp4_frame_set *p)
+{
+	u32 avail = a->curr - a->start;
+
+	if (avail == 0)
+		return false; /* empty */
+
+	p->pkt_arr = a;
+	p->start = a->start;
+	p->curr = a->start;
+	p->end = a->curr;
+
+	return true;
+}
+
+/**
+ * tp4a_flush - Flush processed packets to associated tp4q
+ * @a: pointer to packet array
+ *
+ * Returns 0 for success and -1 for failure
+ **/
+static inline int tp4a_flush(struct tp4_packet_array *a)
+{
+	u32 avail = a->curr - a->start;
+	int ret;
+
+	if (avail == 0)
+		return 0; /* nothing to flush */
+
+	ret = tp4q_enqueue_from_array(a, avail);
+	if (ret < 0)
+		return -1;
+
+	a->start = a->curr;
+
+	return 0;
+}
+
+/**
+ * tp4a_free - Destroy packet array
+ * @a: pointer to packet array
+ **/
+static inline void tp4a_free(struct tp4_packet_array *a)
+{
+	struct tp4_frame_set f;
+
+	if (a) {
+		/* Flush all outstanding requests. */
+		if (tp4a_get_flushable_frame_set(a, &f)) {
+			do {
+				tp4f_set_frame(&f, 0, 0, true);
+			} while (tp4f_next_frame(&f));
+		}
+
+		WARN_ON_ONCE(tp4a_flush(a));
+
+		tp4q_disable(a->dev, a->tp4q);
+	}
+
+	kfree(a);
+}
+
 #endif /* _LINUX_TPACKET4_H */
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index b39be424ec0e..190598eb3461 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -189,6 +189,9 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 #define BLOCK_O2PRIV(x)	((x)->offset_to_priv)
 #define BLOCK_PRIV(x)		((void *)((char *)(x) + BLOCK_O2PRIV(x)))
 
+#define RX_RING 0
+#define TX_RING 1
+
 struct packet_sock;
 static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		       struct packet_type *pt, struct net_device *orig_dev);
@@ -244,6 +247,9 @@ struct packet_skb_cb {
 
 static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
 static void __fanout_link(struct sock *sk, struct packet_sock *po);
+static void packet_v4_ring_free(struct sock *sk, int tx_ring);
+static int packet_v4_ring_new(struct sock *sk, struct tpacket_req4 *req,
+			      int tx_ring);
 
 static int packet_direct_xmit(struct sk_buff *skb)
 {
@@ -2206,6 +2212,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	sk = pt->af_packet_priv;
 	po = pkt_sk(sk);
 
+	if (po->tp_version == TPACKET_V4)
+		goto drop;
+
 	if (!net_eq(dev_net(dev), sock_net(sk)))
 		goto drop;
 
@@ -2973,10 +2982,14 @@ static int packet_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
 
-	if (po->tx_ring.pg_vec)
+	if (po->tx_ring.pg_vec) {
+		if (po->tp_version == TPACKET_V4)
+			return -EINVAL;
+
 		return tpacket_snd(po, msg);
-	else
-		return packet_snd(sock, msg, len);
+	}
+
+	return packet_snd(sock, msg, len);
 }
 
 static void
@@ -3105,6 +3118,25 @@ packet_umem_new(unsigned long addr, size_t size, unsigned int frame_size,
 	return ret < 0 ? ERR_PTR(ret) : umem;
 }
 
+static void packet_clear_ring(struct sock *sk, int tx_ring)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct packet_ring_buffer *rb;
+	union tpacket_req_u req_u;
+
+	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
+	if (!rb->pg_vec)
+		return;
+
+	if (po->tp_version == TPACKET_V4) {
+		packet_v4_ring_free(sk, tx_ring);
+		return;
+	}
+
+	memset(&req_u, 0, sizeof(req_u));
+	packet_set_ring(sk, &req_u, 1, tx_ring);
+}
+
 /*
  *	Close a PACKET socket. This is fairly simple. We immediately go
  *	to 'closed' state and remove our protocol entry in the device list.
@@ -3116,7 +3148,6 @@ static int packet_release(struct socket *sock)
 	struct packet_sock *po;
 	struct packet_fanout *f;
 	struct net *net;
-	union tpacket_req_u req_u;
 
 	if (!sk)
 		return 0;
@@ -3144,15 +3175,8 @@ static int packet_release(struct socket *sock)
 
 	packet_flush_mclist(sk);
 
-	if (po->rx_ring.pg_vec) {
-		memset(&req_u, 0, sizeof(req_u));
-		packet_set_ring(sk, &req_u, 1, 0);
-	}
-
-	if (po->tx_ring.pg_vec) {
-		memset(&req_u, 0, sizeof(req_u));
-		packet_set_ring(sk, &req_u, 1, 1);
-	}
+	packet_clear_ring(sk, TX_RING);
+	packet_clear_ring(sk, RX_RING);
 
 	if (po->umem) {
 		packet_umem_free(po->umem);
@@ -3786,16 +3810,24 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 			len = sizeof(req_u.req);
 			break;
 		case TPACKET_V3:
-		default:
 			len = sizeof(req_u.req3);
 			break;
+		case TPACKET_V4:
+		default:
+			len = sizeof(req_u.req4);
+			break;
 		}
 		if (optlen < len)
 			return -EINVAL;
 		if (copy_from_user(&req_u.req, optval, len))
 			return -EFAULT;
-		return packet_set_ring(sk, &req_u, 0,
-			optname == PACKET_TX_RING);
+
+		if (po->tp_version == TPACKET_V4)
+			return packet_v4_ring_new(sk, &req_u.req4,
+						  optname == PACKET_TX_RING);
+		else
+			return packet_set_ring(sk, &req_u, 0,
+					       optname == PACKET_TX_RING);
 	}
 	case PACKET_COPY_THRESH:
 	{
@@ -3821,6 +3853,7 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		case TPACKET_V1:
 		case TPACKET_V2:
 		case TPACKET_V3:
+		case TPACKET_V4:
 			break;
 		default:
 			return -EINVAL;
@@ -4061,6 +4094,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 		case TPACKET_V3:
 			val = sizeof(struct tpacket3_hdr);
 			break;
+		case TPACKET_V4:
+			val = 0;
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -4247,6 +4283,9 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
 	struct packet_sock *po = pkt_sk(sk);
 	unsigned int mask = datagram_poll(file, sock, wait);
 
+	if (po->tp_version == TPACKET_V4)
+		return mask;
+
 	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (po->rx_ring.pg_vec) {
 		if (!packet_previous_rx_frame(po, &po->rx_ring,
@@ -4363,6 +4402,197 @@ static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order)
 	goto out;
 }
 
+static struct socket *
+packet_v4_umem_sock_get(int fd)
+{
+	struct {
+		struct sockaddr_ll sa;
+		char  buf[MAX_ADDR_LEN];
+	} uaddr;
+	int uaddr_len = sizeof(uaddr), r;
+	struct socket *sock = sockfd_lookup(fd, &r);
+
+	if (!sock)
+		return ERR_PTR(-ENOTSOCK);
+
+	/* Parameter checking */
+	if (sock->sk->sk_type != SOCK_RAW) {
+		r = -ESOCKTNOSUPPORT;
+		goto err;
+	}
+
+	r = sock->ops->getname(sock, (struct sockaddr *)&uaddr.sa,
+			       &uaddr_len, 0);
+	if (r)
+		goto err;
+
+	if (uaddr.sa.sll_family != AF_PACKET) {
+		r = -EPFNOSUPPORT;
+		goto err;
+	}
+
+	if (!pkt_sk(sock->sk)->umem) {
+		r = -ESOCKTNOSUPPORT;
+		goto err;
+	}
+
+	return sock;
+err:
+	sockfd_put(sock);
+	return ERR_PTR(r);
+}
+
+#define TP4_ARRAY_SIZE 32
+
+static int
+packet_v4_ring_new(struct sock *sk, struct tpacket_req4 *req, int tx_ring)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct packet_ring_buffer *rb;
+	struct sk_buff_head *rb_queue;
+	int was_running, order = 0;
+	struct socket *mrsock;
+	struct tpacket_req r;
+	struct pgv *pg_vec;
+	size_t rb_size;
+	__be16 num;
+	int err;
+
+	if (req->desc_nr == 0)
+		return -EINVAL;
+
+	lock_sock(sk);
+
+	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
+	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
+
+	err = -EBUSY;
+	if (atomic_read(&po->mapped))
+		goto out;
+	if (packet_read_pending(rb))
+		goto out;
+	if (unlikely(rb->pg_vec))
+		goto out;
+
+	err = -EINVAL;
+	if (po->tp_version != TPACKET_V4)
+		goto out;
+
+	po->tp_hdrlen = 0;
+
+	rb_size = req->desc_nr * sizeof(struct tpacket4_desc);
+	if (unlikely(!rb_size))
+		goto out;
+
+	err = -ENOMEM;
+	order = get_order(rb_size);
+
+	r.tp_block_nr = 1;
+	pg_vec = alloc_pg_vec(&r, order);
+	if (unlikely(!pg_vec))
+		goto out;
+
+	mrsock = packet_v4_umem_sock_get(req->mr_fd);
+	if (IS_ERR(mrsock)) {
+		err = PTR_ERR(mrsock);
+		free_pg_vec(pg_vec, order, 1);
+		goto out;
+	}
+
+	/* Check if umem is from this socket, if so don't make
+	 * circular references.
+	 */
+	if (sk->sk_socket == mrsock)
+		sockfd_put(mrsock);
+
+	spin_lock(&po->bind_lock);
+	was_running = po->running;
+	num = po->num;
+	if (was_running) {
+		po->num = 0;
+		__unregister_prot_hook(sk, false);
+	}
+	spin_unlock(&po->bind_lock);
+
+	synchronize_net();
+
+	mutex_lock(&po->pg_vec_lock);
+	spin_lock_bh(&rb_queue->lock);
+
+	rb->pg_vec = pg_vec;
+	rb->head = 0;
+	rb->frame_max = req->desc_nr - 1;
+	rb->mrsock = mrsock;
+	tp4q_init(&rb->tp4q, req->desc_nr, pkt_sk(mrsock->sk)->umem,
+		  (struct tpacket4_desc *)rb->pg_vec->buffer);
+	spin_unlock_bh(&rb_queue->lock);
+
+	rb->tp4a = tx_ring ? tp4a_tx_new(&rb->tp4q, TP4_ARRAY_SIZE, NULL)
+		   : tp4a_rx_new(&rb->tp4q, TP4_ARRAY_SIZE, NULL);
+
+	if (!rb->tp4a) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	rb->pg_vec_order = order;
+	rb->pg_vec_len = 1;
+	rb->pg_vec_pages = PAGE_ALIGN(rb_size) / PAGE_SIZE;
+
+	po->prot_hook.func = po->rx_ring.pg_vec ? tpacket_rcv : packet_rcv;
+	skb_queue_purge(rb_queue);
+
+	mutex_unlock(&po->pg_vec_lock);
+
+	spin_lock(&po->bind_lock);
+	if (was_running && po->prot_hook.dev) {
+		/* V4 requires a bound socket, so only rebind if
+		 * ifindex > 0 / !dev
+		 */
+		po->num = num;
+		register_prot_hook(sk);
+	}
+	spin_unlock(&po->bind_lock);
+
+	err = 0;
+out:
+	release_sock(sk);
+	return err;
+}
+
+static void
+packet_v4_ring_free(struct sock *sk, int tx_ring)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct packet_ring_buffer *rb;
+	struct sk_buff_head *rb_queue;
+
+	lock_sock(sk);
+
+	rb = tx_ring ? &po->tx_ring : &po->rx_ring;
+	rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue;
+
+	spin_lock(&po->bind_lock);
+	unregister_prot_hook(sk, true);
+	spin_unlock(&po->bind_lock);
+
+	mutex_lock(&po->pg_vec_lock);
+	spin_lock_bh(&rb_queue->lock);
+
+	if (rb->pg_vec) {
+		free_pg_vec(rb->pg_vec, rb->pg_vec_order, rb->pg_vec_len);
+		rb->pg_vec = NULL;
+	}
+	if (rb->mrsock && sk->sk_socket != rb->mrsock)
+		sockfd_put(rb->mrsock);
+	tp4a_free(rb->tp4a);
+
+	spin_unlock_bh(&rb_queue->lock);
+	skb_queue_purge(rb_queue);
+	mutex_unlock(&po->pg_vec_lock);
+	release_sock(sk);
+}
+
 static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		int closing, int tx_ring)
 {
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 9c07cfe1b8a3..3eedab29e4d7 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -71,6 +71,10 @@ struct packet_ring_buffer {
 	unsigned int __percpu	*pending_refcnt;
 
 	struct tpacket_kbdq_core	prb_bdqc;
+
+	struct tp4_packet_array	*tp4a;
+	struct tp4_queue	tp4q;
+	struct socket		*mrsock;
 };
 
 extern struct mutex fanout_mutex;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 04/14] packet: enable Rx for AF_PACKET V4
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (2 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 05/14] packet: enable Tx support " Björn Töpel
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit, ingress support is implemented.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/tpacket4.h | 361 +++++++++++++++++++++++++++++++++++++++++++++++
 net/packet/af_packet.c   |  83 +++++++----
 2 files changed, 419 insertions(+), 25 deletions(-)

diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index 44ba38034133..1d4c13d472e5 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -191,6 +191,172 @@ static inline struct tp4_umem *tp4q_umem_new(unsigned long addr, size_t size,
 }
 
 /**
+ * tp4q_set_error - Sets an errno on the descriptor
+ *
+ * @desc: Pointer to the descriptor to be manipulated
+ * @errno: The errno number to write to the descriptor
+ **/
+static inline void tp4q_set_error(struct tpacket4_desc *desc,
+				  int errno)
+{
+	desc->error = errno;
+}
+
+/**
+ * tp4q_set_offset - Sets the data offset for the descriptor
+ *
+ * @desc: Pointer to the descriptor to be manipulated
+ * @offset: The data offset to write to the descriptor
+ **/
+static inline void tp4q_set_offset(struct tpacket4_desc *desc,
+				   u16 offset)
+{
+	desc->offset = offset;
+}
+
+/**
+ * tp4q_is_free - Is there a free entry on the queue?
+ *
+ * @q: Pointer to the tp4 queue to examine
+ *
+ * Returns true if there is a free entry, otherwise false
+ **/
+static inline int tp4q_is_free(struct tp4_queue *q)
+{
+	unsigned int idx = q->used_idx & q->ring_mask;
+	unsigned int prev_idx;
+
+	if (!idx)
+		prev_idx = q->ring_mask;
+	else
+		prev_idx = idx - 1;
+
+	/* previous frame is already consumed by userspace
+	 * meaning ring is free
+	 */
+	if (q->ring[prev_idx].flags & TP4_DESC_KERNEL)
+		return 1;
+
+	/* there is some data that userspace can read immediately */
+	return 0;
+}
+
+/**
+ * tp4q_get_data_headroom - How much data headroom does the queue have
+ *
+ * @q: Pointer to the tp4 queue to examine
+ *
+ * Returns the amount of data headroom that has been configured for the
+ * queue
+ **/
+static inline unsigned int tp4q_get_data_headroom(struct tp4_queue *q)
+{
+	return q->umem->data_headroom + TP4_KERNEL_HEADROOM;
+}
+
+/**
+ * tp4q_is_valid_entry - Is the entry valid?
+ *
+ * @q: Pointer to the tp4 queue the descriptor resides in
+ * @desc: Pointer to the descriptor to examine
+ * @validation: The type of validation to perform
+ *
+ * Returns true if the entry is a valid, otherwise false
+ **/
+static inline bool tp4q_is_valid_entry(struct tp4_queue *q,
+				       struct tpacket4_desc *d,
+				       enum tp4_validation validation)
+{
+	if (validation == TP4_VALIDATION_NONE)
+		return true;
+
+	if (unlikely(d->idx >= q->umem->nframes)) {
+		tp4q_set_error(d, EBADF);
+		return false;
+	}
+	if (validation == TP4_VALIDATION_IDX) {
+		tp4q_set_offset(d, tp4q_get_data_headroom(q));
+		return true;
+	}
+
+	/* TP4_VALIDATION_DESC */
+	if (unlikely(d->len > q->umem->frame_size ||
+		     d->len == 0 ||
+		     d->offset > q->umem->frame_size ||
+		     d->offset + d->len > q->umem->frame_size)) {
+		tp4q_set_error(d, EBADF);
+		return false;
+	}
+
+	return true;
+}
+
+/**
+ * tp4q_nb_avail - Returns the number of available entries
+ *
+ * @q: Pointer to the tp4 queue to examine
+ * @dcnt: Max number of entries to check
+ *
+ * Returns the the number of entries available in the queue up to dcnt
+ **/
+static inline int tp4q_nb_avail(struct tp4_queue *q, int dcnt)
+{
+	unsigned int idx, last_avail_idx = q->last_avail_idx;
+	int i, entries = 0;
+
+	for (i = 0; i < dcnt; i++) {
+		idx = (last_avail_idx++) & q->ring_mask;
+		if (!(q->ring[idx].flags & TP4_DESC_KERNEL))
+			break;
+		entries++;
+	}
+
+	return entries;
+}
+
+/**
+ * tp4q_enqueue - Enqueue entries to a tp4 queue
+ *
+ * @q: Pointer to the tp4 queue the descriptor resides in
+ * @d: Pointer to the descriptor to examine
+ * @dcnt: Max number of entries to dequeue
+ *
+ * Returns 0 for success or an errno at failure
+ **/
+static inline int tp4q_enqueue(struct tp4_queue *q,
+			       const struct tpacket4_desc *d, int dcnt)
+{
+	unsigned int used_idx = q->used_idx;
+	int i;
+
+	if (q->num_free < dcnt)
+		return -ENOSPC;
+
+	q->num_free -= dcnt;
+
+	for (i = 0; i < dcnt; i++) {
+		unsigned int idx = (used_idx++) & q->ring_mask;
+
+		q->ring[idx].idx = d[i].idx;
+		q->ring[idx].len = d[i].len;
+		q->ring[idx].offset = d[i].offset;
+		q->ring[idx].error = d[i].error;
+	}
+
+	/* Order flags and data */
+	smp_wmb();
+
+	for (i = dcnt - 1; i >= 0; i--) {
+		unsigned int idx = (q->used_idx + i) & q->ring_mask;
+
+		q->ring[idx].flags = d[i].flags & ~TP4_DESC_KERNEL;
+	}
+	q->used_idx += dcnt;
+
+	return 0;
+}
+
+/**
  * tp4q_enqueue_from_array - Enqueue entries from packet array to tp4 queue
  *
  * @a: Pointer to the packet array to enqueue from
@@ -236,6 +402,45 @@ static inline int tp4q_enqueue_from_array(struct tp4_packet_array *a,
 }
 
 /**
+ * tp4q_dequeue_to_array - Dequeue entries from tp4 queue to packet array
+ *
+ * @a: Pointer to the packet array to dequeue from
+ * @dcnt: Max number of entries to dequeue
+ *
+ * Returns the number of entries dequeued. Non valid entries will be
+ * discarded.
+ **/
+static inline int tp4q_dequeue_to_array(struct tp4_packet_array *a, u32 dcnt)
+{
+	struct tpacket4_desc *d = a->items;
+	int i, entries, valid_entries = 0;
+	struct tp4_queue *q = a->tp4q;
+	u32 start = a->end;
+
+	entries = tp4q_nb_avail(q, dcnt);
+	q->num_free += entries;
+
+	/* Order flags and data */
+	smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		unsigned int d_idx = start & a->mask;
+		unsigned int idx;
+
+		idx = (q->last_avail_idx++) & q->ring_mask;
+		d[d_idx] = q->ring[idx];
+		if (!tp4q_is_valid_entry(q, &d[d_idx], a->validation)) {
+			WARN_ON_ONCE(tp4q_enqueue(a->tp4q, &d[d_idx], 1));
+			continue;
+		}
+
+		start++;
+		valid_entries++;
+	}
+	return valid_entries;
+}
+
+/**
  * tp4q_disable - Disable a tp4 queue
  *
  * @dev: Pointer to the netdevice the queue is connected to
@@ -309,6 +514,67 @@ static inline int tp4q_enable(struct device *dev,
 	return 0;
 }
 
+/**
+ * tp4q_get_page_offset - Get offset into page frame resides at
+ *
+ * @q: Pointer to the tp4 queue that this frame resides in
+ * @addr: Index of this frame in the packet buffer / umem
+ * @pg: Returns a pointer to the page of this frame
+ * @off: Returns the offset to the page of this frame
+ **/
+static inline void tp4q_get_page_offset(struct tp4_queue *q, u64 addr,
+				       u64 *pg, u64 *off)
+{
+	*pg = addr >> q->umem->nfpplog2;
+	*off = (addr - (*pg << q->umem->nfpplog2))
+	       << q->umem->frame_size_log2;
+}
+
+/**
+ * tp4q_max_data_size - Get the max packet size supported by a queue
+ *
+ * @q: Pointer to the tp4 queue to examine
+ *
+ * Returns the max packet size supported by the queue
+ **/
+static inline unsigned int tp4q_max_data_size(struct tp4_queue *q)
+{
+	return q->umem->frame_size - q->umem->data_headroom -
+		TP4_KERNEL_HEADROOM;
+}
+
+/**
+ * tp4q_get_data - Gets a pointer to the start of the packet
+ *
+ * @q: Pointer to the tp4 queue to examine
+ * @desc: Pointer to descriptor of the packet
+ *
+ * Returns a pointer to the start of the packet the descriptor is pointing
+ * to
+ **/
+static inline void *tp4q_get_data(struct tp4_queue *q,
+				  struct tpacket4_desc *desc)
+{
+	u64 pg, off;
+	u8 *pkt;
+
+	tp4q_get_page_offset(q, desc->idx, &pg, &off);
+	pkt = page_address(q->umem->pgs[pg]);
+	return (u8 *)(pkt + off) + desc->offset;
+}
+
+/**
+ * tp4q_get_desc - Get descriptor associated with frame
+ *
+ * @p: Pointer to the packet to examine
+ *
+ * Returns the descriptor of the current frame of packet p
+ **/
+static inline struct tpacket4_desc *tp4q_get_desc(struct tp4_frame_set *p)
+{
+	return &p->pkt_arr->items[p->curr & p->pkt_arr->mask];
+}
+
 /*************** FRAME OPERATIONS *******************************/
 /* A frame is always just one frame of size frame_size.
  * A frame set is one or more frames.
@@ -331,6 +597,18 @@ static inline bool tp4f_next_frame(struct tp4_frame_set *p)
 }
 
 /**
+ * tp4f_get_data - Gets a pointer to the frame the frame set is on
+ * @p: pointer to the frame set
+ *
+ * Returns a pointer to the data of the frame that the frame set is
+ * pointing to. Note that there might be configured headroom before this
+ **/
+static inline void *tp4f_get_data(struct tp4_frame_set *p)
+{
+	return tp4q_get_data(p->pkt_arr->tp4q, tp4q_get_desc(p));
+}
+
+/**
  * tp4f_set_frame - Sets the properties of a frame
  * @p: pointer to frame
  * @len: the length in bytes of the data in the frame
@@ -443,6 +721,29 @@ static inline bool tp4a_get_flushable_frame_set(struct tp4_packet_array *a,
 }
 
 /**
+ * tp4a_next_frame - Get next frame in array and advance curr pointer
+ * @a: pointer to packet array
+ * @p: supplied pointer to packet structure that is filled in by function
+ *
+ * Returns true if there is a frame, false otherwise. Frame returned in *p.
+ **/
+static inline bool tp4a_next_frame(struct tp4_packet_array *a,
+				   struct tp4_frame_set *p)
+{
+	u32 avail = a->end - a->curr;
+
+	if (avail == 0)
+		return false; /* empty */
+
+	p->pkt_arr = a;
+	p->start = a->curr;
+	p->curr = a->curr;
+	p->end = ++a->curr;
+
+	return true;
+}
+
+/**
  * tp4a_flush - Flush processed packets to associated tp4q
  * @a: pointer to packet array
  *
@@ -489,4 +790,64 @@ static inline void tp4a_free(struct tp4_packet_array *a)
 	kfree(a);
 }
 
+/**
+ * tp4a_get_data_headroom - Returns the data headroom configured for the array
+ * @a: pointer to packet array
+ *
+ * Returns the data headroom configured for the array
+ **/
+static inline unsigned int tp4a_get_data_headroom(struct tp4_packet_array *a)
+{
+	return tp4q_get_data_headroom(a->tp4q);
+}
+
+/**
+ * tp4a_max_data_size - Get the max packet size supported for the array
+ * @a: pointer to packet array
+ *
+ * Returns the maximum size of data that can be put in a frame when headroom
+ * has been accounted for.
+ **/
+static inline unsigned int tp4a_max_data_size(struct tp4_packet_array *a)
+{
+	return tp4q_max_data_size(a->tp4q);
+
+}
+
+/**
+ * tp4a_populate - Populate an array with packets from associated tp4q
+ * @a: pointer to packet array
+ **/
+static inline void tp4a_populate(struct tp4_packet_array *a)
+{
+	u32 cnt, free = a->mask + 1 - (a->end - a->start);
+
+	if (free == 0)
+		return; /* no space! */
+
+	cnt = tp4q_dequeue_to_array(a, free);
+	a->end += cnt;
+}
+
+/**
+ * tp4a_next_frame_populate - Get next frame and populate array if empty
+ * @a: pointer to packet array
+ * @p: supplied pointer to packet structure that is filled in by function
+ *
+ * Returns true if there is a frame, false otherwise. Frame returned in *p.
+ **/
+static inline bool tp4a_next_frame_populate(struct tp4_packet_array *a,
+					    struct tp4_frame_set *p)
+{
+	bool more_frames;
+
+	more_frames = tp4a_next_frame(a, p);
+	if (!more_frames) {
+		tp4a_populate(a);
+		more_frames = tp4a_next_frame(a, p);
+	}
+
+	return more_frames;
+}
+
 #endif /* _LINUX_TPACKET4_H */
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 190598eb3461..830d97ff4358 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2192,7 +2192,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	int skb_len = skb->len;
 	unsigned int snaplen, res;
 	unsigned long status = TP_STATUS_USER;
-	unsigned short macoff, netoff, hdrlen;
+	unsigned short macoff = 0, netoff = 0, hdrlen;
 	struct sk_buff *copy_skb = NULL;
 	struct timespec ts;
 	__u32 ts_status;
@@ -2212,9 +2212,6 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	sk = pt->af_packet_priv;
 	po = pkt_sk(sk);
 
-	if (po->tp_version == TPACKET_V4)
-		goto drop;
-
 	if (!net_eq(dev_net(dev), sock_net(sk)))
 		goto drop;
 
@@ -2246,7 +2243,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (sk->sk_type == SOCK_DGRAM) {
 		macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
 				  po->tp_reserve;
-	} else {
+	} else if (po->tp_version != TPACKET_V4) {
 		unsigned int maclen = skb_network_offset(skb);
 		netoff = TPACKET_ALIGN(po->tp_hdrlen +
 				       (maclen < 16 ? 16 : maclen)) +
@@ -2276,6 +2273,12 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 				do_vnet = false;
 			}
 		}
+	} else if (po->tp_version == TPACKET_V4) {
+		if (snaplen > tp4a_max_data_size(po->rx_ring.tp4a)) {
+			pr_err_once("%s: packet too big, %u, dropping.",
+				    __func__, snaplen);
+			goto drop_n_restore;
+		}
 	} else if (unlikely(macoff + snaplen >
 			    GET_PBDQC_FROM_RB(&po->rx_ring)->max_frame_len)) {
 		u32 nval;
@@ -2291,8 +2294,22 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		}
 	}
 	spin_lock(&sk->sk_receive_queue.lock);
-	h.raw = packet_current_rx_frame(po, skb,
-					TP_STATUS_KERNEL, (macoff+snaplen));
+	if (po->tp_version != TPACKET_V4) {
+		h.raw = packet_current_rx_frame(po, skb,
+						TP_STATUS_KERNEL,
+						(macoff + snaplen));
+	} else {
+		struct tp4_frame_set p;
+
+		if (tp4a_next_frame_populate(po->rx_ring.tp4a, &p)) {
+			u16 offset = tp4a_get_data_headroom(po->rx_ring.tp4a);
+
+			tp4f_set_frame(&p, snaplen, offset, true);
+			h.raw = tp4f_get_data(&p);
+		} else {
+			h.raw = NULL;
+		}
+	}
 	if (!h.raw)
 		goto drop_n_account;
 	if (po->tp_version <= TPACKET_V2) {
@@ -2371,20 +2388,25 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 		memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding));
 		hdrlen = sizeof(*h.h3);
 		break;
+	case TPACKET_V4:
+		hdrlen = 0;
+		break;
 	default:
 		BUG();
 	}
 
-	sll = h.raw + TPACKET_ALIGN(hdrlen);
-	sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
-	sll->sll_family = AF_PACKET;
-	sll->sll_hatype = dev->type;
-	sll->sll_protocol = skb->protocol;
-	sll->sll_pkttype = skb->pkt_type;
-	if (unlikely(po->origdev))
-		sll->sll_ifindex = orig_dev->ifindex;
-	else
-		sll->sll_ifindex = dev->ifindex;
+	if (po->tp_version != TPACKET_V4) {
+		sll = h.raw + TPACKET_ALIGN(hdrlen);
+		sll->sll_halen = dev_parse_header(skb, sll->sll_addr);
+		sll->sll_family = AF_PACKET;
+		sll->sll_hatype = dev->type;
+		sll->sll_protocol = skb->protocol;
+		sll->sll_pkttype = skb->pkt_type;
+		if (unlikely(po->origdev))
+			sll->sll_ifindex = orig_dev->ifindex;
+		else
+			sll->sll_ifindex = dev->ifindex;
+	}
 
 	smp_mb();
 
@@ -2401,11 +2423,21 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	smp_wmb();
 #endif
 
-	if (po->tp_version <= TPACKET_V2) {
+	switch (po->tp_version) {
+	case TPACKET_V1:
+	case TPACKET_V2:
 		__packet_set_status(po, h.raw, status);
 		sk->sk_data_ready(sk);
-	} else {
+		break;
+	case TPACKET_V3:
 		prb_clear_blk_fill_status(&po->rx_ring);
+		break;
+	case TPACKET_V4:
+		spin_lock(&sk->sk_receive_queue.lock);
+		WARN_ON_ONCE(tp4a_flush(po->rx_ring.tp4a));
+		spin_unlock(&sk->sk_receive_queue.lock);
+		sk->sk_data_ready(sk);
+		break;
 	}
 
 drop_n_restore:
@@ -4283,20 +4315,21 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
 	struct packet_sock *po = pkt_sk(sk);
 	unsigned int mask = datagram_poll(file, sock, wait);
 
-	if (po->tp_version == TPACKET_V4)
-		return mask;
-
 	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (po->rx_ring.pg_vec) {
-		if (!packet_previous_rx_frame(po, &po->rx_ring,
-			TP_STATUS_KERNEL))
+		if (po->tp_version == TPACKET_V4) {
+			if (!tp4q_is_free(&po->rx_ring.tp4q))
+				mask |= POLLIN | POLLRDNORM;
+		} else if (!packet_previous_rx_frame(po, &po->rx_ring,
+					TP_STATUS_KERNEL)) {
 			mask |= POLLIN | POLLRDNORM;
+		}
 	}
 	if (po->pressure && __packet_rcv_has_room(po, NULL) == ROOM_NORMAL)
 		po->pressure = 0;
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
 	spin_lock_bh(&sk->sk_write_queue.lock);
-	if (po->tx_ring.pg_vec) {
+	if (po->tx_ring.pg_vec && po->tp_version != TPACKET_V4) {
 		if (packet_current_frame(po, &po->tx_ring, TP_STATUS_AVAILABLE))
 			mask |= POLLOUT | POLLWRNORM;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 05/14] packet: enable Tx support for AF_PACKET V4
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (3 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 04/14] packet: enable Rx for AF_PACKET V4 Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 06/14] netdevice: add AF_PACKET V4 zerocopy ops Björn Töpel
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit AF_PACKET V4 egress support is added.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/tpacket4.h | 192 +++++++++++++++++++++++++++++++++++++++++++++++
 net/packet/af_packet.c   | 169 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 350 insertions(+), 11 deletions(-)

diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index 1d4c13d472e5..ac6c721294e8 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -18,6 +18,8 @@
 #define TP4_UMEM_MIN_FRAME_SIZE 2048
 #define TP4_KERNEL_HEADROOM 256 /* Headrom for XDP */
 
+#define TP4A_FRAME_COMPLETED TP4_DESC_KERNEL
+
 enum tp4_validation {
 	TP4_VALIDATION_NONE,	/* No validation is performed */
 	TP4_VALIDATION_IDX,	/* Only address to packet buffer is validated */
@@ -402,6 +404,60 @@ static inline int tp4q_enqueue_from_array(struct tp4_packet_array *a,
 }
 
 /**
+ * tp4q_enqueue_completed_from_array - Enqueue only completed entries
+ *				       from packet array
+ *
+ * @a: Pointer to the packet array to enqueue from
+ * @dcnt: Max number of entries to enqueue
+ *
+ * Returns the number of entries successfully enqueued or a negative errno
+ * at failure.
+ **/
+static inline int tp4q_enqueue_completed_from_array(struct tp4_packet_array *a,
+						    u32 dcnt)
+{
+	struct tp4_queue *q = a->tp4q;
+	unsigned int used_idx = q->used_idx;
+	struct tpacket4_desc *d = a->items;
+	int i, j;
+
+	if (q->num_free < dcnt)
+		return -ENOSPC;
+
+	for (i = 0; i < dcnt; i++) {
+		unsigned int didx = (a->start + i) & a->mask;
+
+		if (d[didx].flags & TP4A_FRAME_COMPLETED) {
+			unsigned int idx = (used_idx++) & q->ring_mask;
+
+			q->ring[idx].idx = d[didx].idx;
+			q->ring[idx].len = d[didx].len;
+			q->ring[idx].offset = d[didx].offset;
+			q->ring[idx].error = d[didx].error;
+		} else {
+			break;
+		}
+	}
+
+	if (i == 0)
+		return 0;
+
+	/* Order flags and data */
+	smp_wmb();
+
+	for (j = i - 1; j >= 0; j--) {
+		unsigned int idx = (q->used_idx + j) & q->ring_mask;
+		unsigned int didx = (a->start + j) & a->mask;
+
+		q->ring[idx].flags = d[didx].flags & ~TP4_DESC_KERNEL;
+	}
+	q->num_free -= i;
+	q->used_idx += i;
+
+	return i;
+}
+
+/**
  * tp4q_dequeue_to_array - Dequeue entries from tp4 queue to packet array
  *
  * @a: Pointer to the packet array to dequeue from
@@ -581,6 +637,15 @@ static inline struct tpacket4_desc *tp4q_get_desc(struct tp4_frame_set *p)
  **/
 
 /**
+ * tp4f_reset - Start to traverse the frames in the set from the beginning
+ * @p: pointer to frame set
+ **/
+static inline void tp4f_reset(struct tp4_frame_set *p)
+{
+	p->curr = p->start;
+}
+
+/**
  * tp4f_next_frame - Go to next frame in frame set
  * @p: pointer to frame set
  *
@@ -597,6 +662,38 @@ static inline bool tp4f_next_frame(struct tp4_frame_set *p)
 }
 
 /**
+ * tp4f_get_frame_id - Get packet buffer id of frame
+ * @p: pointer to frame set
+ *
+ * Returns the id of the packet buffer of the current frame
+ **/
+static inline u64 tp4f_get_frame_id(struct tp4_frame_set *p)
+{
+	return p->pkt_arr->items[p->curr & p->pkt_arr->mask].idx;
+}
+
+/**
+ * tp4f_get_frame_len - Get length of data in current frame
+ * @p: pointer to frame set
+ *
+ * Returns the length of data in the packet buffer of the current frame
+ **/
+static inline u32 tp4f_get_frame_len(struct tp4_frame_set *p)
+{
+	return p->pkt_arr->items[p->curr & p->pkt_arr->mask].len;
+}
+
+/**
+ * tp4f_set_error - Set an error on the current frame
+ * @p: pointer to frame set
+ * @errno: the errno to be assigned
+ **/
+static inline void tp4f_set_error(struct tp4_frame_set *p, int errno)
+{
+	p->pkt_arr->items[p->curr & p->pkt_arr->mask].error = errno;
+}
+
+/**
  * tp4f_get_data - Gets a pointer to the frame the frame set is on
  * @p: pointer to the frame set
  *
@@ -627,6 +724,48 @@ static inline void tp4f_set_frame(struct tp4_frame_set *p, u32 len, u16 offset,
 		d->flags |= TP4_PKT_CONT;
 }
 
+/*************** PACKET OPERATIONS *******************************/
+/* A packet consists of one or more frames. Both frames and packets
+ * are represented by a tp4_frame_set. The only difference is that
+ * packet functions look at the EOP flag.
+ **/
+
+/**
+ * tp4f_get_packet_len - Length of packet
+ * @p: pointer to packet
+ *
+ * Returns the length of the packet in bytes.
+ * Resets curr pointer of packet.
+ **/
+static inline u32 tp4f_get_packet_len(struct tp4_frame_set *p)
+{
+	u32 len = 0;
+
+	tp4f_reset(p);
+
+	do {
+		len += tp4f_get_frame_len(p);
+	} while (tp4f_next_frame(p));
+
+	return len;
+}
+
+/**
+ * tp4f_packet_completed - Mark packet as completed
+ * @p: pointer to packet
+ *
+ * Resets curr pointer of packet.
+ **/
+static inline void tp4f_packet_completed(struct tp4_frame_set *p)
+{
+	tp4f_reset(p);
+
+	do {
+		p->pkt_arr->items[p->curr & p->pkt_arr->mask].flags |=
+			TP4A_FRAME_COMPLETED;
+	} while (tp4f_next_frame(p));
+}
+
 /**************** PACKET_ARRAY FUNCTIONS ********************************/
 
 static inline struct tp4_packet_array *__tp4a_new(
@@ -815,6 +954,59 @@ static inline unsigned int tp4a_max_data_size(struct tp4_packet_array *a)
 }
 
 /**
+ * tp4a_next_packet - Get next packet in array and advance curr pointer
+ * @a: pointer to packet array
+ * @p: supplied pointer to packet structure that is filled in by function
+ *
+ * Returns true if there is a packet, false otherwise. Packet returned in *p.
+ **/
+static inline bool tp4a_next_packet(struct tp4_packet_array *a,
+				    struct tp4_frame_set *p)
+{
+	u32 avail = a->end - a->curr;
+
+	if (avail == 0)
+		return false; /* empty */
+
+	p->pkt_arr = a;
+	p->start = a->curr;
+	p->curr = a->curr;
+	p->end = a->curr;
+
+	/* XXX Sanity check for too-many-frames packets? */
+	while (a->items[p->end++ & a->mask].flags & TP4_PKT_CONT) {
+		avail--;
+		if (avail == 0)
+			return false;
+	}
+
+	a->curr += (p->end - p->start);
+	return true;
+}
+
+/**
+ * tp4a_flush_completed - Flushes only frames marked as completed
+ * @a: pointer to packet array
+ *
+ * Returns 0 for success and -1 for failure
+ **/
+static inline int tp4a_flush_completed(struct tp4_packet_array *a)
+{
+	u32 avail = a->curr - a->start;
+	int ret;
+
+	if (avail == 0)
+		return 0; /* nothing to flush */
+
+	ret = tp4q_enqueue_completed_from_array(a, avail);
+	if (ret < 0)
+		return -1;
+
+	a->start += ret;
+	return 0;
+}
+
+/**
  * tp4a_populate - Populate an array with packets from associated tp4q
  * @a: pointer to packet array
  **/
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 830d97ff4358..444eb4834362 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2462,6 +2462,28 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	goto drop_n_restore;
 }
 
+static void packet_v4_destruct_skb(struct sk_buff *skb)
+{
+	struct packet_sock *po = pkt_sk(skb->sk);
+
+	if (likely(po->tx_ring.pg_vec)) {
+		u64 idx = (u64)skb_shinfo(skb)->destructor_arg;
+		struct tp4_frame_set p = {.start = idx,
+					  .curr = idx,
+					  .end = idx + 1,
+					  .pkt_arr = po->tx_ring.tp4a};
+
+		spin_lock(&po->sk.sk_write_queue.lock);
+		tp4f_packet_completed(&p);
+		WARN_ON_ONCE(tp4a_flush_completed(po->tx_ring.tp4a));
+		spin_unlock(&po->sk.sk_write_queue.lock);
+
+		packet_dec_pending(&po->tx_ring);
+	}
+
+	sock_wfree(skb);
+}
+
 static void tpacket_destruct_skb(struct sk_buff *skb)
 {
 	struct packet_sock *po = pkt_sk(skb->sk);
@@ -2519,24 +2541,24 @@ static int packet_snd_vnet_parse(struct msghdr *msg, size_t *len,
 }
 
 static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
-		void *frame, struct net_device *dev, void *data, int tp_len,
+		void *dtor_arg, struct net_device *dev, void *data, int tp_len,
 		__be16 proto, unsigned char *addr, int hlen, int copylen,
 		const struct sockcm_cookie *sockc)
 {
-	union tpacket_uhdr ph;
 	int to_write, offset, len, nr_frags, len_max;
 	struct socket *sock = po->sk.sk_socket;
 	struct page *page;
 	int err;
 
-	ph.raw = frame;
-
 	skb->protocol = proto;
 	skb->dev = dev;
 	skb->priority = po->sk.sk_priority;
 	skb->mark = po->sk.sk_mark;
-	sock_tx_timestamp(&po->sk, sockc->tsflags, &skb_shinfo(skb)->tx_flags);
-	skb_shinfo(skb)->destructor_arg = ph.raw;
+	if (sockc) {
+		sock_tx_timestamp(&po->sk, sockc->tsflags,
+				  &skb_shinfo(skb)->tx_flags);
+	}
+	skb_shinfo(skb)->destructor_arg = dtor_arg;
 
 	skb_reserve(skb, hlen);
 	skb_reset_network_header(skb);
@@ -2840,6 +2862,126 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	return err;
 }
 
+static int packet_v4_snd(struct packet_sock *po, struct msghdr *msg)
+{
+	DECLARE_SOCKADDR(struct sockaddr_ll *, saddr, msg->msg_name);
+	bool need_wait = !(msg->msg_flags & MSG_DONTWAIT);
+	struct packet_ring_buffer *rb = &po->tx_ring;
+	int err = 0, dlen, size_max, hlen, tlen;
+	struct tp4_frame_set p;
+	struct net_device *dev;
+	struct sk_buff *skb;
+	unsigned char *addr;
+	bool has_packet;
+	__be16 proto;
+	void *data;
+
+	mutex_lock(&po->pg_vec_lock);
+
+	if (likely(!saddr)) {
+		dev = packet_cached_dev_get(po);
+		proto = po->num;
+		addr = NULL;
+	} else {
+		pr_warn("packet v4 not implemented!\n");
+		return -EINVAL;
+	}
+
+	err = -ENXIO;
+	if (unlikely(!dev))
+		goto out;
+	err = -ENETDOWN;
+	if (unlikely(!(dev->flags & IFF_UP)))
+		goto out_put;
+
+	size_max = tp4a_max_data_size(rb->tp4a);
+
+	if (size_max > dev->mtu + dev->hard_header_len + VLAN_HLEN)
+		size_max = dev->mtu + dev->hard_header_len + VLAN_HLEN;
+
+	spin_lock_bh(&po->sk.sk_write_queue.lock);
+	tp4a_populate(rb->tp4a);
+	spin_unlock_bh(&po->sk.sk_write_queue.lock);
+
+	do {
+		spin_lock_bh(&po->sk.sk_write_queue.lock);
+		has_packet = tp4a_next_packet(rb->tp4a, &p);
+		spin_unlock_bh(&po->sk.sk_write_queue.lock);
+
+		if (!has_packet) {
+			if (need_wait && need_resched()) {
+				schedule();
+				continue;
+			}
+			break;
+		}
+
+		dlen = tp4f_get_packet_len(&p);
+		data = tp4f_get_data(&p);
+		hlen = LL_RESERVED_SPACE(dev);
+		tlen = dev->needed_tailroom;
+		skb = sock_alloc_send_skb(&po->sk,
+					  hlen + tlen +
+					  sizeof(struct sockaddr_ll),
+					  !need_wait, &err);
+
+		if (unlikely(!skb)) {
+			err = -EAGAIN;
+			goto out_err;
+		}
+
+		dlen = tpacket_fill_skb(po, skb,
+					(void *)(long)tp4f_get_frame_id(&p),
+					dev,
+					data, dlen, proto, addr, hlen,
+					dev->hard_header_len, NULL);
+		if (likely(dlen >= 0) &&
+		    dlen > dev->mtu + dev->hard_header_len &&
+		    !packet_extra_vlan_len_allowed(dev, skb)) {
+			dlen = -EMSGSIZE;
+		}
+
+		if (unlikely(dlen < 0)) {
+			err = dlen;
+			goto out_err;
+		}
+
+		skb->destructor = packet_v4_destruct_skb;
+		packet_inc_pending(&po->tx_ring);
+
+		err = po->xmit(skb);
+		/* Ignore NET_XMIT_CN as packet might have been sent */
+		if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
+			err = -EAGAIN;
+			packet_dec_pending(&po->tx_ring);
+			skb = NULL;
+			goto out_err;
+		}
+	} while (!err ||
+		/* Note: packet_read_pending() might be slow if we have
+		 * to call it as it's per_cpu variable, but in fast-path
+		 * we already short-circuit the loop with the first
+		 * condition, and luckily don't have to go that path
+		 * anyway.
+		 */
+		 (need_wait && packet_read_pending(&po->tx_ring)));
+
+	goto out_put;
+
+out_err:
+	spin_lock_bh(&po->sk.sk_write_queue.lock);
+	tp4f_set_error(&p, -err);
+	tp4f_packet_completed(&p);
+	WARN_ON_ONCE(tp4a_flush_completed(rb->tp4a));
+	spin_unlock_bh(&po->sk.sk_write_queue.lock);
+	kfree_skb(skb);
+out_put:
+	dev_put(dev);
+out:
+	mutex_unlock(&po->pg_vec_lock);
+	return 0;
+}
+
 static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
 				        size_t reserve, size_t len,
 				        size_t linear, int noblock,
@@ -3015,10 +3157,10 @@ static int packet_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	struct packet_sock *po = pkt_sk(sk);
 
 	if (po->tx_ring.pg_vec) {
-		if (po->tp_version == TPACKET_V4)
-			return -EINVAL;
+		if (po->tp_version != TPACKET_V4)
+			return tpacket_snd(po, msg);
 
-		return tpacket_snd(po, msg);
+		return packet_v4_snd(po, msg);
 	}
 
 	return packet_snd(sock, msg, len);
@@ -4329,9 +4471,14 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
 		po->pressure = 0;
 	spin_unlock_bh(&sk->sk_receive_queue.lock);
 	spin_lock_bh(&sk->sk_write_queue.lock);
-	if (po->tx_ring.pg_vec && po->tp_version != TPACKET_V4) {
-		if (packet_current_frame(po, &po->tx_ring, TP_STATUS_AVAILABLE))
+	if (po->tx_ring.pg_vec) {
+		if (po->tp_version == TPACKET_V4) {
+			if (tp4q_nb_avail(&po->tx_ring.tp4q, 1))
+				mask |= POLLOUT | POLLWRNORM;
+		} else if (packet_current_frame(po, &po->tx_ring,
+					 TP_STATUS_AVAILABLE)) {
 			mask |= POLLOUT | POLLWRNORM;
+		}
 	}
 	spin_unlock_bh(&sk->sk_write_queue.lock);
 	return mask;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 06/14] netdevice: add AF_PACKET V4 zerocopy ops
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (4 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 05/14] packet: enable Tx support " Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4 Björn Töpel
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Two new ndo ops are added. One for enabling/disabling AF_PACKET V4
zerocopy, and one for kicking the egress ring.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5e02f79b2110..1421206bf243 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -833,6 +833,8 @@ struct dev_ifalias {
 	char ifalias[];
 };
 
+struct tp4_netdev_parms;
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1133,6 +1135,15 @@ struct dev_ifalias {
  * void (*ndo_xdp_flush)(struct net_device *dev);
  *	This function is used to inform the driver to flush a particular
  *	xdp tx queue. Must be called on same CPU as xdp_xmit.
+ * int (*ndo_tp4_zerocopy)(struct net_device *dev,
+ *			   struct tp4_netdev_parms *parms);
+ *	This function is used to enable and disable the AF_PACKET V4
+ *	PACKET_ZEROCOPY support. See definition of enum tp4_netdev_command
+ *	in tpacket4.h for details.
+ * int (*ndo_tp4_xmit)(struct net_device *dev, int queue_pair);
+ *	This function is used to send packets when the PACKET_ZEROCOPY
+ *	option is set. The rtnl lock is not held when entering this
+ *	function.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1320,6 +1331,11 @@ struct net_device_ops {
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
 						struct xdp_buff *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
+	int                     (*ndo_tp4_zerocopy)(
+					struct net_device *dev,
+					struct tp4_netdev_parms *parms);
+	int                     (*ndo_tp4_xmit)(struct net_device *dev,
+						int queue_pair);
 };
 
 /**
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (5 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 06/14] netdevice: add AF_PACKET V4 zerocopy ops Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-11-03  3:17   ` Willem de Bruijn
  2017-10-31 12:41 ` [RFC PATCH 08/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support Björn Töpel
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commits adds support for zerocopy mode. Note that zerocopy mode
requires that the network interface has been bound to the socket using
the bind syscall, and that the corresponding netdev implements the
AF_PACKET V4 ndos.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/tpacket4.h |  38 +++++
 net/packet/af_packet.c   | 399 +++++++++++++++++++++++++++++++++++++++++++----
 net/packet/internal.h    |   1 +
 3 files changed, 404 insertions(+), 34 deletions(-)

diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index ac6c721294e8..839485108b2d 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -105,6 +105,44 @@ struct tp4_frame_set {
 	u32 end;
 };
 
+enum tp4_netdev_command {
+	/* Enable the AF_PACKET V4 zerocopy support. When this is enabled,
+	 * packets will arrive to the socket without being copied resulting
+	 * in better performance. Note that this also means that no packets
+	 * are sent to the kernel stack after this feature has been enabled.
+	 */
+	TP4_ENABLE,
+	/* Disables the PACKET_ZEROCOPY support. */
+	TP4_DISABLE,
+};
+
+/**
+ * struct tp4_netdev_parms - TP4 netdev parameters for configuration
+ *
+ * @command: netdev command, currently enable or disable
+ * @rx_opaque: an opaque pointer to the rx queue
+ * @tx_opaque: an opaque pointer to the tx queue
+ * @data_ready: function to be called when data is ready in poll mode
+ * @data_ready_opauqe: opaque parameter returned with data_ready
+ * @write_space: called when data needs to be transmitted in poll mode
+ * @write_space_opaque: opaque parameter returned with write_space
+ * @error_report: called when there is an error
+ * @error_report_opaque: opaque parameter returned in error_report
+ * @queue_pair: the queue_pair associated with this zero-copy operation
+ **/
+struct tp4_netdev_parms {
+	enum tp4_netdev_command command;
+	void *rx_opaque;
+	void *tx_opaque;
+	void (*data_ready)(void *);
+	void *data_ready_opaque;
+	void (*write_space)(void *);
+	void *write_space_opaque;
+	void (*error_report)(void *, int);
+	void *error_report_opaque;
+	int queue_pair;
+};
+
 /*************** V4 QUEUE OPERATIONS *******************************/
 
 /**
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 444eb4834362..fbfada773463 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3151,16 +3151,218 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	return err;
 }
 
+static void packet_v4_data_ready_callback(void *data_ready_opaque)
+{
+	struct sock *sk = (struct sock *)data_ready_opaque;
+
+	sk->sk_data_ready(sk);
+}
+
+static void packet_v4_write_space_callback(void *write_space_opaque)
+{
+	struct sock *sk = (struct sock *)write_space_opaque;
+
+	sk->sk_write_space(sk);
+}
+
+static void packet_v4_disable_zerocopy(struct net_device *dev,
+				       struct tp4_netdev_parms *zc)
+{
+	struct tp4_netdev_parms params;
+
+	params = *zc;
+	params.command  = TP4_DISABLE;
+
+	(void)dev->netdev_ops->ndo_tp4_zerocopy(dev, &params);
+}
+
+static int packet_v4_enable_zerocopy(struct net_device *dev,
+				     struct tp4_netdev_parms *zc)
+{
+	return dev->netdev_ops->ndo_tp4_zerocopy(dev, zc);
+}
+
+static void packet_v4_error_report_callback(void *error_report_opaque,
+					    int errno)
+{
+	struct packet_sock *po = error_report_opaque;
+	struct tp4_netdev_parms *zc;
+	struct net_device *dev;
+
+	zc = rtnl_dereference(po->zc);
+	dev = packet_cached_dev_get(po);
+	if (zc && dev) {
+		packet_v4_disable_zerocopy(dev, zc);
+
+		pr_warn("packet v4 zerocopy queue pair %d no longer available! errno=%d\n",
+			zc->queue_pair, errno);
+		dev_put(dev);
+	}
+}
+
+static int packet_v4_get_zerocopy_qp(struct packet_sock *po)
+{
+	struct tp4_netdev_parms *zc;
+	int qp;
+
+	rcu_read_lock();
+	zc = rcu_dereference(po->zc);
+	qp = zc ? zc->queue_pair : -1;
+	rcu_read_unlock();
+
+	return qp;
+}
+
+static int packet_v4_zerocopy(struct sock *sk, int qp)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct socket *sock = sk->sk_socket;
+	struct tp4_netdev_parms *zc = NULL;
+	struct net_device *dev;
+	bool if_up;
+	int ret = 0;
+
+	/* Currently, only RAW sockets are supported.*/
+	if (sock->type != SOCK_RAW)
+		return -EINVAL;
+
+	rtnl_lock();
+	dev = packet_cached_dev_get(po);
+
+	/* Socket needs to be bound to an interface. */
+	if (!dev) {
+		rtnl_unlock();
+		return -EISCONN;
+	}
+
+	/* The device needs to have both the NDOs implemented. */
+	if (!(dev->netdev_ops->ndo_tp4_zerocopy &&
+	      dev->netdev_ops->ndo_tp4_xmit)) {
+		ret = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
+	if (!(po->rx_ring.pg_vec && po->tx_ring.pg_vec)) {
+		ret = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
+	if_up = dev->flags & IFF_UP;
+	zc = rtnl_dereference(po->zc);
+
+	/* Disable */
+	if (qp <= 0) {
+		if (!zc)
+			goto out_unlock;
+
+		packet_v4_disable_zerocopy(dev, zc);
+		rcu_assign_pointer(po->zc, NULL);
+
+		if (if_up) {
+			spin_lock(&po->bind_lock);
+			register_prot_hook(sk);
+			spin_unlock(&po->bind_lock);
+		}
+
+		goto out_unlock;
+	}
+
+	/* Enable */
+	if (!zc) {
+		zc = kzalloc(sizeof(*zc), GFP_KERNEL);
+		if (!zc) {
+			ret = -ENOMEM;
+			goto out_unlock;
+		}
+	}
+
+	if (zc->queue_pair >= 0)
+		packet_v4_disable_zerocopy(dev, zc);
+
+	zc->command = TP4_ENABLE;
+	if (po->rx_ring.tp4q.umem)
+		zc->rx_opaque = &po->rx_ring.tp4q;
+	else
+		zc->rx_opaque = NULL;
+	if (po->tx_ring.tp4q.umem)
+		zc->tx_opaque = &po->tx_ring.tp4q;
+	else
+		zc->tx_opaque = NULL;
+	zc->data_ready = packet_v4_data_ready_callback;
+	zc->write_space = packet_v4_write_space_callback;
+	zc->error_report = packet_v4_error_report_callback;
+	zc->data_ready_opaque = (void *)sk;
+	zc->write_space_opaque = (void *)sk;
+	zc->error_report_opaque = po;
+	zc->queue_pair = qp - 1;
+
+	spin_lock(&po->bind_lock);
+	unregister_prot_hook(sk, true);
+	spin_unlock(&po->bind_lock);
+
+	if (if_up) {
+		ret = packet_v4_enable_zerocopy(dev, zc);
+		if (ret) {
+			spin_lock(&po->bind_lock);
+			register_prot_hook(sk);
+			spin_unlock(&po->bind_lock);
+
+			kfree(po->zc);
+			po->zc = NULL;
+			goto out_unlock;
+		}
+	} else {
+		sk->sk_err = ENETDOWN;
+		if (!sock_flag(sk, SOCK_DEAD))
+			sk->sk_error_report(sk);
+	}
+
+	rcu_assign_pointer(po->zc, zc);
+	zc = NULL;
+
+out_unlock:
+	if (dev)
+		dev_put(dev);
+	rtnl_unlock();
+	if (zc) {
+		synchronize_rcu();
+		kfree(zc);
+	}
+	return ret;
+}
+
+static int packet_v4_zc_snd(struct packet_sock *po, int qp)
+{
+	struct net_device *dev;
+	int ret = -1;
+
+	/* NOTE: It's a bit unorthodox having an ndo without the RTNL
+	 * lock taken during the call. The ndo_tp4_xmit cannot sleep.
+	 */
+	dev = packet_cached_dev_get(po);
+	if (dev) {
+		ret = dev->netdev_ops->ndo_tp4_xmit(dev, qp);
+		dev_put(dev);
+	}
+
+	return ret;
+}
+
 static int packet_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 {
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
+	int zc_qp;
 
 	if (po->tx_ring.pg_vec) {
 		if (po->tp_version != TPACKET_V4)
 			return tpacket_snd(po, msg);
 
-		return packet_v4_snd(po, msg);
+		zc_qp = packet_v4_get_zerocopy_qp(po);
+		if (zc_qp < 0)
+			return packet_v4_snd(po, msg);
+
+		return packet_v4_zc_snd(po, zc_qp);
 	}
 
 	return packet_snd(sock, msg, len);
@@ -3318,7 +3520,9 @@ static void packet_clear_ring(struct sock *sk, int tx_ring)
 
 static int packet_release(struct socket *sock)
 {
+	struct tp4_netdev_parms *zc;
 	struct sock *sk = sock->sk;
+	struct net_device *dev;
 	struct packet_sock *po;
 	struct packet_fanout *f;
 	struct net *net;
@@ -3337,6 +3541,20 @@ static int packet_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	preempt_enable();
 
+	rtnl_lock();
+	zc = rtnl_dereference(po->zc);
+	dev = packet_cached_dev_get(po);
+	if (zc && dev)
+		packet_v4_disable_zerocopy(dev, zc);
+	if (dev)
+		dev_put(dev);
+	rtnl_unlock();
+
+	if (zc) {
+		synchronize_rcu();
+		kfree(zc);
+	}
+
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, false);
 	packet_cached_dev_reset(po);
@@ -3381,6 +3599,54 @@ static int packet_release(struct socket *sock)
 	return 0;
 }
 
+static int packet_v4_rehook_zerocopy(struct sock *sk,
+				     struct net_device *dev_prev,
+				     struct net_device *dev)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct tp4_netdev_parms *zc;
+	bool dev_up;
+	int ret = 0;
+
+	rtnl_lock();
+	dev_up = (dev && (dev->flags & IFF_UP));
+	zc = rtnl_dereference(po->zc);
+	/* Recheck */
+	if (!zc) {
+		if (dev_up) {
+			spin_lock(&po->bind_lock);
+			register_prot_hook(sk);
+			spin_unlock(&po->bind_lock);
+			rtnl_unlock();
+
+			return 0;
+		}
+
+		sk->sk_err = ENETDOWN; /* XXX something else? */
+		if (!sock_flag(sk, SOCK_DEAD))
+			sk->sk_error_report(sk);
+
+		goto out;
+	}
+
+	if (dev_prev)
+		packet_v4_disable_zerocopy(dev_prev, zc);
+	if (dev_up) {
+		ret = packet_v4_enable_zerocopy(dev, zc);
+		if (ret) {
+			/* XXX re-enable hook? */
+			sk->sk_err = ENETDOWN; /* XXX something else? */
+			if (!sock_flag(sk, SOCK_DEAD))
+				sk->sk_error_report(sk);
+		}
+	}
+
+out:
+	rtnl_unlock();
+
+	return ret;
+}
+
 /*
  *	Attach a packet hook.
  */
@@ -3388,11 +3654,10 @@ static int packet_release(struct socket *sock)
 static int packet_do_bind(struct sock *sk, const char *name, int ifindex,
 			  __be16 proto)
 {
+	struct net_device *dev_curr = NULL, *dev = NULL;
 	struct packet_sock *po = pkt_sk(sk);
-	struct net_device *dev_curr;
 	__be16 proto_curr;
 	bool need_rehook;
-	struct net_device *dev = NULL;
 	int ret = 0;
 	bool unlisted = false;
 
@@ -3443,6 +3708,7 @@ static int packet_do_bind(struct sock *sk, const char *name, int ifindex,
 
 		if (unlikely(unlisted)) {
 			dev_put(dev);
+			dev = NULL;
 			po->prot_hook.dev = NULL;
 			po->ifindex = -1;
 			packet_cached_dev_reset(po);
@@ -3452,14 +3718,13 @@ static int packet_do_bind(struct sock *sk, const char *name, int ifindex,
 			packet_cached_dev_assign(po, dev);
 		}
 	}
-	if (dev_curr)
-		dev_put(dev_curr);
 
 	if (proto == 0 || !need_rehook)
 		goto out_unlock;
 
 	if (!unlisted && (!dev || (dev->flags & IFF_UP))) {
-		register_prot_hook(sk);
+		if (!rcu_dereference(po->zc))
+			register_prot_hook(sk);
 	} else {
 		sk->sk_err = ENETDOWN;
 		if (!sock_flag(sk, SOCK_DEAD))
@@ -3470,6 +3735,12 @@ static int packet_do_bind(struct sock *sk, const char *name, int ifindex,
 	rcu_read_unlock();
 	spin_unlock(&po->bind_lock);
 	release_sock(sk);
+
+	if (!ret && need_rehook)
+		ret = packet_v4_rehook_zerocopy(sk, dev_curr, dev);
+	if (dev_curr)
+		dev_put(dev_curr);
+
 	return ret;
 }
 
@@ -4003,6 +4274,19 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 			return packet_set_ring(sk, &req_u, 0,
 					       optname == PACKET_TX_RING);
 	}
+	case PACKET_ZEROCOPY:
+	{
+		int qp; /* <=0 disable, 1..n is queue pair index */
+
+		if (optlen != sizeof(qp))
+			return -EINVAL;
+		if (copy_from_user(&qp, optval, sizeof(qp)))
+			return -EFAULT;
+
+		if (po->tp_version == TPACKET_V4)
+			return packet_v4_zerocopy(sk, qp);
+		return -EOPNOTSUPP;
+	}
 	case PACKET_COPY_THRESH:
 	{
 		int val;
@@ -4311,6 +4595,12 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_QDISC_BYPASS:
 		val = packet_use_direct_xmit(po);
 		break;
+	case PACKET_ZEROCOPY:
+		if (po->tp_version == TPACKET_V4) {
+			val = packet_v4_get_zerocopy_qp(po) + 1;
+			break;
+		}
+		return -ENOPROTOOPT;
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -4346,6 +4636,71 @@ static int compat_packet_setsockopt(struct socket *sock, int level, int optname,
 }
 #endif
 
+static void packet_notifier_down(struct sock *sk, struct net_device *dev,
+				 bool unregister)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct tp4_netdev_parms *zc;
+	bool report = false;
+
+	if (unregister && po->mclist)
+		packet_dev_mclist_delete(dev, &po->mclist);
+
+	if (dev->ifindex == po->ifindex) {
+		spin_lock(&po->bind_lock);
+		if (po->running) {
+			__unregister_prot_hook(sk, false);
+			report = true;
+		}
+
+		zc = rtnl_dereference(po->zc);
+		if (zc) {
+			packet_v4_disable_zerocopy(dev, zc);
+			report = true;
+		}
+
+		if (report) {
+			sk->sk_err = ENETDOWN;
+			if (!sock_flag(sk, SOCK_DEAD))
+				sk->sk_error_report(sk);
+		}
+
+		if (unregister) {
+			packet_cached_dev_reset(po);
+			po->ifindex = -1;
+			if (po->prot_hook.dev)
+				dev_put(po->prot_hook.dev);
+			po->prot_hook.dev = NULL;
+		}
+		spin_unlock(&po->bind_lock);
+	}
+}
+
+static void packet_notifier_up(struct sock *sk, struct net_device *dev)
+{
+	struct packet_sock *po = pkt_sk(sk);
+	struct tp4_netdev_parms *zc;
+	int ret;
+
+	if (dev->ifindex == po->ifindex) {
+		spin_lock(&po->bind_lock);
+		if (po->num) {
+			zc = rtnl_dereference(po->zc);
+			if (zc) {
+				ret = packet_v4_enable_zerocopy(dev, zc);
+				if (ret) {
+					sk->sk_err = ENETDOWN;
+					if (!sock_flag(sk, SOCK_DEAD))
+						sk->sk_error_report(sk);
+				}
+			} else {
+				register_prot_hook(sk);
+			}
+		}
+		spin_unlock(&po->bind_lock);
+	}
+}
+
 static int packet_notifier(struct notifier_block *this,
 			   unsigned long msg, void *ptr)
 {
@@ -4355,44 +4710,20 @@ static int packet_notifier(struct notifier_block *this,
 
 	rcu_read_lock();
 	sk_for_each_rcu(sk, &net->packet.sklist) {
-		struct packet_sock *po = pkt_sk(sk);
-
 		switch (msg) {
 		case NETDEV_UNREGISTER:
-			if (po->mclist)
-				packet_dev_mclist_delete(dev, &po->mclist);
 			/* fallthrough */
-
 		case NETDEV_DOWN:
-			if (dev->ifindex == po->ifindex) {
-				spin_lock(&po->bind_lock);
-				if (po->running) {
-					__unregister_prot_hook(sk, false);
-					sk->sk_err = ENETDOWN;
-					if (!sock_flag(sk, SOCK_DEAD))
-						sk->sk_error_report(sk);
-				}
-				if (msg == NETDEV_UNREGISTER) {
-					packet_cached_dev_reset(po);
-					po->ifindex = -1;
-					if (po->prot_hook.dev)
-						dev_put(po->prot_hook.dev);
-					po->prot_hook.dev = NULL;
-				}
-				spin_unlock(&po->bind_lock);
-			}
+			packet_notifier_down(sk, dev,
+					     msg == NETDEV_UNREGISTER);
 			break;
 		case NETDEV_UP:
-			if (dev->ifindex == po->ifindex) {
-				spin_lock(&po->bind_lock);
-				if (po->num)
-					register_prot_hook(sk);
-				spin_unlock(&po->bind_lock);
-			}
+			packet_notifier_up(sk, dev);
 			break;
 		}
 	}
 	rcu_read_unlock();
+
 	return NOTIFY_DONE;
 }
 
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 3eedab29e4d7..1551cbe7b47b 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -116,6 +116,7 @@ struct packet_sock {
 	struct packet_ring_buffer	tx_ring;
 
 	struct tp4_umem			*umem;
+	struct tp4_netdev_parms __rcu	*zc;
 
 	int			copy_thresh;
 	spinlock_t		bind_lock;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 08/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (6 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4 Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 09/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support Björn Töpel
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This commit adds an implementation for ndo_tp4_zerocopy.

When an AF_PACKET V4 socket enables zerocopy, it will trigger the
ndo_tp4_zerocopy implementation. The selected queue pair is disabled,
TP4 mode is enabled and the queue pair is re-enabled.

Instead of allocating buffers from the page allocator, buffers from
the userland TP4 socket are used. The i40e_alloc_rx_buffers_tp4
function does the allocation.

Pulling buffers from the hardware descriptor queue, validation and
passing descriptor to userland is done in i40e_clean_rx_tp4_irq.

Common code for updating stats in i40e_clean_rx_irq and
i40e_clean_rx_tp4_irq has been refactored out into a function.

As Rx allocation, descriptor configuration and hardware descriptor
ring clean up now has multiple implementations, a couple of new
members has been introduced into the struct i40e_ring. Two function
pointers, one for Rx buffer allocation, and one for Rx clean up. The
i40e_ring also contains some Rx descriptor configuration parameters
(rx_buf_len and rx_max_frame), since each Rx ring potentionally can
have different configuration. This also opens up for future 16B
descriptor usage for TP4 rings.

The TP4 implementation does not use the struct i40e_rx_buffer to track
hardware descriptor metadata, but instead uses the packet array
directly from tpacket4.h.

All TP4 state is kept in the struct i40e_ring. However, to support
that a zerocopy context can survive a soft reset, e.g. when changing
the number of queue pairs via ethtool, functionality for storing the
TP4 context in the vsi is required. When a soft reset is done, we
store the TP4 state in the vsi. The vsi rings are tore down, and when
setting the rings up again, the TP4 state from vsi is restored.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h         |   3 +
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   9 +
 drivers/net/ethernet/intel/i40e/i40e_main.c    | 751 ++++++++++++++++++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 196 ++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |  34 ++
 include/linux/tpacket4.h                       |  85 +++
 6 files changed, 1033 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index eb017763646d..56dff7d314c4 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -744,6 +744,9 @@ struct i40e_vsi {
 
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
+
+	struct i40e_tp4_ctx **tp4_ctxs; /* Rx context */
+	u16 num_tp4_ctxs;
 } ____cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 9eb618799a30..da64776108c6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1515,6 +1515,15 @@ static int i40e_set_ringparam(struct net_device *netdev,
 		goto done;
 	}
 
+	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		if (ring_uses_tp4(vsi->rx_rings[i])) {
+			netdev_warn(netdev,
+				    "FIXME TP4 zerocopy does not support changing descriptors. Take down the interface first\n");
+			err = -ENOTSUPP;
+			goto done;
+		}
+	}
+
 	/* We can't just free everything and then setup again,
 	 * because the ISRs in MSI-X mode get passed pointers
 	 * to the Tx and Rx ring structs.
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 54ff34faca37..5456ef6cce1b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3187,8 +3187,6 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	/* clear the context structure first */
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
-	ring->rx_buf_len = vsi->rx_buf_len;
-
 	rx_ctx.dbuff = DIV_ROUND_UP(ring->rx_buf_len,
 				    BIT_ULL(I40E_RXQ_CTX_DBUFF_SHIFT));
 
@@ -3203,7 +3201,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	 */
 	rx_ctx.hsplit_0 = 0;
 
-	rx_ctx.rxmax = min_t(u16, vsi->max_frame, chain_len * ring->rx_buf_len);
+	rx_ctx.rxmax = min_t(u16, ring->rx_max_frame,
+			     chain_len * ring->rx_buf_len);
 	if (hw->revision_id == 0)
 		rx_ctx.lrxqthresh = 0;
 	else
@@ -3243,7 +3242,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q);
 	writel(0, ring->tail);
 
-	i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring));
+	ring->rx_alloc_fn(ring, I40E_DESC_UNUSED(ring));
 
 	return 0;
 }
@@ -3282,21 +3281,6 @@ static int i40e_vsi_configure_rx(struct i40e_vsi *vsi)
 	int err = 0;
 	u16 i;
 
-	if (!vsi->netdev || (vsi->back->flags & I40E_FLAG_LEGACY_RX)) {
-		vsi->max_frame = I40E_MAX_RXBUFFER;
-		vsi->rx_buf_len = I40E_RXBUFFER_2048;
-#if (PAGE_SIZE < 8192)
-	} else if (!I40E_2K_TOO_SMALL_WITH_PADDING &&
-		   (vsi->netdev->mtu <= ETH_DATA_LEN)) {
-		vsi->max_frame = I40E_RXBUFFER_1536 - NET_IP_ALIGN;
-		vsi->rx_buf_len = I40E_RXBUFFER_1536 - NET_IP_ALIGN;
-#endif
-	} else {
-		vsi->max_frame = I40E_MAX_RXBUFFER;
-		vsi->rx_buf_len = (PAGE_SIZE < 8192) ? I40E_RXBUFFER_3072 :
-						       I40E_RXBUFFER_2048;
-	}
-
 	/* set up individual rings */
 	for (i = 0; i < vsi->num_queue_pairs && !err; i++)
 		err = i40e_configure_rx_ring(vsi->rx_rings[i]);
@@ -4778,6 +4762,193 @@ static void i40e_pf_unquiesce_all_vsi(struct i40e_pf *pf)
 }
 
 /**
+ * i40e_vsi_free_tp4_ctxs - Free TP4 contexts
+ * @vsi: vsi
+ */
+static void i40e_vsi_free_tp4_ctxs(struct i40e_vsi *vsi)
+{
+	int i;
+
+	if (!vsi->tp4_ctxs)
+		return;
+
+	for (i = 0; i < vsi->num_tp4_ctxs; i++)
+		kfree(vsi->tp4_ctxs[i]);
+
+	kfree(vsi->tp4_ctxs);
+	vsi->tp4_ctxs = NULL;
+}
+
+/**
+ * i40e_qp_error_report_tp4 - Trigger the TP4 error handler
+ * @vsi: vsi
+ * @queue_pair: queue_pair to report
+ * @errno: the error code
+ **/
+static void i40e_qp_error_report_tp4(struct i40e_vsi *vsi, int queue_pair,
+				     int errno)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+
+	rxr->tp4.err_handler(rxr->tp4.err_opaque, errno);
+}
+
+/**
+ * i40e_qp_uses_tp4 - Check for TP4 usage
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns true if TP4 is enabled, else false.
+ **/
+static bool i40e_qp_uses_tp4(struct i40e_vsi *vsi, int queue_pair)
+{
+	return ring_uses_tp4(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_vsi_save_tp4_ctxs - Save TP4 context to a vsi
+ * @vsi: vsi
+ */
+static void i40e_vsi_save_tp4_ctxs(struct i40e_vsi *vsi)
+{
+	int i = 0;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return;
+
+	kfree(vsi->tp4_ctxs); /* Let's be cautious */
+
+	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		if (i40e_qp_uses_tp4(vsi, i)) {
+			if (!vsi->tp4_ctxs) {
+				vsi->tp4_ctxs = kcalloc(vsi->num_queue_pairs,
+							sizeof(*vsi->tp4_ctxs),
+							GFP_KERNEL);
+				if (!vsi->tp4_ctxs)
+					goto out;
+
+				vsi->num_tp4_ctxs = vsi->num_queue_pairs;
+			}
+
+			vsi->tp4_ctxs[i] = kzalloc(sizeof(struct i40e_tp4_ctx),
+						   GFP_KERNEL);
+			if (!vsi->tp4_ctxs[i])
+				goto out_elmn;
+
+			*vsi->tp4_ctxs[i] = vsi->rx_rings[i]->tp4;
+		}
+	}
+
+	return;
+
+out_elmn:
+	i40e_vsi_free_tp4_ctxs(vsi);
+out:
+	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		if (i40e_qp_uses_tp4(vsi, i))
+			i40e_qp_error_report_tp4(vsi, i, ENOMEM);
+	}
+}
+
+/**
+ * i40e_tp4_set_rx_handler - Sets the Rx clean_irq function for TP4
+ * @rxr: ingress ring
+ **/
+static void i40e_tp4_set_rx_handler(struct i40e_ring *rxr)
+{
+	unsigned int buf_len;
+
+	buf_len = min_t(unsigned int,
+			tp4a_max_data_size(rxr->tp4.arr),
+			I40E_MAX_RXBUFFER) &
+		  ~(BIT(I40E_RXQ_CTX_DBUFF_SHIFT) - 1);
+
+	/* Currently we don't allow packets spanning multiple
+	 * buffers.
+	 */
+	rxr->rx_buf_len = buf_len;
+	rxr->rx_max_frame = buf_len;
+	rxr->rx_alloc_fn = i40e_alloc_rx_buffers_tp4;
+	rxr->clean_irq = i40e_clean_rx_tp4_irq;
+}
+
+/**
+ * i40e_tp4_flush_all - Flush all outstanding descriptors to userland
+ * @a: pointer to the packet array
+ **/
+static void i40e_tp4_flush_all(struct tp4_packet_array *a)
+{
+	struct tp4_frame_set f;
+
+	/* Flush all outstanding requests. */
+	if (tp4a_get_flushable_frame_set(a, &f)) {
+		do {
+			tp4f_set_frame(&f, 0, 0, true);
+		} while (tp4f_next_frame(&f));
+	}
+
+	WARN_ON(tp4a_flush(a));
+}
+
+/**
+ * i40e_tp4_restore - Restores to a previous TP4 state
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @rx_ctx: the Rx TP4 context
+ **/
+static void i40e_tp4_restore(struct i40e_vsi *vsi, int queue_pair,
+			     struct i40e_tp4_ctx *rx_ctx)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+
+	rxr->tp4 = *rx_ctx;
+	i40e_tp4_flush_all(rxr->tp4.arr);
+	i40e_tp4_set_rx_handler(rxr);
+
+	set_ring_tp4(rxr);
+}
+
+/**
+ * i40e_vsi_restore_tp4_ctxs - Restores all contexts
+ * @vsi: vsi
+ **/
+static void i40e_vsi_restore_tp4_ctxs(struct i40e_vsi *vsi)
+{
+	u16 i, elms;
+
+	if (!vsi->tp4_ctxs)
+		return;
+
+	elms = min(vsi->num_queue_pairs, vsi->num_tp4_ctxs);
+	for (i = 0; i < elms; i++) {
+		if (!vsi->tp4_ctxs[i])
+			continue;
+		i40e_tp4_restore(vsi, i, vsi->tp4_ctxs[i]);
+	}
+
+	i40e_vsi_free_tp4_ctxs(vsi);
+}
+
+/**
+ * i40e_pf_save_tp4_ctx_all_vsi - Saves all TP4 contexts
+ ' @pf: pf
+ */
+static void i40e_pf_save_tp4_ctx_all_vsi(struct i40e_pf *pf)
+{
+	struct i40e_vsi *vsi;
+	int v;
+
+	/* The rings are about to be removed at reset; Saving the TP4
+	 * context in the vsi temporarily
+	 */
+	for (v = 0; v < pf->num_alloc_vsi; v++) {
+		vsi = pf->vsi[v];
+		if (vsi && vsi->netdev)
+			i40e_vsi_save_tp4_ctxs(vsi);
+	}
+}
+
+/**
  * i40e_vsi_wait_queues_disabled - Wait for VSI's queues to be disabled
  * @vsi: the VSI being configured
  *
@@ -6511,6 +6682,8 @@ int i40e_up(struct i40e_vsi *vsi)
 	return err;
 }
 
+static void __i40e_tp4_disable(struct i40e_vsi *vsi, int queue_pair);
+
 /**
  * i40e_down - Shutdown the connection processing
  * @vsi: the VSI being stopped
@@ -6531,6 +6704,7 @@ void i40e_down(struct i40e_vsi *vsi)
 	i40e_napi_disable_all(vsi);
 
 	for (i = 0; i < vsi->num_queue_pairs; i++) {
+		__i40e_tp4_disable(vsi, i);
 		i40e_clean_tx_ring(vsi->tx_rings[i]);
 		if (i40e_enabled_xdp_vsi(vsi))
 			i40e_clean_tx_ring(vsi->xdp_rings[i]);
@@ -8224,6 +8398,7 @@ static void i40e_prep_for_reset(struct i40e_pf *pf, bool lock_acquired)
 	/* pf_quiesce_all_vsi modifies netdev structures -rtnl_lock needed */
 	if (!lock_acquired)
 		rtnl_lock();
+	i40e_pf_save_tp4_ctx_all_vsi(pf);
 	i40e_pf_quiesce_all_vsi(pf);
 	if (!lock_acquired)
 		rtnl_unlock();
@@ -9082,7 +9257,7 @@ static int i40e_vsi_clear(struct i40e_vsi *vsi)
 
 	i40e_vsi_free_arrays(vsi, true);
 	i40e_clear_rss_config_user(vsi);
-
+	i40e_vsi_free_tp4_ctxs(vsi);
 	pf->vsi[vsi->idx] = NULL;
 	if (vsi->idx < pf->next_vsi)
 		pf->next_vsi = vsi->idx;
@@ -9115,6 +9290,28 @@ static void i40e_vsi_clear_rings(struct i40e_vsi *vsi)
 }
 
 /**
+ * i40e_vsi_setup_rx_size - Setup Rx buffer sizes
+ * @vsi: vsi
+ **/
+static void i40e_vsi_setup_rx_size(struct i40e_vsi *vsi)
+{
+	if (!vsi->netdev || (vsi->back->flags & I40E_FLAG_LEGACY_RX)) {
+		vsi->max_frame = I40E_MAX_RXBUFFER;
+		vsi->rx_buf_len = I40E_RXBUFFER_2048;
+#if (PAGE_SIZE < 8192)
+	} else if (!I40E_2K_TOO_SMALL_WITH_PADDING &&
+		   (vsi->netdev->mtu <= ETH_DATA_LEN)) {
+		vsi->max_frame = I40E_RXBUFFER_1536 - NET_IP_ALIGN;
+		vsi->rx_buf_len = I40E_RXBUFFER_1536 - NET_IP_ALIGN;
+#endif
+	} else {
+		vsi->max_frame = I40E_MAX_RXBUFFER;
+		vsi->rx_buf_len = (PAGE_SIZE < 8192) ? I40E_RXBUFFER_3072 :
+				  I40E_RXBUFFER_2048;
+	}
+}
+
+/**
  * i40e_alloc_rings - Allocates the Rx and Tx rings for the provided VSI
  * @vsi: the VSI being configured
  **/
@@ -9124,6 +9321,8 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 	struct i40e_pf *pf = vsi->back;
 	struct i40e_ring *ring;
 
+	i40e_vsi_setup_rx_size(vsi);
+
 	/* Set basic values in the rings to be used later during open() */
 	for (i = 0; i < vsi->alloc_queue_pairs; i++) {
 		/* allocate space for both Tx and Rx in one shot */
@@ -9171,6 +9370,10 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		ring->netdev = vsi->netdev;
 		ring->dev = &pf->pdev->dev;
 		ring->count = vsi->num_desc;
+		ring->rx_buf_len = vsi->rx_buf_len;
+		ring->rx_max_frame = vsi->max_frame;
+		ring->rx_alloc_fn = i40e_alloc_rx_buffers;
+		ring->clean_irq = i40e_clean_rx_irq;
 		ring->size = 0;
 		ring->dcb_tc = 0;
 		ring->rx_itr_setting = pf->rx_itr_default;
@@ -9909,7 +10112,7 @@ static int i40e_pf_config_rss(struct i40e_pf *pf)
 int i40e_reconfig_rss_queues(struct i40e_pf *pf, int queue_count)
 {
 	struct i40e_vsi *vsi = pf->vsi[pf->lan_vsi];
-	int new_rss_size;
+	int i, new_rss_size;
 
 	if (!(pf->flags & I40E_FLAG_RSS_ENABLED))
 		return 0;
@@ -9919,6 +10122,11 @@ int i40e_reconfig_rss_queues(struct i40e_pf *pf, int queue_count)
 	if (queue_count != vsi->num_queue_pairs) {
 		u16 qcount;
 
+		for (i = queue_count; i < vsi->num_queue_pairs; i++) {
+			if (i40e_qp_uses_tp4(vsi, i))
+				i40e_qp_error_report_tp4(vsi, i, ENOENT);
+		}
+
 		vsi->req_queue_pairs = queue_count;
 		i40e_prep_for_reset(pf, true);
 
@@ -10762,6 +10970,505 @@ static int i40e_xdp(struct net_device *dev,
 	}
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+	int timeout = 50;
+
+	while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+		timeout--;
+		if (!timeout)
+			return -EBUSY;
+		usleep_range(1000, 2000);
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+	struct i40e_pf *pf = vsi->back;
+
+	clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_qp_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_qp_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+	memset(&vsi->rx_rings[queue_pair]->rx_stats, 0,
+	       sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+	memset(&vsi->tx_rings[queue_pair]->stats, 0,
+	       sizeof(vsi->tx_rings[queue_pair]->stats));
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		memset(&vsi->xdp_rings[queue_pair]->stats, 0,
+		       sizeof(vsi->xdp_rings[queue_pair]->stats));
+	}
+}
+
+/**
+ * i40e_qp_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_qp_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+	i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+	if (i40e_enabled_xdp_vsi(vsi))
+		i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+	i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_qp_control_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_qp_control_napi(struct i40e_vsi *vsi, int queue_pair,
+				 bool enable)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_q_vector *q_vector = rxr->q_vector;
+
+	if (!vsi->netdev)
+		return;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (q_vector->rx.ring || q_vector->tx.ring) {
+		if (enable)
+			napi_enable(&q_vector->napi);
+		else
+			napi_disable(&q_vector->napi);
+	}
+}
+
+/**
+ * i40e_qp_control_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_qp_control_rings(struct i40e_vsi *vsi, int queue_pair,
+				 bool enable)
+{
+	struct i40e_pf *pf = vsi->back;
+	int pf_q, ret = 0;
+
+	pf_q = vsi->base_queue + queue_pair;
+	ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+				     false /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	i40e_control_rx_q(pf, pf_q, enable);
+	ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d Rx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+		return ret;
+	}
+
+	/* Due to HW errata, on Rx disable only, the register can
+	 * indicate done before it really is. Needs 50ms to be sure
+	 */
+	if (!enable)
+		mdelay(50);
+
+	if (!i40e_enabled_xdp_vsi(vsi))
+		return ret;
+
+	ret = i40e_control_wait_tx_q(vsi->seid, pf,
+				     pf_q + vsi->alloc_queue_pairs,
+				     true /*is xdp*/, enable);
+	if (ret) {
+		dev_info(&pf->pdev->dev,
+			 "VSI seid %d XDP Tx ring %d %sable timeout\n",
+			 vsi->seid, pf_q, (enable ? "en" : "dis"));
+	}
+
+	return ret;
+}
+
+/**
+ * i40e_qp_enable_irq - Enables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_qp_enable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* All rings in a qp belong to the same qvector. */
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED)
+		i40e_irq_dynamic_enable(vsi, rxr->q_vector->v_idx);
+	else
+		i40e_irq_dynamic_enable_icr0(pf);
+
+	i40e_flush(hw);
+}
+
+/**
+ * i40e_qp_disable_irq - Disables interrupts for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue_pair
+ **/
+static void i40e_qp_disable_irq(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_pf *pf = vsi->back;
+	struct i40e_hw *hw = &pf->hw;
+
+	/* For simplicity, instead of removing the qp interrupt causes
+	 * from the interrupt linked list, we simply disable the interrupt, and
+	 * leave the list intact.
+	 *
+	 * All rings in a qp belong to the same qvector.
+	 */
+
+	if (pf->flags & I40E_FLAG_MSIX_ENABLED) {
+		u32 intpf = vsi->base_vector + rxr->q_vector->v_idx;
+
+		wr32(hw, I40E_PFINT_DYN_CTLN(intpf - 1), 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->msix_entries[intpf].vector);
+	} else {
+		/* Legacy and MSI mode - this stops all interrupt handling */
+		wr32(hw, I40E_PFINT_ICR0_ENA, 0);
+		wr32(hw, I40E_PFINT_DYN_CTL0, 0);
+		i40e_flush(hw);
+		synchronize_irq(pf->pdev->irq);
+	}
+}
+
+/**
+ * i40e_qp_disable - Disables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_qp_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_enter_busy_conf(vsi);
+	if (err)
+		return err;
+
+	i40e_qp_disable_irq(vsi, queue_pair);
+	err = i40e_qp_control_rings(vsi, queue_pair, false /* disable */);
+	i40e_qp_control_napi(vsi, queue_pair, false /* disable */);
+	i40e_qp_clean_rings(vsi, queue_pair);
+	i40e_qp_reset_stats(vsi, queue_pair);
+
+	return err;
+}
+
+/**
+ * i40e_qp_enable - Enables a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_qp_enable(struct i40e_vsi *vsi, int queue_pair)
+{
+	int err;
+
+	err = i40e_configure_tx_ring(vsi->tx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	if (i40e_enabled_xdp_vsi(vsi)) {
+		err = i40e_configure_tx_ring(vsi->xdp_rings[queue_pair]);
+		if (err)
+			return err;
+	}
+
+	err = i40e_configure_rx_ring(vsi->rx_rings[queue_pair]);
+	if (err)
+		return err;
+
+	err = i40e_qp_control_rings(vsi, queue_pair, true /* enable */);
+	i40e_qp_control_napi(vsi, queue_pair, true /* enable */);
+	i40e_qp_enable_irq(vsi, queue_pair);
+
+	i40e_exit_busy_conf(vsi);
+
+	return err;
+}
+
+/**
+ * i40e_qp_kick_napi - Schedules a NAPI run
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_qp_kick_napi(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+
+	napi_schedule(&rxr->q_vector->napi);
+}
+
+/**
+ * i40e_vsi_get_tp4_rx_ctx - Retrieves the Rx TP4 context, if any.
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns NULL if there's no context available.
+ **/
+static struct i40e_tp4_ctx *i40e_vsi_get_tp4_rx_ctx(struct i40e_vsi *vsi,
+						    int queue_pair)
+{
+	if (!vsi->tp4_ctxs)
+		return NULL;
+
+	return vsi->tp4_ctxs[queue_pair];
+}
+
+/**
+ * i40e_tp4_disable_rx - Disables TP4 Rx mode
+ * @rxr: ingress ring
+ **/
+static void i40e_tp4_disable_rx(struct i40e_ring *rxr)
+{
+	/* Don't free, if the context is saved! */
+	if (i40e_vsi_get_tp4_rx_ctx(rxr->vsi, rxr->queue_index))
+		rxr->tp4.arr = NULL;
+	else
+		tp4a_free(rxr->tp4.arr);
+
+	memset(&rxr->tp4, 0, sizeof(rxr->tp4));
+	clear_ring_tp4(rxr);
+
+	rxr->rx_buf_len = rxr->vsi->rx_buf_len;
+	rxr->rx_max_frame = rxr->vsi->max_frame;
+	rxr->rx_alloc_fn = i40e_alloc_rx_buffers;
+	rxr->clean_irq = i40e_clean_rx_irq;
+}
+
+/**
+ * __i40e_tp4_disable - Disables TP4 for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void __i40e_tp4_disable(struct i40e_vsi *vsi, int queue_pair)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+
+	if (!i40e_qp_uses_tp4(vsi, queue_pair))
+		return;
+
+	i40e_tp4_disable_rx(rxr);
+}
+
+/**
+ * i40e_tp4_disable - Disables zerocopy
+ * @netdev: netdevice
+ * @params: tp4 params
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_tp4_disable(struct net_device *netdev,
+			    struct tp4_netdev_parms *params)
+{
+	struct i40e_netdev_priv *np = netdev_priv(netdev);
+	struct i40e_vsi *vsi = np->vsi;
+	int err;
+
+	if (params->queue_pair < 0 ||
+	    params->queue_pair >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (!i40e_qp_uses_tp4(vsi, params->queue_pair))
+		return 0;
+
+	netdev_info(
+		netdev,
+		"disabling TP4 zerocopy qp=%d, failed Rx allocations: %llu\n",
+		params->queue_pair,
+		vsi->rx_rings[params->queue_pair]->rx_stats.alloc_page_failed);
+
+	err =  i40e_qp_disable(vsi, params->queue_pair);
+	if (err) {
+		netdev_warn(
+			netdev,
+			"could not disable qp=%d err=%d, failed disabling TP4 zerocopy\n",
+			params->queue_pair,
+			err);
+		return err;
+	}
+
+	__i40e_tp4_disable(vsi, params->queue_pair);
+
+	err =  i40e_qp_enable(vsi, params->queue_pair);
+	if (err) {
+		netdev_warn(
+			netdev,
+			"could not re-enable qp=%d err=%d, failed disabling TP4 zerocopy\n",
+			params->queue_pair,
+			err);
+		return err;
+	}
+
+	return 0;
+}
+
+/**
+ * i40e_tp4_enable_rx - Enables TP4 Tx
+ * @rxr: ingress ring
+ * @params: tp4 params
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_tp4_enable_rx(struct i40e_ring *rxr,
+			      struct tp4_netdev_parms *params)
+{
+	size_t elems = __roundup_pow_of_two(rxr->count * 8);
+	struct tp4_packet_array *arr;
+
+	arr = tp4a_rx_new(params->rx_opaque, elems, rxr->dev);
+	if (!arr)
+		return -ENOMEM;
+
+	rxr->tp4.arr = arr;
+	rxr->tp4.ev_handler = params->data_ready;
+	rxr->tp4.ev_opaque = params->data_ready_opaque;
+	rxr->tp4.err_handler = params->error_report;
+	rxr->tp4.err_opaque = params->error_report_opaque;
+
+	i40e_tp4_set_rx_handler(rxr);
+
+	set_ring_tp4(rxr);
+
+	return 0;
+}
+
+/**
+ * __i40e_tp4_enable - Enables TP4
+ * @vsi: vsi
+ * @params: tp4 params
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int __i40e_tp4_enable(struct i40e_vsi *vsi,
+			     struct tp4_netdev_parms *params)
+{
+	struct i40e_ring *rxr = vsi->rx_rings[params->queue_pair];
+	int err;
+
+	err = i40e_tp4_enable_rx(rxr, params);
+	if (err)
+		return err;
+
+	return 0;
+}
+
+/**
+ * i40e_tp4_enable - Enables zerocopy
+ * @netdev: netdevice
+ * @params: tp4 params
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_tp4_enable(struct net_device *netdev,
+			   struct tp4_netdev_parms *params)
+{
+	struct i40e_netdev_priv *np = netdev_priv(netdev);
+	struct i40e_vsi *vsi = np->vsi;
+	int err;
+
+	if (vsi->type != I40E_VSI_MAIN)
+		return -EINVAL;
+
+	if (params->queue_pair < 0 ||
+	    params->queue_pair >= vsi->num_queue_pairs)
+		return -EINVAL;
+
+	if (!netif_running(netdev))
+		return -ENETDOWN;
+
+	if (i40e_qp_uses_tp4(vsi, params->queue_pair))
+		return -EBUSY;
+
+	if (!params->rx_opaque)
+		return -EINVAL;
+
+	err =  i40e_qp_disable(vsi, params->queue_pair);
+	if (err) {
+		netdev_warn(netdev, "could not disable qp=%d err=%d, failed enabling TP4 zerocopy\n",
+			    params->queue_pair, err);
+		return err;
+	}
+
+	err =  __i40e_tp4_enable(vsi, params);
+	if (err) {
+		netdev_warn(netdev, "__i40e_tp4_enable qp=%d err=%d, failed enabling TP4 zerocopy\n",
+			    params->queue_pair, err);
+		return err;
+	}
+
+	err = i40e_qp_enable(vsi, params->queue_pair);
+	if (err) {
+		netdev_warn(netdev, "could not re-enable qp=%d err=%d, failed enabling TP4 zerocopy\n",
+			    params->queue_pair, err);
+		return err;
+	}
+
+	/* Kick NAPI to make sure that alloction from userland
+	 * acctually worked.
+	 */
+	i40e_qp_kick_napi(vsi, params->queue_pair);
+
+	netdev_info(netdev, "enabled TP4 zerocopy\n");
+	return 0;
+}
+
+/**
+ * i40e_tp4_zerocopy - enables/disables zerocopy
+ * @netdev: netdevice
+ * @params: tp4 params
+ *
+ * Returns zero on success
+ **/
+static int i40e_tp4_zerocopy(struct net_device *netdev,
+			     struct tp4_netdev_parms *params)
+{
+	switch (params->command) {
+	case TP4_ENABLE:
+		return i40e_tp4_enable(netdev, params);
+
+	case TP4_DISABLE:
+		return i40e_tp4_disable(netdev, params);
+
+	default:
+		return -ENOTSUPP;
+	}
+}
+
 static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_open		= i40e_open,
 	.ndo_stop		= i40e_close,
@@ -10795,6 +11502,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bridge_getlink	= i40e_ndo_bridge_getlink,
 	.ndo_bridge_setlink	= i40e_ndo_bridge_setlink,
 	.ndo_xdp		= i40e_xdp,
+	.ndo_tp4_zerocopy	= i40e_tp4_zerocopy,
 };
 
 /**
@@ -11439,6 +12147,7 @@ static struct i40e_vsi *i40e_vsi_reinit_setup(struct i40e_vsi *vsi)
 	ret = i40e_alloc_rings(vsi);
 	if (ret)
 		goto err_rings;
+	i40e_vsi_restore_tp4_ctxs(vsi);
 
 	/* map all of the rings to the q_vectors */
 	i40e_vsi_map_rings_to_vectors(vsi);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c5cd233c8fee..54c5b7975066 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1083,6 +1083,21 @@ static inline bool i40e_rx_is_programming_status(u64 qw)
 }
 
 /**
+ * i40e_inc_rx_next_to_clean - Bumps the next to clean
+ * @ring: ingress ring
+ */
+static inline void i40e_inc_rx_next_to_clean(struct i40e_ring *ring)
+{
+	u32 ntc;
+
+	ntc = ring->next_to_clean + 1;
+	ntc = (ntc < ring->count) ? ntc : 0;
+	ring->next_to_clean = ntc;
+
+	prefetch(I40E_RX_DESC(ring, ntc));
+}
+
+/**
  * i40e_clean_programming_status - clean the programming status descriptor
  * @rx_ring: the rx ring that has this descriptor
  * @rx_desc: the rx descriptor written back by HW
@@ -1098,15 +1113,10 @@ static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
 					  u64 qw)
 {
 	struct i40e_rx_buffer *rx_buffer;
-	u32 ntc = rx_ring->next_to_clean;
 	u8 id;
 
-	/* fetch, update, and store next to clean */
-	rx_buffer = &rx_ring->rx_bi[ntc++];
-	ntc = (ntc < rx_ring->count) ? ntc : 0;
-	rx_ring->next_to_clean = ntc;
-
-	prefetch(I40E_RX_DESC(rx_ring, ntc));
+	rx_buffer = &rx_ring->rx_bi[rx_ring->next_to_clean];
+	i40e_inc_rx_next_to_clean(rx_ring);
 
 	/* place unused page back on the ring */
 	i40e_reuse_rx_page(rx_ring, rx_buffer);
@@ -1958,6 +1968,18 @@ static void i40e_put_rx_buffer(struct i40e_ring *rx_ring,
 }
 
 /**
+ * i40e_is_rx_desc_eof - Checks if Rx descriptor is end of frame
+ * @rx_desc: rx_desc
+ *
+ * Returns true if EOF, false otherwise.
+ **/
+static inline bool i40e_is_rx_desc_eof(union i40e_rx_desc *rx_desc)
+{
+#define I40E_RXD_EOF BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)
+	return i40e_test_staterr(rx_desc, I40E_RXD_EOF);
+}
+
+/**
  * i40e_is_non_eop - process handling of non-EOP buffers
  * @rx_ring: Rx ring being processed
  * @rx_desc: Rx descriptor for current buffer
@@ -1972,17 +1994,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 			    union i40e_rx_desc *rx_desc,
 			    struct sk_buff *skb)
 {
-	u32 ntc = rx_ring->next_to_clean + 1;
-
-	/* fetch, update, and store next to clean */
-	ntc = (ntc < rx_ring->count) ? ntc : 0;
-	rx_ring->next_to_clean = ntc;
-
-	prefetch(I40E_RX_DESC(rx_ring, ntc));
+	i40e_inc_rx_next_to_clean(rx_ring);
 
 	/* if we are the last buffer then there is nothing else to do */
-#define I40E_RXD_EOF BIT(I40E_RX_DESC_STATUS_EOF_SHIFT)
-	if (likely(i40e_test_staterr(rx_desc, I40E_RXD_EOF)))
+	if (likely(i40e_is_rx_desc_eof(rx_desc)))
 		return false;
 
 	rx_ring->rx_stats.non_eop_descs++;
@@ -2060,6 +2075,24 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
 }
 
 /**
+ * i40e_update_rx_stats - Updates the Rx statistics
+ * @rxr: ingress ring
+ * @rx_bytes: number of bytes
+ * @rx_packets: number of packets
+ **/
+static inline void i40e_update_rx_stats(struct i40e_ring *rxr,
+					unsigned int rx_bytes,
+					unsigned int rx_packets)
+{
+	u64_stats_update_begin(&rxr->syncp);
+	rxr->stats.packets += rx_packets;
+	rxr->stats.bytes += rx_bytes;
+	u64_stats_update_end(&rxr->syncp);
+	rxr->q_vector->rx.total_packets += rx_packets;
+	rxr->q_vector->rx.total_bytes += rx_bytes;
+}
+
+/**
  * i40e_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: rx descriptor ring to transact packets on
  * @budget: Total limit on number of packets to process
@@ -2071,7 +2104,7 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
  *
  * Returns amount of work completed
  **/
-static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
+int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	struct sk_buff *skb = rx_ring->skb;
@@ -2205,17 +2238,84 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 
 	rx_ring->skb = skb;
 
-	u64_stats_update_begin(&rx_ring->syncp);
-	rx_ring->stats.packets += total_rx_packets;
-	rx_ring->stats.bytes += total_rx_bytes;
-	u64_stats_update_end(&rx_ring->syncp);
-	rx_ring->q_vector->rx.total_packets += total_rx_packets;
-	rx_ring->q_vector->rx.total_bytes += total_rx_bytes;
+	i40e_update_rx_stats(rx_ring, total_rx_bytes, total_rx_packets);
 
 	/* guarantee a trip back through this routine if there was a failure */
 	return failure ? budget : (int)total_rx_packets;
 }
 
+/**
+ * i40e_get_rx_desc_size - Returns the size of a received frame
+ * @rxd: rx descriptor
+ *
+ * Returns numbers of bytes received.
+ **/
+static inline unsigned int i40e_get_rx_desc_size(union i40e_rx_desc *rxd)
+{
+	u64 qword = le64_to_cpu(rxd->wb.qword1.status_error_len);
+	unsigned int size;
+
+	size = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
+	       I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
+
+	return size;
+}
+
+/**
+ * i40e_clean_rx_tp4_irq - Pulls received packets of the descriptor ring
+ * @rxr: ingress ring
+ * @budget: NAPI budget
+ *
+ * Returns number of received packets.
+ **/
+int i40e_clean_rx_tp4_irq(struct i40e_ring *rxr, int budget)
+{
+	int total_rx_bytes = 0, total_rx_packets = 0;
+	u16 cleaned_count = I40E_DESC_UNUSED(rxr);
+	struct tp4_frame_set frame_set;
+	bool failure;
+
+	if (!tp4a_get_flushable_frame_set(rxr->tp4.arr, &frame_set))
+		goto out;
+
+	while (total_rx_packets < budget) {
+		union i40e_rx_desc *rxd = I40E_RX_DESC(rxr, rxr->next_to_clean);
+		unsigned int size = i40e_get_rx_desc_size(rxd);
+
+		if (!size)
+			break;
+
+		/* This memory barrier is needed to keep us from
+		 * reading any other fields out of the rxd until we
+		 * have verified the descriptor has been written back.
+		 */
+		dma_rmb();
+
+		tp4f_set_frame_no_offset(&frame_set, size,
+					 i40e_is_rx_desc_eof(rxd));
+
+		total_rx_bytes += size;
+		total_rx_packets++;
+
+		i40e_inc_rx_next_to_clean(rxr);
+
+		WARN_ON(!tp4f_next_frame(&frame_set));
+	}
+
+	WARN_ON(tp4a_flush_n(rxr->tp4.arr, total_rx_packets));
+
+	rxr->tp4.ev_handler(rxr->tp4.ev_opaque);
+
+	i40e_update_rx_stats(rxr, total_rx_bytes, total_rx_packets);
+
+	cleaned_count += total_rx_packets;
+out:
+	failure = (cleaned_count >= I40E_RX_BUFFER_WRITE) ?
+		  i40e_alloc_rx_buffers_tp4(rxr, cleaned_count) : false;
+
+	return failure ? budget : total_rx_packets;
+}
+
 static u32 i40e_buildreg_itr(const int type, const u16 itr)
 {
 	u32 val;
@@ -2372,7 +2472,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	budget_per_ring = max(budget/q_vector->num_ringpairs, 1);
 
 	i40e_for_each_ring(ring, q_vector->rx) {
-		int cleaned = i40e_clean_rx_irq(ring, budget_per_ring);
+		int cleaned = ring->clean_irq(ring, budget_per_ring);
 
 		work_done += cleaned;
 		/* if we clean as many as budgeted, we must not be done */
@@ -3434,3 +3534,51 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 
 	return i40e_xmit_frame_ring(skb, tx_ring);
 }
+
+/**
+ * i40e_alloc_rx_buffers_tp4 - Allocate buffers from the TP4 userland ring
+ * @rxr: ingress ring
+ * @cleaned_count: number of buffers to allocate
+ *
+ * Returns true on failure, false on success.
+ **/
+bool i40e_alloc_rx_buffers_tp4(struct i40e_ring *rxr, u16 cleaned_count)
+{
+	u16 i, ntu = rxr->next_to_use;
+	union i40e_rx_desc *rx_desc;
+	struct tp4_frame_set frame;
+	bool ret = false;
+	dma_addr_t dma;
+
+	rx_desc = I40E_RX_DESC(rxr, ntu);
+
+	for (i = 0; i < cleaned_count; i++) {
+		if (unlikely(!tp4a_next_frame_populate(rxr->tp4.arr, &frame))) {
+			rxr->rx_stats.alloc_page_failed++;
+			ret = true;
+			break;
+		}
+
+		dma = tp4f_get_dma(&frame);
+		dma_sync_single_for_device(rxr->dev, dma, rxr->rx_buf_len,
+					   DMA_FROM_DEVICE);
+
+		rx_desc->read.pkt_addr = cpu_to_le64(dma);
+
+		rx_desc++;
+		ntu++;
+		if (unlikely(ntu == rxr->count)) {
+			rx_desc = I40E_RX_DESC(rxr, 0);
+			ntu = 0;
+		}
+
+		/* clear the status bits for the next_to_use descriptor */
+		rx_desc->wb.qword1.status_error_len = 0;
+	}
+
+	if (rxr->next_to_use != ntu)
+		i40e_release_rx_desc(rxr, ntu);
+
+	return ret;
+}
+
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fbae1182e2ea..602dcd111938 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -27,6 +27,8 @@
 #ifndef _I40E_TXRX_H_
 #define _I40E_TXRX_H_
 
+#include <linux/tpacket4.h>
+
 /* Interrupt Throttling and Rate Limiting Goodies */
 
 #define I40E_MAX_ITR               0x0FF0  /* reg uses 2 usec resolution */
@@ -347,6 +349,14 @@ enum i40e_ring_state_t {
 	__I40E_RING_STATE_NBITS /* must be last */
 };
 
+struct i40e_tp4_ctx {
+	struct tp4_packet_array *arr;
+	void (*ev_handler)(void *);
+	void *ev_opaque;
+	void (*err_handler)(void *, int);
+	void *err_opaque;
+};
+
 /* some useful defines for virtchannel interface, which
  * is the only remaining user of header split
  */
@@ -385,6 +395,7 @@ struct i40e_ring {
 	u16 count;			/* Number of descriptors */
 	u16 reg_idx;			/* HW register index of the ring */
 	u16 rx_buf_len;
+	u16 rx_max_frame;
 
 	/* used in interrupt processing */
 	u16 next_to_use;
@@ -401,6 +412,7 @@ struct i40e_ring {
 #define I40E_TXR_FLAGS_WB_ON_ITR		BIT(0)
 #define I40E_RXR_FLAGS_BUILD_SKB_ENABLED	BIT(1)
 #define I40E_TXR_FLAGS_XDP			BIT(2)
+#define I40E_R_FLAGS_TP4			BIT(3)
 
 	/* stats structs */
 	struct i40e_queue_stats	stats;
@@ -428,6 +440,10 @@ struct i40e_ring {
 					 */
 
 	struct i40e_channel *ch;
+
+	bool (*rx_alloc_fn)(struct i40e_ring *rxr, u16 cleaned_count);
+	int (*clean_irq)(struct i40e_ring *ring, int budget);
+	struct i40e_tp4_ctx tp4;
 } ____cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)
@@ -455,6 +471,21 @@ static inline void set_ring_xdp(struct i40e_ring *ring)
 	ring->flags |= I40E_TXR_FLAGS_XDP;
 }
 
+static inline bool ring_uses_tp4(struct i40e_ring *ring)
+{
+	return !!(ring->flags & I40E_R_FLAGS_TP4);
+}
+
+static inline void set_ring_tp4(struct i40e_ring *ring)
+{
+	ring->flags |= I40E_R_FLAGS_TP4;
+}
+
+static inline void clear_ring_tp4(struct i40e_ring *ring)
+{
+	ring->flags &= ~I40E_R_FLAGS_TP4;
+}
+
 enum i40e_latency_range {
 	I40E_LOWEST_LATENCY = 0,
 	I40E_LOW_LATENCY = 1,
@@ -488,6 +519,9 @@ static inline unsigned int i40e_rx_pg_order(struct i40e_ring *ring)
 #define i40e_rx_pg_size(_ring) (PAGE_SIZE << i40e_rx_pg_order(_ring))
 
 bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
+int i40e_clean_rx_irq(struct i40e_ring *rxr, int budget);
+bool i40e_alloc_rx_buffers_tp4(struct i40e_ring *rxr, u16 cleaned_count);
+int i40e_clean_rx_tp4_irq(struct i40e_ring *rxr, int budget);
 netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
 void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
 void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index 839485108b2d..80bc20543599 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -658,6 +658,19 @@ static inline void *tp4q_get_data(struct tp4_queue *q,
 }
 
 /**
+ * tp4q_get_dma_addr - Get kernel dma address of page
+ *
+ * @q: Pointer to the tp4 queue that this frame resides in
+ * @pg: Pointer to the page of this frame
+ *
+ * Returns the dma address associated with the page
+ **/
+static inline dma_addr_t tp4q_get_dma_addr(struct tp4_queue *q, u64 pg)
+{
+	return q->dma_info[pg].dma;
+}
+
+/**
  * tp4q_get_desc - Get descriptor associated with frame
  *
  * @p: Pointer to the packet to examine
@@ -722,6 +735,18 @@ static inline u32 tp4f_get_frame_len(struct tp4_frame_set *p)
 }
 
 /**
+ * tp4f_get_data_offset - Get offset of packet data in packet buffer
+ * @p: pointer to frame set
+ *
+ * Returns the offset to the data in the packet buffer of the current
+ * frame
+ **/
+static inline u32 tp4f_get_data_offset(struct tp4_frame_set *p)
+{
+	return p->pkt_arr->items[p->curr & p->pkt_arr->mask].offset;
+}
+
+/**
  * tp4f_set_error - Set an error on the current frame
  * @p: pointer to frame set
  * @errno: the errno to be assigned
@@ -762,6 +787,41 @@ static inline void tp4f_set_frame(struct tp4_frame_set *p, u32 len, u16 offset,
 		d->flags |= TP4_PKT_CONT;
 }
 
+/**
+ * tp4f_set_frame_no_offset - Sets the properties of a frame
+ * @p: pointer to frame
+ * @len: the length in bytes of the data in the frame
+ * @is_eop: Set if this is the last frame of the packet
+ **/
+static inline void tp4f_set_frame_no_offset(struct tp4_frame_set *p,
+					    u32 len, bool is_eop)
+{
+	struct tpacket4_desc *d =
+		&p->pkt_arr->items[p->curr & p->pkt_arr->mask];
+
+	d->len = len;
+	if (!is_eop)
+		d->flags |= TP4_PKT_CONT;
+}
+
+/**
+ * tp4f_get_dma - Returns DMA address of the frame
+ * @f: pointer to frame
+ *
+ * Returns the DMA address of the frame
+ **/
+static inline dma_addr_t tp4f_get_dma(struct tp4_frame_set *f)
+{
+	struct tp4_queue *tp4q = f->pkt_arr->tp4q;
+	dma_addr_t dma;
+	u64 pg, off;
+
+	tp4q_get_page_offset(tp4q, tp4f_get_frame_id(f), &pg, &off);
+	dma = tp4q_get_dma_addr(tp4q, pg);
+
+	return dma + off + tp4f_get_data_offset(f);
+}
+
 /*************** PACKET OPERATIONS *******************************/
 /* A packet consists of one or more frames. Both frames and packets
  * are represented by a tp4_frame_set. The only difference is that
@@ -1023,6 +1083,31 @@ static inline bool tp4a_next_packet(struct tp4_packet_array *a,
 }
 
 /**
+ * tp4a_flush_n - Flush n processed packets to associated tp4q
+ * @a: pointer to packet array
+ * @n: number of items to flush
+ *
+ * Returns 0 for success and -1 for failure
+ **/
+static inline int tp4a_flush_n(struct tp4_packet_array *a, unsigned int n)
+{
+	u32 avail = a->curr - a->start;
+	int ret;
+
+	if (avail == 0 || n == 0)
+		return 0; /* nothing to flush */
+
+	avail = (n > avail) ? avail : n; /* XXX trust user? remove? */
+
+	ret = tp4q_enqueue_from_array(a, avail);
+	if (ret < 0)
+		return -1;
+
+	a->start += avail;
+	return 0;
+}
+
+/**
  * tp4a_flush_completed - Flushes only frames marked as completed
  * @a: pointer to packet array
  *
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 09/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (7 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 08/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 10/14] samples/tpacket4: added tpbench Björn Töpel
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

Here, egress support for TP4 is added, and hence implementing
ndo_tp4_xmit. The ndo_tp4_xmit simply kicks the NAPI context.

In the NAPI poll, pulling egress frames from userland, posting the
frames to the hardware descriptor queue and clearing completed frames
from the egress hardware descriptor ring is done.

The clean_irq i40e_ring member is extended to include the Tx ring
clean up as well, resulting in some function signature changes for
i40e_clean_tx_irq.

As in the Rx case, we're not using i40e_tx_buffer for storing
metadata.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e.h      |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c |  98 +++++++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 266 +++++++++++++++++++++++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |   4 +
 include/linux/tpacket4.h                    |  34 ++++
 5 files changed, 373 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 56dff7d314c4..b33b64b87725 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -745,7 +745,7 @@ struct i40e_vsi {
 	/* VSI specific handlers */
 	irqreturn_t (*irq_handler)(int irq, void *data);
 
-	struct i40e_tp4_ctx **tp4_ctxs; /* Rx context */
+	struct i40e_tp4_ctx **tp4_ctxs; /* Rx, Tx context */
 	u16 num_tp4_ctxs;
 } ____cacheline_internodealigned_in_smp;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 5456ef6cce1b..ff6d44dae8d0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4830,12 +4830,14 @@ static void i40e_vsi_save_tp4_ctxs(struct i40e_vsi *vsi)
 				vsi->num_tp4_ctxs = vsi->num_queue_pairs;
 			}
 
-			vsi->tp4_ctxs[i] = kzalloc(sizeof(struct i40e_tp4_ctx),
+			vsi->tp4_ctxs[i] = kcalloc(2, /* rx, tx */
+						   sizeof(struct i40e_tp4_ctx),
 						   GFP_KERNEL);
 			if (!vsi->tp4_ctxs[i])
 				goto out_elmn;
 
-			*vsi->tp4_ctxs[i] = vsi->rx_rings[i]->tp4;
+			vsi->tp4_ctxs[i][0] = vsi->rx_rings[i]->tp4;
+			vsi->tp4_ctxs[i][1] = vsi->tx_rings[i]->tp4;
 		}
 	}
 
@@ -4897,15 +4899,22 @@ static void i40e_tp4_flush_all(struct tp4_packet_array *a)
  * @rx_ctx: the Rx TP4 context
  **/
 static void i40e_tp4_restore(struct i40e_vsi *vsi, int queue_pair,
-			     struct i40e_tp4_ctx *rx_ctx)
+			     struct i40e_tp4_ctx *rx_ctx,
+			     struct i40e_tp4_ctx *tx_ctx)
 {
 	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_ring *txr = vsi->tx_rings[queue_pair];
 
 	rxr->tp4 = *rx_ctx;
 	i40e_tp4_flush_all(rxr->tp4.arr);
 	i40e_tp4_set_rx_handler(rxr);
 
+	txr->tp4 = *tx_ctx;
+	i40e_tp4_flush_all(txr->tp4.arr);
+	txr->clean_irq = i40e_clean_tx_tp4_irq;
+
 	set_ring_tp4(rxr);
+	set_ring_tp4(txr);
 }
 
 /**
@@ -4923,7 +4932,8 @@ static void i40e_vsi_restore_tp4_ctxs(struct i40e_vsi *vsi)
 	for (i = 0; i < elms; i++) {
 		if (!vsi->tp4_ctxs[i])
 			continue;
-		i40e_tp4_restore(vsi, i, vsi->tp4_ctxs[i]);
+		i40e_tp4_restore(vsi, i, &vsi->tp4_ctxs[i][0],
+				 &vsi->tp4_ctxs[i][1]);
 	}
 
 	i40e_vsi_free_tp4_ctxs(vsi);
@@ -9337,6 +9347,7 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		ring->netdev = vsi->netdev;
 		ring->dev = &pf->pdev->dev;
 		ring->count = vsi->num_desc;
+		ring->clean_irq = i40e_clean_tx_irq;
 		ring->size = 0;
 		ring->dcb_tc = 0;
 		if (vsi->back->hw_features & I40E_HW_WB_ON_ITR_CAPABLE)
@@ -9354,6 +9365,7 @@ static int i40e_alloc_rings(struct i40e_vsi *vsi)
 		ring->netdev = NULL;
 		ring->dev = &pf->pdev->dev;
 		ring->count = vsi->num_desc;
+		ring->clean_irq = i40e_clean_tx_irq;
 		ring->size = 0;
 		ring->dcb_tc = 0;
 		if (vsi->back->hw_features & I40E_HW_WB_ON_ITR_CAPABLE)
@@ -11246,7 +11258,23 @@ static struct i40e_tp4_ctx *i40e_vsi_get_tp4_rx_ctx(struct i40e_vsi *vsi,
 	if (!vsi->tp4_ctxs)
 		return NULL;
 
-	return vsi->tp4_ctxs[queue_pair];
+	return &vsi->tp4_ctxs[queue_pair][0];
+}
+
+/**
+ * i40e_vsi_get_tp4_tx_ctx - Retrieves the Tx TP4 context, if any.
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ *
+ * Returns NULL if there's no context available.
+ **/
+static struct i40e_tp4_ctx *i40e_vsi_get_tp4_tx_ctx(struct i40e_vsi *vsi,
+						    int queue_pair)
+{
+	if (!vsi->tp4_ctxs)
+		return NULL;
+
+	return &vsi->tp4_ctxs[queue_pair][1];
 }
 
 /**
@@ -11271,6 +11299,24 @@ static void i40e_tp4_disable_rx(struct i40e_ring *rxr)
 }
 
 /**
+ * i40e_tp4_disable_tx - Disables TP4 Rx mode
+ * @txr: egress ring
+ **/
+static void i40e_tp4_disable_tx(struct i40e_ring *txr)
+{
+	/* Don't free, if the context is saved! */
+	if (i40e_vsi_get_tp4_tx_ctx(txr->vsi, txr->queue_index))
+		txr->tp4.arr = NULL;
+	else
+		tp4a_free(txr->tp4.arr);
+
+	memset(&txr->tp4, 0, sizeof(txr->tp4));
+	clear_ring_tp4(txr);
+
+	txr->clean_irq = i40e_clean_tx_irq;
+}
+
+/**
  * __i40e_tp4_disable - Disables TP4 for a queue pair
  * @vsi: vsi
  * @queue_pair: queue pair
@@ -11278,11 +11324,13 @@ static void i40e_tp4_disable_rx(struct i40e_ring *rxr)
 static void __i40e_tp4_disable(struct i40e_vsi *vsi, int queue_pair)
 {
 	struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+	struct i40e_ring *txr = vsi->tx_rings[queue_pair];
 
 	if (!i40e_qp_uses_tp4(vsi, queue_pair))
 		return;
 
 	i40e_tp4_disable_rx(rxr);
+	i40e_tp4_disable_tx(txr);
 }
 
 /**
@@ -11368,6 +11416,36 @@ static int i40e_tp4_enable_rx(struct i40e_ring *rxr,
 }
 
 /**
+ * i40e_tp4_enable_tx - Enables TP4 Tx
+ * @txr: egress ring
+ * @params: tp4 params
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_tp4_enable_tx(struct i40e_ring *txr,
+			      struct tp4_netdev_parms *params)
+{
+	size_t elems = __roundup_pow_of_two(txr->count * 8);
+	struct tp4_packet_array *arr;
+
+	arr = tp4a_tx_new(params->tx_opaque, elems, txr->dev);
+	if (!arr)
+		return -ENOMEM;
+
+	txr->tp4.arr = arr;
+	txr->tp4.ev_handler = params->write_space;
+	txr->tp4.ev_opaque = params->write_space_opaque;
+	txr->tp4.err_handler = params->error_report;
+	txr->tp4.err_opaque = params->error_report_opaque;
+
+	txr->clean_irq = i40e_clean_tx_tp4_irq;
+
+	set_ring_tp4(txr);
+
+	return 0;
+}
+
+/**
  * __i40e_tp4_enable - Enables TP4
  * @vsi: vsi
  * @params: tp4 params
@@ -11378,12 +11456,19 @@ static int __i40e_tp4_enable(struct i40e_vsi *vsi,
 			     struct tp4_netdev_parms *params)
 {
 	struct i40e_ring *rxr = vsi->rx_rings[params->queue_pair];
+	struct i40e_ring *txr = vsi->tx_rings[params->queue_pair];
 	int err;
 
 	err = i40e_tp4_enable_rx(rxr, params);
 	if (err)
 		return err;
 
+	err = i40e_tp4_enable_tx(txr, params);
+	if (err) {
+		i40e_tp4_disable_rx(rxr);
+		return err;
+	}
+
 	return 0;
 }
 
@@ -11414,7 +11499,7 @@ static int i40e_tp4_enable(struct net_device *netdev,
 	if (i40e_qp_uses_tp4(vsi, params->queue_pair))
 		return -EBUSY;
 
-	if (!params->rx_opaque)
+	if (!params->rx_opaque || !params->tx_opaque)
 		return -EINVAL;
 
 	err =  i40e_qp_disable(vsi, params->queue_pair);
@@ -11503,6 +11588,7 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_bridge_setlink	= i40e_ndo_bridge_setlink,
 	.ndo_xdp		= i40e_xdp,
 	.ndo_tp4_zerocopy	= i40e_tp4_zerocopy,
+	.ndo_tp4_xmit		= i40e_tp4_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 54c5b7975066..712e10e14aec 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -728,16 +728,50 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring)
 #define WB_STRIDE 4
 
 /**
+ * i40e_update_tx_stats_and_arm_wb - Update Tx stats and possibly arm writeback
+ * @txr: egress ring
+ * @tx_bytes: numbers of bytes sent
+ * @tx_packets: number of packets sent
+ * @done: true if writeback should be armed
+ **/
+static inline void i40e_update_tx_stats_and_arm_wb(struct i40e_ring *txr,
+						   unsigned int tx_bytes,
+						   unsigned int tx_packets,
+						   bool done)
+{
+	u64_stats_update_begin(&txr->syncp);
+	txr->stats.bytes += tx_bytes;
+	txr->stats.packets += tx_packets;
+	u64_stats_update_end(&txr->syncp);
+	txr->q_vector->tx.total_bytes += tx_bytes;
+	txr->q_vector->tx.total_packets += tx_packets;
+
+	if (txr->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+		/* check to see if there are < 4 descriptors
+		 * waiting to be written back, then kick the hardware to force
+		 * them to be written back in case we stay in NAPI.
+		 * In this mode on X722 we do not enable interrupts.
+		 */
+		unsigned int j = i40e_get_tx_pending(txr);
+
+		if (done &&
+		    ((j / WB_STRIDE) == 0) && j > 0 &&
+		    !test_bit(__I40E_VSI_DOWN, txr->vsi->state) &&
+		    (I40E_DESC_UNUSED(txr) != txr->count))
+			txr->arm_wb = true;
+	}
+}
+
+/**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
- * @vsi: the VSI we care about
  * @tx_ring: Tx ring to clean
  * @napi_budget: Used to determine if we are in netpoll
  *
  * Returns true if there's any budget left (e.g. the clean is finished)
  **/
-static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
-			      struct i40e_ring *tx_ring, int napi_budget)
+int i40e_clean_tx_irq(struct i40e_ring *tx_ring, int napi_budget)
 {
+	struct i40e_vsi *vsi = tx_ring->vsi;
 	u16 i = tx_ring->next_to_clean;
 	struct i40e_tx_buffer *tx_buf;
 	struct i40e_tx_desc *tx_head;
@@ -831,27 +865,9 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 	i += tx_ring->count;
 	tx_ring->next_to_clean = i;
-	u64_stats_update_begin(&tx_ring->syncp);
-	tx_ring->stats.bytes += total_bytes;
-	tx_ring->stats.packets += total_packets;
-	u64_stats_update_end(&tx_ring->syncp);
-	tx_ring->q_vector->tx.total_bytes += total_bytes;
-	tx_ring->q_vector->tx.total_packets += total_packets;
-
-	if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-		/* check to see if there are < 4 descriptors
-		 * waiting to be written back, then kick the hardware to force
-		 * them to be written back in case we stay in NAPI.
-		 * In this mode on X722 we do not enable Interrupt.
-		 */
-		unsigned int j = i40e_get_tx_pending(tx_ring);
 
-		if (budget &&
-		    ((j / WB_STRIDE) == 0) && (j > 0) &&
-		    !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-		    (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-			tx_ring->arm_wb = true;
-	}
+	i40e_update_tx_stats_and_arm_wb(tx_ring, total_bytes, total_packets,
+					budget);
 
 	if (ring_is_xdp(tx_ring))
 		return !!budget;
@@ -2454,10 +2470,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 	 * budget and be more aggressive about cleaning up the Tx descriptors.
 	 */
 	i40e_for_each_ring(ring, q_vector->tx) {
-		if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+		if (!ring->clean_irq(ring, budget)) {
 			clean_complete = false;
 			continue;
 		}
+
 		arm_wb |= ring->arm_wb;
 		ring->arm_wb = false;
 	}
@@ -3524,6 +3541,7 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_pf *pf = vsi->back;
 	struct i40e_ring *tx_ring = vsi->tx_rings[skb->queue_mapping];
 
 	/* hardware can't handle really short frames, hardware padding works
@@ -3532,6 +3550,18 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	if (skb_put_padto(skb, I40E_MIN_TX_LEN))
 		return NETDEV_TX_OK;
 
+	if (unlikely(ring_uses_tp4(tx_ring) ||
+		     test_bit(__I40E_CONFIG_BUSY, pf->state))) {
+		/* XXX ndo_select_queue is being deprecated, so we
+		 * need another method for routing stack originated
+		 * packets away from the TP4 ring.
+		 *
+		 * For now, silently drop the skbuff.
+		 */
+		kfree_skb(skb);
+		return NETDEV_TX_OK;
+	}
+
 	return i40e_xmit_frame_ring(skb, tx_ring);
 }
 
@@ -3582,3 +3612,191 @@ bool i40e_alloc_rx_buffers_tp4(struct i40e_ring *rxr, u16 cleaned_count)
 	return ret;
 }
 
+/**
+ * i40e_napi_is_scheduled - If napi is running, set the NAPIF_STATE_MISSED
+ * @n: napi context
+ *
+ * Returns true if NAPI is scheduled.
+ **/
+static bool i40e_napi_is_scheduled(struct napi_struct *n)
+{
+	unsigned long val, new;
+
+	do {
+		val = READ_ONCE(n->state);
+		if (val & NAPIF_STATE_DISABLE)
+			return true;
+
+		if (!(val & NAPIF_STATE_SCHED))
+			return false;
+
+		new = val | NAPIF_STATE_MISSED;
+	} while (cmpxchg(&n->state, val, new) != val);
+
+	return true;
+}
+
+/**
+ * i40e_tp4_xmit - ndo_tp4_xmit implementation
+ * @netdev: netdev
+ * @queue_pair: queue_pair
+ *
+ * Returns >=0 on success, <0 on failure.
+ **/
+int i40e_tp4_xmit(struct net_device *netdev, int queue_pair)
+{
+	struct i40e_netdev_priv *np = netdev_priv(netdev);
+	struct i40e_vsi *vsi = np->vsi;
+	struct i40e_ring *txr;
+
+	if (test_bit(__I40E_VSI_DOWN, vsi->state))
+		return -EAGAIN;
+
+	txr = vsi->tx_rings[queue_pair];
+	if (!ring_uses_tp4(txr))
+		return -EINVAL;
+
+	WRITE_ONCE(txr->tp4_xmit, 1);
+	if (!i40e_napi_is_scheduled(&txr->q_vector->napi))
+		i40e_force_wb(vsi, txr->q_vector);
+
+	return 0;
+}
+
+/**
+ * i40e_tp4_xmit_irq - Pull packets from userland, post them to the HW ring
+ * @txr: ingress ring
+ *
+ * Returns true if there no more work to be done.
+ **/
+static bool i40e_tp4_xmit_irq(struct i40e_ring *txr)
+{
+	struct i40e_tx_desc *txd;
+	struct tp4_frame_set pkt;
+	u32 size, td_cmd;
+	bool done = true;
+	int cleaned = 0;
+	dma_addr_t dma;
+	u16 unused;
+
+	if (READ_ONCE(txr->tp4_xmit)) {
+		tp4a_populate(txr->tp4.arr);
+		WRITE_ONCE(txr->tp4_xmit, 0);
+	}
+
+	for (;;) {
+		if (!tp4a_next_packet(txr->tp4.arr, &pkt)) {
+			if (cleaned == 0)
+				return true;
+			break;
+		}
+
+		unused = I40E_DESC_UNUSED(txr);
+		if (unused < tp4f_num_frames(&pkt)) {
+			tp4a_return_packet(txr->tp4.arr, &pkt);
+			done = false;
+			break;
+		}
+
+		do {
+			dma = tp4f_get_dma(&pkt);
+			size = tp4f_get_frame_len(&pkt);
+			dma_sync_single_for_device(txr->dev, dma, size,
+						   DMA_TO_DEVICE);
+
+			txd = I40E_TX_DESC(txr, txr->next_to_use);
+			txd->buffer_addr = cpu_to_le64(dma);
+
+			td_cmd = I40E_TX_DESC_CMD_ICRC | I40E_TX_DESC_CMD_RS;
+			if (tp4f_is_last_frame(&pkt))
+				td_cmd |= I40E_TX_DESC_CMD_EOP;
+
+			txd->cmd_type_offset_bsz = build_ctob(td_cmd, 0,
+							      size, 0);
+
+			cleaned++;
+			txr->next_to_use++;
+			if (txr->next_to_use == txr->count)
+				txr->next_to_use = 0;
+
+		} while (tp4f_next_frame(&pkt));
+	}
+
+	/* Force memory writes to complete before letting h/w know
+	 * there are new descriptors to fetch.
+	 */
+	wmb();
+	writel(txr->next_to_use, txr->tail);
+
+	return done;
+}
+
+/**
+ * i40e_inc_tx_next_to_clean - Bumps the next to clean
+ * @ring: egress ring
+ **/
+static inline void i40e_inc_tx_next_to_clean(struct i40e_ring *ring)
+{
+	u32 ntc;
+
+	ntc = ring->next_to_clean + 1;
+	ntc = (ntc < ring->count) ? ntc : 0;
+	ring->next_to_clean = ntc;
+
+	prefetch(I40E_TX_DESC(ring, ntc));
+}
+
+/**
+ * i40e_clean_tx_tp4_irq - Cleans the egress ring for completed packets
+ * @txr: egress ring
+ * @budget: napi budget
+ *
+ * Returns >0 if there's no more work to be done.
+ **/
+int i40e_clean_tx_tp4_irq(struct i40e_ring *txr, int budget)
+{
+	int total_tx_bytes = 0, total_tx_packets = 0;
+	struct i40e_tx_desc *txd, *txdh;
+	struct tp4_frame_set frame_set;
+	bool clean_done, xmit_done;
+
+	budget = txr->vsi->work_limit;
+
+	if (!tp4a_get_flushable_frame_set(txr->tp4.arr, &frame_set)) {
+		clean_done = true;
+		goto xmit;
+	}
+
+	txdh = I40E_TX_DESC(txr, i40e_get_head(txr));
+
+	while (total_tx_packets < budget) {
+		txd = I40E_TX_DESC(txr, txr->next_to_clean);
+		if (txdh == txd)
+			break;
+
+		txd->buffer_addr = 0;
+		txd->cmd_type_offset_bsz = 0;
+
+		total_tx_packets++;
+		total_tx_bytes += tp4f_get_frame_len(&frame_set);
+
+		i40e_inc_tx_next_to_clean(txr);
+
+		if (!tp4f_next_frame(&frame_set))
+			break;
+	}
+
+	WARN_ON(tp4a_flush_n(txr->tp4.arr, total_tx_packets));
+	clean_done = (total_tx_packets < budget);
+
+	txr->tp4.ev_handler(txr->tp4.ev_opaque);
+
+	i40e_update_tx_stats_and_arm_wb(txr,
+					total_tx_bytes,
+					total_tx_packets,
+					clean_done);
+xmit:
+	xmit_done = i40e_tp4_xmit_irq(txr);
+
+	return clean_done && xmit_done;
+}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 602dcd111938..b50215ddabd1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -430,6 +430,7 @@ struct i40e_ring {
 
 	struct rcu_head rcu;		/* to avoid race on free */
 	u16 next_to_alloc;
+	int tp4_xmit;
 	struct sk_buff *skb;		/* When i40e_clean_rx_ring_irq() must
 					 * return before it sees the EOP for
 					 * the current packet, we save that skb
@@ -520,9 +521,12 @@ static inline unsigned int i40e_rx_pg_order(struct i40e_ring *ring)
 
 bool i40e_alloc_rx_buffers(struct i40e_ring *rxr, u16 cleaned_count);
 int i40e_clean_rx_irq(struct i40e_ring *rxr, int budget);
+int i40e_clean_tx_irq(struct i40e_ring *tx_ring, int napi_budget);
 bool i40e_alloc_rx_buffers_tp4(struct i40e_ring *rxr, u16 cleaned_count);
 int i40e_clean_rx_tp4_irq(struct i40e_ring *rxr, int budget);
+int i40e_clean_tx_tp4_irq(struct i40e_ring *txr, int napi_budget);
 netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+int i40e_tp4_xmit(struct net_device *dev, int queue_pair);
 void i40e_clean_tx_ring(struct i40e_ring *tx_ring);
 void i40e_clean_rx_ring(struct i40e_ring *rx_ring);
 int i40e_setup_tx_descriptors(struct i40e_ring *tx_ring);
diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index 80bc20543599..beaf23f713eb 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -757,6 +757,28 @@ static inline void tp4f_set_error(struct tp4_frame_set *p, int errno)
 }
 
 /**
+ * tp4f_is_last_frame - Is this the last frame of the frame set
+ * @p: pointer to frame set
+ *
+ * Returns true if this is the last frame of the frame set, otherwise 0
+ **/
+static inline bool tp4f_is_last_frame(struct tp4_frame_set *p)
+{
+	return p->curr + 1 == p->end;
+}
+
+/**
+ * tp4f_num_frames - Number of frames in a frame set
+ * @p: pointer to frame set
+ *
+ * Returns the number of frames this frame set consists of
+ **/
+static inline u32 tp4f_num_frames(struct tp4_frame_set *p)
+{
+	return p->end - p->start;
+}
+
+/**
  * tp4f_get_data - Gets a pointer to the frame the frame set is on
  * @p: pointer to the frame set
  *
@@ -1165,4 +1187,16 @@ static inline bool tp4a_next_frame_populate(struct tp4_packet_array *a,
 	return more_frames;
 }
 
+/**
+ * tp4a_return_packet - Return packet to the packet array
+ *
+ * @a: pointer to packet array
+ * @p: pointer to the packet to return
+ **/
+static inline void tp4a_return_packet(struct tp4_packet_array *a,
+				      struct tp4_frame_set *p)
+{
+	a->curr = p->start;
+}
+
 #endif /* _LINUX_TPACKET4_H */
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 10/14] samples/tpacket4: added tpbench
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (8 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 09/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 11/14] veth: added support for PACKET_ZEROCOPY Björn Töpel
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

The tpbench program is benchmarking TPACKET_V2 up to
TPACKET_V4. There's a bench_all.sh script that makes testing all
versions easier.

Note that zero-copy means binding the TPACKET_V4 socket to a certain
NIC hardware queue, so you'll need to steer your traffic to a certain
NIC hardware queue. Say that you'd like your UDP traffic from port
4242 to end up in queue 16. Here, we use ethtool for this:

  ethtool -N p3p2 rx-flow-hash udp4 fn
  ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
      action 16

running the benchmark in zero-copy mode can then be done using:

  taskset -c 16 ./tpbench -i p3p2 --rxdrop --zerocopy 17

Note that the queue is one-based and not zero-based.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 samples/tpacket4/Makefile     |   12 +
 samples/tpacket4/bench_all.sh |   28 +
 samples/tpacket4/tpbench.c    | 1253 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1293 insertions(+)
 create mode 100644 samples/tpacket4/Makefile
 create mode 100755 samples/tpacket4/bench_all.sh
 create mode 100644 samples/tpacket4/tpbench.c

diff --git a/samples/tpacket4/Makefile b/samples/tpacket4/Makefile
new file mode 100644
index 000000000000..1dd731ffe3e9
--- /dev/null
+++ b/samples/tpacket4/Makefile
@@ -0,0 +1,12 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := tpbench
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_tpbench.o += -I$(objtree)/usr/include
+
+all: tpbench
diff --git a/samples/tpacket4/bench_all.sh b/samples/tpacket4/bench_all.sh
new file mode 100755
index 000000000000..8d7ee17e1682
--- /dev/null
+++ b/samples/tpacket4/bench_all.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+
+DIR=`dirname "${BASH_SOURCE[0]}"`
+
+IF=p3p2
+DURATION=60
+CORE=14
+ZC=17
+
+echo "You might want to change the parameters in ${BASH_SOURCE[0]}"
+echo "${IF} cpu${CORE} duration ${DURATION}s zc ${ZC}"
+
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=2 --rxdrop
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=3 --rxdrop
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=4 --rxdrop
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=4 --rxdrop --zerocopy ${ZC}
+
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=2 --txonly
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=3 --txonly
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=4 --txonly
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=4 --txonly --zerocopy ${ZC}
+
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=2 --l2fwd
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=3 --l2fwd
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=4 --l2fwd
+sudo taskset -c ${CORE} timeout -s int ${DURATION} ${DIR}/tpbench -i ${IF} --version=4 --l2fwd --zerocopy ${ZC}
+
+
diff --git a/samples/tpacket4/tpbench.c b/samples/tpacket4/tpbench.c
new file mode 100644
index 000000000000..46fb83009e06
--- /dev/null
+++ b/samples/tpacket4/tpbench.c
@@ -0,0 +1,1253 @@
+/*
+ *  tpbench
+ *  Copyright(c) 2017 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <getopt.h>
+#include <limits.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/ether.h>
+#include <netinet/ip.h>
+#include <netinet/udp.h>
+#include <poll.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/shm.h>
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+#define BATCH_SIZE 64 /* process pace */
+
+#define NUM_BUFFERS 131072
+#define FRAME_SIZE 2048
+
+#define BLOCK_SIZE (1 << 22) /* V2/V3 */
+#define NUM_DESCS 4096 /* V4 */
+
+static unsigned long rx_npkts;
+static unsigned long tx_npkts;
+static unsigned long start_time;
+
+/* cli options */
+enum tpacket_version {
+	PV2 = 0,
+	PV3 = 1,
+	PV4 = 2,
+};
+
+enum benchmark_type {
+	BENCH_RXDROP = 0,
+	BENCH_TXONLY = 1,
+	BENCH_L2FWD = 2,
+};
+
+static enum tpacket_version opt_tpver = PV4;
+static enum benchmark_type opt_bench = BENCH_RXDROP;
+static const char *opt_if = "";
+static int opt_zerocopy;
+
+struct tpacket2_queue {
+	void *ring;
+
+	unsigned int last_used_idx;
+	unsigned int ring_size;
+	unsigned int frame_size_log2;
+};
+
+struct tp2_queue_pair {
+	struct tpacket2_queue rx;
+	struct tpacket2_queue tx;
+	int sfd;
+	const char *interface_name;
+};
+
+struct tpacket3_rx_queue {
+	void *ring;
+	struct tpacket3_hdr *frames[BATCH_SIZE];
+
+	unsigned int last_used_idx;
+	unsigned int ring_size; /* NB! blocks, not frames */
+	unsigned int block_size_log2;
+
+	struct tpacket3_hdr *last_frame;
+	unsigned int npkts; /* >0 in block */
+};
+
+struct tp3_queue_pair {
+	struct tpacket3_rx_queue rx;
+	struct tpacket2_queue tx;
+	int sfd;
+	const char *interface_name;
+};
+
+struct tp4_umem {
+	char *buffer;
+	size_t size;
+	unsigned int frame_size;
+	unsigned int frame_size_log2;
+	unsigned int nframes;
+	int mr_fd;
+	unsigned long free_stack[NUM_BUFFERS];
+	unsigned int free_stack_idx;
+};
+
+struct tp4_queue_pair {
+	struct tpacket4_queue rx;
+	struct tpacket4_queue tx;
+	int sfd;
+	const char *interface_name;
+	struct tp4_umem *umem;
+};
+
+struct benchmark {
+	void *		(*configure)(const char *interface_name);
+	void		(*rx)(void *queue_pair, unsigned int *start,
+			      unsigned int *end);
+	void *		(*get_data)(void *queue_pair, unsigned int idx,
+				    unsigned int *len);
+	unsigned long	(*get_data_desc)(void *queue_pair, unsigned int idx,
+					 unsigned int *len,
+					 unsigned short *offset);
+	void		(*set_data_desc)(void *queue_pair, unsigned int idx,
+					 unsigned long didx);
+	void		(*process)(void *queue_pair, unsigned int start,
+				   unsigned int end);
+	void		(*rx_release)(void *queue_pair, unsigned int start,
+				      unsigned int end);
+	void		(*tx)(void *queue_pair, unsigned int start,
+			      unsigned int end);
+};
+
+static char tx_frame[1024];
+static unsigned int tx_frame_len;
+static struct benchmark benchmark;
+
+#define lassert(expr)							\
+	do {								\
+		if (!(expr)) {						\
+			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
+				#expr ": errno: %d/\"%s\"\n",		\
+				__FILE__, __func__, __LINE__,		\
+				errno, strerror(errno));		\
+			exit(EXIT_FAILURE);				\
+		}							\
+	} while (0)
+
+#define barrier() __asm__ __volatile__("" : : : "memory")
+#define u_smp_rmb() barrier()
+#define u_smp_wmb() barrier()
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+#define log2(x)							\
+	((unsigned int)(8 * sizeof(unsigned long long) -	\
+			__builtin_clzll((x)) - 1))
+
+#if 0
+static void hex_dump(void *pkt, size_t length, const char *prefix)
+{
+	int i = 0;
+	const unsigned char *address = (unsigned char *)pkt;
+	const unsigned char *line = address;
+	size_t line_size = 32;
+	unsigned char c;
+
+	printf("%s | ", prefix);
+	while (length-- > 0) {
+		printf("%02X ", *address++);
+		if (!(++i % line_size) || (length == 0 && i % line_size)) {
+			if (length == 0) {
+				while (i++ % line_size)
+					printf("__ ");
+			}
+			printf(" | ");	/* right close */
+			while (line < address) {
+				c = *line++;
+				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
+			}
+			printf("\n");
+			if (length > 0)
+				printf("%s | ", prefix);
+		}
+	}
+	printf("\n");
+}
+#endif
+
+static size_t gen_eth_frame(char *frame, int data)
+{
+	static const char d[] =
+		"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
+		"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
+		"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
+		"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
+
+	(void)data;
+	memcpy(frame, d, sizeof(d) - 1);
+	return sizeof(d) - 1;
+
+#if 0
+	/* XXX This generates "multicast packets" */
+	struct ether_header *eh = (struct ether_header *)frame;
+	size_t len = sizeof(struct ether_header);
+	int i;
+
+	for (i = 0; i < 6; i++) {
+		eh->ether_shost[i] = i + 0x01;
+		eh->ether_dhost[i] = i + 0x11;
+	}
+	eh->ether_type = htons(ETH_P_IP);
+
+	for (i = 0; i < 46; i++)
+		frame[len++] = data;
+
+	return len;
+#endif
+}
+
+static void setup_tx_frame(void)
+{
+	tx_frame_len = gen_eth_frame(tx_frame, 42);
+}
+
+static void swap_mac_addresses(void *data)
+{
+	struct ether_header *eth = (struct ether_header *)data;
+	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
+	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
+	struct ether_addr tmp;
+
+	tmp = *src_addr;
+	*src_addr = *dst_addr;
+	*dst_addr = tmp;
+}
+
+static void rx_dummy(void *queue_pair, unsigned int *start, unsigned int *end)
+{
+	(void)queue_pair;
+	*start = 0;
+	*end = BATCH_SIZE;
+}
+
+static void rx_release_dummy(void *queue_pair, unsigned int start,
+			     unsigned int end)
+{
+	(void)queue_pair;
+	(void)start;
+	(void)end;
+}
+
+static void *get_data_dummy(void *queue_pair, unsigned int idx,
+			    unsigned int *len)
+{
+	(void)queue_pair;
+	(void)idx;
+
+	*len = tx_frame_len;
+
+	return tx_frame;
+}
+
+#if 0
+static void process_hexdump(void *queue_pair, unsigned int start,
+			    unsigned int end)
+{
+	unsigned int len;
+	void *data;
+
+	while (start != end) {
+		data = benchmark.get_data(queue_pair, start, &len);
+		hex_dump(data, len, "Rx:");
+		start++;
+	}
+}
+#endif
+
+static void process_swap_mac(void *queue_pair, unsigned int start,
+			     unsigned int end)
+{
+	unsigned int len;
+	void *data;
+
+	while (start != end) {
+		data = benchmark.get_data(queue_pair, start, &len);
+		swap_mac_addresses(data);
+		start++;
+	}
+}
+
+static void run_benchmark(const char *interface_name)
+{
+	unsigned int start, end;
+	struct tp2_queue_pair *qp;
+
+	qp = benchmark.configure(interface_name);
+
+	for (;;) {
+		for (;;) {
+			benchmark.rx(qp, &start, &end);
+			if ((end - start) > 0)
+				break;
+			// XXX
+			//if (poll)
+			//	poll();
+		}
+
+		if (benchmark.process)
+			benchmark.process(qp, start, end);
+
+		benchmark.tx(qp, start, end);
+	}
+}
+
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+
+static void *tp2_configure(const char *interface_name)
+{
+	int sfd, noqdisc, ret, ver = TPACKET_V2;
+	struct tp2_queue_pair *tqp;
+	struct tpacket_req req = {};
+	struct sockaddr_ll ll;
+	void *rxring;
+
+	/* create PF_PACKET socket */
+	sfd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	lassert(sfd >= 0);
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_VERSION, &ver, sizeof(ver));
+	lassert(ret == 0);
+
+	tqp = calloc(1, sizeof(*tqp));
+	lassert(tqp);
+
+	tqp->sfd = sfd;
+	tqp->interface_name = interface_name;
+
+	req.tp_block_size = BLOCK_SIZE;
+	req.tp_frame_size = FRAME_SIZE;
+	req.tp_block_nr = NUM_BUFFERS * FRAME_SIZE / BLOCK_SIZE;
+	req.tp_frame_nr = req.tp_block_nr * BLOCK_SIZE / FRAME_SIZE;
+
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_TX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+
+	rxring = mmap(0, 2 * req.tp_block_size * req.tp_block_nr,
+		      PROT_READ | PROT_WRITE,
+		      MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sfd, 0);
+	lassert(rxring != MAP_FAILED);
+
+	tqp->rx.ring = rxring;
+	tqp->rx.ring_size = NUM_BUFFERS;
+	tqp->rx.frame_size_log2 = log2(req.tp_frame_size);
+
+	tqp->tx.ring = rxring + req.tp_block_size * req.tp_block_nr;
+	tqp->tx.ring_size = NUM_BUFFERS;
+	tqp->tx.frame_size_log2 = log2(req.tp_frame_size);
+
+	ll.sll_family = PF_PACKET;
+	ll.sll_protocol = htons(ETH_P_ALL);
+	ll.sll_ifindex = if_nametoindex(interface_name);
+	ll.sll_hatype = 0;
+	ll.sll_pkttype = 0;
+	ll.sll_halen = 0;
+
+	noqdisc = 1;
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_QDISC_BYPASS,
+			 &noqdisc, sizeof(noqdisc));
+	lassert(ret == 0);
+
+	ret = bind(sfd, (struct sockaddr *)&ll, sizeof(ll));
+	lassert(ret == 0);
+
+	setup_tx_frame();
+
+	return tqp;
+}
+
+static void tp2_rx(void *queue_pair, unsigned int *start, unsigned int *end)
+{
+	struct tpacket2_queue *rxq = &((struct tp2_queue_pair *)queue_pair)->rx;
+	unsigned int batch = 0;
+
+	*start = rxq->last_used_idx;
+	*end = rxq->last_used_idx;
+
+	for (;;) {
+		unsigned int idx = *end & (rxq->ring_size - 1);
+		struct tpacket2_hdr *hdr;
+
+		hdr = (struct tpacket2_hdr *)(rxq->ring +
+					      (idx << rxq->frame_size_log2));
+		if ((hdr->tp_status & TP_STATUS_USER) != TP_STATUS_USER)
+			break;
+
+		(*end)++;
+		if (++batch == BATCH_SIZE)
+			break;
+	}
+
+	rxq->last_used_idx = *end;
+	rx_npkts += (*end - *start);
+
+	/* status before data */
+	u_smp_rmb();
+}
+
+static void tp2_rx_release(void *queue_pair, unsigned int start,
+			   unsigned int end)
+{
+	struct tpacket2_queue *rxq = &((struct tp2_queue_pair *)queue_pair)->rx;
+	struct tpacket2_hdr *hdr;
+
+	while (start != end) {
+		hdr = (struct tpacket2_hdr *)(rxq->ring +
+					      ((start & (rxq->ring_size - 1))
+					       << rxq->frame_size_log2));
+
+		hdr->tp_status = TP_STATUS_KERNEL;
+		start++;
+	}
+}
+
+static void *tp2_get_data(void *queue_pair, unsigned int idx, unsigned int *len)
+{
+	struct tpacket2_queue *rxq = &((struct tp2_queue_pair *)queue_pair)->rx;
+	struct tpacket2_hdr *hdr;
+
+	hdr = (struct tpacket2_hdr *)(rxq->ring + ((idx & (rxq->ring_size - 1))
+						   << rxq->frame_size_log2));
+	*len = hdr->tp_snaplen;
+
+	return (char *)hdr + hdr->tp_mac;
+}
+
+static void tp2_tx(void *queue_pair, unsigned int start, unsigned int end)
+{
+	struct tp2_queue_pair *qp = queue_pair;
+	struct tpacket2_queue *txq = &qp->tx;
+	unsigned int len, curr = start;
+	void *data;
+	int ret;
+
+	while (curr != end) {
+		unsigned int idx = txq->last_used_idx & (txq->ring_size - 1);
+		struct tpacket2_hdr *hdr;
+
+		hdr = (struct tpacket2_hdr *)(txq->ring +
+					      (idx << txq->frame_size_log2));
+		if (hdr->tp_status &
+		    (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) {
+			break;
+		}
+
+		data = benchmark.get_data(queue_pair, curr, &len);
+
+		hdr->tp_snaplen = len;
+		hdr->tp_len = len;
+		memcpy((char *)hdr + TPACKET2_HDRLEN -
+		       sizeof(struct sockaddr_ll), data, len);
+
+		u_smp_wmb();
+
+		hdr->tp_status = TP_STATUS_SEND_REQUEST;
+
+		txq->last_used_idx++;
+		curr++;
+	}
+
+	ret = sendto(qp->sfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!(ret >= 0 || errno == EAGAIN || errno == ENOBUFS))
+		lassert(0);
+
+	benchmark.rx_release(queue_pair, start, end);
+
+	tx_npkts += (curr - start);
+}
+
+static void *tp3_configure(const char *interface_name)
+{
+	int sfd, noqdisc, ret, ver = TPACKET_V3;
+	struct tp3_queue_pair *tqp;
+	struct tpacket_req3 req = {};
+	struct sockaddr_ll ll;
+	void *rxring;
+
+	unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
+	unsigned int blocknum = 64;
+
+	/* create PF_PACKET socket */
+	sfd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	lassert(sfd >= 0);
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_VERSION, &ver, sizeof(ver));
+	lassert(ret == 0);
+
+	tqp = calloc(1, sizeof(*tqp));
+	lassert(tqp);
+
+	tqp->sfd = sfd;
+	tqp->interface_name = interface_name;
+
+	/* XXX is is unfair to have 2 frames per block in V3? */
+	req.tp_block_size = BLOCK_SIZE;
+	req.tp_frame_size = FRAME_SIZE;
+	req.tp_block_nr = NUM_BUFFERS * FRAME_SIZE / BLOCK_SIZE;
+	req.tp_frame_nr = req.tp_block_nr * BLOCK_SIZE / FRAME_SIZE;
+	req.tp_retire_blk_tov = 0;
+	req.tp_sizeof_priv = 0;
+	req.tp_feature_req_word = 0;
+
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_TX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+
+	rxring = mmap(0, 2 * req.tp_block_size * req.tp_block_nr,
+		      PROT_READ | PROT_WRITE,
+		      MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sfd, 0);
+	lassert(rxring != MAP_FAILED);
+
+	tqp->rx.ring = rxring;
+	tqp->rx.ring_size = blocknum;
+	tqp->rx.block_size_log2 = log2(blocksiz);
+
+	tqp->tx.ring = rxring + req.tp_block_size * req.tp_block_nr;
+	tqp->tx.ring_size = (blocksiz * blocknum) / framesiz;
+	tqp->tx.frame_size_log2 = log2(req.tp_frame_size);
+
+	ll.sll_family = PF_PACKET;
+	ll.sll_protocol = htons(ETH_P_ALL);
+	ll.sll_ifindex = if_nametoindex(interface_name);
+	ll.sll_hatype = 0;
+	ll.sll_pkttype = 0;
+	ll.sll_halen = 0;
+
+	noqdisc = 1;
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_QDISC_BYPASS,
+			 &noqdisc, sizeof(noqdisc));
+	lassert(ret == 0);
+
+	ret = bind(sfd, (struct sockaddr *)&ll, sizeof(ll));
+	lassert(ret == 0);
+
+	setup_tx_frame();
+
+	return tqp;
+}
+
+static void tp3_rx(void *queue_pair, unsigned int *start, unsigned int *end)
+{
+	struct tpacket3_rx_queue *rxq =
+		&((struct tp3_queue_pair *)queue_pair)->rx;
+	unsigned int i, npkts = BATCH_SIZE;
+	struct tpacket_block_desc *bd;
+	bool no_more_frames = false;
+
+	*start = 0;
+	*end = 0;
+
+	if (rxq->last_frame) {
+		if (rxq->npkts <= BATCH_SIZE) {
+			no_more_frames = true;
+			npkts = rxq->npkts;
+		}
+
+		for (i = 0; i < npkts; i++) {
+			rxq->last_frame = (struct tpacket3_hdr *)
+					  ((char *)rxq->last_frame +
+					   rxq->last_frame->tp_next_offset);
+			rxq->frames[i] = rxq->last_frame;
+		}
+
+		if (no_more_frames)
+			rxq->last_frame = NULL;
+
+		rxq->npkts -= npkts;
+		*end = npkts;
+		rx_npkts += npkts;
+
+		return;
+	}
+
+	bd = (struct tpacket_block_desc *)
+	     (rxq->ring + ((rxq->last_used_idx & (rxq->ring_size - 1))
+			   << rxq->block_size_log2));
+	if ((bd->hdr.bh1.block_status & TP_STATUS_USER) != TP_STATUS_USER)
+		return;
+
+	u_smp_rmb();
+
+	rxq->npkts = bd->hdr.bh1.num_pkts;
+	if (rxq->npkts <= BATCH_SIZE) {
+		no_more_frames = true;
+		npkts = rxq->npkts;
+	}
+
+	rxq->last_frame = (struct tpacket3_hdr *)
+			  ((char *)bd + bd->hdr.bh1.offset_to_first_pkt);
+	rxq->frames[0] = rxq->last_frame;
+	for (i = 1; i < npkts; i++) {
+		rxq->last_frame = (struct tpacket3_hdr *)
+				  ((char *)rxq->last_frame +
+				   rxq->last_frame->tp_next_offset);
+		rxq->frames[i] = rxq->last_frame;
+	}
+
+	if (no_more_frames)
+		rxq->last_frame = NULL;
+
+	*end = npkts;
+	rx_npkts += npkts;
+}
+
+static void tp3_rx_release(void *queue_pair, unsigned int start,
+			   unsigned int end)
+{
+	struct tpacket3_rx_queue *rxq =
+		&((struct tp3_queue_pair *)queue_pair)->rx;
+	struct tpacket_block_desc *bd;
+
+	(void)start;
+	(void)end;
+
+	if (rxq->last_frame)
+		return;
+
+	bd = (struct tpacket_block_desc *)
+	     (rxq->ring + ((rxq->last_used_idx & (rxq->ring_size - 1))
+			   << rxq->block_size_log2));
+
+	bd->hdr.bh1.block_status = TP_STATUS_KERNEL;
+	rxq->last_used_idx++;
+}
+
+static void *tp3_get_data(void *queue_pair, unsigned int idx, unsigned int *len)
+{
+	struct tpacket3_rx_queue *rxq =
+		&((struct tp3_queue_pair *)queue_pair)->rx;
+	struct tpacket3_hdr *hdr = rxq->frames[idx];
+
+	*len = hdr->tp_snaplen;
+
+	return (char *)hdr + hdr->tp_mac;
+}
+
+static void tp3_tx(void *queue_pair, unsigned int start, unsigned int end)
+{
+	struct tp3_queue_pair *qp = queue_pair;
+	struct tpacket2_queue *txq = &qp->tx;
+	unsigned int len, curr = start;
+	void *data;
+	int ret;
+
+	while (curr != end) {
+		unsigned int idx = txq->last_used_idx & (txq->ring_size - 1);
+		struct tpacket3_hdr *hdr;
+
+		hdr = (struct tpacket3_hdr *)(txq->ring +
+					      (idx << txq->frame_size_log2));
+		if (hdr->tp_status &
+		    (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) {
+			break;
+		}
+
+		data = benchmark.get_data(queue_pair, curr, &len);
+
+		hdr->tp_snaplen = len;
+		hdr->tp_len = len;
+		memcpy((char *)hdr + TPACKET3_HDRLEN -
+		       sizeof(struct sockaddr_ll), data, len);
+
+		u_smp_wmb();
+
+		hdr->tp_status = TP_STATUS_SEND_REQUEST;
+
+		txq->last_used_idx++;
+		curr++;
+	}
+
+	ret = sendto(qp->sfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!(ret >= 0 || errno == EAGAIN || errno == ENOBUFS))
+		lassert(0);
+
+	benchmark.rx_release(queue_pair, start, end);
+
+	tx_npkts += (curr - start);
+}
+
+static inline void push_free_stack(struct tp4_umem *umem, unsigned long idx)
+{
+	umem->free_stack[--umem->free_stack_idx] = idx;
+}
+
+static inline unsigned long pop_free_stack(struct tp4_umem *umem)
+{
+	return	umem->free_stack[umem->free_stack_idx++];
+}
+
+static struct tp4_umem *alloc_and_register_buffers(size_t nbuffers)
+{
+	struct tpacket_memreg_req req = { .frame_size = FRAME_SIZE };
+	struct tp4_umem *umem;
+	size_t i;
+	int fd, ret;
+	void *bufs;
+
+	ret = posix_memalign((void **)&bufs, getpagesize(),
+			     nbuffers * req.frame_size);
+	lassert(ret == 0);
+
+	umem = calloc(1, sizeof(*umem));
+	lassert(umem);
+	fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	lassert(fd > 0);
+	req.addr = (unsigned long)bufs;
+	req.len = nbuffers * req.frame_size;
+	ret = setsockopt(fd, SOL_PACKET, PACKET_MEMREG, &req, sizeof(req));
+	lassert(ret == 0);
+
+	umem->frame_size = FRAME_SIZE;
+	umem->frame_size_log2 = log2(FRAME_SIZE);
+	umem->buffer = bufs;
+	umem->size = nbuffers * req.frame_size;
+	umem->nframes = nbuffers;
+	umem->mr_fd = fd;
+
+	for (i = 0; i < nbuffers; i++)
+		umem->free_stack[i] = i;
+
+	for (i = 0; i < nbuffers; i++) {
+		tx_frame_len = gen_eth_frame(bufs, 42);
+		bufs += FRAME_SIZE;
+	}
+
+	return umem;
+}
+
+static inline int tp4q_enqueue(struct tpacket4_queue *q,
+			       const struct tpacket4_desc *d,
+			       unsigned int dcnt)
+{
+	unsigned int avail_idx = q->avail_idx;
+	unsigned int i;
+	int j;
+
+	if (q->num_free < dcnt)
+		return -ENOSPC;
+
+	q->num_free -= dcnt;
+
+	for (i = 0; i < dcnt; i++) {
+		unsigned int idx = (avail_idx++) & q->ring_mask;
+
+		q->ring[idx].idx = d[i].idx;
+		q->ring[idx].len = d[i].len;
+		q->ring[idx].offset = d[i].offset;
+		q->ring[idx].error = 0;
+	}
+	u_smp_wmb();
+
+	for (j = dcnt - 1; j >= 0; j--) {
+		unsigned int idx = (q->avail_idx + j) & q->ring_mask;
+
+		q->ring[idx].flags = d[j].flags | TP4_DESC_KERNEL;
+	}
+	q->avail_idx += dcnt;
+
+	return 0;
+}
+
+static void *tp4_configure(const char *interface_name)
+{
+	int sfd, noqdisc, ret, ver = TPACKET_V4;
+	struct tpacket_req4 req = {};
+	struct tp4_queue_pair *tqp;
+	struct sockaddr_ll ll;
+	unsigned int i;
+	void *rxring;
+
+	/* create PF_PACKET socket */
+	sfd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+	lassert(sfd >= 0);
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_VERSION, &ver, sizeof(ver));
+	lassert(ret == 0);
+
+	tqp = calloc(1, sizeof(*tqp));
+	lassert(tqp);
+
+	tqp->sfd = sfd;
+	tqp->interface_name = interface_name;
+
+	tqp->umem = alloc_and_register_buffers(NUM_BUFFERS);
+	lassert(tqp->umem);
+
+	req.mr_fd = tqp->umem->mr_fd;
+	req.desc_nr = NUM_DESCS;
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_RX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_TX_RING, &req, sizeof(req));
+	lassert(ret == 0);
+
+	rxring = mmap(0, 2 * req.desc_nr * sizeof(struct tpacket4_desc),
+		      PROT_READ | PROT_WRITE,
+		      MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sfd, 0);
+	lassert(rxring != MAP_FAILED);
+
+	tqp->rx.ring = rxring;
+	tqp->rx.num_free = req.desc_nr;
+	tqp->rx.ring_mask = req.desc_nr - 1;
+
+	tqp->tx.ring = &tqp->rx.ring[req.desc_nr];
+	tqp->tx.num_free = req.desc_nr;
+	tqp->tx.ring_mask = req.desc_nr - 1;
+
+	ll.sll_family = PF_PACKET;
+	ll.sll_protocol = htons(ETH_P_ALL);
+	ll.sll_ifindex = if_nametoindex(interface_name);
+	ll.sll_hatype = 0;
+	ll.sll_pkttype = 0;
+	ll.sll_halen = 0;
+
+	noqdisc = 1;
+	ret = setsockopt(sfd, SOL_PACKET, PACKET_QDISC_BYPASS,
+			 &noqdisc, sizeof(noqdisc));
+	lassert(ret == 0);
+
+	ret = bind(sfd, (struct sockaddr *)&ll, sizeof(ll));
+	lassert(ret == 0);
+
+	if (opt_zerocopy > 0) {
+		ret = setsockopt(sfd, SOL_PACKET, PACKET_ZEROCOPY,
+				 &opt_zerocopy, sizeof(opt_zerocopy));
+		lassert(ret == 0);
+	}
+
+	for (i = 0; i < (tqp->rx.ring_mask + 1)/4; i++) {
+		struct tpacket4_desc desc = {};
+
+		desc.idx = i;
+		ret = tp4q_enqueue(&tqp->rx, &desc, 1);
+		lassert(ret == 0);
+	}
+
+	return tqp;
+}
+
+static void tp4_rx(void *queue_pair, unsigned int *start, unsigned int *end)
+{
+	struct tpacket4_queue *q = &((struct tp4_queue_pair *)queue_pair)->rx;
+	unsigned int idx, recv_size, last_used = q->last_used_idx;
+	unsigned int uncleared = (q->avail_idx - last_used);
+
+	*start = last_used;
+	*end = last_used;
+	recv_size = (uncleared < BATCH_SIZE) ? uncleared : BATCH_SIZE;
+
+	idx = (last_used + recv_size - 1) & q->ring_mask;
+	if (q->ring[idx].flags & TP4_DESC_KERNEL)
+		return;
+
+	*end += recv_size;
+	rx_npkts += recv_size;
+	q->num_free = recv_size;
+
+	u_smp_rmb();
+}
+
+static inline void tp4_rx_release(void *queue_pair, unsigned int start,
+				  unsigned int end)
+{
+	struct tp4_queue_pair *qp = queue_pair;
+	struct tpacket4_queue *q = &qp->rx;
+	struct tpacket4_desc *src, *dst;
+	unsigned int nitems = end - start;
+
+	while (nitems--) {
+		dst = &q->ring[(q->avail_idx++) & q->ring_mask];
+		src = &q->ring[start++ & q->ring_mask];
+		*dst = *src;
+
+		u_smp_wmb();
+
+		dst->flags = TP4_DESC_KERNEL;
+	}
+
+	q->last_used_idx += q->num_free;
+	q->num_free = 0;
+}
+
+static inline void *tp4_get_data(void *queue_pair, unsigned int idx,
+				 unsigned int *len)
+{
+	struct tp4_queue_pair *qp = (struct tp4_queue_pair *)queue_pair;
+	struct tp4_umem *umem = qp->umem;
+	struct tpacket4_desc *d;
+
+	d = &qp->rx.ring[idx & qp->rx.ring_mask];
+	*len = d->len;
+
+	return (char *)umem->buffer + (d->idx << umem->frame_size_log2)
+		+ d->offset;
+}
+
+
+static inline unsigned long tp4_get_data_desc(void *queue_pair,
+					      unsigned int idx,
+					      unsigned int *len,
+					      unsigned short *offset)
+{
+	struct tp4_queue_pair *qp = queue_pair;
+	struct tpacket4_queue *q = &qp->rx;
+	struct tpacket4_desc *d;
+
+	d = &q->ring[idx & q->ring_mask];
+	*len = d->len;
+	*offset = d->offset;
+
+	return d->idx;
+}
+
+static inline unsigned long tp4_get_data_desc_dummy(void *queue_pair,
+						    unsigned int idx,
+						    unsigned int *len,
+						    unsigned short *offset)
+{
+	struct tp4_queue_pair *qp = queue_pair;
+
+	(void)idx;
+
+	*len = tx_frame_len;
+	*offset = 0;
+
+	return pop_free_stack(qp->umem);
+}
+
+static inline void tp4_set_data_desc(void *queue_pair, unsigned int idx,
+				     unsigned long didx)
+{
+	struct tp4_queue_pair *qp = queue_pair;
+	struct tpacket4_queue *q = &qp->rx;
+	struct tpacket4_desc *d;
+
+	d = &q->ring[idx & q->ring_mask];
+	d->idx = didx;
+}
+
+static inline void tp4_set_data_desc_dummy(void *queue_pair, unsigned int idx,
+					   unsigned long didx)
+{
+	struct tp4_queue_pair *qp = queue_pair;
+
+	(void)idx;
+
+	push_free_stack(qp->umem, didx);
+}
+
+static void tp4_tx(void *queue_pair, unsigned int start, unsigned int end)
+{
+	struct tp4_queue_pair *qp = (struct tp4_queue_pair *)queue_pair;
+	struct tpacket4_queue *q = &qp->tx;
+	unsigned int i, aidx, uidx, send_size, s, entries, ncleared = 0;
+	unsigned long cleared[BATCH_SIZE];
+	int ret;
+
+	entries = end - start;
+
+	if (q->num_free != NUM_DESCS) {
+		for (i = 0; i < entries; i++) {
+			uidx = q->last_used_idx & q->ring_mask;
+			if (q->ring[uidx].flags & TP4_DESC_KERNEL)
+				break;
+
+			q->last_used_idx++;
+			cleared[i] = q->ring[uidx].idx;
+			q->num_free++;
+			ncleared++;
+		}
+	}
+
+	tx_npkts += ncleared;
+
+	send_size = (q->num_free < entries) ? q->num_free : entries;
+	i = 0;
+	s = start;
+	q->num_free -= send_size;
+
+	while (send_size--) {
+		aidx = q->avail_idx++ & q->ring_mask;
+
+		q->ring[aidx].idx = benchmark.get_data_desc(
+			qp, s, &q->ring[aidx].len,
+			&q->ring[aidx].offset);
+		if (i < ncleared)
+			benchmark.set_data_desc(qp, s++, cleared[i++]);
+
+		u_smp_wmb();
+
+		q->ring[aidx].flags = TP4_DESC_KERNEL;
+	}
+
+	benchmark.rx_release(queue_pair, start, start + ncleared);
+
+	ret = sendto(qp->sfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!(ret >= 0 || errno == EAGAIN || errno == ENOBUFS))
+		lassert(0);
+}
+
+static struct benchmark benchmarks[3][3] = {
+	{ /* V2 */
+		{ .configure = tp2_configure,
+		  .rx = tp2_rx,
+		  .get_data = NULL,
+		  .get_data_desc = NULL,
+		  .set_data_desc = NULL,
+		  .process = NULL,
+		  .rx_release = NULL,
+		  .tx = tp2_rx_release,
+		},
+		{ .configure = tp2_configure,
+		  .rx = rx_dummy,
+		  .get_data = get_data_dummy,
+		  .get_data_desc = NULL,
+		  .set_data_desc = NULL,
+		  .process = NULL,
+		  .rx_release = rx_release_dummy,
+		  .tx = tp2_tx,
+		},
+		{ .configure = tp2_configure,
+		  .rx = tp2_rx,
+		  .get_data = tp2_get_data,
+		  .get_data_desc = NULL,
+		  .set_data_desc = NULL,
+		  .process = process_swap_mac,
+		  .rx_release = tp2_rx_release,
+		  .tx = tp2_tx,
+		}
+	},
+	{ /* V3 */
+		{ .configure = tp3_configure,
+		  .rx = tp3_rx,
+		  .get_data = NULL,
+		  .get_data_desc = NULL,
+		  .set_data_desc = NULL,
+		  .process = NULL,
+		  .rx_release = NULL,
+		  .tx = tp3_rx_release,
+		},
+		{ .configure = tp3_configure,
+		  .rx = rx_dummy,
+		  .get_data = get_data_dummy,
+		  .get_data_desc = NULL,
+		  .set_data_desc = NULL,
+		  .process = NULL,
+		  .rx_release = rx_release_dummy,
+		  .tx = tp3_tx,
+		},
+		{ .configure = tp3_configure,
+		  .rx = tp3_rx,
+		  .get_data = tp3_get_data,
+		  .set_data_desc = NULL,
+		  .get_data_desc = NULL,
+		  .process = process_swap_mac,
+		  .rx_release = tp3_rx_release,
+		  .tx = tp3_tx,
+		}
+	},
+	{ /* V4 */
+		{ .configure = tp4_configure,
+		  .rx = tp4_rx,
+		  .get_data = NULL,
+		  .get_data_desc = NULL,
+		  .set_data_desc = NULL,
+		  .process = NULL,
+		  .rx_release = NULL,
+		  .tx = tp4_rx_release,
+		},
+		{ .configure = tp4_configure,
+		  .rx = rx_dummy,
+		  .get_data = NULL,
+		  .get_data_desc = tp4_get_data_desc_dummy,
+		  .set_data_desc = tp4_set_data_desc_dummy,
+		  .process = NULL,
+		  .rx_release = rx_release_dummy,
+		  .tx = tp4_tx,
+		},
+		{ .configure = tp4_configure,
+		  .rx = tp4_rx,
+		  .get_data = tp4_get_data,
+		  .get_data_desc = tp4_get_data_desc,
+		  .set_data_desc = tp4_set_data_desc,
+		  .process = process_swap_mac,
+		  .rx_release = tp4_rx_release,
+		  .tx = tp4_tx,
+		}
+	}
+};
+
+static struct benchmark *get_benchmark(enum tpacket_version ver,
+				       enum benchmark_type type)
+{
+	return &benchmarks[ver][type];
+}
+
+
+
+
+static struct option long_options[] = {
+	{"version", required_argument, 0, 'v'},
+	{"rxdrop", no_argument, 0, 'r'},
+	{"txonly", no_argument, 0, 't'},
+	{"l2fwd", no_argument, 0, 'l'},
+	{"zerocopy", required_argument, 0, 'z'},
+	{"interface", required_argument, 0, 'i'},
+	{0, 0, 0, 0}
+};
+
+static void usage(void)
+{
+	const char *str =
+		"  Usage: tpbench [OPTIONS]\n"
+		"  Options:\n"
+		"  -v, --version=n	Use tpacket version n (default 4)\n"
+		"  -r, --rxdrop		Discard all incoming packets (default)\n"
+		"  -t, --txonly		Only send packets\n"
+		"  -l, --l2fwd		MAC swap L2 forwarding\n"
+		"  -z, --zerocopy=n	Enable zero-copy on queue n\n"
+		"  -i, --interface=n	Run on interface n\n"
+		"\n";
+	fprintf(stderr, "%s", str);
+	exit(EXIT_FAILURE);
+}
+
+static void parse_command_line(int argc, char **argv)
+{
+	int option_index, c, version, ret;
+
+	opterr = 0;
+
+	for (;;) {
+		c = getopt_long(argc, argv, "v:rtlz:i:", long_options,
+				&option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'v':
+			version = atoi(optarg);
+			if (version < 2 || version > 4) {
+				fprintf(stderr,
+					"ERROR: version has to be [2,4]\n");
+				usage();
+			}
+			opt_tpver = version - 2;
+			break;
+		case 'r':
+			opt_bench = BENCH_RXDROP;
+			break;
+		case 't':
+			opt_bench = BENCH_TXONLY;
+			break;
+		case 'l':
+			opt_bench = BENCH_L2FWD;
+			break;
+		case 'z':
+			opt_zerocopy = atoi(optarg);
+			break;
+		case 'i':
+			opt_if = optarg;
+			break;
+		default:
+			usage();
+		}
+	}
+
+	if (opt_zerocopy > 0 && opt_tpver != PV4) {
+		fprintf(stderr, "ERROR: version 4 required for zero-copy\n");
+		usage();
+	}
+
+	ret = if_nametoindex(opt_if);
+	if (!ret) {
+		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
+			opt_if);
+		usage();
+	}
+}
+
+static void print_benchmark(bool running)
+{
+	const char *bench_str = "INVALID";
+
+	if (opt_bench == BENCH_RXDROP)
+		bench_str = "rxdrop";
+	else if (opt_bench == BENCH_TXONLY)
+		bench_str = "txonly";
+	else if (opt_bench == BENCH_L2FWD)
+		bench_str = "l2fwd";
+
+	printf("%s v%d %s ", opt_if, opt_tpver + 2, bench_str);
+	if (opt_zerocopy > 0)
+		printf("zc ");
+	else
+		printf("   ");
+
+	if (running) {
+		printf("running...");
+		fflush(stdout);
+	}
+}
+
+static void sigdie(int sig)
+{
+	unsigned long stop_time = get_nsecs();
+	long dt = stop_time - start_time;
+	(void)sig;
+
+	double rx_pps = rx_npkts * 1000000000. / dt;
+	double tx_pps = tx_npkts * 1000000000. / dt;
+
+	printf("\r");
+	print_benchmark(false);
+	printf("duration %4.2fs rx: %16lupkts @ %16.2fpps tx: %16lupkts @ %16.2fpps.\n",
+	       dt / 1000000000., rx_npkts, rx_pps, tx_npkts, tx_pps);
+
+	exit(EXIT_SUCCESS);
+}
+
+int main(int argc, char **argv)
+{
+	signal(SIGINT, sigdie);
+	parse_command_line(argc, argv);
+	print_benchmark(true);
+	benchmark = *get_benchmark(opt_tpver, opt_bench);
+	start_time = get_nsecs();
+	run_benchmark(opt_if);
+
+	return 0;
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 11/14] veth: added support for PACKET_ZEROCOPY
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (9 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 10/14] samples/tpacket4: added tpbench Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 12/14] samples/tpacket4: added veth support Björn Töpel
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

Add AF_PACKET V4 zerocopy support for the veth driver.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/veth.c       | 172 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/tpacket4.h | 131 ++++++++++++++++++++++++++++++++++++
 2 files changed, 303 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f5438d0978ca..3dfb5fb89460 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,6 +19,7 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 #include <linux/module.h>
+#include <linux/tpacket4.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -33,6 +34,10 @@ struct veth_priv {
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	struct tp4_packet_array *tp4a_rx;
+	struct tp4_packet_array *tp4a_tx;
+	struct napi_struct      *napi;
+	bool                    tp4_zerocopy;
 };
 
 /*
@@ -104,6 +109,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct net_device *rcv;
 	int length = skb->len;
 
+	/* Drop packets from stack if we are in zerocopy mode. */
+	if (unlikely(priv->tp4_zerocopy)) {
+		consume_skb(skb);
+		return NETDEV_TX_OK;
+	}
+
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
 	if (unlikely(!rcv)) {
@@ -126,6 +137,64 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
+static int veth_tp4_xmit(struct net_device *netdev, int queue_pair)
+{
+	struct veth_priv *priv = netdev_priv(netdev);
+
+	local_bh_disable();
+	napi_schedule(priv->napi);
+	local_bh_enable();
+
+	return NETDEV_TX_OK;
+}
+
+static int veth_napi_poll(struct napi_struct *napi, int budget)
+{
+	struct net_device *netdev = napi->dev;
+	struct pcpu_vstats *stats = this_cpu_ptr(netdev->vstats);
+	struct veth_priv *priv_rcv, *priv = netdev_priv(netdev);
+	struct tp4_packet_array *tp4a_tx = priv->tp4a_tx;
+	struct tp4_packet_array *tp4a_rx;
+	struct net_device *rcv;
+	int npackets = 0;
+	int length = 0;
+
+	rcu_read_lock();
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		goto exit;
+
+	priv_rcv = netdev_priv(rcv);
+	if (unlikely(!priv_rcv->tp4_zerocopy))
+		goto exit;
+
+	/* To make sure we do not read the tp4_queue pointers
+	 * before the other process has enabled zerocopy
+	 */
+	smp_rmb();
+
+	tp4a_rx = priv_rcv->tp4a_rx;
+
+	tp4a_populate(tp4a_tx);
+	tp4a_populate(tp4a_rx);
+
+	npackets = tp4a_copy(tp4a_rx, tp4a_tx, &length);
+
+	WARN_ON_ONCE(tp4a_flush(tp4a_tx));
+	WARN_ON_ONCE(tp4a_flush(tp4a_rx));
+
+	u64_stats_update_begin(&stats->syncp);
+	stats->bytes += length;
+	stats->packets += npackets;
+	u64_stats_update_end(&stats->syncp);
+
+exit:
+	rcu_read_unlock();
+	if (npackets < NAPI_POLL_WEIGHT)
+		napi_complete_done(priv->napi, 0);
+	return npackets;
+}
+
 /*
  * general routines
  */
@@ -276,6 +345,105 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int veth_tp4_disable(struct net_device *netdev,
+			    struct tp4_netdev_parms *params)
+{
+	struct veth_priv *priv_rcv, *priv = netdev_priv(netdev);
+	struct net_device *rcv;
+
+	if (!priv->tp4_zerocopy)
+		return 0;
+	priv->tp4_zerocopy = false;
+
+	/* Make sure other process sees zero copy as off before starting
+	 * to turn things off
+	 */
+	smp_wmb();
+
+	napi_disable(priv->napi);
+	netif_napi_del(priv->napi);
+
+	rcu_read_lock();
+	rcv = rcu_dereference(priv->peer);
+	if (!rcv) {
+		WARN_ON(!rcv);
+		goto exit;
+	}
+	priv_rcv = netdev_priv(rcv);
+
+	if (priv_rcv->tp4_zerocopy) {
+		/* Wait for other thread to complete
+		 * before removing tp4 queues
+		 */
+		napi_synchronize(priv_rcv->napi);
+	}
+exit:
+	rcu_read_unlock();
+
+	tp4a_free(priv->tp4a_rx);
+	tp4a_free(priv->tp4a_tx);
+	kfree(priv->napi);
+
+	return 0;
+}
+
+static int veth_tp4_enable(struct net_device *netdev,
+			   struct tp4_netdev_parms *params)
+{
+	struct veth_priv *priv = netdev_priv(netdev);
+	int err;
+
+	priv->napi = kzalloc(sizeof(*priv->napi), GFP_KERNEL);
+	if (!priv->napi)
+		return -ENOMEM;
+
+	netif_napi_add(netdev, priv->napi, veth_napi_poll,
+		       NAPI_POLL_WEIGHT);
+
+	priv->tp4a_rx = tp4a_rx_new(params->rx_opaque, NAPI_POLL_WEIGHT, NULL);
+	if (!priv->tp4a_rx) {
+		err = -ENOMEM;
+		goto rxa_err;
+	}
+
+	priv->tp4a_tx = tp4a_tx_new(params->tx_opaque, NAPI_POLL_WEIGHT, NULL);
+	if (!priv->tp4a_tx) {
+		err = -ENOMEM;
+		goto txa_err;
+	}
+
+	/* Make sure other process sees queues initialized before enabling
+	 * zerocopy mode
+	 */
+	smp_wmb();
+	priv->tp4_zerocopy = true;
+	napi_enable(priv->napi);
+
+	return 0;
+
+txa_err:
+	tp4a_free(priv->tp4a_rx);
+rxa_err:
+	netif_napi_del(priv->napi);
+	kfree(priv->napi);
+	return err;
+}
+
+static int veth_tp4_zerocopy(struct net_device *netdev,
+			     struct tp4_netdev_parms *params)
+{
+	switch (params->command) {
+	case TP4_ENABLE:
+		return veth_tp4_enable(netdev, params);
+
+	case TP4_DISABLE:
+		return veth_tp4_disable(netdev, params);
+
+	default:
+		return -ENOTSUPP;
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -290,6 +458,8 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_get_iflink		= veth_get_iflink,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
+	.ndo_tp4_zerocopy	= veth_tp4_zerocopy,
+	.ndo_tp4_xmit           = veth_tp4_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
@@ -449,9 +619,11 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 
 	priv = netdev_priv(dev);
 	rcu_assign_pointer(priv->peer, peer);
+	priv->tp4_zerocopy = false;
 
 	priv = netdev_priv(peer);
 	rcu_assign_pointer(priv->peer, dev);
+	priv->tp4_zerocopy = false;
 	return 0;
 
 err_register_dev:
diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index beaf23f713eb..360d80086104 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -1074,6 +1074,19 @@ static inline unsigned int tp4a_max_data_size(struct tp4_packet_array *a)
 }
 
 /**
+ * tp4a_has_same_umem - Checks if two packet arrays have the same umem
+ * @a1: pointer to packet array
+ * @a2: pointer to packet array
+ *
+ * Returns true if arrays have the same umem, false otherwise
+ **/
+static inline bool tp4a_has_same_umem(struct tp4_packet_array *a1,
+				      struct tp4_packet_array *a2)
+{
+	return (a1->tp4q->umem == a2->tp4q->umem) ? true : false;
+}
+
+/**
  * tp4a_next_packet - Get next packet in array and advance curr pointer
  * @a: pointer to packet array
  * @p: supplied pointer to packet structure that is filled in by function
@@ -1188,6 +1201,124 @@ static inline bool tp4a_next_frame_populate(struct tp4_packet_array *a,
 }
 
 /**
+ * tp4a_add_packet - Adds a packet into a packet array without copying data
+ * @a: pointer to packet array to insert the packet into
+ * @pkt: pointer to packet to insert
+ * @len: returns the length in bytes of data added according to descriptor
+ *
+ * Note that this function does not copy the data. Instead it copies
+ * the address that points to the packet buffer.
+ *
+ * Returns 0 for success and -1 for failure
+ **/
+static inline int tp4a_add_packet(struct tp4_packet_array *a,
+				  struct tp4_frame_set *p, u32 *len)
+{
+	u32 free = a->end - a->curr;
+	u32 nframes = p->end - p->start;
+
+	if (nframes > free)
+		return -1;
+
+	tp4f_reset(p);
+	*len = 0;
+
+	do {
+		int frame_len = tp4f_get_frame_len(p);
+		int idx = a->curr & a->mask;
+
+		a->items[idx].idx = tp4f_get_frame_id(p);
+		a->items[idx].len = frame_len;
+		a->items[idx].offset = tp4f_get_data_offset(p);
+		a->items[idx].flags = tp4f_is_last_frame(p) ?
+						   0 : TP4_PKT_CONT;
+		a->items[idx].error = 0;
+
+		a->curr++;
+		*len += frame_len;
+	} while (tp4f_next_frame(p));
+
+	return 0;
+}
+
+/**
+ * tp4a_copy_packet - Copies a packet with data into a packet array
+ * @a: pointer to packet array to insert the packet into
+ * @pkt: pointer to packet to insert and copy
+ * @len: returns the length in bytes of data copied
+ *
+ * Puts the packet where curr is pointing
+ *
+ * Returns 0 for success and -1 for failure
+ **/
+static inline int tp4a_copy_packet(struct tp4_packet_array *a,
+				   struct tp4_frame_set *p, int *len)
+{
+	u32 free = a->end - a->curr;
+	u32 nframes = p->end - p->start;
+
+	if (nframes > free)
+		return -1;
+
+	tp4f_reset(p);
+	*len = 0;
+
+	do {
+		int frame_len = tp4f_get_frame_len(p);
+		int idx = a->curr & a->mask;
+
+		a->items[idx].len = frame_len;
+		a->items[idx].offset = tp4f_get_data_offset(p);
+		a->items[idx].flags = tp4f_is_last_frame(p) ?
+						   0 : TP4_PKT_CONT;
+		a->items[idx].error = 0;
+
+		memcpy(tp4q_get_data(a->tp4q, &a->items[idx]),
+		       tp4f_get_data(p), frame_len);
+		a->curr++;
+		*len += frame_len;
+	} while (tp4f_next_frame(p));
+
+	return 0;
+}
+
+/**
+ * tp4a_copy - Copy a packet array
+ * @dst: pointer to destination packet array
+ * @src: pointer to source packet array
+ * @len: returns the length in bytes of all packets copied
+ *
+ * Returns number of packets copied
+ **/
+static inline int tp4a_copy(struct tp4_packet_array *dst,
+			    struct tp4_packet_array *src, int *len)
+{
+	int npackets = 0;
+
+	*len = 0;
+	for (;;) {
+		struct tp4_frame_set src_pkt;
+		int pkt_len;
+
+		if (!tp4a_next_packet(src, &src_pkt))
+			break;
+
+		if (tp4a_has_same_umem(src, dst)) {
+			if (tp4a_add_packet(dst, &src_pkt, &pkt_len))
+				break;
+		} else {
+			if (tp4a_copy_packet(dst, &src_pkt, &pkt_len))
+				break;
+		}
+
+		npackets++;
+		*len += pkt_len;
+	}
+
+	return npackets;
+}
+
+/**
  * tp4a_return_packet - Return packet to the packet array
  *
  * @a: pointer to packet array
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 12/14] samples/tpacket4: added veth support
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (10 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 11/14] veth: added support for PACKET_ZEROCOPY Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 13/14] i40e: added XDP support for TP4 enabled queue pairs Björn Töpel
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

This commit adds support for running the benchmark using a veth pair.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/tpacket4/tpbench.c | 189 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 163 insertions(+), 26 deletions(-)

diff --git a/samples/tpacket4/tpbench.c b/samples/tpacket4/tpbench.c
index 46fb83009e06..2479f182d1b8 100644
--- a/samples/tpacket4/tpbench.c
+++ b/samples/tpacket4/tpbench.c
@@ -65,8 +65,18 @@ enum benchmark_type {
 static enum tpacket_version opt_tpver = PV4;
 static enum benchmark_type opt_bench = BENCH_RXDROP;
 static const char *opt_if = "";
+static int opt_veth;
 static int opt_zerocopy;
 
+static const char *veth_if1 = "vm1";
+static const char *veth_if2 = "vm2";
+
+/* For process synchronization */
+static int shmid;
+volatile unsigned int *sync_var;
+#define SLEEP_STEP 10
+#define MAX_SLEEP (1000000 / (SLEEP_STEP))
+
 struct tpacket2_queue {
 	void *ring;
 
@@ -296,13 +306,53 @@ static void process_swap_mac(void *queue_pair, unsigned int start,
 	}
 }
 
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+
 static void run_benchmark(const char *interface_name)
 {
 	unsigned int start, end;
 	struct tp2_queue_pair *qp;
 
+	if (opt_veth) {
+		shmid = shmget(14082017, sizeof(unsigned int),
+			       IPC_CREAT | 666);
+		sync_var = shmat(shmid, 0, 0);
+		if (sync_var == (unsigned int *)-1) {
+			printf("You are probably not running as root\n");
+			exit(EXIT_FAILURE);
+		}
+		*sync_var = 0;
+
+		if (fork() == 0) {
+			opt_if = veth_if2;
+			interface_name = veth_if2;
+		} else {
+			unsigned int i;
+
+			/* Wait for child */
+			for (i = 0; *sync_var == 0 && i < MAX_SLEEP; i++)
+				usleep(SLEEP_STEP);
+			if (i >= MAX_SLEEP) {
+				printf("Wait for vm2 timed out. Exiting.\n");
+				exit(EXIT_FAILURE);
+			}
+		}
+	}
+
 	qp = benchmark.configure(interface_name);
 
+	/* Notify parent that interface configuration completed */
+	if (opt_veth && !strcmp(interface_name, "vm2"))
+		*sync_var = 1;
+
+	start_time = get_nsecs();
+
 	for (;;) {
 		for (;;) {
 			benchmark.rx(qp, &start, &end);
@@ -320,14 +370,6 @@ static void run_benchmark(const char *interface_name)
 	}
 }
 
-static unsigned long get_nsecs(void)
-{
-	struct timespec ts;
-
-	clock_gettime(CLOCK_MONOTONIC, &ts);
-	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
-}
-
 static void *tp2_configure(const char *interface_name)
 {
 	int sfd, noqdisc, ret, ver = TPACKET_V2;
@@ -386,6 +428,36 @@ static void *tp2_configure(const char *interface_name)
 	ret = bind(sfd, (struct sockaddr *)&ll, sizeof(ll));
 	lassert(ret == 0);
 
+	if (opt_veth && !strcmp(interface_name, "vm1"))	{
+		struct tpacket2_queue *txq = &tqp->tx;
+		int i;
+
+		for (i = 0; i < opt_veth; i++) {
+			unsigned int idx = txq->last_used_idx &
+				(txq->ring_size - 1);
+			struct tpacket2_hdr *hdr;
+			unsigned int len;
+
+			hdr = (struct tpacket2_hdr *)(txq->ring +
+					     (idx << txq->frame_size_log2));
+			len = gen_eth_frame((char *)hdr + TPACKET2_HDRLEN -
+					    sizeof(struct sockaddr_ll), i + 1);
+			hdr->tp_snaplen = len;
+			hdr->tp_len = len;
+
+			u_smp_wmb();
+
+			hdr->tp_status = TP_STATUS_SEND_REQUEST;
+			txq->last_used_idx++;
+		}
+
+		ret = sendto(sfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+		if (!(ret >= 0 || errno == EAGAIN || errno == ENOBUFS))
+			lassert(0);
+
+		tx_npkts += opt_veth;
+	}
+
 	setup_tx_frame();
 
 	return tqp;
@@ -556,6 +628,36 @@ static void *tp3_configure(const char *interface_name)
 	ret = bind(sfd, (struct sockaddr *)&ll, sizeof(ll));
 	lassert(ret == 0);
 
+	if (opt_veth && !strcmp(interface_name, "vm1"))	{
+		struct tpacket2_queue *txq = &tqp->tx;
+		int i;
+
+		for (i = 0; i < opt_veth; i++) {
+			unsigned int idx = txq->last_used_idx &
+				(txq->ring_size - 1);
+			struct tpacket3_hdr *hdr;
+			unsigned int len;
+
+			hdr = (struct tpacket3_hdr *)(txq->ring +
+					     (idx << txq->frame_size_log2));
+			len = gen_eth_frame((char *)hdr + TPACKET3_HDRLEN -
+					    sizeof(struct sockaddr_ll), i + 1);
+			hdr->tp_snaplen = len;
+			hdr->tp_len = len;
+
+			u_smp_wmb();
+
+			hdr->tp_status = TP_STATUS_SEND_REQUEST;
+			txq->last_used_idx++;
+		}
+
+		ret = sendto(sfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+		if (!(ret >= 0 || errno == EAGAIN || errno == ENOBUFS))
+			lassert(0);
+
+		tx_npkts += opt_veth;
+	}
+
 	setup_tx_frame();
 
 	return tqp;
@@ -783,6 +885,28 @@ static inline int tp4q_enqueue(struct tpacket4_queue *q,
 	return 0;
 }
 
+static inline void *tp4_get_data(void *queue_pair, unsigned int idx,
+				 unsigned int *len)
+{
+	struct tp4_queue_pair *qp = (struct tp4_queue_pair *)queue_pair;
+	struct tp4_umem *umem = qp->umem;
+	struct tpacket4_desc *d;
+
+	d = &qp->rx.ring[idx & qp->rx.ring_mask];
+	*len = d->len;
+
+	return (char *)umem->buffer + (d->idx << umem->frame_size_log2)
+		+ d->offset;
+}
+
+static inline void *tp4_get_buffer(void *queue_pair, unsigned int idx)
+{
+	struct tp4_queue_pair *qp = (struct tp4_queue_pair *)queue_pair;
+	struct tp4_umem *umem = qp->umem;
+
+	return (char *)umem->buffer + (idx << umem->frame_size_log2);
+}
+
 static void *tp4_configure(const char *interface_name)
 {
 	int sfd, noqdisc, ret, ver = TPACKET_V4;
@@ -848,7 +972,27 @@ static void *tp4_configure(const char *interface_name)
 		lassert(ret == 0);
 	}
 
-	for (i = 0; i < (tqp->rx.ring_mask + 1)/4; i++) {
+	if (opt_veth >= (tqp->rx.ring_mask + 1)/4) {
+		printf("Veth batch size too large.\n");
+		exit(EXIT_FAILURE);
+	}
+
+	if (opt_veth && !strcmp(interface_name, "vm1"))	{
+		for (i = 0; i < opt_veth; i++) {
+			struct tpacket4_desc desc = {.idx = i};
+			unsigned int len;
+
+			len = gen_eth_frame(tp4_get_buffer(tqp, i), i + 1);
+
+			desc.len = len;
+			ret = tp4q_enqueue(&tqp->tx, &desc, 1);
+			lassert(ret == 0);
+		}
+		ret = sendto(sfd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+		lassert(ret != -1);
+	}
+
+	for (i = opt_veth; i < (tqp->rx.ring_mask + 1)/4; i++) {
 		struct tpacket4_desc desc = {};
 
 		desc.idx = i;
@@ -902,21 +1046,6 @@ static inline void tp4_rx_release(void *queue_pair, unsigned int start,
 	q->num_free = 0;
 }
 
-static inline void *tp4_get_data(void *queue_pair, unsigned int idx,
-				 unsigned int *len)
-{
-	struct tp4_queue_pair *qp = (struct tp4_queue_pair *)queue_pair;
-	struct tp4_umem *umem = qp->umem;
-	struct tpacket4_desc *d;
-
-	d = &qp->rx.ring[idx & qp->rx.ring_mask];
-	*len = d->len;
-
-	return (char *)umem->buffer + (d->idx << umem->frame_size_log2)
-		+ d->offset;
-}
-
-
 static inline unsigned long tp4_get_data_desc(void *queue_pair,
 					      unsigned int idx,
 					      unsigned int *len,
@@ -1126,6 +1255,7 @@ static struct option long_options[] = {
 	{"l2fwd", no_argument, 0, 'l'},
 	{"zerocopy", required_argument, 0, 'z'},
 	{"interface", required_argument, 0, 'i'},
+	{"veth", required_argument, 0, 'e'},
 	{0, 0, 0, 0}
 };
 
@@ -1152,7 +1282,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "v:rtlz:i:", long_options,
+		c = getopt_long(argc, argv, "v:rtlz:i:e:", long_options,
 				&option_index);
 		if (c == -1)
 			break;
@@ -1182,6 +1312,9 @@ static void parse_command_line(int argc, char **argv)
 		case 'i':
 			opt_if = optarg;
 			break;
+		case 'e':
+			opt_veth = atoi(optarg);
+			break;
 		default:
 			usage();
 		}
@@ -1192,6 +1325,11 @@ static void parse_command_line(int argc, char **argv)
 		usage();
 	}
 
+	if (opt_veth) {
+		opt_bench = BENCH_L2FWD;
+		opt_if = veth_if1;
+	}
+
 	ret = if_nametoindex(opt_if);
 	if (!ret) {
 		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
@@ -1246,7 +1384,6 @@ int main(int argc, char **argv)
 	parse_command_line(argc, argv);
 	print_benchmark(true);
 	benchmark = *get_benchmark(opt_tpver, opt_bench);
-	start_time = get_nsecs();
 	run_benchmark(opt_if);
 
 	return 0;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 13/14] i40e: added XDP support for TP4 enabled queue pairs
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (11 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 12/14] samples/tpacket4: added veth support Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-10-31 12:41 ` [RFC PATCH 14/14] xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use Björn Töpel
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit the packet array learned to execute XDP programs on
it's flushable range. This means that before the kernel flush
completed/filled Rx frame to userspace, an XDP program will be
executed and acted upon.

Currently, a packet array user still have to explicitly call the
tp4a_run_xdp function, prior a tp4a_flush/tp4a_flush_n call, but this
will change in a future patch set.

The XDP_TX/XDP_REDIRECT is doing page allocation, so exepect lousy
performance. The i40e XDP infrastructure needs to be aligned to handle
TP4 properly.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   4 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |  70 +++++++++++-
 drivers/net/veth.c                          |   6 +-
 include/linux/tpacket4.h                    | 160 +++++++++++++++++++++++++++-
 net/packet/af_packet.c                      |   4 +-
 5 files changed, 233 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ff6d44dae8d0..b63cc4c8957f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11398,7 +11398,7 @@ static int i40e_tp4_enable_rx(struct i40e_ring *rxr,
 	size_t elems = __roundup_pow_of_two(rxr->count * 8);
 	struct tp4_packet_array *arr;
 
-	arr = tp4a_rx_new(params->rx_opaque, elems, rxr->dev);
+	arr = tp4a_rx_new(params->rx_opaque, elems, rxr->netdev, rxr->dev);
 	if (!arr)
 		return -ENOMEM;
 
@@ -11428,7 +11428,7 @@ static int i40e_tp4_enable_tx(struct i40e_ring *txr,
 	size_t elems = __roundup_pow_of_two(txr->count * 8);
 	struct tp4_packet_array *arr;
 
-	arr = tp4a_tx_new(params->tx_opaque, elems, txr->dev);
+	arr = tp4a_tx_new(params->tx_opaque, elems, txr->netdev, txr->dev);
 	if (!arr)
 		return -ENOMEM;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 712e10e14aec..730fe57ca8ee 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2277,6 +2277,9 @@ static inline unsigned int i40e_get_rx_desc_size(union i40e_rx_desc *rxd)
 	return size;
 }
 
+static void i40e_run_xdp_tp4(struct tp4_frame_set *f, bool *recycled,
+			     struct bpf_prog *xdp_prog, struct i40e_ring *xdpr);
+
 /**
  * i40e_clean_rx_tp4_irq - Pulls received packets of the descriptor ring
  * @rxr: ingress ring
@@ -2286,14 +2289,18 @@ static inline unsigned int i40e_get_rx_desc_size(union i40e_rx_desc *rxd)
  **/
 int i40e_clean_rx_tp4_irq(struct i40e_ring *rxr, int budget)
 {
-	int total_rx_bytes = 0, total_rx_packets = 0;
+	int total_rx_bytes = 0, total_rx_packets = 0, nflush = 0;
 	u16 cleaned_count = I40E_DESC_UNUSED(rxr);
 	struct tp4_frame_set frame_set;
+	struct bpf_prog *xdp_prog;
+	struct i40e_ring *xdpr;
 	bool failure;
 
 	if (!tp4a_get_flushable_frame_set(rxr->tp4.arr, &frame_set))
 		goto out;
 
+	rcu_read_lock();
+	xdp_prog = READ_ONCE(rxr->xdp_prog);
 	while (total_rx_packets < budget) {
 		union i40e_rx_desc *rxd = I40E_RX_DESC(rxr, rxr->next_to_clean);
 		unsigned int size = i40e_get_rx_desc_size(rxd);
@@ -2310,6 +2317,19 @@ int i40e_clean_rx_tp4_irq(struct i40e_ring *rxr, int budget)
 		tp4f_set_frame_no_offset(&frame_set, size,
 					 i40e_is_rx_desc_eof(rxd));
 
+		if (xdp_prog) {
+			bool recycled;
+
+			xdpr = rxr->vsi->xdp_rings[rxr->queue_index];
+			i40e_run_xdp_tp4(&frame_set, &recycled,
+					 xdp_prog, xdpr);
+
+			if (!recycled)
+				nflush++;
+		} else {
+			nflush++;
+		}
+
 		total_rx_bytes += size;
 		total_rx_packets++;
 
@@ -2317,8 +2337,9 @@ int i40e_clean_rx_tp4_irq(struct i40e_ring *rxr, int budget)
 
 		WARN_ON(!tp4f_next_frame(&frame_set));
 	}
+	rcu_read_unlock();
 
-	WARN_ON(tp4a_flush_n(rxr->tp4.arr, total_rx_packets));
+	WARN_ON(tp4a_flush_n(rxr->tp4.arr, nflush));
 
 	rxr->tp4.ev_handler(rxr->tp4.ev_opaque);
 
@@ -3800,3 +3821,48 @@ int i40e_clean_tx_tp4_irq(struct i40e_ring *txr, int budget)
 
 	return clean_done && xmit_done;
 }
+
+/**
+ * i40e_tp4_xdp_tx_handler - XDP xmit
+ * @ctx: context
+ * @xdp: XDP buff
+ *
+ * Returns >=0 on success, <0 on failure.
+ **/
+static int i40e_tp4_xdp_tx_handler(void *ctx, struct xdp_buff *xdp)
+{
+	struct i40e_ring *xdpr = ctx;
+
+	return i40e_xmit_xdp_ring(xdp, xdpr);
+}
+
+/**
+ * i40e_tp4_xdp_tx_flush_handler - XDP flush
+ * @ctx: context
+ **/
+static void i40e_tp4_xdp_tx_flush_handler(void *ctx)
+{
+	struct i40e_ring *xdpr = ctx;
+
+	/* Force memory writes to complete before letting h/w
+	 * know there are new descriptors to fetch.
+	 */
+	wmb();
+
+	writel(xdpr->next_to_use, xdpr->tail);
+}
+
+/**
+ * i40e_run_xdp_tp4 - Runs an XDP program on a the flushable range of packets
+ * @a: pointer to frame set
+ * @recycled: true if element was removed from flushable range
+ * @xdp_prog: XDP program
+ * @xdpr: XDP Tx ring
+ **/
+static void i40e_run_xdp_tp4(struct tp4_frame_set *f, bool *recycled,
+			     struct bpf_prog *xdp_prog, struct i40e_ring *xdpr)
+{
+	tp4a_run_xdp(f, recycled, xdp_prog,
+		     i40e_tp4_xdp_tx_handler, xdpr,
+		     i40e_tp4_xdp_tx_flush_handler, xdpr);
+}
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 3dfb5fb89460..eea1eab00624 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -400,13 +400,15 @@ static int veth_tp4_enable(struct net_device *netdev,
 	netif_napi_add(netdev, priv->napi, veth_napi_poll,
 		       NAPI_POLL_WEIGHT);
 
-	priv->tp4a_rx = tp4a_rx_new(params->rx_opaque, NAPI_POLL_WEIGHT, NULL);
+	priv->tp4a_rx = tp4a_rx_new(params->rx_opaque, NAPI_POLL_WEIGHT, NULL,
+				    NULL);
 	if (!priv->tp4a_rx) {
 		err = -ENOMEM;
 		goto rxa_err;
 	}
 
-	priv->tp4a_tx = tp4a_tx_new(params->tx_opaque, NAPI_POLL_WEIGHT, NULL);
+	priv->tp4a_tx = tp4a_tx_new(params->tx_opaque, NAPI_POLL_WEIGHT, NULL,
+				    NULL);
 	if (!priv->tp4a_tx) {
 		err = -ENOMEM;
 		goto txa_err;
diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index 360d80086104..cade34e48a2d 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -15,6 +15,8 @@
 #ifndef _LINUX_TPACKET4_H
 #define _LINUX_TPACKET4_H
 
+#include <linux/bpf_trace.h>
+
 #define TP4_UMEM_MIN_FRAME_SIZE 2048
 #define TP4_KERNEL_HEADROOM 256 /* Headrom for XDP */
 
@@ -73,6 +75,7 @@ struct tp4_queue {
  **/
 struct tp4_packet_array {
 	struct tp4_queue *tp4q;
+	struct net_device *netdev;
 	struct device *dev;
 	enum dma_data_direction direction;
 	enum tp4_validation validation;
@@ -890,6 +893,7 @@ static inline void tp4f_packet_completed(struct tp4_frame_set *p)
 
 static inline struct tp4_packet_array *__tp4a_new(
 	struct tp4_queue *tp4q,
+	struct net_device *netdev,
 	struct device *dev,
 	enum dma_data_direction direction,
 	enum tp4_validation validation,
@@ -913,6 +917,7 @@ static inline struct tp4_packet_array *__tp4a_new(
 	}
 
 	arr->tp4q = tp4q;
+	arr->netdev = netdev;
 	arr->dev = dev;
 	arr->direction = direction;
 	arr->validation = validation;
@@ -930,11 +935,12 @@ static inline struct tp4_packet_array *__tp4a_new(
  **/
 static inline struct tp4_packet_array *tp4a_rx_new(void *rx_opaque,
 						   size_t elems,
+						   struct net_device *netdev,
 						   struct device *dev)
 {
 	enum dma_data_direction direction = dev ? DMA_FROM_DEVICE : DMA_NONE;
 
-	return __tp4a_new(rx_opaque, dev, direction, TP4_VALIDATION_IDX,
+	return __tp4a_new(rx_opaque, netdev, dev, direction, TP4_VALIDATION_IDX,
 			  elems);
 }
 
@@ -948,12 +954,13 @@ static inline struct tp4_packet_array *tp4a_rx_new(void *rx_opaque,
  **/
 static inline struct tp4_packet_array *tp4a_tx_new(void *tx_opaque,
 						   size_t elems,
+						   struct net_device *netdev,
 						   struct device *dev)
 {
 	enum dma_data_direction direction = dev ? DMA_TO_DEVICE : DMA_NONE;
 
-	return __tp4a_new(tx_opaque, dev, direction, TP4_VALIDATION_DESC,
-			  elems);
+	return __tp4a_new(tx_opaque, netdev, dev, direction,
+			  TP4_VALIDATION_DESC, elems);
 }
 
 /**
@@ -1330,4 +1337,151 @@ static inline void tp4a_return_packet(struct tp4_packet_array *a,
 	a->curr = p->start;
 }
 
+static inline struct tpacket4_desc __tp4a_swap_out(struct tp4_packet_array *a,
+						   u32 idx)
+{
+	struct tpacket4_desc tmp, *d;
+
+	/* NB! idx is already masked, so 0 <= idx < size holds! */
+	d = &a->items[a->start & a->mask];
+	tmp = *d;
+	*d = a->items[idx];
+	a->items[idx] = tmp;
+	a->start++;
+
+	return tmp;
+}
+
+static inline void  __tp4a_recycle(struct tp4_packet_array *a,
+				   struct tpacket4_desc *d)
+{
+	/* NB! No bound checking, assume paired with __tp4a_swap_out
+	 * to guarantee space.
+	 */
+	d->offset = tp4q_get_data_headroom(a->tp4q);
+	a->items[a->end++ & a->mask] = *d;
+}
+
+static inline void __tp4a_fill_xdp_buff(struct tp4_packet_array *a,
+					struct xdp_buff *xdp,
+					struct tpacket4_desc *d)
+{
+	xdp->data = tp4q_get_data(a->tp4q, d);
+	xdp->data_end = xdp->data + d->len;
+	xdp->data_meta = xdp->data;
+	xdp->data_hard_start = xdp->data - TP4_KERNEL_HEADROOM;
+}
+
+#define TP4_XDP_PASS 0
+#define TP4_XDP_CONSUMED 1
+#define TP4_XDP_TX 2
+
+/**
+ * tp4a_run_xdp - Execute an XDP program on the flushable range
+ * @a: pointer to frame set
+ * @recycled: the element was removed from flushable range
+ * @xdp_prog: XDP program
+ * @xdp_tx_handler: XDP xmit handler
+ * @xdp_tx_ctx: XDP xmit handler ctx
+ * @xdp_tx_flush_handler: XDP xmit flush handler
+ * @xdp_tx_flush_ctx: XDP xmit flush ctx
+ **/
+static inline void tp4a_run_xdp(struct tp4_frame_set *f,
+				bool *recycled,
+				struct bpf_prog *xdp_prog,
+				int (*xdp_tx_handler)(void *ctx,
+						      struct xdp_buff *xdp),
+				void *xdp_tx_ctx,
+				void (*xdp_tx_flush_handler)(void *ctx),
+				void *xdp_tx_flush_ctx)
+{
+	struct tp4_packet_array *a = f->pkt_arr;
+	struct tpacket4_desc *d, tmp;
+	bool xdp_xmit = false;
+	struct xdp_buff xdp;
+	ptrdiff_t diff, len;
+	struct page *page;
+	u32 act, idx;
+	void *data;
+	int err;
+
+	*recycled = false;
+
+	idx = f->curr & a->mask;
+	d = &a->items[idx];
+	__tp4a_fill_xdp_buff(a, &xdp, d);
+	data = xdp.data;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	switch (act) {
+	case XDP_PASS:
+		if (data != xdp.data) {
+			diff = data - xdp.data;
+			d->offset += diff;
+		}
+		break;
+	case XDP_TX:
+	case XDP_REDIRECT:
+		*recycled = true;
+		tmp = __tp4a_swap_out(a, idx);
+		__tp4a_recycle(a, &tmp);
+
+		/* Ick! ndo_xdp_xmit is missing a destructor,
+		 * meaning that we cannot do proper completion
+		 * to userland, so we need to resort to
+		 * copying. Also, we need to rethink XDP Tx to
+		 * unify it with the existing patch, so we'll
+		 * do a copy here as well. So much for
+		 * "fast-path"...
+		 */
+		page = dev_alloc_pages(0);
+		if (!page)
+			break;
+
+		len = xdp.data_end - xdp.data;
+		if (len > PAGE_SIZE) {
+			put_page(page);
+			break;
+		}
+		data = page_address(page);
+		memcpy(data, xdp.data, len);
+
+		xdp.data = data;
+		xdp.data_end = data + len;
+		xdp_set_data_meta_invalid(&xdp);
+		xdp.data_hard_start = xdp.data;
+		if (act == XDP_TX) {
+			err = xdp_tx_handler(xdp_tx_ctx, &xdp);
+			/* XXX Clean this return value ugliness up... */
+			if (err != TP4_XDP_TX) {
+				put_page(page);
+				break;
+			}
+		} else {
+			err = xdp_do_redirect(a->netdev, &xdp, xdp_prog);
+			if (err) {
+				put_page(page);
+				break;
+			}
+		}
+		xdp_xmit = true;
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		/* fallthrough */
+	case XDP_ABORTED:
+		trace_xdp_exception(a->netdev, xdp_prog, act);
+		/* fallthrough -- handle aborts by dropping packet */
+	case XDP_DROP:
+		*recycled = true;
+		tmp = __tp4a_swap_out(a, idx);
+		__tp4a_recycle(a, &tmp);
+	}
+
+	if (xdp_xmit) {
+		xdp_tx_flush_handler(xdp_tx_ctx);
+		xdp_do_flush_map();
+	}
+}
+
 #endif /* _LINUX_TPACKET4_H */
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index fbfada773463..105cdac13343 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -5038,8 +5038,8 @@ packet_v4_ring_new(struct sock *sk, struct tpacket_req4 *req, int tx_ring)
 		  (struct tpacket4_desc *)rb->pg_vec->buffer);
 	spin_unlock_bh(&rb_queue->lock);
 
-	rb->tp4a = tx_ring ? tp4a_tx_new(&rb->tp4q, TP4_ARRAY_SIZE, NULL)
-		   : tp4a_rx_new(&rb->tp4q, TP4_ARRAY_SIZE, NULL);
+	rb->tp4a = tx_ring ? tp4a_tx_new(&rb->tp4q, TP4_ARRAY_SIZE, NULL, NULL)
+		   : tp4a_rx_new(&rb->tp4q, TP4_ARRAY_SIZE, NULL, NULL);
 
 	if (!rb->tp4a) {
 		err = -ENOMEM;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH 14/14] xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (12 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 13/14] i40e: added XDP support for TP4 enabled queue pairs Björn Töpel
@ 2017-10-31 12:41 ` Björn Töpel
  2017-11-03  4:34 ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support Willem de Bruijn
  2017-11-13 13:07 ` Björn Töpel
  15 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-10-31 12:41 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev
  Cc: jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

From: Magnus Karlsson <magnus.karlsson@intel.com>

This patch introduces XDP_PASS_TO_KERNEL especially for use with
PACKET_ZEROCOPY (ZC) and AF_PACKET V4. When ZC is enabled, XDP_PASS
will send a packet to the V4 socket so that the application can
receive it. If the XDP program would like to send a packet
towards the kernel stack, then XDP_PASS_TO_KERNEL can be used. It will
copy the packet from the packet buffer into an skb and pass it on. When
PACKET_ZEROCOPY is not enabled, XDP_PASS_TO_KERNEL defaults to XDP_PASS.

Note that in ZC mode, user space will be able to see the packet that
XDP is running on, so this is only for trusted applications. For
untrusted applications, NIC HW steering support is a requirement to
make sure the untrusted applications can only see their own packets.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 62 +++++++++++++++++++++++++++--
 include/linux/tpacket4.h                    | 17 +++++++-
 include/uapi/linux/bpf.h                    |  1 +
 3 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 730fe57ca8ee..bf2680ed2b05 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2050,6 +2050,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
+	case XDP_PASS_TO_KERNEL:
 		break;
 	case XDP_TX:
 		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
@@ -2278,7 +2279,8 @@ static inline unsigned int i40e_get_rx_desc_size(union i40e_rx_desc *rxd)
 }
 
 static void i40e_run_xdp_tp4(struct tp4_frame_set *f, bool *recycled,
-			     struct bpf_prog *xdp_prog, struct i40e_ring *xdpr);
+			     struct bpf_prog *xdp_prog, struct i40e_ring *xdpr,
+			     struct i40e_ring *rxr);
 
 /**
  * i40e_clean_rx_tp4_irq - Pulls received packets of the descriptor ring
@@ -2322,7 +2324,7 @@ int i40e_clean_rx_tp4_irq(struct i40e_ring *rxr, int budget)
 
 			xdpr = rxr->vsi->xdp_rings[rxr->queue_index];
 			i40e_run_xdp_tp4(&frame_set, &recycled,
-					 xdp_prog, xdpr);
+					 xdp_prog, xdpr, rxr);
 
 			if (!recycled)
 				nflush++;
@@ -3853,16 +3855,68 @@ static void i40e_tp4_xdp_tx_flush_handler(void *ctx)
 }
 
 /**
+ * i40e_tp4_xdp_tx_flush_handler - XDP pass to kernel callback
+ * @ctx: context. A pointer to the RX ring.
+ * @xdp: XDP buff
+ *
+ * Returns 0 for success and <0 on failure.
+ **/
+static int i40e_tp4_xdp_to_kernel_handler(void *ctx, struct xdp_buff *xdp)
+{
+	struct i40e_ring *rx_ring = ctx;
+	union i40e_rx_desc *rx_desc;
+	struct sk_buff *skb;
+	unsigned int len;
+	u16 vlan_tag;
+	u8 rx_ptype;
+	u64 qword;
+	int err;
+
+	len = xdp->data_end - xdp->data;
+	skb = __napi_alloc_skb(&rx_ring->q_vector->napi, len,
+			       GFP_ATOMIC | __GFP_NOWARN);
+	if (unlikely(!skb))
+		return -ENOMEM;
+
+	/* XXX Use fragments for the data here */
+	skb_put(skb, len);
+	err = skb_store_bits(skb, 0, xdp->data, len);
+	if (unlikely(err)) {
+		kfree_skb(skb);
+		return err;
+	}
+
+	rx_desc = I40E_RX_DESC(rx_ring, rx_ring->next_to_clean);
+	qword = le64_to_cpu(rx_desc->wb.qword1.status_error_len);
+	rx_ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >>
+		I40E_RXD_QW1_PTYPE_SHIFT;
+
+	/* populate checksum, VLAN, and protocol */
+	i40e_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+
+	vlan_tag = (qword & BIT(I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) ?
+		le16_to_cpu(rx_desc->wb.qword0.lo_dword.l2tag1) : 0;
+
+	i40e_trace(clean_rx_irq_rx, rx_ring, rx_desc, skb);
+	i40e_receive_skb(rx_ring, skb, vlan_tag);
+
+	return 0;
+}
+
+/**
  * i40e_run_xdp_tp4 - Runs an XDP program on a the flushable range of packets
  * @a: pointer to frame set
  * @recycled: true if element was removed from flushable range
  * @xdp_prog: XDP program
  * @xdpr: XDP Tx ring
+ * @rxr: pointer to RX ring
  **/
 static void i40e_run_xdp_tp4(struct tp4_frame_set *f, bool *recycled,
-			     struct bpf_prog *xdp_prog, struct i40e_ring *xdpr)
+			     struct bpf_prog *xdp_prog, struct i40e_ring *xdpr,
+			     struct i40e_ring *rxr)
 {
 	tp4a_run_xdp(f, recycled, xdp_prog,
 		     i40e_tp4_xdp_tx_handler, xdpr,
-		     i40e_tp4_xdp_tx_flush_handler, xdpr);
+		     i40e_tp4_xdp_tx_flush_handler, xdpr,
+		     i40e_tp4_xdp_to_kernel_handler, rxr);
 }
diff --git a/include/linux/tpacket4.h b/include/linux/tpacket4.h
index cade34e48a2d..9cb879ea558e 100644
--- a/include/linux/tpacket4.h
+++ b/include/linux/tpacket4.h
@@ -1385,6 +1385,8 @@ static inline void __tp4a_fill_xdp_buff(struct tp4_packet_array *a,
  * @xdp_tx_ctx: XDP xmit handler ctx
  * @xdp_tx_flush_handler: XDP xmit flush handler
  * @xdp_tx_flush_ctx: XDP xmit flush ctx
+ * @xdp_to_kernel_handler: XDP pass to kernel handler
+ * @xdp_to_kernel_ctx: XDP pass to kernel ctx
  **/
 static inline void tp4a_run_xdp(struct tp4_frame_set *f,
 				bool *recycled,
@@ -1393,7 +1395,10 @@ static inline void tp4a_run_xdp(struct tp4_frame_set *f,
 						      struct xdp_buff *xdp),
 				void *xdp_tx_ctx,
 				void (*xdp_tx_flush_handler)(void *ctx),
-				void *xdp_tx_flush_ctx)
+				void *xdp_tx_flush_ctx,
+				int (*xdp_to_kernel_handler)(void *ctx,
+							 struct xdp_buff *xdp),
+				void *xdp_to_kernel_ctx)
 {
 	struct tp4_packet_array *a = f->pkt_arr;
 	struct tpacket4_desc *d, tmp;
@@ -1415,10 +1420,20 @@ static inline void tp4a_run_xdp(struct tp4_frame_set *f,
 	act = bpf_prog_run_xdp(xdp_prog, &xdp);
 	switch (act) {
 	case XDP_PASS:
+	case XDP_PASS_TO_KERNEL:
 		if (data != xdp.data) {
 			diff = data - xdp.data;
 			d->offset += diff;
 		}
+
+		if (act == XDP_PASS_TO_KERNEL) {
+			*recycled = true;
+			tmp = __tp4a_swap_out(a, idx);
+			__tp4a_recycle(a, &tmp);
+
+			err = xdp_to_kernel_handler(xdp_to_kernel_ctx, &xdp);
+		}
+
 		break;
 	case XDP_TX:
 	case XDP_REDIRECT:
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0b7b54d898bd..32d19f5727e2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -875,6 +875,7 @@ enum xdp_action {
 	XDP_PASS,
 	XDP_TX,
 	XDP_REDIRECT,
+	XDP_PASS_TO_KERNEL,
 };
 
 /* user accessible metadata for XDP packet hook
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-10-31 12:41 ` [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API Björn Töpel
@ 2017-11-02  1:45   ` Willem de Bruijn
  2017-11-02 10:06     ` Björn Töpel
  2017-11-15 22:34   ` chet l
  1 sibling, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-02  1:45 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This patch adds the necessary AF_PACKET V4 structures for usage from
> userspace. AF_PACKET V4 is a new interface optimized for high
> performance packet processing.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  include/uapi/linux/if_packet.h | 65 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 64 insertions(+), 1 deletion(-)
>
> +struct tpacket4_queue {
> +       struct tpacket4_desc *ring;
> +
> +       unsigned int avail_idx;
> +       unsigned int last_used_idx;
> +       unsigned int num_free;
> +       unsigned int ring_mask;
> +};
>
>  struct packet_mreq {
> @@ -294,6 +335,28 @@ struct packet_mreq {
>         unsigned char   mr_address[8];
>  };
>
> +/*
> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
> + * to register user memory which should be used to store the packet
> + * data.
> + *
> + * There are some constraints for the memory being registered:
> + * - The memory area has to be memory page size aligned.
> + * - The frame size has to be a power of 2.
> + * - The frame size cannot be smaller than 2048B.
> + * - The frame size cannot be larger than the memory page size.
> + *
> + * Corollary: The number of frames that can be stored is
> + * len / frame_size.
> + *
> + */
> +struct tpacket_memreg_req {
> +       unsigned long   addr;           /* Start of packet data area */
> +       unsigned long   len;            /* Length of packet data area */
> +       unsigned int    frame_size;     /* Frame size */
> +       unsigned int    data_headroom;  /* Frame head room */
> +};

Existing packet sockets take a tpacket_req, allocate memory and let the
user process mmap this. I understand that TPACKET_V4 distinguishes
the descriptor from packet pools, but could both use the existing structs
and logic (packet_mmap)? That would avoid introducing a lot of new code
just for granting user pages to the kernel.

Also, use of unsigned long can cause problems on 32/64 bit compat
environments. Prefer fixed width types in uapi. Same for pointer in
tpacket4_queue.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-02  1:45   ` Willem de Bruijn
@ 2017-11-02 10:06     ` Björn Töpel
  2017-11-02 16:40       ` Tushar Dave
  2017-11-03  2:29       ` Willem de Bruijn
  0 siblings, 2 replies; 49+ messages in thread
From: Björn Töpel @ 2017-11-02 10:06 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

On 2017-11-02 02:45, Willem de Bruijn wrote:
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> This patch adds the necessary AF_PACKET V4 structures for usage from
>> userspace. AF_PACKET V4 is a new interface optimized for high
>> performance packet processing.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>>   include/uapi/linux/if_packet.h | 65 +++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 64 insertions(+), 1 deletion(-)
>>
>> +struct tpacket4_queue {
>> +       struct tpacket4_desc *ring;
>> +
>> +       unsigned int avail_idx;
>> +       unsigned int last_used_idx;
>> +       unsigned int num_free;
>> +       unsigned int ring_mask;
>> +};
>>
>>   struct packet_mreq {
>> @@ -294,6 +335,28 @@ struct packet_mreq {
>>          unsigned char   mr_address[8];
>>   };
>>
>> +/*
>> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
>> + * to register user memory which should be used to store the packet
>> + * data.
>> + *
>> + * There are some constraints for the memory being registered:
>> + * - The memory area has to be memory page size aligned.
>> + * - The frame size has to be a power of 2.
>> + * - The frame size cannot be smaller than 2048B.
>> + * - The frame size cannot be larger than the memory page size.
>> + *
>> + * Corollary: The number of frames that can be stored is
>> + * len / frame_size.
>> + *
>> + */
>> +struct tpacket_memreg_req {
>> +       unsigned long   addr;           /* Start of packet data area */
>> +       unsigned long   len;            /* Length of packet data area */
>> +       unsigned int    frame_size;     /* Frame size */
>> +       unsigned int    data_headroom;  /* Frame head room */
>> +};
>
> Existing packet sockets take a tpacket_req, allocate memory and let the
> user process mmap this. I understand that TPACKET_V4 distinguishes
> the descriptor from packet pools, but could both use the existing structs
> and logic (packet_mmap)? That would avoid introducing a lot of new code
> just for granting user pages to the kernel.
>

We could certainly pass the "tpacket_memreg_req" fields as part of
descriptor ring setup ("tpacket_req4"), but we went with having the
memory register as a new separate setsockopt. Having it separated,
makes it easier to compare regions at the kernel side of things. "Is
this the same umem as another one?" If we go the path of passing the
range at descriptor ring setup, we need to handle all kind of
overlapping ranges to determine when a copy is needed or not, in those
cases where the packet buffer (i.e. umem) is shared between processes.

> Also, use of unsigned long can cause problems on 32/64 bit compat
> environments. Prefer fixed width types in uapi. Same for pointer in
> tpacket4_queue.

I agree; We'll change to a fixed width type in next version. Do you
(and others on the list) prefer __u32/__u64 or unsigned int / unsigned
long long?


Thanks,
Björn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-02 10:06     ` Björn Töpel
@ 2017-11-02 16:40       ` Tushar Dave
  2017-11-02 16:47         ` Björn Töpel
  2017-11-03  2:29       ` Willem de Bruijn
  1 sibling, 1 reply; 49+ messages in thread
From: Tushar Dave @ 2017-11-02 16:40 UTC (permalink / raw)
  To: Björn Töpel, Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang



On 11/02/2017 03:06 AM, Björn Töpel wrote:
> On 2017-11-02 02:45, Willem de Bruijn wrote:
>> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> This patch adds the necessary AF_PACKET V4 structures for usage from
>>> userspace. AF_PACKET V4 is a new interface optimized for high
>>> performance packet processing.
>>>
>>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>>> ---
>>>    include/uapi/linux/if_packet.h | 65 +++++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 64 insertions(+), 1 deletion(-)
>>>
>>> +struct tpacket4_queue {
>>> +       struct tpacket4_desc *ring;
>>> +
>>> +       unsigned int avail_idx;
>>> +       unsigned int last_used_idx;
>>> +       unsigned int num_free;
>>> +       unsigned int ring_mask;
>>> +};
>>>
>>>    struct packet_mreq {
>>> @@ -294,6 +335,28 @@ struct packet_mreq {
>>>           unsigned char   mr_address[8];
>>>    };
>>>
>>> +/*
>>> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
>>> + * to register user memory which should be used to store the packet
>>> + * data.
>>> + *
>>> + * There are some constraints for the memory being registered:
>>> + * - The memory area has to be memory page size aligned.
>>> + * - The frame size has to be a power of 2.
>>> + * - The frame size cannot be smaller than 2048B.
>>> + * - The frame size cannot be larger than the memory page size.
>>> + *
>>> + * Corollary: The number of frames that can be stored is
>>> + * len / frame_size.
>>> + *
>>> + */
>>> +struct tpacket_memreg_req {
>>> +       unsigned long   addr;           /* Start of packet data area */
>>> +       unsigned long   len;            /* Length of packet data area */
>>> +       unsigned int    frame_size;     /* Frame size */
>>> +       unsigned int    data_headroom;  /* Frame head room */
>>> +};
>>
>> Existing packet sockets take a tpacket_req, allocate memory and let the
>> user process mmap this. I understand that TPACKET_V4 distinguishes
>> the descriptor from packet pools, but could both use the existing structs
>> and logic (packet_mmap)? That would avoid introducing a lot of new code
>> just for granting user pages to the kernel.
>>
> 
> We could certainly pass the "tpacket_memreg_req" fields as part of
> descriptor ring setup ("tpacket_req4"), but we went with having the
> memory register as a new separate setsockopt. Having it separated,
> makes it easier to compare regions at the kernel side of things. "Is
> this the same umem as another one?" If we go the path of passing the
> range at descriptor ring setup, we need to handle all kind of
> overlapping ranges to determine when a copy is needed or not, in those
> cases where the packet buffer (i.e. umem) is shared between processes.

Is there a reason to use separate packet socket for umem? Looks like
userspace has to create separate packet socket for PACKET_MEMREG.


-Tushar>
>> Also, use of unsigned long can cause problems on 32/64 bit compat
>> environments. Prefer fixed width types in uapi. Same for pointer in
>> tpacket4_queue.
> 
> I agree; We'll change to a fixed width type in next version. Do you
> (and others on the list) prefer __u32/__u64 or unsigned int / unsigned
> long long?
> 
> 
> Thanks,
> Björn
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-02 16:40       ` Tushar Dave
@ 2017-11-02 16:47         ` Björn Töpel
  0 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-11-02 16:47 UTC (permalink / raw)
  To: Tushar Dave
  Cc: Willem de Bruijn, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, michael.lundkvist, ravineet.singh,
	Daniel Borkmann, Network Development, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

2017-11-02 17:40 GMT+01:00 Tushar Dave <tushar.n.dave@oracle.com>:
>
>
> On 11/02/2017 03:06 AM, Björn Töpel wrote:
>>
>> On 2017-11-02 02:45, Willem de Bruijn wrote:
>>>
>>> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com>
>>> wrote:
>>>>
>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>
>>>> This patch adds the necessary AF_PACKET V4 structures for usage from
>>>> userspace. AF_PACKET V4 is a new interface optimized for high
>>>> performance packet processing.
>>>>
>>>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>>>> ---
>>>>    include/uapi/linux/if_packet.h | 65
>>>> +++++++++++++++++++++++++++++++++++++++++-
>>>>    1 file changed, 64 insertions(+), 1 deletion(-)
>>>>
>>>> +struct tpacket4_queue {
>>>> +       struct tpacket4_desc *ring;
>>>> +
>>>> +       unsigned int avail_idx;
>>>> +       unsigned int last_used_idx;
>>>> +       unsigned int num_free;
>>>> +       unsigned int ring_mask;
>>>> +};
>>>>
>>>>    struct packet_mreq {
>>>> @@ -294,6 +335,28 @@ struct packet_mreq {
>>>>           unsigned char   mr_address[8];
>>>>    };
>>>>
>>>> +/*
>>>> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
>>>> + * to register user memory which should be used to store the packet
>>>> + * data.
>>>> + *
>>>> + * There are some constraints for the memory being registered:
>>>> + * - The memory area has to be memory page size aligned.
>>>> + * - The frame size has to be a power of 2.
>>>> + * - The frame size cannot be smaller than 2048B.
>>>> + * - The frame size cannot be larger than the memory page size.
>>>> + *
>>>> + * Corollary: The number of frames that can be stored is
>>>> + * len / frame_size.
>>>> + *
>>>> + */
>>>> +struct tpacket_memreg_req {
>>>> +       unsigned long   addr;           /* Start of packet data area */
>>>> +       unsigned long   len;            /* Length of packet data area */
>>>> +       unsigned int    frame_size;     /* Frame size */
>>>> +       unsigned int    data_headroom;  /* Frame head room */
>>>> +};
>>>
>>>
>>> Existing packet sockets take a tpacket_req, allocate memory and let the
>>> user process mmap this. I understand that TPACKET_V4 distinguishes
>>> the descriptor from packet pools, but could both use the existing structs
>>> and logic (packet_mmap)? That would avoid introducing a lot of new code
>>> just for granting user pages to the kernel.
>>>
>>
>> We could certainly pass the "tpacket_memreg_req" fields as part of
>> descriptor ring setup ("tpacket_req4"), but we went with having the
>> memory register as a new separate setsockopt. Having it separated,
>> makes it easier to compare regions at the kernel side of things. "Is
>> this the same umem as another one?" If we go the path of passing the
>> range at descriptor ring setup, we need to handle all kind of
>> overlapping ranges to determine when a copy is needed or not, in those
>> cases where the packet buffer (i.e. umem) is shared between processes.
>
>
> Is there a reason to use separate packet socket for umem? Looks like
> userspace has to create separate packet socket for PACKET_MEMREG.
>

Let me clarify; You *can* use a separate socket for umem, but
you can also use the same/existing AF_PACKET socket for that.


Björn

>
> -Tushar>
>
>>> Also, use of unsigned long can cause problems on 32/64 bit compat
>>> environments. Prefer fixed width types in uapi. Same for pointer in
>>> tpacket4_queue.
>>
>>
>> I agree; We'll change to a fixed width type in next version. Do you
>> (and others on the list) prefer __u32/__u64 or unsigned int / unsigned
>> long long?
>>
>>
>> Thanks,
>> Björn
>>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-02 10:06     ` Björn Töpel
  2017-11-02 16:40       ` Tushar Dave
@ 2017-11-03  2:29       ` Willem de Bruijn
  2017-11-03  9:54         ` Björn Töpel
  1 sibling, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-03  2:29 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

>>> +/*
>>> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
>>> + * to register user memory which should be used to store the packet
>>> + * data.
>>> + *
>>> + * There are some constraints for the memory being registered:
>>> + * - The memory area has to be memory page size aligned.
>>> + * - The frame size has to be a power of 2.
>>> + * - The frame size cannot be smaller than 2048B.
>>> + * - The frame size cannot be larger than the memory page size.
>>> + *
>>> + * Corollary: The number of frames that can be stored is
>>> + * len / frame_size.
>>> + *
>>> + */
>>> +struct tpacket_memreg_req {
>>> +       unsigned long   addr;           /* Start of packet data area */
>>> +       unsigned long   len;            /* Length of packet data area */
>>> +       unsigned int    frame_size;     /* Frame size */
>>> +       unsigned int    data_headroom;  /* Frame head room */
>>> +};
>>
>> Existing packet sockets take a tpacket_req, allocate memory and let the
>> user process mmap this. I understand that TPACKET_V4 distinguishes
>> the descriptor from packet pools, but could both use the existing structs
>> and logic (packet_mmap)? That would avoid introducing a lot of new code
>> just for granting user pages to the kernel.
>>
>
> We could certainly pass the "tpacket_memreg_req" fields as part of
> descriptor ring setup ("tpacket_req4"), but we went with having the
> memory register as a new separate setsockopt. Having it separated,
> makes it easier to compare regions at the kernel side of things. "Is
> this the same umem as another one?" If we go the path of passing the
> range at descriptor ring setup, we need to handle all kind of
> overlapping ranges to determine when a copy is needed or not, in those
> cases where the packet buffer (i.e. umem) is shared between processes.

That's not what I meant. Both descriptor rings and packet pools are
memory regions. Packet sockets already have logic to allocate regions
and make them available to userspace with mmap(). Packet v4 reuses
that logic for its descriptor rings. Can it use the same for its packet
pool? Why does the kernel map user memory, instead? That is a lot of
non-trivial new logic.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt
  2017-10-31 12:41 ` [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt Björn Töpel
@ 2017-11-03  3:00   ` Willem de Bruijn
  2017-11-03  9:57     ` Björn Töpel
  0 siblings, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-03  3:00 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> Here, the PACKET_MEMREG setsockopt is implemented for the AF_PACKET
> protocol family. PACKET_MEMREG allows the user to register memory
> regions that can be used by AF_PACKET V4 as packet data buffers.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
> +/*************** V4 QUEUE OPERATIONS *******************************/
> +
> +/**
> + * tp4q_umem_new - Creates a new umem (packet buffer)
> + *
> + * @addr: The address to the umem
> + * @size: The size of the umem
> + * @frame_size: The size of each frame, between 2K and PAGE_SIZE
> + * @data_headroom: The desired data headroom before start of the packet
> + *
> + * Returns a pointer to the new umem or NULL for failure
> + **/
> +static inline struct tp4_umem *tp4q_umem_new(unsigned long addr, size_t size,
> +                                            unsigned int frame_size,
> +                                            unsigned int data_headroom)
> +{
> +       struct tp4_umem *umem;
> +       unsigned int nframes;
> +
> +       if (frame_size < TP4_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
> +               /* Strictly speaking we could support this, if:
> +                * - huge pages, or*
> +                * - using an IOMMU, or
> +                * - making sure the memory area is consecutive
> +                * but for now, we simply say "computer says no".
> +                */
> +               return ERR_PTR(-EINVAL);
> +       }
> +
> +       if (!is_power_of_2(frame_size))
> +               return ERR_PTR(-EINVAL);
> +
> +       if (!PAGE_ALIGNED(addr)) {
> +               /* Memory area has to be page size aligned. For
> +                * simplicity, this might change.
> +                */
> +               return ERR_PTR(-EINVAL);
> +       }
> +
> +       if ((addr + size) < addr)
> +               return ERR_PTR(-EINVAL);
> +
> +       nframes = size / frame_size;
> +       if (nframes == 0)
> +               return ERR_PTR(-EINVAL);
> +
> +       data_headroom = ALIGN(data_headroom, 64);
> +
> +       if (frame_size - data_headroom - TP4_KERNEL_HEADROOM < 0)
> +               return ERR_PTR(-EINVAL);

signed comparison on unsigned int

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4
  2017-10-31 12:41 ` [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4 Björn Töpel
@ 2017-11-03  3:17   ` Willem de Bruijn
  2017-11-03 10:47     ` Björn Töpel
  0 siblings, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-03  3:17 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This commits adds support for zerocopy mode. Note that zerocopy mode
> requires that the network interface has been bound to the socket using
> the bind syscall, and that the corresponding netdev implements the
> AF_PACKET V4 ndos.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
> +
> +static void packet_v4_disable_zerocopy(struct net_device *dev,
> +                                      struct tp4_netdev_parms *zc)
> +{
> +       struct tp4_netdev_parms params;
> +
> +       params = *zc;
> +       params.command  = TP4_DISABLE;
> +
> +       (void)dev->netdev_ops->ndo_tp4_zerocopy(dev, &params);

Don't ignore error return codes.

> +static int packet_v4_zerocopy(struct sock *sk, int qp)
> +{
> +       struct packet_sock *po = pkt_sk(sk);
> +       struct socket *sock = sk->sk_socket;
> +       struct tp4_netdev_parms *zc = NULL;
> +       struct net_device *dev;
> +       bool if_up;
> +       int ret = 0;
> +
> +       /* Currently, only RAW sockets are supported.*/
> +       if (sock->type != SOCK_RAW)
> +               return -EINVAL;
> +
> +       rtnl_lock();
> +       dev = packet_cached_dev_get(po);
> +
> +       /* Socket needs to be bound to an interface. */
> +       if (!dev) {
> +               rtnl_unlock();
> +               return -EISCONN;
> +       }
> +
> +       /* The device needs to have both the NDOs implemented. */
> +       if (!(dev->netdev_ops->ndo_tp4_zerocopy &&
> +             dev->netdev_ops->ndo_tp4_xmit)) {
> +               ret = -EOPNOTSUPP;
> +               goto out_unlock;
> +       }

Inconsistent error handling with above test.

> +
> +       if (!(po->rx_ring.pg_vec && po->tx_ring.pg_vec)) {
> +               ret = -EOPNOTSUPP;
> +               goto out_unlock;
> +       }

A ring can be unmapped later with packet_set_ring. Should that operation
fail if zerocopy is enabled? After that, it can also change version with
PACKET_VERSION.

> +
> +       if_up = dev->flags & IFF_UP;
> +       zc = rtnl_dereference(po->zc);
> +
> +       /* Disable */
> +       if (qp <= 0) {
> +               if (!zc)
> +                       goto out_unlock;
> +
> +               packet_v4_disable_zerocopy(dev, zc);
> +               rcu_assign_pointer(po->zc, NULL);
> +
> +               if (if_up) {
> +                       spin_lock(&po->bind_lock);
> +                       register_prot_hook(sk);
> +                       spin_unlock(&po->bind_lock);
> +               }

There have been a bunch of race conditions in this bind code. We need
to be very careful with adding more states to the locking, especially when
open coding in multiple locations, as this patch does. I counted at least
four bind locations. See for instance also
http://patchwork.ozlabs.org/patch/813945/


> +
> +               goto out_unlock;
> +       }
> +
> +       /* Enable */
> +       if (!zc) {
> +               zc = kzalloc(sizeof(*zc), GFP_KERNEL);
> +               if (!zc) {
> +                       ret = -ENOMEM;
> +                       goto out_unlock;
> +               }
> +       }
> +
> +       if (zc->queue_pair >= 0)
> +               packet_v4_disable_zerocopy(dev, zc);

This calls disable even if zc was freshly allocated.
Shoud be > 0?

>  static int packet_release(struct socket *sock)
>  {
> +       struct tp4_netdev_parms *zc;
>         struct sock *sk = sock->sk;
> +       struct net_device *dev;
>         struct packet_sock *po;
>         struct packet_fanout *f;
>         struct net *net;
> @@ -3337,6 +3541,20 @@ static int packet_release(struct socket *sock)
>         sock_prot_inuse_add(net, sk->sk_prot, -1);
>         preempt_enable();
>
> +       rtnl_lock();
> +       zc = rtnl_dereference(po->zc);
> +       dev = packet_cached_dev_get(po);
> +       if (zc && dev)
> +               packet_v4_disable_zerocopy(dev, zc);
> +       if (dev)
> +               dev_put(dev);
> +       rtnl_unlock();
> +
> +       if (zc) {
> +               synchronize_rcu();
> +               kfree(zc);
> +       }

Please use a helper function for anything this complex.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings
  2017-10-31 12:41 ` [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings Björn Töpel
@ 2017-11-03  4:16   ` Willem de Bruijn
  2017-11-03 10:02     ` Björn Töpel
  0 siblings, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-03  4:16 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

> +/**
> + * tp4q_enqueue_from_array - Enqueue entries from packet array to tp4 queue
> + *
> + * @a: Pointer to the packet array to enqueue from
> + * @dcnt: Max number of entries to enqueue
> + *
> + * Returns 0 for success or an errno at failure
> + **/
> +static inline int tp4q_enqueue_from_array(struct tp4_packet_array *a,
> +                                         u32 dcnt)
> +{
> +       struct tp4_queue *q = a->tp4q;
> +       unsigned int used_idx = q->used_idx;
> +       struct tpacket4_desc *d = a->items;
> +       int i;
> +
> +       if (q->num_free < dcnt)
> +               return -ENOSPC;
> +
> +       q->num_free -= dcnt;

perhaps annotate with a lockdep_is_held to document which lock
ensures mutual exclusion on the ring. Different for tx and rx?

> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index b39be424ec0e..190598eb3461 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -189,6 +189,9 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
>  #define BLOCK_O2PRIV(x)        ((x)->offset_to_priv)
>  #define BLOCK_PRIV(x)          ((void *)((char *)(x) + BLOCK_O2PRIV(x)))
>
> +#define RX_RING 0
> +#define TX_RING 1
> +

Not needed if using bool for tx_ring below. The test effectively already
treats it as bool: does not explicitly test these constants.

> +static void packet_clear_ring(struct sock *sk, int tx_ring)
> +{
> +       struct packet_sock *po = pkt_sk(sk);
> +       struct packet_ring_buffer *rb;
> +       union tpacket_req_u req_u;
> +
> +       rb = tx_ring ? &po->tx_ring : &po->rx_ring;


I meant here.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (13 preceding siblings ...)
  2017-10-31 12:41 ` [RFC PATCH 14/14] xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use Björn Töpel
@ 2017-11-03  4:34 ` Willem de Bruijn
  2017-11-03 10:13   ` Karlsson, Magnus
  2017-11-13 13:07 ` Björn Töpel
  15 siblings, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-03  4:34 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 40x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it.
>
> The main difference between V4 and V2/V3 is that TX and RX descriptors
> are separated from packet buffers.

Cool feature. I'm looking forward to the netdev talk. Aside from the
inline comments in the patches, a few architecture questions.

Is TX support needed? Existing PACKET_TX_RING already sends out
packets without copying directly from the tx_ring. Indirection through a
descriptor ring is not helpful on TX if all packets still have to come from
a pre-registered packet pool. The patch set adds a lot of tx-only code
and is complex enough without it.

Can you use the existing PACKET_V2 format for the packet pool? The
v4 format is nearly the same as V2. Using the same version might avoid
some code duplication and simplify upgrading existing legacy code.
Instead of continuing to add new versions whose behavior is implicit,
perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.

Finally, is it necessary to define a new descriptor ring format? Same for the
packet array and frame set. The kernel already has a few, such as virtio for
the first, skb_array/ptr_ring, even linux list for the second. These containers
add a lot of new boilerplate code. If new formats are absolutely necessary,
at least we should consider making them generic (like skb_array and
ptr_ring). But I'd like to understand first why, e.g., virtio cannot be used.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-03  2:29       ` Willem de Bruijn
@ 2017-11-03  9:54         ` Björn Töpel
  2017-11-15 22:21           ` chet l
  0 siblings, 1 reply; 49+ messages in thread
From: Björn Töpel @ 2017-11-03  9:54 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

2017-11-03 3:29 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
>>>> +/*
>>>> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
>>>> + * to register user memory which should be used to store the packet
>>>> + * data.
>>>> + *
>>>> + * There are some constraints for the memory being registered:
>>>> + * - The memory area has to be memory page size aligned.
>>>> + * - The frame size has to be a power of 2.
>>>> + * - The frame size cannot be smaller than 2048B.
>>>> + * - The frame size cannot be larger than the memory page size.
>>>> + *
>>>> + * Corollary: The number of frames that can be stored is
>>>> + * len / frame_size.
>>>> + *
>>>> + */
>>>> +struct tpacket_memreg_req {
>>>> +       unsigned long   addr;           /* Start of packet data area */
>>>> +       unsigned long   len;            /* Length of packet data area */
>>>> +       unsigned int    frame_size;     /* Frame size */
>>>> +       unsigned int    data_headroom;  /* Frame head room */
>>>> +};
>>>
>>> Existing packet sockets take a tpacket_req, allocate memory and let the
>>> user process mmap this. I understand that TPACKET_V4 distinguishes
>>> the descriptor from packet pools, but could both use the existing structs
>>> and logic (packet_mmap)? That would avoid introducing a lot of new code
>>> just for granting user pages to the kernel.
>>>
>>
>> We could certainly pass the "tpacket_memreg_req" fields as part of
>> descriptor ring setup ("tpacket_req4"), but we went with having the
>> memory register as a new separate setsockopt. Having it separated,
>> makes it easier to compare regions at the kernel side of things. "Is
>> this the same umem as another one?" If we go the path of passing the
>> range at descriptor ring setup, we need to handle all kind of
>> overlapping ranges to determine when a copy is needed or not, in those
>> cases where the packet buffer (i.e. umem) is shared between processes.
>
> That's not what I meant. Both descriptor rings and packet pools are
> memory regions. Packet sockets already have logic to allocate regions
> and make them available to userspace with mmap(). Packet v4 reuses
> that logic for its descriptor rings. Can it use the same for its packet
> pool? Why does the kernel map user memory, instead? That is a lot of
> non-trivial new logic.

Ah, got it. So, why do we register packet pool memory, instead of
allocating in the kernel and mapping *that* memory.

Actually, we started out with that approach, where the packet_mmap
call mapped Tx/Rx descriptor rings and the packet buffer region. We
later moved to this (register umem) approach, because it's more
flexible for user space, not having to use a AF_PACKET specific
allocator (i.e. continue to use regular mallocs, huge pages and such).

I agree that the memory register code is adding a lot of new logic,
but I believe it's worth the flexibility for user space. I'm looking
into if I can share the memory register logic from Infiniband/verbs
subsystem (drivers/infiniband/core/umem.c).


Björn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt
  2017-11-03  3:00   ` Willem de Bruijn
@ 2017-11-03  9:57     ` Björn Töpel
  0 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-11-03  9:57 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

2017-11-03 4:00 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> Here, the PACKET_MEMREG setsockopt is implemented for the AF_PACKET
>> protocol family. PACKET_MEMREG allows the user to register memory
>> regions that can be used by AF_PACKET V4 as packet data buffers.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>> +/*************** V4 QUEUE OPERATIONS *******************************/
>> +
>> +/**
>> + * tp4q_umem_new - Creates a new umem (packet buffer)
>> + *
>> + * @addr: The address to the umem
>> + * @size: The size of the umem
>> + * @frame_size: The size of each frame, between 2K and PAGE_SIZE
>> + * @data_headroom: The desired data headroom before start of the packet
>> + *
>> + * Returns a pointer to the new umem or NULL for failure
>> + **/
>> +static inline struct tp4_umem *tp4q_umem_new(unsigned long addr, size_t size,
>> +                                            unsigned int frame_size,
>> +                                            unsigned int data_headroom)
>> +{
>> +       struct tp4_umem *umem;
>> +       unsigned int nframes;
>> +
>> +       if (frame_size < TP4_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +               /* Strictly speaking we could support this, if:
>> +                * - huge pages, or*
>> +                * - using an IOMMU, or
>> +                * - making sure the memory area is consecutive
>> +                * but for now, we simply say "computer says no".
>> +                */
>> +               return ERR_PTR(-EINVAL);
>> +       }
>> +
>> +       if (!is_power_of_2(frame_size))
>> +               return ERR_PTR(-EINVAL);
>> +
>> +       if (!PAGE_ALIGNED(addr)) {
>> +               /* Memory area has to be page size aligned. For
>> +                * simplicity, this might change.
>> +                */
>> +               return ERR_PTR(-EINVAL);
>> +       }
>> +
>> +       if ((addr + size) < addr)
>> +               return ERR_PTR(-EINVAL);
>> +
>> +       nframes = size / frame_size;
>> +       if (nframes == 0)
>> +               return ERR_PTR(-EINVAL);
>> +
>> +       data_headroom = ALIGN(data_headroom, 64);
>> +
>> +       if (frame_size - data_headroom - TP4_KERNEL_HEADROOM < 0)
>> +               return ERR_PTR(-EINVAL);
>
> signed comparison on unsigned int

Thanks, will address in next revision!

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings
  2017-11-03  4:16   ` Willem de Bruijn
@ 2017-11-03 10:02     ` Björn Töpel
  0 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-11-03 10:02 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

2017-11-03 5:16 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
>> +/**
>> + * tp4q_enqueue_from_array - Enqueue entries from packet array to tp4 queue
>> + *
>> + * @a: Pointer to the packet array to enqueue from
>> + * @dcnt: Max number of entries to enqueue
>> + *
>> + * Returns 0 for success or an errno at failure
>> + **/
>> +static inline int tp4q_enqueue_from_array(struct tp4_packet_array *a,
>> +                                         u32 dcnt)
>> +{
>> +       struct tp4_queue *q = a->tp4q;
>> +       unsigned int used_idx = q->used_idx;
>> +       struct tpacket4_desc *d = a->items;
>> +       int i;
>> +
>> +       if (q->num_free < dcnt)
>> +               return -ENOSPC;
>> +
>> +       q->num_free -= dcnt;
>
> perhaps annotate with a lockdep_is_held to document which lock
> ensures mutual exclusion on the ring. Different for tx and rx?
>

Good idea. I'll give that a try!

>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index b39be424ec0e..190598eb3461 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -189,6 +189,9 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
>>  #define BLOCK_O2PRIV(x)        ((x)->offset_to_priv)
>>  #define BLOCK_PRIV(x)          ((void *)((char *)(x) + BLOCK_O2PRIV(x)))
>>
>> +#define RX_RING 0
>> +#define TX_RING 1
>> +
>
> Not needed if using bool for tx_ring below. The test effectively already
> treats it as bool: does not explicitly test these constants.
>
>> +static void packet_clear_ring(struct sock *sk, int tx_ring)
>> +{
>> +       struct packet_sock *po = pkt_sk(sk);
>> +       struct packet_ring_buffer *rb;
>> +       union tpacket_req_u req_u;
>> +
>> +       rb = tx_ring ? &po->tx_ring : &po->rx_ring;
>
>
> I meant here.

Yup, I'll remove/clean this up.


Björn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* RE: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-03  4:34 ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support Willem de Bruijn
@ 2017-11-03 10:13   ` Karlsson, Magnus
  2017-11-03 13:55     ` Willem de Bruijn
  0 siblings, 1 reply; 49+ messages in thread
From: Karlsson, Magnus @ 2017-11-03 10:13 UTC (permalink / raw)
  To: Willem de Bruijn, Björn Töpel
  Cc: Duyck, Alexander H, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, Jesper Dangaard Brouer, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, Network Development, Topel,
	Bjorn, Brandeburg, Jesse, Singhai, Anjali, Rosen, Rami, Shaw,
	Jeffrey B, Yigit, Ferruh, Zhang, Qi Z



> -----Original Message-----
> From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com]
> Sent: Friday, November 3, 2017 5:35 AM
> To: Björn Töpel <bjorn.topel@gmail.com>
> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Duyck, Alexander H
> <alexander.h.duyck@intel.com>; Alexander Duyck
> <alexander.duyck@gmail.com>; John Fastabend
> <john.fastabend@gmail.com>; Alexei Starovoitov <ast@fb.com>; Jesper
> Dangaard Brouer <brouer@redhat.com>; michael.lundkvist@ericsson.com;
> ravineet.singh@ericsson.com; Daniel Borkmann <daniel@iogearbox.net>;
> Network Development <netdev@vger.kernel.org>; Topel, Bjorn
> <bjorn.topel@intel.com>; Brandeburg, Jesse
> <jesse.brandeburg@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>;
> Rosen, Rami <rami.rosen@intel.com>; Shaw, Jeffrey B
> <jeffrey.b.shaw@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; Zhang,
> Qi Z <qi.z.zhang@intel.com>
> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
> 
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com>
> wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
> > optimized for high performance packet processing and zero-copy
> > semantics. Throughput improvements can be up to 40x compared to V2
> and
> > V3 for the micro benchmarks included. Would be great to get your
> > feedback on it.
> >
> > The main difference between V4 and V2/V3 is that TX and RX descriptors
> > are separated from packet buffers.
> 
> Cool feature. I'm looking forward to the netdev talk. Aside from the inline
> comments in the patches, a few architecture questions.

Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up
and discuss your comments in further detail. Some initial thoughts
below.

> Is TX support needed? Existing PACKET_TX_RING already sends out packets
> without copying directly from the tx_ring. Indirection through a descriptor
> ring is not helpful on TX if all packets still have to come from a pre-registered
> packet pool. The patch set adds a lot of tx-only code and is complex enough
> without it.

That is correct, but what if the packet you are going to transmit came
in from the receive path and is already in the packet buffer? This
might happen if the application is examining/sniffing packets then
sending them out, or doing some modification to them. In that case we
avoid a copy in V4 since the packet is already in the packet
buffer. With V2 and V3, a copy from the RX ring to the TX ring would
be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases
performance quite a lot.

> Can you use the existing PACKET_V2 format for the packet pool? The
> v4 format is nearly the same as V2. Using the same version might avoid some
> code duplication and simplify upgrading existing legacy code.
> Instead of continuing to add new versions whose behavior is implicit,
> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.

Interesting idea that I think is worth thinking more about. One
problem though with the V2 ring format model, and the current V4
format too by the way, when applied to a user-space allocated memory
is that they are symmetric, i.e. that user space and kernel space have
to produce and consume the same amount of entries (within the length
of the descriptor area). User space sends down a buffer entry that the
kernel fills in for RX for example. Symmetric queues do not work when
you have a shared packet buffer between two processes. (This is not a
requirement, but someone might do a mmap with MAP_SHARED for the
packet buffer and then fork of a child that then inherits this packet
buffer.) One of the processes might just receive packets, while the
other one is transmitting. Or if you have a veth link pair between two
processes and they have been set up to share packet buffer area. With
a symmetric queue you have to copy even if they share the same packet
buffer, but with an asymmetric queue, you do not and the driver only
needs to copy the packet buffer id between the TX desc ring of the
sender to the RX desc ring of the receiver, not the data. I think this
gives an indication that we need a new structure. Anyway, I like your
idea and I think it is worth thinking more about it. Let us have a
discussion about this at Netdev, if you are there.

> Finally, is it necessary to define a new descriptor ring format? Same for the
> packet array and frame set. The kernel already has a few, such as virtio for
> the first, skb_array/ptr_ring, even linux list for the second. These containers
> add a lot of new boilerplate code. If new formats are absolutely necessary, at
> least we should consider making them generic (like skb_array and ptr_ring).
> But I'd like to understand first why, e.g., virtio cannot be used.

Agree with you. Good if we can use something existing. The descriptor
format of V4 was based on one of the first Virtio 1.1 proposal by
Michael Tsirkin (tools/virtio/ringtest/ring.c). Then we have diverged
somewhat due to performance reasons and Virtio 1.1 has done the same
but in another direction. We should take a look at the latest Virtio
1.1 proposal again and see what it offers. The reason we did not go
with Virtio 0.9 was for performance. Too many indirections, something
that the people behind Virtio 1.1 had identified too. With ptr_ring,
how do we deal with the pointers in the structure as this now has to
go to user-space? In any way, we would like to have a ring structure
that is asymmetric for the reasons above. Other than that, we would
not mind using anything as long as it is fast. If it already exists,
perfect.

/Magnus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4
  2017-11-03  3:17   ` Willem de Bruijn
@ 2017-11-03 10:47     ` Björn Töpel
  0 siblings, 0 replies; 49+ messages in thread
From: Björn Töpel @ 2017-11-03 10:47 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Karlsson, Magnus, Alexander Duyck, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

2017-11-03 4:17 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com> wrote:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> This commits adds support for zerocopy mode. Note that zerocopy mode
>> requires that the network interface has been bound to the socket using
>> the bind syscall, and that the corresponding netdev implements the
>> AF_PACKET V4 ndos.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>> +
>> +static void packet_v4_disable_zerocopy(struct net_device *dev,
>> +                                      struct tp4_netdev_parms *zc)
>> +{
>> +       struct tp4_netdev_parms params;
>> +
>> +       params = *zc;
>> +       params.command  = TP4_DISABLE;
>> +
>> +       (void)dev->netdev_ops->ndo_tp4_zerocopy(dev, &params);
>
> Don't ignore error return codes.
>

Will fix!

>> +static int packet_v4_zerocopy(struct sock *sk, int qp)
>> +{
>> +       struct packet_sock *po = pkt_sk(sk);
>> +       struct socket *sock = sk->sk_socket;
>> +       struct tp4_netdev_parms *zc = NULL;
>> +       struct net_device *dev;
>> +       bool if_up;
>> +       int ret = 0;
>> +
>> +       /* Currently, only RAW sockets are supported.*/
>> +       if (sock->type != SOCK_RAW)
>> +               return -EINVAL;
>> +
>> +       rtnl_lock();
>> +       dev = packet_cached_dev_get(po);
>> +
>> +       /* Socket needs to be bound to an interface. */
>> +       if (!dev) {
>> +               rtnl_unlock();
>> +               return -EISCONN;
>> +       }
>> +
>> +       /* The device needs to have both the NDOs implemented. */
>> +       if (!(dev->netdev_ops->ndo_tp4_zerocopy &&
>> +             dev->netdev_ops->ndo_tp4_xmit)) {
>> +               ret = -EOPNOTSUPP;
>> +               goto out_unlock;
>> +       }
>
> Inconsistent error handling with above test.
>

Will fix.

>> +
>> +       if (!(po->rx_ring.pg_vec && po->tx_ring.pg_vec)) {
>> +               ret = -EOPNOTSUPP;
>> +               goto out_unlock;
>> +       }
>
> A ring can be unmapped later with packet_set_ring. Should that operation
> fail if zerocopy is enabled? After that, it can also change version with
> PACKET_VERSION.
>

You're correct, I've missed this. I need to revisit the scenario when
a ring is unmapped, and recreated. Thanks for pointing this out.

>> +
>> +       if_up = dev->flags & IFF_UP;
>> +       zc = rtnl_dereference(po->zc);
>> +
>> +       /* Disable */
>> +       if (qp <= 0) {
>> +               if (!zc)
>> +                       goto out_unlock;
>> +
>> +               packet_v4_disable_zerocopy(dev, zc);
>> +               rcu_assign_pointer(po->zc, NULL);
>> +
>> +               if (if_up) {
>> +                       spin_lock(&po->bind_lock);
>> +                       register_prot_hook(sk);
>> +                       spin_unlock(&po->bind_lock);
>> +               }
>
> There have been a bunch of race conditions in this bind code. We need
> to be very careful with adding more states to the locking, especially when
> open coding in multiple locations, as this patch does. I counted at least
> four bind locations. See for instance also
> http://patchwork.ozlabs.org/patch/813945/
>

Yeah, the locking schemes in AF_PACKET is pretty convoluted. I'll
document and make the locking more explicit (and avoiding open coding
it).

>
>> +
>> +               goto out_unlock;
>> +       }
>> +
>> +       /* Enable */
>> +       if (!zc) {
>> +               zc = kzalloc(sizeof(*zc), GFP_KERNEL);
>> +               if (!zc) {
>> +                       ret = -ENOMEM;
>> +                       goto out_unlock;
>> +               }
>> +       }
>> +
>> +       if (zc->queue_pair >= 0)
>> +               packet_v4_disable_zerocopy(dev, zc);
>
> This calls disable even if zc was freshly allocated.
> Shoud be > 0?
>

Good catch. It should be > 0.

>>  static int packet_release(struct socket *sock)
>>  {
>> +       struct tp4_netdev_parms *zc;
>>         struct sock *sk = sock->sk;
>> +       struct net_device *dev;
>>         struct packet_sock *po;
>>         struct packet_fanout *f;
>>         struct net *net;
>> @@ -3337,6 +3541,20 @@ static int packet_release(struct socket *sock)
>>         sock_prot_inuse_add(net, sk->sk_prot, -1);
>>         preempt_enable();
>>
>> +       rtnl_lock();
>> +       zc = rtnl_dereference(po->zc);
>> +       dev = packet_cached_dev_get(po);
>> +       if (zc && dev)
>> +               packet_v4_disable_zerocopy(dev, zc);
>> +       if (dev)
>> +               dev_put(dev);
>> +       rtnl_unlock();
>> +
>> +       if (zc) {
>> +               synchronize_rcu();
>> +               kfree(zc);
>> +       }
>
> Please use a helper function for anything this complex.

Will fix.


Thanks,
Björn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-03 10:13   ` Karlsson, Magnus
@ 2017-11-03 13:55     ` Willem de Bruijn
  0 siblings, 0 replies; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-03 13:55 UTC (permalink / raw)
  To: Karlsson, Magnus
  Cc: Björn Töpel, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann,
	Network Development, Topel, Bjorn, Brandeburg, Jesse, Singhai,
	Anjali, Rosen, Rami, Shaw, Jeffrey B, Yigit, Ferruh

On Fri, Nov 3, 2017 at 7:13 PM, Karlsson, Magnus
<magnus.karlsson@intel.com> wrote:
>
>
>> -----Original Message-----
>> From: Willem de Bruijn [mailto:willemdebruijn.kernel@gmail.com]
>> Sent: Friday, November 3, 2017 5:35 AM
>> To: Björn Töpel <bjorn.topel@gmail.com>
>> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Duyck, Alexander H
>> <alexander.h.duyck@intel.com>; Alexander Duyck
>> <alexander.duyck@gmail.com>; John Fastabend
>> <john.fastabend@gmail.com>; Alexei Starovoitov <ast@fb.com>; Jesper
>> Dangaard Brouer <brouer@redhat.com>; michael.lundkvist@ericsson.com;
>> ravineet.singh@ericsson.com; Daniel Borkmann <daniel@iogearbox.net>;
>> Network Development <netdev@vger.kernel.org>; Topel, Bjorn
>> <bjorn.topel@intel.com>; Brandeburg, Jesse
>> <jesse.brandeburg@intel.com>; Singhai, Anjali <anjali.singhai@intel.com>;
>> Rosen, Rami <rami.rosen@intel.com>; Shaw, Jeffrey B
>> <jeffrey.b.shaw@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; Zhang,
>> Qi Z <qi.z.zhang@intel.com>
>> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
>>
>> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.topel@gmail.com>
>> wrote:
>> > From: Björn Töpel <bjorn.topel@intel.com>
>> >
>> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
>> > optimized for high performance packet processing and zero-copy
>> > semantics. Throughput improvements can be up to 40x compared to V2
>> and
>> > V3 for the micro benchmarks included. Would be great to get your
>> > feedback on it.
>> >
>> > The main difference between V4 and V2/V3 is that TX and RX descriptors
>> > are separated from packet buffers.
>>
>> Cool feature. I'm looking forward to the netdev talk. Aside from the inline
>> comments in the patches, a few architecture questions.
>
> Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up
> and discuss your comments in further detail. Some initial thoughts
> below.

Sounds great. I'll be there.

>> Is TX support needed? Existing PACKET_TX_RING already sends out packets
>> without copying directly from the tx_ring. Indirection through a descriptor
>> ring is not helpful on TX if all packets still have to come from a pre-registered
>> packet pool. The patch set adds a lot of tx-only code and is complex enough
>> without it.
>
> That is correct, but what if the packet you are going to transmit came
> in from the receive path and is already in the packet buffer?

Oh, yes, of course. That is a common use case. I should have
thought of that.

> This
> might happen if the application is examining/sniffing packets then
> sending them out, or doing some modification to them. In that case we
> avoid a copy in V4 since the packet is already in the packet
> buffer. With V2 and V3, a copy from the RX ring to the TX ring would
> be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases
> performance quite a lot.
>
>> Can you use the existing PACKET_V2 format for the packet pool? The
>> v4 format is nearly the same as V2. Using the same version might avoid some
>> code duplication and simplify upgrading existing legacy code.
>> Instead of continuing to add new versions whose behavior is implicit,
>> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.
>
> Interesting idea that I think is worth thinking more about. One
> problem though with the V2 ring format model, and the current V4
> format too by the way, when applied to a user-space allocated memory
> is that they are symmetric, i.e. that user space and kernel space have
> to produce and consume the same amount of entries (within the length
> of the descriptor area). User space sends down a buffer entry that the
> kernel fills in for RX for example. Symmetric queues do not work when
> you have a shared packet buffer between two processes. (This is not a
> requirement, but someone might do a mmap with MAP_SHARED for the
> packet buffer and then fork of a child that then inherits this packet
> buffer.) One of the processes might just receive packets, while the
> other one is transmitting. Or if you have a veth link pair between two
> processes and they have been set up to share packet buffer area. With
> a symmetric queue you have to copy even if they share the same packet
> buffer, but with an asymmetric queue, you do not and the driver only
> needs to copy the packet buffer id between the TX desc ring of the
> sender to the RX desc ring of the receiver, not the data. I think this
> gives an indication that we need a new structure. Anyway, I like your
> idea and I think it is worth thinking more about it. Let us have a
> discussion about this at Netdev, if you are there.

Okay. I don't quite understand the definition of symmetric here. At
least one problem that you describe, the veth pair, is solved by
introducing descriptor rings as level of indirection, regardless of the
format of the frames in the packet ring (now, really, random access
packet pool).

>> Finally, is it necessary to define a new descriptor ring format? Same for the
>> packet array and frame set. The kernel already has a few, such as virtio for
>> the first, skb_array/ptr_ring, even linux list for the second. These containers
>> add a lot of new boilerplate code. If new formats are absolutely necessary, at
>> least we should consider making them generic (like skb_array and ptr_ring).
>> But I'd like to understand first why, e.g., virtio cannot be used.
>
> Agree with you. Good if we can use something existing. The descriptor
> format of V4 was based on one of the first Virtio 1.1 proposal by
> Michael Tsirkin (tools/virtio/ringtest/ring.c). Then we have diverged
> somewhat due to performance reasons and Virtio 1.1 has done the same
> but in another direction. We should take a look at the latest Virtio
> 1.1 proposal again and see what it offers. The reason we did not go
> with Virtio 0.9 was for performance. Too many indirections, something
> that the people behind Virtio 1.1 had identified too. With ptr_ring,
> how do we deal with the pointers in the structure as this now has to
> go to user-space? In any way, we would like to have a ring structure
> that is asymmetric for the reasons above. Other than that, we would
> not mind using anything as long as it is fast. If it already exists,
> perfect.

Thanks for that context. I was not aware that this format branched off
of the early virtio 1.1 draft at all. Yes, I'm not sure where that stands
and which workloads it is targeting. One issue is dealing with hw and
minimizing communication over the PCI bus. That is not immediately
relevant to this virtual descriptor model.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
                   ` (14 preceding siblings ...)
  2017-11-03  4:34 ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support Willem de Bruijn
@ 2017-11-13 13:07 ` Björn Töpel
  2017-11-13 14:34   ` John Fastabend
                     ` (2 more replies)
  15 siblings, 3 replies; 49+ messages in thread
From: Björn Töpel @ 2017-11-13 13:07 UTC (permalink / raw)
  To: Bjorn Topel, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, michael.lundkvist, ravineet.singh,
	Daniel Borkmann, Netdev, Willem de Bruijn, Tushar Dave,
	eric.dumazet
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, davem

2017-10-31 13:41 GMT+01:00 Björn Töpel <bjorn.topel@gmail.com>:
> From: Björn Töpel <bjorn.topel@intel.com>
>
[...]
>
> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
> Korea, and our paper with complete benchmarks will be released shortly
> on the NetDev 2.2 site.
>

We're back in the saddle after an excellent netdevconf week. Kudos to
the organizers; We had a blast! Thanks for all the constructive
feedback.

I'll summarize the major points, that we'll address in the next RFC
below.

* Instead of extending AF_PACKET with yet another version, introduce a
  new address/packet family. As for naming had some name suggestions:
  AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
  AF_ZEROCOPY, unless there're no strong opinions against it.

* No explicit zerocopy enablement. Use the zeropcopy path if
  supported, if not -- fallback to the skb path, for netdevs that
  don't support the required ndos. Further, we'll have the zerocopy
  behavior for the skb path as well, meaning that an AF_ZEROCOPY
  socket will consume the skb and we'll honor skb->queue_mapping,
  meaning that we only consume the packets for the enabled queue.

* Limit the scope of the first patchset to Rx only, and introduce Tx
  in a separate patchset.

* Minimize the size of the i40e zerocopy patches, by moving the driver
  specific code to separate patches.

* Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
  XDP redirect map call with ingress flag.

* Extend the XDP redirect to support explicit allocator/destructor
  functions. Right now, XDP redirect assumes that the page allocator
  was used, and the XDP redirect cleanup path is decreasing the page
  count of the XDP buffer. This assumption breaks for the zerocopy
  case.


Björn


> We based this patch set on net-next commit e1ea2f9856b7 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>
> Please focus your review on:
>
> * The V4 user space interface
> * PACKET_ZEROCOPY and its semantics
> * Packet array interface
> * XDP semantics when excuting in zero-copy mode (user space passed
>   buffers)
> * XDP_PASS_TO_KERNEL semantics
>
> To do:
>
> * Investigate the user-space ring structure’s performance problems
> * Continue the XDP integration into packet arrays
> * Optimize performance
> * SKB <-> V4 conversions in tp4a_populate & tp4a_flush
> * Packet buffer is unnecessarily pinned for virtual devices
> * Support shared packet buffers
> * Unify V4 and SKB receive path in I40E driver
> * Support for packets spanning multiple frames
> * Disassociate the packet array implementation from the V4 queue
>   structure
>
> We would really like to thank the reviewers of the limited
> distribution RFC for all their comments that have helped improve the
> interfaces and the code significantly: Alexei Starovoitov, Alexander
> Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
> at Intel that has been helping out reviewing code, writing tests, and
> sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
> Zhang, your participation has really helped.
>
> Thanks: Björn and Magnus
>
> [1] https://www.netdevconf.org/2.2/
>
> Björn Töpel (7):
>   packet: introduce AF_PACKET V4 userspace API
>   packet: implement PACKET_MEMREG setsockopt
>   packet: enable AF_PACKET V4 rings
>   packet: wire up zerocopy for AF_PACKET V4
>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
>   samples/tpacket4: added tpbench
>
> Magnus Karlsson (7):
>   packet: enable Rx for AF_PACKET V4
>   packet: enable Tx support for AF_PACKET V4
>   netdevice: add AF_PACKET V4 zerocopy ops
>   veth: added support for PACKET_ZEROCOPY
>   samples/tpacket4: added veth support
>   i40e: added XDP support for TP4 enabled queue pairs
>   xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use
>
>  drivers/net/ethernet/intel/i40e/i40e.h         |    3 +
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |    9 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  837 ++++++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  582 ++++++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   38 +
>  drivers/net/veth.c                             |  174 +++
>  include/linux/netdevice.h                      |   16 +
>  include/linux/tpacket4.h                       | 1502 ++++++++++++++++++++++++
>  include/uapi/linux/bpf.h                       |    1 +
>  include/uapi/linux/if_packet.h                 |   65 +-
>  net/packet/af_packet.c                         | 1252 +++++++++++++++++-

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-13 13:07 ` Björn Töpel
@ 2017-11-13 14:34   ` John Fastabend
  2017-11-13 23:50   ` Alexei Starovoitov
  2017-11-14 17:19   ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?) Jesper Dangaard Brouer
  2 siblings, 0 replies; 49+ messages in thread
From: John Fastabend @ 2017-11-13 14:34 UTC (permalink / raw)
  To: Björn Töpel, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, Alexei Starovoitov, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann, Netdev,
	Willem de Bruijn, Tushar Dave, eric.dumazet
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, davem,
	Andy Gospodarek

On 11/13/2017 05:07 AM, Björn Töpel wrote:
> 2017-10-31 13:41 GMT+01:00 Björn Töpel <bjorn.topel@gmail.com>:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
> [...]
>>
>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
>> Korea, and our paper with complete benchmarks will be released shortly
>> on the NetDev 2.2 site.
>>
> 
> We're back in the saddle after an excellent netdevconf week. Kudos to
> the organizers; We had a blast! Thanks for all the constructive
> feedback.
> 
> I'll summarize the major points, that we'll address in the next RFC
> below.
> 
> * Instead of extending AF_PACKET with yet another version, introduce a
>   new address/packet family. As for naming had some name suggestions:
>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>   AF_ZEROCOPY, unless there're no strong opinions against it.
> 

Works for me.

> * No explicit zerocopy enablement. Use the zeropcopy path if
>   supported, if not -- fallback to the skb path, for netdevs that
>   don't support the required ndos. Further, we'll have the zerocopy
>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>   socket will consume the skb and we'll honor skb->queue_mapping,
>   meaning that we only consume the packets for the enabled queue.
> 
> * Limit the scope of the first patchset to Rx only, and introduce Tx
>   in a separate patchset.
> 
> * Minimize the size of the i40e zerocopy patches, by moving the driver
>   specific code to separate patches.
> 
> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>   XDP redirect map call with ingress flag.
> 

Sounds good we will need to add this as a separate patch series though.

> * Extend the XDP redirect to support explicit allocator/destructor
>   functions. Right now, XDP redirect assumes that the page allocator
>   was used, and the XDP redirect cleanup path is decreasing the page
>   count of the XDP buffer. This assumption breaks for the zerocopy
>   case.
> 

Probably sync with Andy and Jesper on this. I think they are both
looking into something similar.

Thanks,
John

> 
> Björn
> 
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-13 13:07 ` Björn Töpel
  2017-11-13 14:34   ` John Fastabend
@ 2017-11-13 23:50   ` Alexei Starovoitov
  2017-11-14  5:33     ` Björn Töpel
  2017-11-14 17:19   ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?) Jesper Dangaard Brouer
  2 siblings, 1 reply; 49+ messages in thread
From: Alexei Starovoitov @ 2017-11-13 23:50 UTC (permalink / raw)
  To: Björn Töpel, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann, Netdev,
	Willem de Bruijn, Tushar Dave, eric.dumazet
  Cc: Björn Töpel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang, davem

On 11/13/17 9:07 PM, Björn Töpel wrote:
> 2017-10-31 13:41 GMT+01:00 Björn Töpel <bjorn.topel@gmail.com>:
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
> [...]
>>
>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
>> Korea, and our paper with complete benchmarks will be released shortly
>> on the NetDev 2.2 site.
>>
>
> We're back in the saddle after an excellent netdevconf week. Kudos to
> the organizers; We had a blast! Thanks for all the constructive
> feedback.
>
> I'll summarize the major points, that we'll address in the next RFC
> below.
>
> * Instead of extending AF_PACKET with yet another version, introduce a
>   new address/packet family. As for naming had some name suggestions:
>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>   AF_ZEROCOPY, unless there're no strong opinions against it.
>
> * No explicit zerocopy enablement. Use the zeropcopy path if
>   supported, if not -- fallback to the skb path, for netdevs that
>   don't support the required ndos. Further, we'll have the zerocopy
>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>   socket will consume the skb and we'll honor skb->queue_mapping,
>   meaning that we only consume the packets for the enabled queue.
>
> * Limit the scope of the first patchset to Rx only, and introduce Tx
>   in a separate patchset.

all sounds good to me except above bit.
I don't remember people suggesting to split it this way.
What's the value of it without tx?

> * Minimize the size of the i40e zerocopy patches, by moving the driver
>   specific code to separate patches.
>
> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>   XDP redirect map call with ingress flag.
>
> * Extend the XDP redirect to support explicit allocator/destructor
>   functions. Right now, XDP redirect assumes that the page allocator
>   was used, and the XDP redirect cleanup path is decreasing the page
>   count of the XDP buffer. This assumption breaks for the zerocopy
>   case.
>
>
> Björn
>
>
>> We based this patch set on net-next commit e1ea2f9856b7 ("Merge
>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>>
>> Please focus your review on:
>>
>> * The V4 user space interface
>> * PACKET_ZEROCOPY and its semantics
>> * Packet array interface
>> * XDP semantics when excuting in zero-copy mode (user space passed
>>   buffers)
>> * XDP_PASS_TO_KERNEL semantics
>>
>> To do:
>>
>> * Investigate the user-space ring structure’s performance problems
>> * Continue the XDP integration into packet arrays
>> * Optimize performance
>> * SKB <-> V4 conversions in tp4a_populate & tp4a_flush
>> * Packet buffer is unnecessarily pinned for virtual devices
>> * Support shared packet buffers
>> * Unify V4 and SKB receive path in I40E driver
>> * Support for packets spanning multiple frames
>> * Disassociate the packet array implementation from the V4 queue
>>   structure
>>
>> We would really like to thank the reviewers of the limited
>> distribution RFC for all their comments that have helped improve the
>> interfaces and the code significantly: Alexei Starovoitov, Alexander
>> Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
>> at Intel that has been helping out reviewing code, writing tests, and
>> sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
>> Zhang, your participation has really helped.
>>
>> Thanks: Björn and Magnus
>>
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qR6oNZj1CqLATni4ibTgAQ&m=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos&s=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc&e=
>>
>> Björn Töpel (7):
>>   packet: introduce AF_PACKET V4 userspace API
>>   packet: implement PACKET_MEMREG setsockopt
>>   packet: enable AF_PACKET V4 rings
>>   packet: wire up zerocopy for AF_PACKET V4
>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
>>   samples/tpacket4: added tpbench
>>
>> Magnus Karlsson (7):
>>   packet: enable Rx for AF_PACKET V4
>>   packet: enable Tx support for AF_PACKET V4
>>   netdevice: add AF_PACKET V4 zerocopy ops
>>   veth: added support for PACKET_ZEROCOPY
>>   samples/tpacket4: added veth support
>>   i40e: added XDP support for TP4 enabled queue pairs
>>   xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use
>>
>>  drivers/net/ethernet/intel/i40e/i40e.h         |    3 +
>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |    9 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  837 ++++++++++++-
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  582 ++++++++-
>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   38 +
>>  drivers/net/veth.c                             |  174 +++
>>  include/linux/netdevice.h                      |   16 +
>>  include/linux/tpacket4.h                       | 1502 ++++++++++++++++++++++++
>>  include/uapi/linux/bpf.h                       |    1 +
>>  include/uapi/linux/if_packet.h                 |   65 +-
>>  net/packet/af_packet.c                         | 1252 +++++++++++++++++---
>>  net/packet/internal.h                          |    9 +
>>  samples/tpacket4/Makefile                      |   12 +
>>  samples/tpacket4/bench_all.sh                  |   28 +
>>  samples/tpacket4/tpbench.c                     | 1390 ++++++++++++++++++++++
>>  15 files changed, 5674 insertions(+), 244 deletions(-)
>>  create mode 100644 include/linux/tpacket4.h
>>  create mode 100644 samples/tpacket4/Makefile
>>  create mode 100755 samples/tpacket4/bench_all.sh
>>  create mode 100644 samples/tpacket4/tpbench.c
>>
>> --
>> 2.11.0
>>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-13 23:50   ` Alexei Starovoitov
@ 2017-11-14  5:33     ` Björn Töpel
  2017-11-14  7:02       ` John Fastabend
  0 siblings, 1 reply; 49+ messages in thread
From: Björn Töpel @ 2017-11-14  5:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Jesper Dangaard Brouer, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, Netdev, Willem de Bruijn,
	Tushar Dave, eric.dumazet, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang, davem

2017-11-14 0:50 GMT+01:00 Alexei Starovoitov <ast@fb.com>:
> On 11/13/17 9:07 PM, Björn Töpel wrote:
>>
>> 2017-10-31 13:41 GMT+01:00 Björn Töpel <bjorn.topel@gmail.com>:
>>>
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>> [...]
>>>
>>>
>>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
>>> Korea, and our paper with complete benchmarks will be released shortly
>>> on the NetDev 2.2 site.
>>>
>>
>> We're back in the saddle after an excellent netdevconf week. Kudos to
>> the organizers; We had a blast! Thanks for all the constructive
>> feedback.
>>
>> I'll summarize the major points, that we'll address in the next RFC
>> below.
>>
>> * Instead of extending AF_PACKET with yet another version, introduce a
>>   new address/packet family. As for naming had some name suggestions:
>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>>
>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>   supported, if not -- fallback to the skb path, for netdevs that
>>   don't support the required ndos. Further, we'll have the zerocopy
>>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>>   socket will consume the skb and we'll honor skb->queue_mapping,
>>   meaning that we only consume the packets for the enabled queue.
>>
>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>   in a separate patchset.
>
>
> all sounds good to me except above bit.
> I don't remember people suggesting to split it this way.
> What's the value of it without tx?
>

We definitely need Tx for our use-cases! I'll rephrase, so the
idea was making the initial patch set without Tx *driver*
specific code, e.g. use ndo_xdp_xmit/flush at a later point.

So AF_ZEROCOPY, the socket parts, would have Tx support.

@John Did I recall that correctly?

>> * Minimize the size of the i40e zerocopy patches, by moving the driver
>>   specific code to separate patches.
>>
>> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>>   XDP redirect map call with ingress flag.
>>
>> * Extend the XDP redirect to support explicit allocator/destructor
>>   functions. Right now, XDP redirect assumes that the page allocator
>>   was used, and the XDP redirect cleanup path is decreasing the page
>>   count of the XDP buffer. This assumption breaks for the zerocopy
>>   case.
>>
>>
>> Björn
>>
>>
>>> We based this patch set on net-next commit e1ea2f9856b7 ("Merge
>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net").
>>>
>>> Please focus your review on:
>>>
>>> * The V4 user space interface
>>> * PACKET_ZEROCOPY and its semantics
>>> * Packet array interface
>>> * XDP semantics when excuting in zero-copy mode (user space passed
>>>   buffers)
>>> * XDP_PASS_TO_KERNEL semantics
>>>
>>> To do:
>>>
>>> * Investigate the user-space ring structure’s performance problems
>>> * Continue the XDP integration into packet arrays
>>> * Optimize performance
>>> * SKB <-> V4 conversions in tp4a_populate & tp4a_flush
>>> * Packet buffer is unnecessarily pinned for virtual devices
>>> * Support shared packet buffers
>>> * Unify V4 and SKB receive path in I40E driver
>>> * Support for packets spanning multiple frames
>>> * Disassociate the packet array implementation from the V4 queue
>>>   structure
>>>
>>> We would really like to thank the reviewers of the limited
>>> distribution RFC for all their comments that have helped improve the
>>> interfaces and the code significantly: Alexei Starovoitov, Alexander
>>> Duyck, Jesper Dangaard Brouer, and John Fastabend. The internal team
>>> at Intel that has been helping out reviewing code, writing tests, and
>>> sanity checking our ideas: Rami Rosen, Jeff Shaw, Ferruh Yigit, and Qi
>>> Zhang, your participation has really helped.
>>>
>>> Thanks: Björn and Magnus
>>>
>>> [1]
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.netdevconf.org_2.2_&d=DwIFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qR6oNZj1CqLATni4ibTgAQ&m=lKyFxON3kKygiOgECLBfmqRwM7ZyXFSUvLED1vP-gos&s=44jzm1W8xkGyZSZVANRygzHz6y4XHbYrYBRM-K5RhTc&e=
>>>
>>>
>>> Björn Töpel (7):
>>>   packet: introduce AF_PACKET V4 userspace API
>>>   packet: implement PACKET_MEMREG setsockopt
>>>   packet: enable AF_PACKET V4 rings
>>>   packet: wire up zerocopy for AF_PACKET V4
>>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support
>>>   i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support
>>>   samples/tpacket4: added tpbench
>>>
>>> Magnus Karlsson (7):
>>>   packet: enable Rx for AF_PACKET V4
>>>   packet: enable Tx support for AF_PACKET V4
>>>   netdevice: add AF_PACKET V4 zerocopy ops
>>>   veth: added support for PACKET_ZEROCOPY
>>>   samples/tpacket4: added veth support
>>>   i40e: added XDP support for TP4 enabled queue pairs
>>>   xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use
>>>
>>>  drivers/net/ethernet/intel/i40e/i40e.h         |    3 +
>>>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |    9 +
>>>  drivers/net/ethernet/intel/i40e/i40e_main.c    |  837 ++++++++++++-
>>>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  582 ++++++++-
>>>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   38 +
>>>  drivers/net/veth.c                             |  174 +++
>>>  include/linux/netdevice.h                      |   16 +
>>>  include/linux/tpacket4.h                       | 1502
>>> ++++++++++++++++++++++++
>>>  include/uapi/linux/bpf.h                       |    1 +
>>>  include/uapi/linux/if_packet.h                 |   65 +-
>>>  net/packet/af_packet.c                         | 1252
>>> +++++++++++++++++---
>>>  net/packet/internal.h                          |    9 +
>>>  samples/tpacket4/Makefile                      |   12 +
>>>  samples/tpacket4/bench_all.sh                  |   28 +
>>>  samples/tpacket4/tpbench.c                     | 1390
>>> ++++++++++++++++++++++
>>>  15 files changed, 5674 insertions(+), 244 deletions(-)
>>>  create mode 100644 include/linux/tpacket4.h
>>>  create mode 100644 samples/tpacket4/Makefile
>>>  create mode 100755 samples/tpacket4/bench_all.sh
>>>  create mode 100644 samples/tpacket4/tpbench.c
>>>
>>> --
>>> 2.11.0
>>>
>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-14  5:33     ` Björn Töpel
@ 2017-11-14  7:02       ` John Fastabend
  2017-11-14 12:20         ` Willem de Bruijn
  0 siblings, 1 reply; 49+ messages in thread
From: John Fastabend @ 2017-11-14  7:02 UTC (permalink / raw)
  To: Björn Töpel, Alexei Starovoitov
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	Jesper Dangaard Brouer, michael.lundkvist, ravineet.singh,
	Daniel Borkmann, Netdev, Willem de Bruijn, Tushar Dave,
	eric.dumazet, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang, davem

On 11/13/2017 09:33 PM, Björn Töpel wrote:
> 2017-11-14 0:50 GMT+01:00 Alexei Starovoitov <ast@fb.com>:
>> On 11/13/17 9:07 PM, Björn Töpel wrote:
>>>
>>> 2017-10-31 13:41 GMT+01:00 Björn Töpel <bjorn.topel@gmail.com>:
>>>>
>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>
>>> [...]
>>>>
>>>>
>>>> We'll do a presentation on AF_PACKET V4 in NetDev 2.2 [1] Seoul,
>>>> Korea, and our paper with complete benchmarks will be released shortly
>>>> on the NetDev 2.2 site.
>>>>
>>>
>>> We're back in the saddle after an excellent netdevconf week. Kudos to
>>> the organizers; We had a blast! Thanks for all the constructive
>>> feedback.
>>>
>>> I'll summarize the major points, that we'll address in the next RFC
>>> below.
>>>
>>> * Instead of extending AF_PACKET with yet another version, introduce a
>>>   new address/packet family. As for naming had some name suggestions:
>>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>>>
>>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>>   supported, if not -- fallback to the skb path, for netdevs that
>>>   don't support the required ndos. Further, we'll have the zerocopy
>>>   behavior for the skb path as well, meaning that an AF_ZEROCOPY
>>>   socket will consume the skb and we'll honor skb->queue_mapping,
>>>   meaning that we only consume the packets for the enabled queue.
>>>
>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>   in a separate patchset.
>>
>>
>> all sounds good to me except above bit.
>> I don't remember people suggesting to split it this way.
>> What's the value of it without tx?
>>
> 
> We definitely need Tx for our use-cases! I'll rephrase, so the
> idea was making the initial patch set without Tx *driver*
> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
> 
> So AF_ZEROCOPY, the socket parts, would have Tx support.
> 
> @John Did I recall that correctly?
> 

Yep, that is what I said. However, on second thought, without the
driver tx half I guess tx will be significantly slower. So in order
to get the driver API correct in the first go around lets implement
this in the first series as well.

Just try to minimize the TX driver work as much as possible.

Thanks,
John

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-14  7:02       ` John Fastabend
@ 2017-11-14 12:20         ` Willem de Bruijn
  2017-11-16  2:55           ` Alexei Starovoitov
  0 siblings, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-14 12:20 UTC (permalink / raw)
  To: John Fastabend
  Cc: Björn Töpel, Alexei Starovoitov, Karlsson, Magnus,
	Duyck, Alexander H, Alexander Duyck, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann, Netdev,
	Tushar Dave, Eric Dumazet, Björn Töpel, Brandeburg,
	Jesse, Singhai, Anjali, Rosen, Rami, Shaw, Jeffrey B, Yigit,
	Ferruh

>>>>
>>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>>   in a separate patchset.
>>>
>>>
>>> all sounds good to me except above bit.
>>> I don't remember people suggesting to split it this way.
>>> What's the value of it without tx?
>>>
>>
>> We definitely need Tx for our use-cases! I'll rephrase, so the
>> idea was making the initial patch set without Tx *driver*
>> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
>>
>> So AF_ZEROCOPY, the socket parts, would have Tx support.
>>
>> @John Did I recall that correctly?
>>
>
> Yep, that is what I said. However, on second thought, without the
> driver tx half I guess tx will be significantly slower.

The idea was that existing packet rings already send without
copying, so the benefit from device driver changes is not obvious.

I would leave them out for now and evaluate before possibly
sending a separate patchset.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)
  2017-11-13 13:07 ` Björn Töpel
  2017-11-13 14:34   ` John Fastabend
  2017-11-13 23:50   ` Alexei Starovoitov
@ 2017-11-14 17:19   ` Jesper Dangaard Brouer
  2017-11-14 19:01     ` Björn Töpel
  2 siblings, 1 reply; 49+ messages in thread
From: Jesper Dangaard Brouer @ 2017-11-14 17:19 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, Netdev, Willem de Bruijn,
	Tushar Dave, eric.dumazet, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang, davem, brouer


On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel <bjorn.topel@gmail.com> wrote:

> I'll summarize the major points, that we'll address in the next RFC
> below.
> 
> * Instead of extending AF_PACKET with yet another version, introduce a
>   new address/packet family. As for naming had some name suggestions:
>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>   AF_ZEROCOPY, unless there're no strong opinions against it.

I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into
a kernel-side facility, that moves XDP-frames/packets _inside_ the
kernel.

*BUT* I've always imagined, that we would create a "channel" to
userspace.  By using XDP_REDIRECT to choose what frames get redirected
into which userspace "channel" (new channel-map type).  Userspace
pre-allocate and register memory/pages exactly like this patchset.

[Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace
memory pages.  And update your packet_array etc. (Use map-flush to get
RX bulking).

[Step 2]: (ZC) Userspace call driver NDO to register pages. The
XDP_REDIRECT action happens in driver, and can have knowledge about
RX-ring.  It can know if this RX-ring is Zero-Copy enabled and can skip
the copy-step.


> * No explicit zerocopy enablement. Use the zeropcopy path if
>   supported, if not -- fallback to the skb path, for netdevs that
>   don't support the required ndos.

When driver does not support NDO in above model. I think, that there
will still be a significant performance boost for the non-ZC variant.
Even-though we need a copy-operation, because there are no memory
allocations.  As userspace have preallocated and registered pages with
the kernel (and mem-limits are implicit via mem-size reg by userspace).


> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>   XDP redirect map call with ingress flag.

In above model, XDP_REDIRECT is used for filtering into a userspace
"channel".  If ZC gets enabled on a RX-ring queue, then XDP_PASS have
to do a copy (RX-ring knowledge is avail), like you describe with
XDP_PASS_TO_KERNEL.


> * Extend the XDP redirect to support explicit allocator/destructor
>   functions. Right now, XDP redirect assumes that the page allocator
>   was used, and the XDP redirect cleanup path is decreasing the page
>   count of the XDP buffer. This assumption breaks for the zerocopy
>   case.

Yes, please.  If XDP_REDIRECT get call a destructor call-back, then we
can allow XDP_REDIRECT out another net_device, even-when ZC is enabled
on a RX-ring queue.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)
  2017-11-14 17:19   ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?) Jesper Dangaard Brouer
@ 2017-11-14 19:01     ` Björn Töpel
  2017-11-16  8:00       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 49+ messages in thread
From: Björn Töpel @ 2017-11-14 19:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, Netdev, Willem de Bruijn,
	Tushar Dave, eric.dumazet, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang, davem

2017-11-14 18:19 GMT+01:00 Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel <bjorn.topel@gmail.com> wrote:
>
>> I'll summarize the major points, that we'll address in the next RFC
>> below.
>>
>> * Instead of extending AF_PACKET with yet another version, introduce a
>>   new address/packet family. As for naming had some name suggestions:
>>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
>>   AF_ZEROCOPY, unless there're no strong opinions against it.
>
> I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into
> a kernel-side facility, that moves XDP-frames/packets _inside_ the
> kernel.
>
> *BUT* I've always imagined, that we would create a "channel" to
> userspace.  By using XDP_REDIRECT to choose what frames get redirected
> into which userspace "channel" (new channel-map type).  Userspace
> pre-allocate and register memory/pages exactly like this patchset.
>
> [Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace
> memory pages.  And update your packet_array etc. (Use map-flush to get
> RX bulking).
>
> [Step 2]: (ZC) Userspace call driver NDO to register pages. The
> XDP_REDIRECT action happens in driver, and can have knowledge about
> RX-ring.  It can know if this RX-ring is Zero-Copy enabled and can skip
> the copy-step.
>

Jesper, I *really* like this approach -- especially the fact that the
existing XDP path in the drivers can be reused. I'll spend some time
dissecting the details of your suggestion.

>> * No explicit zerocopy enablement. Use the zeropcopy path if
>>   supported, if not -- fallback to the skb path, for netdevs that
>>   don't support the required ndos.
>
> When driver does not support NDO in above model. I think, that there
> will still be a significant performance boost for the non-ZC variant.
> Even-though we need a copy-operation, because there are no memory
> allocations.  As userspace have preallocated and registered pages with
> the kernel (and mem-limits are implicit via mem-size reg by userspace).
>

Yup, and we're not paying for the whole skb creation, given that we
execute from XDP_DRV and not XDP_SKB.

>> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
>>   XDP redirect map call with ingress flag.
>
> In above model, XDP_REDIRECT is used for filtering into a userspace
> "channel".  If ZC gets enabled on a RX-ring queue, then XDP_PASS have
> to do a copy (RX-ring knowledge is avail), like you describe with
> XDP_PASS_TO_KERNEL.
>

Again, this fits nicely in.

>> * Extend the XDP redirect to support explicit allocator/destructor
>>   functions. Right now, XDP redirect assumes that the page allocator
>>   was used, and the XDP redirect cleanup path is decreasing the page
>>   count of the XDP buffer. This assumption breaks for the zerocopy
>>   case.
>
> Yes, please.  If XDP_REDIRECT get call a destructor call-back, then we
> can allow XDP_REDIRECT out another net_device, even-when ZC is enabled
> on a RX-ring queue.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-03  9:54         ` Björn Töpel
@ 2017-11-15 22:21           ` chet l
  2017-11-16 16:53             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 49+ messages in thread
From: chet l @ 2017-11-15 22:21 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Willem de Bruijn, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, michael.lundkvist, ravineet.singh,
	Daniel Borkmann, Network Development, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

>
> Actually, we started out with that approach, where the packet_mmap
> call mapped Tx/Rx descriptor rings and the packet buffer region. We
> later moved to this (register umem) approach, because it's more
> flexible for user space, not having to use a AF_PACKET specific
> allocator (i.e. continue to use regular mallocs, huge pages and such).
>


One quick question:
Any thoughts on SVM support?
Is SVM support going to be so disruptive that we will need to churn a tp_v5?

If not then to accommodate future SVM enablement do you think it might
make sense to add/stuff a control-info union in the tp4_queue (or umem
etc). And then in the future, I think setmemreg (or something else)
would need to pass the PASID in addition to the malloc'd addr.
Assumption here is that the user-app will bind PID<->PASID before
invoking the AF_ZC setup.



> Björn

Chetan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-10-31 12:41 ` [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API Björn Töpel
  2017-11-02  1:45   ` Willem de Bruijn
@ 2017-11-15 22:34   ` chet l
  2017-11-16  1:44     ` David Miller
  1 sibling, 1 reply; 49+ messages in thread
From: chet l @ 2017-11-15 22:34 UTC (permalink / raw)
  To: Björn Töpel
  Cc: magnus.karlsson, alexander.h.duyck, Alexander Duyck,
	John Fastabend, ast, Jesper Dangaard Brouer, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, netdev, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

On Tue, Oct 31, 2017 at 5:41 AM, Björn Töpel <bjorn.topel@gmail.com> wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
>

> +/*
> + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
> + * to register user memory which should be used to store the packet
> + * data.
> + *
> + * There are some constraints for the memory being registered:
> + * - The memory area has to be memory page size aligned.
> + * - The frame size has to be a power of 2.
> + * - The frame size cannot be smaller than 2048B.
> + * - The frame size cannot be larger than the memory page size.
> + *
> + * Corollary: The number of frames that can be stored is
> + * len / frame_size.
> + *
> + */
> +struct tpacket_memreg_req {
> +       unsigned long   addr;           /* Start of packet data area */
> +       unsigned long   len;            /* Length of packet data area */
> +       unsigned int    frame_size;     /* Frame size */
> +       unsigned int    data_headroom;  /* Frame head room */
> +};
> +

I have not reviewed the entire patchset but I think if we could add a
version_hdr and then unionize the fields, it might be easier to add
SVM support without having to spin v5. I could be wrong though.


Chetan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-15 22:34   ` chet l
@ 2017-11-16  1:44     ` David Miller
  2017-11-16 19:32       ` chetan L
  0 siblings, 1 reply; 49+ messages in thread
From: David Miller @ 2017-11-16  1:44 UTC (permalink / raw)
  To: loke.chetan
  Cc: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, michael.lundkvist, ravineet.singh,
	daniel, netdev, bjorn.topel, jesse.brandeburg, anjali.singhai,
	rami.rosen, jeffrey.b.shaw, ferruh.yigit, qi.z.zhang

From: chet l <loke.chetan@gmail.com>
Date: Wed, 15 Nov 2017 14:34:32 -0800

> I have not reviewed the entire patchset but I think if we could add a
> version_hdr and then unionize the fields, it might be easier to add
> SVM support without having to spin v5. I could be wrong though.

Please, NO VERSION FIELDS!

Design things properly from the start rather than using a crutch of
being able to "adjust things later".

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-14 12:20         ` Willem de Bruijn
@ 2017-11-16  2:55           ` Alexei Starovoitov
  2017-11-16  3:35             ` Willem de Bruijn
  0 siblings, 1 reply; 49+ messages in thread
From: Alexei Starovoitov @ 2017-11-16  2:55 UTC (permalink / raw)
  To: Willem de Bruijn, John Fastabend
  Cc: Björn Töpel, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, Jesper Dangaard Brouer, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, Netdev, Tushar Dave,
	Eric Dumazet, Björn Töpel, Brandeburg, Jesse, Singhai,
	Anjali, Rosen, Rami, Shaw, Jeffrey B, Yigit, Ferruh

On 11/14/17 4:20 AM, Willem de Bruijn wrote:
>>>>>
>>>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>>>   in a separate patchset.
>>>>
>>>>
>>>> all sounds good to me except above bit.
>>>> I don't remember people suggesting to split it this way.
>>>> What's the value of it without tx?
>>>>
>>>
>>> We definitely need Tx for our use-cases! I'll rephrase, so the
>>> idea was making the initial patch set without Tx *driver*
>>> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
>>>
>>> So AF_ZEROCOPY, the socket parts, would have Tx support.
>>>
>>> @John Did I recall that correctly?
>>>
>>
>> Yep, that is what I said. However, on second thought, without the
>> driver tx half I guess tx will be significantly slower.
>
> The idea was that existing packet rings already send without
> copying, so the benefit from device driver changes is not obvious.
>
> I would leave them out for now and evaluate before possibly
> sending a separate patchset.

are you suggesting to use new af_zerocopy for rx and old
af_packet for tx ? imo that's too cumbersome to use.
New interface has to be symmetrical from the start.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-16  2:55           ` Alexei Starovoitov
@ 2017-11-16  3:35             ` Willem de Bruijn
  2017-11-16  7:09               ` Björn Töpel
  0 siblings, 1 reply; 49+ messages in thread
From: Willem de Bruijn @ 2017-11-16  3:35 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, Björn Töpel, Karlsson, Magnus, Duyck,
	Alexander H, Alexander Duyck, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann, Netdev,
	Tushar Dave, Eric Dumazet, Björn Töpel, Brandeburg,
	Jesse, Singhai, Anjali, Rosen, Rami, Shaw, Jeffrey B

On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitov <ast@fb.com> wrote:
> On 11/14/17 4:20 AM, Willem de Bruijn wrote:
>>>>>>
>>>>>>
>>>>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>>>>   in a separate patchset.
>>>>>
>>>>>
>>>>>
>>>>> all sounds good to me except above bit.
>>>>> I don't remember people suggesting to split it this way.
>>>>> What's the value of it without tx?
>>>>>
>>>>
>>>> We definitely need Tx for our use-cases! I'll rephrase, so the
>>>> idea was making the initial patch set without Tx *driver*
>>>> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
>>>>
>>>> So AF_ZEROCOPY, the socket parts, would have Tx support.
>>>>
>>>> @John Did I recall that correctly?
>>>>
>>>
>>> Yep, that is what I said. However, on second thought, without the
>>> driver tx half I guess tx will be significantly slower.
>>
>>
>> The idea was that existing packet rings already send without
>> copying, so the benefit from device driver changes is not obvious.
>>
>> I would leave them out for now and evaluate before possibly
>> sending a separate patchset.
>
>
> are you suggesting to use new af_zerocopy for rx and old
> af_packet for tx ? imo that's too cumbersome to use.
> New interface has to be symmetrical from the start.

No, that tx can be implemented without device driver
changes. At least initially.

Unlike rx, tx does not need driver support to implement
copy avoidance, as pf_packet tx_ring already has this.

Having to go through ndo_start_xmit does introduce other
overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a
better choice (but I'm not very familiar with that).

If some cost is inherent to a device-independent solution
and needs driver support to avoid it, then that can be added
in a follow-on patchset. But this one is large already without
the i40e tx patch.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-16  3:35             ` Willem de Bruijn
@ 2017-11-16  7:09               ` Björn Töpel
  2017-11-16  8:26                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 49+ messages in thread
From: Björn Töpel @ 2017-11-16  7:09 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Alexei Starovoitov, John Fastabend, Karlsson, Magnus, Duyck,
	Alexander H, Alexander Duyck, Jesper Dangaard Brouer,
	michael.lundkvist, ravineet.singh, Daniel Borkmann, Netdev,
	Tushar Dave, Eric Dumazet, Björn Töpel, Brandeburg,
	Jesse, Singhai, Anjali, Rosen, Rami, Shaw, Jeffrey B, Yigit,
	Ferruh

2017-11-16 4:35 GMT+01:00 Willem de Bruijn <willemdebruijn.kernel@gmail.com>:
> On Wed, Nov 15, 2017 at 9:55 PM, Alexei Starovoitov <ast@fb.com> wrote:
>> On 11/14/17 4:20 AM, Willem de Bruijn wrote:
>>>>>>>
>>>>>>>
>>>>>>> * Limit the scope of the first patchset to Rx only, and introduce Tx
>>>>>>>   in a separate patchset.
>>>>>>
>>>>>>
>>>>>>
>>>>>> all sounds good to me except above bit.
>>>>>> I don't remember people suggesting to split it this way.
>>>>>> What's the value of it without tx?
>>>>>>
>>>>>
>>>>> We definitely need Tx for our use-cases! I'll rephrase, so the
>>>>> idea was making the initial patch set without Tx *driver*
>>>>> specific code, e.g. use ndo_xdp_xmit/flush at a later point.
>>>>>
>>>>> So AF_ZEROCOPY, the socket parts, would have Tx support.
>>>>>
>>>>> @John Did I recall that correctly?
>>>>>
>>>>
>>>> Yep, that is what I said. However, on second thought, without the
>>>> driver tx half I guess tx will be significantly slower.
>>>
>>>
>>> The idea was that existing packet rings already send without
>>> copying, so the benefit from device driver changes is not obvious.
>>>
>>> I would leave them out for now and evaluate before possibly
>>> sending a separate patchset.
>>
>>
>> are you suggesting to use new af_zerocopy for rx and old
>> af_packet for tx ? imo that's too cumbersome to use.
>> New interface has to be symmetrical from the start.
>
> No, that tx can be implemented without device driver
> changes. At least initially.
>
> Unlike rx, tx does not need driver support to implement
> copy avoidance, as pf_packet tx_ring already has this.
>
> Having to go through ndo_start_xmit does introduce other
> overhead, notably skb alloc. Perhaps ndo_xdp_xmit is a
> better choice (but I'm not very familiar with that).
>
> If some cost is inherent to a device-independent solution
> and needs driver support to avoid it, then that can be added
> in a follow-on patchset. But this one is large already without
> the i40e tx patch.

Ideally, it would be best not having to introduce yet another xmit
ndo. I believe ndo_xdp_xmit/ndo_xdp_flush would be the best fit, but
we need to extend it with a destructor callback and potentially some
kind of DMA trait. Why DMA? For zerocopy, we know the working set of
packet buffers, so they are DMA mapped up front, whereas ndo_xdp_xmit
does yet another DMA mapping. Paying for the DMA mapping in the
fast-path is something we'd like to avoid.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?)
  2017-11-14 19:01     ` Björn Töpel
@ 2017-11-16  8:00       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 49+ messages in thread
From: Jesper Dangaard Brouer @ 2017-11-16  8:00 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Karlsson, Magnus, Duyck, Alexander H, Alexander Duyck,
	John Fastabend, Alexei Starovoitov, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, Netdev, Willem de Bruijn,
	Tushar Dave, eric.dumazet, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang, davem, brouer


On Tue, 14 Nov 2017 20:01:01 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:

> 2017-11-14 18:19 GMT+01:00 Jesper Dangaard Brouer <brouer@redhat.com>:
> >
> > On Mon, 13 Nov 2017 22:07:47 +0900 Björn Töpel <bjorn.topel@gmail.com> wrote:
> >  
> >> I'll summarize the major points, that we'll address in the next RFC
> >> below.
> >>
> >> * Instead of extending AF_PACKET with yet another version, introduce a
> >>   new address/packet family. As for naming had some name suggestions:
> >>   AF_CAPTURE, AF_CHANNEL, AF_XDP and AF_ZEROCOPY. We'll go for
> >>   AF_ZEROCOPY, unless there're no strong opinions against it.  
> >
> > I mostly like AF_CHANNEL and AF_XDP. I do know XDP is/have-evolved-into
> > a kernel-side facility, that moves XDP-frames/packets _inside_ the
> > kernel.
> >
> > *BUT* I've always imagined, that we would create a "channel" to
> > userspace.  By using XDP_REDIRECT to choose what frames get redirected
> > into which userspace "channel" (new channel-map type).  Userspace
> > pre-allocate and register memory/pages exactly like this patchset.
> >
> > [Step-1]: (non-ZC) XDP_REDIRECT need to copy frame-data into userspace
> > memory pages.  And update your packet_array etc. (Use map-flush to get
> > RX bulking).
> >
> > [Step 2]: (ZC) Userspace call driver NDO to register pages. The
> > XDP_REDIRECT action happens in driver, and can have knowledge about
> > RX-ring.  It can know if this RX-ring is Zero-Copy enabled and can skip
> > the copy-step.
> >  
> 
> Jesper, I *really* like this approach -- especially the fact that the
> existing XDP path in the drivers can be reused. I'll spend some time
> dissecting the details of your suggestion.

I'm very happy that you like this approach :-)

> >> * No explicit zerocopy enablement. Use the zeropcopy path if
> >>   supported, if not -- fallback to the skb path, for netdevs that
> >>   don't support the required ndos.  
> >
> > When driver does not support NDO in above model. I think, that there
> > will still be a significant performance boost for the non-ZC variant.
> > Even-though we need a copy-operation, because there are no memory
> > allocations.  As userspace have preallocated and registered pages with
> > the kernel (and mem-limits are implicit via mem-size reg by userspace).
> >  
> 
> Yup, and we're not paying for the whole skb creation, given that we
> execute from XDP_DRV and not XDP_SKB.

Yes, exactly. Avoiding the SKB allocation for non-ZC mode will be a
significant saving.  As your benchmarks showed, the AF_PACKET-V4
approach for non-ZC mode does not give you/us any real performance
improvement.  This approach would.


> >> * Do not introduce a new XDP action XDP_PASS_TO_KERNEL, instead use
> >>   XDP redirect map call with ingress flag.  
> >
> > In above model, XDP_REDIRECT is used for filtering into a userspace
> > "channel".  If ZC gets enabled on a RX-ring queue, then XDP_PASS have
> > to do a copy (RX-ring knowledge is avail), like you describe with
> > XDP_PASS_TO_KERNEL.
> >  
> 
> Again, this fits nicely in.
> 
> >> * Extend the XDP redirect to support explicit allocator/destructor
> >>   functions. Right now, XDP redirect assumes that the page allocator
> >>   was used, and the XDP redirect cleanup path is decreasing the page
> >>   count of the XDP buffer. This assumption breaks for the zerocopy
> >>   case.  
> >
> > Yes, please.  If XDP_REDIRECT get call a destructor call-back, then we
> > can allow XDP_REDIRECT out another net_device, even-when ZC is enabled
> > on a RX-ring queue.

I will (of-cause) be eager to test and benchmark this approach, as I
have high hopes a performance boost even for non-ZC.  I know an AF_XDP
approach is a lot of work, but I would like to offer to help-out in
anyway I can.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
  2017-11-16  7:09               ` Björn Töpel
@ 2017-11-16  8:26                 ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 49+ messages in thread
From: Jesper Dangaard Brouer @ 2017-11-16  8:26 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Willem de Bruijn, Alexei Starovoitov, John Fastabend, Karlsson,
	Magnus, Duyck, Alexander H, Alexander Duyck, michael.lundkvist,
	ravineet.singh, Daniel Borkmann, Netdev, Tushar Dave,
	Eric Dumazet, Björn Töpel, Brandeburg, Jesse, Singhai,
	Anjali, Rosen, Rami, Shaw, Jeffrey B, Yigit, Ferruh


On Thu, 16 Nov 2017 08:09:04 +0100 Björn Töpel <bjorn.topel@gmail.com> wrote:

> Ideally, it would be best not having to introduce yet another xmit
> ndo. I believe ndo_xdp_xmit/ndo_xdp_flush would be the best fit, but
> we need to extend it with a destructor callback and potentially some
> kind of DMA trait. Why DMA? For zerocopy, we know the working set of
> packet buffers, so they are DMA mapped up front, whereas ndo_xdp_xmit
> does yet another DMA mapping. Paying for the DMA mapping in the
> fast-path is something we'd like to avoid.

I like your idea of reusing ndo_xdp_xmit/ndo_xdp_flush.  At NetConf I
think we agreed that the ndo_xdp_xmit API likely need to change. See[1]
slide 11.  Andy Gospodarek and Michael Chan wanted to look into the
needed API changes (Cc'ed) thus, lets keep them in the loop.

I also appreciate that you are thinking about avoiding the DMA-mapping
at TX.  It would be a welcomed optimization.

[1] http://people.netfilter.org/hawk/presentations/NetConf2017_Seoul/XDP_devel_update_NetConf2017_Seoul.pdf
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-15 22:21           ` chet l
@ 2017-11-16 16:53             ` Jesper Dangaard Brouer
  2017-11-17  3:32               ` chetan L
  0 siblings, 1 reply; 49+ messages in thread
From: Jesper Dangaard Brouer @ 2017-11-16 16:53 UTC (permalink / raw)
  To: chet l
  Cc: Björn Töpel, Willem de Bruijn, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, michael.lundkvist, ravineet.singh,
	Daniel Borkmann, Network Development, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang, brouer

On Wed, 15 Nov 2017 14:21:38 -0800
chet l <loke.chetan@gmail.com> wrote:

> One quick question:
> Any thoughts on SVM support?

What is SVM ?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-16  1:44     ` David Miller
@ 2017-11-16 19:32       ` chetan L
  0 siblings, 0 replies; 49+ messages in thread
From: chetan L @ 2017-11-16 19:32 UTC (permalink / raw)
  To: David Miller
  Cc: Björn Töpel, Karlsson, Magnus, Alexander Duyck,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, michael.lundkvist, ravineet.singh,
	Daniel Borkmann, netdev, Björn Töpel, jesse.brandeburg,
	anjali.singhai, rami.rosen, jeffrey.b.shaw, ferruh.yigit,
	qi.z.zhang

On Wed, Nov 15, 2017 at 5:44 PM, David Miller <davem@davemloft.net> wrote:
> From: chet l <loke.chetan@gmail.com>
> Date: Wed, 15 Nov 2017 14:34:32 -0800
>
>> I have not reviewed the entire patchset but I think if we could add a
>> version_hdr and then unionize the fields, it might be easier to add
>> SVM support without having to spin v5. I could be wrong though.
>
> Please, NO VERSION FIELDS!
>
> Design things properly from the start rather than using a crutch of
> being able to "adjust things later".

Agreed. If this step in tpkt_v4 is able to follow what req1/2/3 did as
part of the setsockopt(..) API then it should be ok. If its a
different API then it will be difficult for the follow-on version(s)
to make seamless changes.

Look at tpacket_req3 for example. Since there was no hdr, I had no
option but to align the fields with tpacket_req/req2 during the setup.
I won't have access to a SMMUv3 capable ARM platform anytime soon. So
I can't actually test/write anything as of now.


Chetan

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API
  2017-11-16 16:53             ` Jesper Dangaard Brouer
@ 2017-11-17  3:32               ` chetan L
  0 siblings, 0 replies; 49+ messages in thread
From: chetan L @ 2017-11-17  3:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Willem de Bruijn, Karlsson, Magnus,
	Alexander Duyck, Alexander Duyck, John Fastabend,
	Alexei Starovoitov, michael.lundkvist, ravineet.singh,
	Daniel Borkmann, Network Development, Björn Töpel,
	jesse.brandeburg, anjali.singhai, rami.rosen, jeffrey.b.shaw,
	ferruh.yigit, qi.z.zhang

On Thu, Nov 16, 2017 at 8:53 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Wed, 15 Nov 2017 14:21:38 -0800
> chet l <loke.chetan@gmail.com> wrote:
>
>> One quick question:
>> Any thoughts on SVM support?
>
> What is SVM ?
>

Shared Virtual Memory(PCIe based). So going back to one of your
mapping examples. The protocol can be AF_CHANNEL.
Modes could be:
AF_ZC , AF_XDP_REDIRECT

Mapping types could be:
AF_NON_SVM(current setup - no PASID needed), AF_SVM(onus is on the
user to pass the PASID as part of the setsockopt), AF_SVM++


Chetan

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-11-17  3:32 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-31 12:41 [RFC PATCH 00/14] Introducing AF_PACKET V4 support Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API Björn Töpel
2017-11-02  1:45   ` Willem de Bruijn
2017-11-02 10:06     ` Björn Töpel
2017-11-02 16:40       ` Tushar Dave
2017-11-02 16:47         ` Björn Töpel
2017-11-03  2:29       ` Willem de Bruijn
2017-11-03  9:54         ` Björn Töpel
2017-11-15 22:21           ` chet l
2017-11-16 16:53             ` Jesper Dangaard Brouer
2017-11-17  3:32               ` chetan L
2017-11-15 22:34   ` chet l
2017-11-16  1:44     ` David Miller
2017-11-16 19:32       ` chetan L
2017-10-31 12:41 ` [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt Björn Töpel
2017-11-03  3:00   ` Willem de Bruijn
2017-11-03  9:57     ` Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings Björn Töpel
2017-11-03  4:16   ` Willem de Bruijn
2017-11-03 10:02     ` Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 04/14] packet: enable Rx for AF_PACKET V4 Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 05/14] packet: enable Tx support " Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 06/14] netdevice: add AF_PACKET V4 zerocopy ops Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4 Björn Töpel
2017-11-03  3:17   ` Willem de Bruijn
2017-11-03 10:47     ` Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 08/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Rx support Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 09/14] i40e: AF_PACKET V4 ndo_tp4_zerocopy Tx support Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 10/14] samples/tpacket4: added tpbench Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 11/14] veth: added support for PACKET_ZEROCOPY Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 12/14] samples/tpacket4: added veth support Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 13/14] i40e: added XDP support for TP4 enabled queue pairs Björn Töpel
2017-10-31 12:41 ` [RFC PATCH 14/14] xdp: introducing XDP_PASS_TO_KERNEL for PACKET_ZEROCOPY use Björn Töpel
2017-11-03  4:34 ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support Willem de Bruijn
2017-11-03 10:13   ` Karlsson, Magnus
2017-11-03 13:55     ` Willem de Bruijn
2017-11-13 13:07 ` Björn Töpel
2017-11-13 14:34   ` John Fastabend
2017-11-13 23:50   ` Alexei Starovoitov
2017-11-14  5:33     ` Björn Töpel
2017-11-14  7:02       ` John Fastabend
2017-11-14 12:20         ` Willem de Bruijn
2017-11-16  2:55           ` Alexei Starovoitov
2017-11-16  3:35             ` Willem de Bruijn
2017-11-16  7:09               ` Björn Töpel
2017-11-16  8:26                 ` Jesper Dangaard Brouer
2017-11-14 17:19   ` [RFC PATCH 00/14] Introducing AF_PACKET V4 support (AF_XDP or AF_CHANNEL?) Jesper Dangaard Brouer
2017-11-14 19:01     ` Björn Töpel
2017-11-16  8:00       ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.