All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] rx zero copy interface for af_packet
@ 2017-01-27 21:33 John Fastabend
  2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: John Fastabend @ 2017-01-27 21:33 UTC (permalink / raw)
  To: bjorn.topel, jasowang, ast, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev, john.fastabend

This is an experimental implementation of rx zero copy for af_packet.
Its a bit rough and likely has errors but the plan is to clean it up
over the next few months.

And seeing I said I would post it in another thread a few days back
here it is.

Comments welcome and use at your own risk.

Thanks,
John


---

John Fastabend (2):
      af_packet: direct dma for packet ineterface
      ixgbe: add af_packet direct copy support


 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  255 +++++++++++++++++++++++++
 include/linux/netdevice.h                     |    8 +
 include/net/af_packet.h                       |   64 ++++++
 include/uapi/linux/if_packet.h                |    1 
 net/packet/af_packet.c                        |   37 ++++
 net/packet/internal.h                         |   60 ------
 tools/testing/selftests/net/psock_tpacket.c   |   51 ++++-
 8 files changed, 410 insertions(+), 69 deletions(-)
 create mode 100644 include/net/af_packet.h

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
@ 2017-01-27 21:33 ` John Fastabend
  2017-01-30 18:16   ` Jesper Dangaard Brouer
  2017-02-04  3:10   ` Jason Wang
  2017-01-27 21:34 ` [RFC PATCH 2/2] ixgbe: add af_packet direct copy support John Fastabend
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: John Fastabend @ 2017-01-27 21:33 UTC (permalink / raw)
  To: bjorn.topel, jasowang, ast, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev, john.fastabend

This adds ndo ops for upper layer objects to request direct DMA from
the network interface into memory "slots". The slots must be DMA'able
memory given by a page/offset/size vector in a packet_ring_buffer
structure.

The PF_PACKET socket interface can use these ndo_ops to do zerocopy
RX from the network device into memory mapped userspace memory. For
this to work drivers encode the correct descriptor blocks and headers
so that existing PF_PACKET applications work without any modification.
This only supports the V2 header formats for now. And works by mapping
a ring of the network device to these slots. Originally I used V2
header formats but this does complicate the driver a bit.

V3 header formats added bulk polling via socket calls and timers
used in the polling interface to return every n milliseconds. Currently,
I don't see any way to support this in hardware because we can't
know if the hardware is in the middle of a DMA operation or not
on a slot. So when a timer fires I don't know how to advance the
descriptor ring leaving empty descriptors similar to how the software
ring works. The easiest (best?) route is to simply not support this.

It might be worth creating a new v4 header that is simple for drivers
to support direct DMA ops with. I can imagine using the xdp_buff
structure as a header for example. Thoughts?

The ndo operations and new socket option PACKET_RX_DIRECT work by
giving a queue_index to run the direct dma operations over. Once
setsockopt returns successfully the indicated queue is mapped
directly to the requesting application and can not be used for
other purposes. Also any kernel layers such as tc will be bypassed
and need to be implemented in the hardware via some other mechanism
such as tc offload or other offload interfaces.

Users steer traffic to the selected queue using flow director,
tc offload infrastructure or via macvlan offload.

The new socket option added to PF_PACKET is called PACKET_RX_DIRECT.
It takes a single unsigned int value specifying the queue index,

     setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
		&queue_index, sizeof(queue_index));

Implementing busy_poll support will allow userspace to kick the
drivers receive routine if needed. This work is TBD.

To test this I hacked a hardcoded test into  the tool psock_tpacket
in the selftests kernel directory here:

     ./tools/testing/selftests/net/psock_tpacket.c

Running this tool opens a socket and listens for packets over
the PACKET_RX_DIRECT enabled socket. Obviously it needs to be
reworked to enable all the older tests and not hardcode my
interface before it actually gets released.

In general this is a rough patch to explore the interface and
put something concrete up for debate. The patch does not handle
all the error cases correctly and needs to be cleaned up.

Known Limitations (TBD):

     (1) Users are required to match the number of rx ring
         slots with ethtool to the number requested by the
         setsockopt PF_PACKET layout. In the future we could
         possibly do this automatically.

     (2) Users need to configure Flow director or setup_tc
         to steer traffic to the correct queues. I don't believe
         this needs to be changed it seems to be a good mechanism
         for driving directed dma.

     (3) Not supporting timestamps or priv space yet, pushing
	 a v4 packet header would resolve this nicely.

     (5) Only RX supported so far. TX already supports direct DMA
         interface but uses skbs which is really not needed. In
         the TX_RING case we can optimize this path as well.

To support TX case we can do a similar "slots" mechanism and
kick operation. The kick could be a busy_poll like operation
but on the TX side. The flow would be user space loads up
n number of slots with packets, kicks tx busy poll bit, the
driver sends packets, and finally when xmit is complete
clears header bits to give slots back. When we have qdisc
bypass set today we already bypass the entire stack so no
paticular reason to use skb's in this case. Using xdp_buff
as a v4 packet header would also allow us to consolidate
driver code.

To be done:

     (1) More testing and performance analysis
     (2) Busy polling sockets
     (3) Implement v4 xdp_buff headers for analysis
     (4) performance testing :/ hopefully it looks good.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/netdevice.h                   |    8 +++
 include/net/af_packet.h                     |   64 +++++++++++++++++++++++++++
 include/uapi/linux/if_packet.h              |    1 
 net/packet/af_packet.c                      |   37 ++++++++++++++++
 net/packet/internal.h                       |   60 -------------------------
 tools/testing/selftests/net/psock_tpacket.c |   51 +++++++++++++++++++---
 6 files changed, 154 insertions(+), 67 deletions(-)
 create mode 100644 include/net/af_packet.h

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9bde955..a64a333 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -54,6 +54,8 @@
 #include <uapi/linux/pkt_cls.h>
 #include <linux/hashtable.h>
 
+#include <net/af_packet.h>
+
 struct netpoll_info;
 struct device;
 struct phy_device;
@@ -1324,6 +1326,12 @@ struct net_device_ops {
 						       int needed_headroom);
 	int			(*ndo_xdp)(struct net_device *dev,
 					   struct netdev_xdp *xdp);
+	int			(*ndo_ddma_map)(struct net_device *dev,
+					unsigned int rindex,
+					struct sock *sk,
+					struct packet_ring_buffer *rb);
+	void			(*ndo_ddma_unmap)(struct net_device *dev,
+						  unsigned int rindex);
 };
 
 /**
diff --git a/include/net/af_packet.h b/include/net/af_packet.h
new file mode 100644
index 0000000..9e82ba1
--- /dev/null
+++ b/include/net/af_packet.h
@@ -0,0 +1,64 @@
+#include <linux/timer.h>
+
+struct pgv {
+	char *buffer;
+};
+
+/* kbdq - kernel block descriptor queue */
+struct tpacket_kbdq_core {
+	struct pgv	*pkbdq;
+	unsigned int	feature_req_word;
+	unsigned int	hdrlen;
+	unsigned char	reset_pending_on_curr_blk;
+	unsigned char   delete_blk_timer;
+	unsigned short	kactive_blk_num;
+	unsigned short	blk_sizeof_priv;
+
+
+	/* last_kactive_blk_num:
+	 * trick to see if user-space has caught up
+	 * in order to avoid refreshing timer when every single pkt arrives.
+	 */
+	unsigned short	last_kactive_blk_num;
+
+	char		*pkblk_start;
+	char		*pkblk_end;
+	int		kblk_size;
+	unsigned int	max_frame_len;
+	unsigned int	knum_blocks;
+	uint64_t	knxt_seq_num;
+	char		*prev;
+	char		*nxt_offset;
+	struct sk_buff	*skb;
+
+	atomic_t	blk_fill_in_prog;
+
+	/* Default is set to 8ms */
+#define DEFAULT_PRB_RETIRE_TOV	(8)
+
+	unsigned short  retire_blk_tov;
+	unsigned short  version;
+	unsigned long	tov_in_jiffies;
+
+	/* timer to retire an outstanding block */
+	struct timer_list retire_blk_timer;
+};
+
+struct packet_ring_buffer {
+	struct pgv		*pg_vec;
+
+	unsigned int		head;
+	unsigned int		frames_per_block;
+	unsigned int		frame_size;
+	unsigned int		frame_max;
+
+	unsigned int		pg_vec_order;
+	unsigned int		pg_vec_pages;
+	unsigned int		pg_vec_len;
+
+	unsigned int __percpu	*pending_refcnt;
+
+	bool			ddma;
+
+	struct tpacket_kbdq_core prb_bdqc;
+};
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 9e7edfd..04b069a 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -56,6 +56,7 @@ struct sockaddr_ll {
 #define PACKET_QDISC_BYPASS		20
 #define PACKET_ROLLOVER_STATS		21
 #define PACKET_FANOUT_DATA		22
+#define PACKET_RX_DIRECT		23
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3d555c7..180666f 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3731,6 +3731,34 @@ static void packet_flush_mclist(struct sock *sk)
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+	case PACKET_RX_DIRECT:
+	{
+		struct packet_ring_buffer *rb = &po->rx_ring;
+		struct net_device *dev;
+		unsigned int index;
+		int err;
+
+		if (optlen != sizeof(index))
+			return -EINVAL;
+		if (copy_from_user(&index, optval, sizeof(index)))
+			return -EFAULT;
+
+		/* This call only works after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_ddma_map)
+			return -EOPNOTSUPP;
+		if (!atomic_read(&po->mapped))
+			return -EINVAL;
+
+		err =  dev->netdev_ops->ndo_ddma_map(dev, index, sk, rb);
+		if (!err)
+			rb->ddma = true;
+		return err;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -4228,6 +4256,15 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
 		if (atomic_read(&po->mapped))
 			pr_err("packet_mmap: vma is busy: %d\n",
 			       atomic_read(&po->mapped));
+
+		if (rb->ddma) {
+			struct net_device *dev =
+				__dev_get_by_index(sock_net(sk), po->ifindex);
+
+			if (dev && dev->netdev_ops->ndo_ddma_map)
+				dev->netdev_ops->ndo_ddma_unmap(dev, 0);
+			rb->ddma = false;
+		}
 	}
 	mutex_unlock(&po->pg_vec_lock);
 
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 9ee4631..4eec79e 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -10,66 +10,6 @@ struct packet_mclist {
 	unsigned char		addr[MAX_ADDR_LEN];
 };
 
-/* kbdq - kernel block descriptor queue */
-struct tpacket_kbdq_core {
-	struct pgv	*pkbdq;
-	unsigned int	feature_req_word;
-	unsigned int	hdrlen;
-	unsigned char	reset_pending_on_curr_blk;
-	unsigned char   delete_blk_timer;
-	unsigned short	kactive_blk_num;
-	unsigned short	blk_sizeof_priv;
-
-	/* last_kactive_blk_num:
-	 * trick to see if user-space has caught up
-	 * in order to avoid refreshing timer when every single pkt arrives.
-	 */
-	unsigned short	last_kactive_blk_num;
-
-	char		*pkblk_start;
-	char		*pkblk_end;
-	int		kblk_size;
-	unsigned int	max_frame_len;
-	unsigned int	knum_blocks;
-	uint64_t	knxt_seq_num;
-	char		*prev;
-	char		*nxt_offset;
-	struct sk_buff	*skb;
-
-	atomic_t	blk_fill_in_prog;
-
-	/* Default is set to 8ms */
-#define DEFAULT_PRB_RETIRE_TOV	(8)
-
-	unsigned short  retire_blk_tov;
-	unsigned short  version;
-	unsigned long	tov_in_jiffies;
-
-	/* timer to retire an outstanding block */
-	struct timer_list retire_blk_timer;
-};
-
-struct pgv {
-	char *buffer;
-};
-
-struct packet_ring_buffer {
-	struct pgv		*pg_vec;
-
-	unsigned int		head;
-	unsigned int		frames_per_block;
-	unsigned int		frame_size;
-	unsigned int		frame_max;
-
-	unsigned int		pg_vec_order;
-	unsigned int		pg_vec_pages;
-	unsigned int		pg_vec_len;
-
-	unsigned int __percpu	*pending_refcnt;
-
-	struct tpacket_kbdq_core	prb_bdqc;
-};
-
 extern struct mutex fanout_mutex;
 #define PACKET_FANOUT_MAX	256
 
diff --git a/tools/testing/selftests/net/psock_tpacket.c b/tools/testing/selftests/net/psock_tpacket.c
index 24adf70..32514f3 100644
--- a/tools/testing/selftests/net/psock_tpacket.c
+++ b/tools/testing/selftests/net/psock_tpacket.c
@@ -133,6 +133,20 @@ static void status_bar_update(void)
 	}
 }
 
+static void print_payload(void *pay, size_t len)
+{
+	unsigned char *payload = pay;
+	int i;
+
+	printf("payload (bytes %lu): ", len);
+	for (i = 0; i < len; i++) {
+		if ((i % 32) == 0)
+			printf("\n");
+		printf("0x%02x ", payload[i]);
+	}
+	printf("\n");
+}
+
 static void test_payload(void *pay, size_t len)
 {
 	struct ethhdr *eth = pay;
@@ -148,6 +162,7 @@ static void test_payload(void *pay, size_t len)
 			"type: 0x%x!\n", ntohs(eth->h_proto));
 		exit(1);
 	}
+	print_payload(pay, len);
 }
 
 static void create_payload(void *pay, size_t *len)
@@ -232,21 +247,21 @@ static inline void __v1_v2_rx_user_ready(void *base, int version)
 static void walk_v1_v2_rx(int sock, struct ring *ring)
 {
 	struct pollfd pfd;
-	int udp_sock[2];
+	//int udp_sock[2];
 	union frame_map ppd;
 	unsigned int frame_num = 0;
 
 	bug_on(ring->type != PACKET_RX_RING);
 
-	pair_udp_open(udp_sock, PORT_BASE);
-	pair_udp_setfilter(sock);
+	//pair_udp_open(udp_sock, PORT_BASE);
+	//pair_udp_setfilter(sock);
 
 	memset(&pfd, 0, sizeof(pfd));
 	pfd.fd = sock;
 	pfd.events = POLLIN | POLLERR;
 	pfd.revents = 0;
 
-	pair_udp_send(udp_sock, NUM_PACKETS);
+	//pair_udp_send(udp_sock, NUM_PACKETS);
 
 	while (total_packets < NUM_PACKETS * 2) {
 		while (__v1_v2_rx_kernel_ready(ring->rd[frame_num].iov_base,
@@ -257,6 +272,9 @@ static void walk_v1_v2_rx(int sock, struct ring *ring)
 			case TPACKET_V1:
 				test_payload((uint8_t *) ppd.raw + ppd.v1->tp_h.tp_mac,
 					     ppd.v1->tp_h.tp_snaplen);
+				print_payload((uint8_t *) ppd.raw +
+						ppd.v2->tp_h.tp_mac,
+					      ppd.v2->tp_h.tp_snaplen);
 				total_bytes += ppd.v1->tp_h.tp_snaplen;
 				break;
 
@@ -278,7 +296,7 @@ static void walk_v1_v2_rx(int sock, struct ring *ring)
 		poll(&pfd, 1, 1);
 	}
 
-	pair_udp_close(udp_sock);
+	//pair_udp_close(udp_sock);
 
 	if (total_packets != 2 * NUM_PACKETS) {
 		fprintf(stderr, "walk_v%d_rx: received %u out of %u pkts\n",
@@ -372,7 +390,8 @@ static void walk_v1_v2_tx(int sock, struct ring *ring)
 
 	pair_udp_setfilter(rcv_sock);
 
-	ll.sll_ifindex = if_nametoindex("lo");
+	/* hacking my test up */
+	ll.sll_ifindex = if_nametoindex("eth3");
 	ret = bind(rcv_sock, (struct sockaddr *) &ll, sizeof(ll));
 	if (ret == -1) {
 		perror("bind");
@@ -687,7 +706,7 @@ static void bind_ring(int sock, struct ring *ring)
 
 	ring->ll.sll_family = PF_PACKET;
 	ring->ll.sll_protocol = htons(ETH_P_ALL);
-	ring->ll.sll_ifindex = if_nametoindex("lo");
+	ring->ll.sll_ifindex = if_nametoindex("eth3");
 	ring->ll.sll_hatype = 0;
 	ring->ll.sll_pkttype = 0;
 	ring->ll.sll_halen = 0;
@@ -755,6 +774,19 @@ static int test_user_bit_width(void)
 	[PACKET_TX_RING] = "PACKET_TX_RING",
 };
 
+void direct_dma_ring(int sock)
+{
+	int ret;
+	int index = 1;
+
+	ret = setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
+			 &index, sizeof(index));
+	if (ret == -10)
+		printf("Failed direct dma socket with %i\n", ret);
+	else
+		printf("Configured a direct dma socket!\n");
+}
+
 static int test_tpacket(int version, int type)
 {
 	int sock;
@@ -777,6 +809,7 @@ static int test_tpacket(int version, int type)
 	setup_ring(sock, &ring, version, type);
 	mmap_ring(sock, &ring);
 	bind_ring(sock, &ring);
+	direct_dma_ring(sock);
 	walk_ring(sock, &ring);
 	unmap_ring(sock, &ring);
 	close(sock);
@@ -789,13 +822,17 @@ int main(void)
 {
 	int ret = 0;
 
+#if 0
 	ret |= test_tpacket(TPACKET_V1, PACKET_RX_RING);
 	ret |= test_tpacket(TPACKET_V1, PACKET_TX_RING);
+#endif
 
 	ret |= test_tpacket(TPACKET_V2, PACKET_RX_RING);
+#if 0
 	ret |= test_tpacket(TPACKET_V2, PACKET_TX_RING);
 
 	ret |= test_tpacket(TPACKET_V3, PACKET_RX_RING);
+#endif
 
 	if (ret)
 		return 1;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/2] ixgbe: add af_packet direct copy support
  2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
  2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
@ 2017-01-27 21:34 ` John Fastabend
  2017-01-31  2:53   ` Alexei Starovoitov
  2017-01-30 22:02 ` [RFC PATCH 0/2] rx zero copy interface for af_packet David Miller
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2017-01-27 21:34 UTC (permalink / raw)
  To: bjorn.topel, jasowang, ast, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev, john.fastabend

This implements the ndo ops for direct dma socket option. This is
to start looking at the interface and driver work needed to enable
it.

Note error paths are not handled and I'm aware of a few bugs.
For example interface must be up before attaching socket or else
it will fail silently. TBD fix all these things.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  255 +++++++++++++++++++++++++
 2 files changed, 256 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index ef81c3d..198f90c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -266,6 +266,7 @@ struct ixgbe_ring {
 		struct ixgbe_tx_buffer *tx_buffer_info;
 		struct ixgbe_rx_buffer *rx_buffer_info;
 	};
+	unsigned int buffer_size;
 	unsigned long state;
 	u8 __iomem *tail;
 	dma_addr_t dma;			/* phys. address of descriptor ring */
@@ -299,6 +300,8 @@ struct ixgbe_ring {
 		struct ixgbe_tx_queue_stats tx_stats;
 		struct ixgbe_rx_queue_stats rx_stats;
 	};
+	bool ddma;	      /* ring data buffers mapped to userspace */
+	struct sock *rx_kick; /* rx kick userspace */
 } ____cacheline_internodealigned_in_smp;
 
 enum ixgbe_ring_f_enum {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1e2f39e..c5ab44a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1608,8 +1608,19 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
 	i -= rx_ring->count;
 
 	do {
-		if (!ixgbe_alloc_mapped_page(rx_ring, bi))
-			break;
+		/* If direct dma has not released packet yet stop */
+		if (rx_ring->ddma) {
+			int align = TPACKET_ALIGN(TPACKET2_HDRLEN);
+			int hdrlen = ALIGN(align, L1_CACHE_BYTES);
+			struct tpacket2_hdr *hdr;
+
+			hdr = page_address(bi->page) + bi->page_offset - hdrlen;
+			if (unlikely(TP_STATUS_USER & hdr->tp_status))
+				break;
+		} else {
+			if (!ixgbe_alloc_mapped_page(rx_ring, bi))
+				break;
+		}
 
 		/*
 		 * Refresh the desc even if buffer_addrs didn't change
@@ -2005,6 +2016,97 @@ static bool ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring,
 	return true;
 }
 
+/* ixgbe_do_ddma - direct dma routine to populate PACKET_RX_RING mmap region
+ *
+ * The packet socket interface builds a shared memory region using mmap after
+ * it is specified by the PACKET_RX_RING socket option. This will create a
+ * circular ring of memory slots. Typical software usage case copies the skb
+ * into these pages via tpacket_rcv() routine.
+ *
+ * Here we do direct DMA from the hardware (82599 in this case) into the
+ * mmap regions and populate the uhdr (think user space descriptor). This
+ * requires the hardware to support Scatter Gather and HighDMA which should
+ * be standard on most (all?) 10/40 Gbps devices.
+ *
+ * The buffer mapping should have already been done so that rx_buffer pages
+ * are handed to the driver from the mmap setup done at the socket layer.
+ *
+ * See ./include/uapi/linux/if_packet.h for details on packet layout here
+ * we can only use tpacket2_hdr type. v3 of the header type introduced bulk
+ * polling modes which do not work correctly with hardware DMA engine. The
+ * primary issue is we can not stop a DMA transaction from occurring after it
+ * has been configured. What results is the software timer advances the
+ * ring ahead of the hardware and the ring state is lost. Maybe there is
+ * a clever way to resolve this by I haven't thought it up yet.
+ */
+static int ixgbe_do_ddma(struct ixgbe_ring *rx_ring,
+			 union ixgbe_adv_rx_desc *rx_desc)
+{
+	int hdrlen = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES);
+	struct ixgbe_adapter *adapter = netdev_priv(rx_ring->netdev);
+	struct ixgbe_rx_buffer *rx_buffer;
+	struct tpacket2_hdr *h2; /* userspace descriptor */
+	struct sockaddr_ll *sll;
+	struct ethhdr *eth;
+	int len = 0;
+	u64 ns = 0;
+	s32 rem;
+
+	rx_buffer = &rx_ring->rx_buffer_info[rx_ring->next_to_clean];
+	if (!rx_buffer->dma)
+		return -EBUSY;
+
+	prefetchw(rx_buffer->page);
+
+	/* test for any known error cases */
+	WARN_ON(ixgbe_test_staterr(rx_desc,
+				   IXGBE_RXDADV_ERR_FRAME_ERR_MASK) &&
+				   !(rx_ring->netdev->features & NETIF_F_RXALL));
+
+	dma_sync_single_range_for_cpu(rx_ring->dev,
+				      rx_buffer->dma,
+				      rx_buffer->page_offset,
+				      rx_ring->buffer_size,
+				      DMA_FROM_DEVICE);
+
+	/* Update the last_rx_timestamp timer in order to enable watchdog check
+	 * for error case of latched timestamp on a dropped packet.
+	 */
+	adapter->last_rx_timestamp = jiffies;
+	h2 = page_address(rx_buffer->page) + rx_buffer->page_offset - hdrlen;
+	eth = page_address(rx_buffer->page) + rx_buffer->page_offset,
+
+	/* This indicates a bug in ixgbe leaving for testing purposes */
+	WARN_ON(TP_STATUS_USER & h2->tp_status);
+	len = le16_to_cpu(rx_desc->wb.upper.length);
+	h2->tp_len = len;
+	h2->tp_snaplen = len;
+	h2->tp_mac = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES);
+	h2->tp_net = h2->tp_mac + ETH_HLEN;
+	h2->tp_sec = div_s64_rem(ns, NSEC_PER_SEC, &rem);
+	h2->tp_nsec = rem;
+
+	sll = (void *)h2 + TPACKET_ALIGN(sizeof(struct tpacket2_hdr));
+	sll->sll_halen = ETH_HLEN;
+	memcpy(sll->sll_addr, eth->h_source, ETH_ALEN);
+	sll->sll_family = AF_PACKET;
+	sll->sll_hatype = rx_ring->netdev->type;
+	sll->sll_protocol = eth->h_proto;
+	sll->sll_pkttype = PACKET_HOST;
+	sll->sll_ifindex = rx_ring->netdev->ifindex;
+
+	smp_mb();
+	h2->tp_status = TP_STATUS_USER;
+	rx_ring->rx_kick->sk_data_ready(rx_ring->rx_kick);
+
+	/* TBD handle non-EOP frames? - I think this is an invalid case
+	 * assuming ring slots are setup correctly.
+	 */
+	if (ixgbe_is_non_eop(rx_ring, rx_desc, NULL))
+		e_warn(drv, "Direct DMA received non-eop descriptor!");
+	return len;
+}
+
 static struct sk_buff *ixgbe_fetch_rx_buffer(struct ixgbe_ring *rx_ring,
 					     union ixgbe_adv_rx_desc *rx_desc)
 {
@@ -2134,6 +2236,21 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		 */
 		dma_rmb();
 
+		/* If we use direct DMA to shmem then we do not need SKBs
+		 * because user space descriptors are populated directly.
+		 */
+		if (rx_ring->ddma) {
+			int len = ixgbe_do_ddma(rx_ring, rx_desc);
+
+			if (len) {
+				total_rx_packets++;
+				total_rx_bytes += len;
+				cleaned_count++;
+				continue;
+			}
+			break;
+		}
+
 		/* retrieve a buffer from the ring */
 		skb = ixgbe_fetch_rx_buffer(rx_ring, rx_desc);
 
@@ -4887,6 +5004,12 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring)
 	if (!rx_ring->rx_buffer_info)
 		return;
 
+	/* Buffers are owned by direct dma endpoint so we should not free
+	 * them here and instead wait for endpoint to claim buffers.
+	 */
+	if (rx_ring->ddma)
+		return;
+
 	/* Free all the Rx ring sk_buffs */
 	for (i = 0; i < rx_ring->count; i++) {
 		struct ixgbe_rx_buffer *rx_buffer = &rx_ring->rx_buffer_info[i];
@@ -9264,6 +9387,132 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	return features;
 }
 
+static int ixgbe_ddma_map(struct net_device *dev, unsigned int ring,
+			  struct sock *sk, struct packet_ring_buffer *prb)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_ring *rx_ring = adapter->rx_ring[ring];
+	unsigned int frames_per_blk = prb->frames_per_block;
+	unsigned int blk_nr = prb->pg_vec_len;
+	struct ixgbe_rx_buffer *bi;
+	int i, j, err = 0;
+
+	/* Verify we are given a valid ring */
+	if (ring >= adapter->num_rx_queues)
+		return -EINVAL;
+	/* Verify we have enough descriptors to support user space ring */
+	if (!frames_per_blk || ((blk_nr * frames_per_blk) != rx_ring->count)) {
+		e_warn(drv, "ddma map requires %i ring slots\n",
+		       blk_nr * frames_per_blk);
+		return -EBUSY;
+	}
+
+	/* Bring the queue down to point buffers at user space */
+	ixgbe_disable_rx_queue(adapter, rx_ring);
+	usleep_range(10000, 20000);
+	ixgbe_irq_disable_queues(adapter, ((u64)1 << ring));
+	ixgbe_clean_rx_ring(rx_ring);
+
+	rx_ring->buffer_size = prb->frame_size;
+
+	/* In Direct DMA mode the descriptor block and tpacket headers are
+	 * held in fixed locations so we can pre-populate most of the fields.
+	 * Similarly the pages, offsets, and sizes for DMA are pre-calculated
+	 * to align with the user space ring and do not need to change.
+	 *
+	 * The layout is fixed to match the PF_PACKET layer in /net/packet/
+	 * which also invokes this routine via ndo_ddma_map().
+	 *
+	 * Couple design comments: Should we create v4 protocol rather than
+	 * trying to leverage v2 for compat. Perhaps. And similarly should
+	 * we generalize the packet_ring_buffer onto something that can be
+	 * generalized for other users?
+	 */
+	for (i = 0; i < blk_nr; i++) {
+		char *kaddr = prb->pg_vec[i].buffer;
+		unsigned int blk_size;
+		struct page *page;
+		size_t offset = 0;
+		dma_addr_t dma;
+
+		/* For DMA usage we can't use vmalloc */
+		if (is_vmalloc_addr(kaddr)) {
+			e_warn(drv, "vmalloc pages not supported in ddma\n");
+			err = -EINVAL;
+			goto unwind_map;
+		}
+
+		/* map page for use */
+		page = virt_to_page(kaddr);
+		blk_size = PAGE_SIZE << prb->pg_vec_order;
+		dma = dma_map_page(rx_ring->dev, page, 0,
+				   blk_size, DMA_FROM_DEVICE);
+		if (dma_mapping_error(rx_ring->dev, dma)) {
+			e_warn(drv, "ddma mapping error DMA_FROM_DEVICE\n");
+			rx_ring->rx_stats.alloc_rx_page_failed++;
+			err = -EBUSY;
+			goto unwind_map;
+		}
+
+		/* We may be able to push multiple frames per block in this case
+		 * set offset correctly to set pkt_addr correctly in descriptor.
+		 */
+		for (j = 0; j < frames_per_blk; j++) {
+			int hdrlen = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN),
+					   L1_CACHE_BYTES);
+			int b = ((i * frames_per_blk) + j);
+
+			bi = &rx_ring->rx_buffer_info[b];
+			if (!bi) // TBD abort here this just stops a panic
+				continue;
+			bi->page = page;
+			bi->dma = dma;
+
+			/* ignore priv for now */
+			bi->page_offset = offset + hdrlen;
+			offset += rx_ring->buffer_size;
+		}
+	}
+
+	rx_ring->ddma = true;
+	rx_ring->rx_kick = sk;
+unwind_map:
+	//tbd error path handling
+	ixgbe_configure_rx_ring(adapter, rx_ring);
+	return err;
+}
+
+static void ixgbe_ddma_unmap(struct net_device *dev, unsigned int ring)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_ring *rx_ring = adapter->rx_ring[ring];
+	int i;
+
+	rx_ring->ddma = false;
+	// tbd live traffic here(?) need to stop queue
+
+	/* Free all the Rx ring sk_buffs */
+	for (i = 0; i < rx_ring->count; i++) {
+		struct ixgbe_rx_buffer *rx_buffer;
+
+		rx_buffer = &rx_ring->rx_buffer_info[i];
+		if (!rx_buffer)
+			continue;
+		if (rx_buffer->dma)
+			dma_unmap_page(rx_ring->dev, rx_buffer->dma,
+				       rx_buffer->page_offset,
+				       DMA_FROM_DEVICE);
+
+		rx_buffer->dma = 0;
+		rx_buffer->page_offset = 0;
+		rx_buffer->page = NULL;
+	}
+
+	rtnl_lock();
+	ixgbe_setup_tc(dev, netdev_get_num_tc(dev));
+	rtnl_unlock();
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
@@ -9312,6 +9561,8 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	.ndo_udp_tunnel_add	= ixgbe_add_udp_tunnel_port,
 	.ndo_udp_tunnel_del	= ixgbe_del_udp_tunnel_port,
 	.ndo_features_check	= ixgbe_features_check,
+	.ndo_ddma_map		= ixgbe_ddma_map,
+	.ndo_ddma_unmap		= ixgbe_ddma_unmap,
 };
 
 /**

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
@ 2017-01-30 18:16   ` Jesper Dangaard Brouer
  2017-01-30 21:51     ` John Fastabend
  2017-02-04  3:10   ` Jason Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-30 18:16 UTC (permalink / raw)
  To: John Fastabend
  Cc: bjorn.topel, jasowang, ast, alexander.duyck, john.r.fastabend,
	netdev, brouer

On Fri, 27 Jan 2017 13:33:44 -0800 John Fastabend <john.fastabend@gmail.com> wrote:

> This adds ndo ops for upper layer objects to request direct DMA from
> the network interface into memory "slots". The slots must be DMA'able
> memory given by a page/offset/size vector in a packet_ring_buffer
> structure.
> 
> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
> RX from the network device into memory mapped userspace memory. For
> this to work drivers encode the correct descriptor blocks and headers
> so that existing PF_PACKET applications work without any modification.
> This only supports the V2 header formats for now. And works by mapping
> a ring of the network device to these slots. Originally I used V2
> header formats but this does complicate the driver a bit.
> 
> V3 header formats added bulk polling via socket calls and timers
> used in the polling interface to return every n milliseconds. Currently,
> I don't see any way to support this in hardware because we can't
> know if the hardware is in the middle of a DMA operation or not
> on a slot. So when a timer fires I don't know how to advance the
> descriptor ring leaving empty descriptors similar to how the software
> ring works. The easiest (best?) route is to simply not support this.

>From a performance pov bulking is essential. Systems like netmap that
also depend on transferring control between kernel and userspace,
report[1] that they need at least bulking size 8, to amortize the overhead.

[1] Figure 7, page 10, http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf


> It might be worth creating a new v4 header that is simple for drivers
> to support direct DMA ops with. I can imagine using the xdp_buff
> structure as a header for example. Thoughts?

Likely, but I would like that we do a measurement based approach.  Lets
benchmark with this V2 header format, and see how far we are from
target, and see what lights-up in perf report and if it is something we
can address.


> The ndo operations and new socket option PACKET_RX_DIRECT work by
> giving a queue_index to run the direct dma operations over. Once
> setsockopt returns successfully the indicated queue is mapped
> directly to the requesting application and can not be used for
> other purposes. Also any kernel layers such as tc will be bypassed
> and need to be implemented in the hardware via some other mechanism
> such as tc offload or other offload interfaces.

Will this also need to bypass XDP too?

E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
packet with this solution (and place it on a TX queue and wait for DMA
TX completion).


> Users steer traffic to the selected queue using flow director,
> tc offload infrastructure or via macvlan offload.
> 
> The new socket option added to PF_PACKET is called PACKET_RX_DIRECT.
> It takes a single unsigned int value specifying the queue index,
> 
>      setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
> 		&queue_index, sizeof(queue_index));
> 
> Implementing busy_poll support will allow userspace to kick the
> drivers receive routine if needed. This work is TBD.
> 
> To test this I hacked a hardcoded test into  the tool psock_tpacket
> in the selftests kernel directory here:
> 
>      ./tools/testing/selftests/net/psock_tpacket.c
> 
> Running this tool opens a socket and listens for packets over
> the PACKET_RX_DIRECT enabled socket. Obviously it needs to be
> reworked to enable all the older tests and not hardcode my
> interface before it actually gets released.
> 
> In general this is a rough patch to explore the interface and
> put something concrete up for debate. The patch does not handle
> all the error cases correctly and needs to be cleaned up.
> 
> Known Limitations (TBD):
> 
>      (1) Users are required to match the number of rx ring
>          slots with ethtool to the number requested by the
>          setsockopt PF_PACKET layout. In the future we could
>          possibly do this automatically.
> 
>      (2) Users need to configure Flow director or setup_tc
>          to steer traffic to the correct queues. I don't believe
>          this needs to be changed it seems to be a good mechanism
>          for driving directed dma.
> 
>      (3) Not supporting timestamps or priv space yet, pushing
> 	 a v4 packet header would resolve this nicely.
> 
>      (5) Only RX supported so far. TX already supports direct DMA
>          interface but uses skbs which is really not needed. In
>          the TX_RING case we can optimize this path as well.
> 
> To support TX case we can do a similar "slots" mechanism and
> kick operation. The kick could be a busy_poll like operation
> but on the TX side. The flow would be user space loads up
> n number of slots with packets, kicks tx busy poll bit, the
> driver sends packets, and finally when xmit is complete
> clears header bits to give slots back. When we have qdisc
> bypass set today we already bypass the entire stack so no
> paticular reason to use skb's in this case. Using xdp_buff
> as a v4 packet header would also allow us to consolidate
> driver code.
> 
> To be done:
> 
>      (1) More testing and performance analysis
>      (2) Busy polling sockets
>      (3) Implement v4 xdp_buff headers for analysis
>      (4) performance testing :/ hopefully it looks good.

Guess, I don't understand the details of the af_packet versions well
enough, but can you explain to me, how userspace knows what slots it
can read/fetch, and how it marks when it is complete/finished so the
kernel knows it can reuse this slot?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-30 18:16   ` Jesper Dangaard Brouer
@ 2017-01-30 21:51     ` John Fastabend
  2017-01-31  1:31       ` Willem de Bruijn
  2017-01-31 12:20       ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 18+ messages in thread
From: John Fastabend @ 2017-01-30 21:51 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bjorn.topel, jasowang, ast, alexander.duyck, john.r.fastabend, netdev

On 17-01-30 10:16 AM, Jesper Dangaard Brouer wrote:
> On Fri, 27 Jan 2017 13:33:44 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
> 
>> This adds ndo ops for upper layer objects to request direct DMA from
>> the network interface into memory "slots". The slots must be DMA'able
>> memory given by a page/offset/size vector in a packet_ring_buffer
>> structure.
>>
>> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
>> RX from the network device into memory mapped userspace memory. For
>> this to work drivers encode the correct descriptor blocks and headers
>> so that existing PF_PACKET applications work without any modification.
>> This only supports the V2 header formats for now. And works by mapping
>> a ring of the network device to these slots. Originally I used V2
>> header formats but this does complicate the driver a bit.
>>
>> V3 header formats added bulk polling via socket calls and timers
>> used in the polling interface to return every n milliseconds. Currently,
>> I don't see any way to support this in hardware because we can't
>> know if the hardware is in the middle of a DMA operation or not
>> on a slot. So when a timer fires I don't know how to advance the
>> descriptor ring leaving empty descriptors similar to how the software
>> ring works. The easiest (best?) route is to simply not support this.
> 
> From a performance pov bulking is essential. Systems like netmap that
> also depend on transferring control between kernel and userspace,
> report[1] that they need at least bulking size 8, to amortize the overhead.
> 

Bulking in general is not a problem it can be done on top of V2 implementation
without issue. Its the poll timer that seemed a bit clumsy to implement.
Looking at it again though I think we could do something if we cared to.
I'm not convinced we would gain much though.

Also v2 uses static buffer sizes so that every buffer is 2k or whatever
size the user configures. V3 allows the buffer size to be updated at
runtime which could be done in the drivers I suppose but would require
some ixgbe restructuring.

> [1] Figure 7, page 10, http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf
> 
> 
>> It might be worth creating a new v4 header that is simple for drivers
>> to support direct DMA ops with. I can imagine using the xdp_buff
>> structure as a header for example. Thoughts?
> 
> Likely, but I would like that we do a measurement based approach.  Lets
> benchmark with this V2 header format, and see how far we are from
> target, and see what lights-up in perf report and if it is something we
> can address.

Yep I'm hoping to get to this sometime this week.

> 
> 
>> The ndo operations and new socket option PACKET_RX_DIRECT work by
>> giving a queue_index to run the direct dma operations over. Once
>> setsockopt returns successfully the indicated queue is mapped
>> directly to the requesting application and can not be used for
>> other purposes. Also any kernel layers such as tc will be bypassed
>> and need to be implemented in the hardware via some other mechanism
>> such as tc offload or other offload interfaces.
> 
> Will this also need to bypass XDP too?

There is nothing stopping this from working with XDP but why would
you want to run XDP here?

Dropping a packet for example is not really useful because its
already in memory user space can read. Modifying the packet seems
pointless user space can modify it. Maybe pushing a copy of the packet
to kernel stack is useful in some case? But I can't see where I would
want this.

> 
> E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
> packet with this solution (and place it on a TX queue and wait for DMA
> TX completion).
> 

This is something worth exploring. tpacket_v2 uses a fixed ring with
slots so all the pages are allocated and assigned to the ring at init
time. To xmit a packet in this case the user space application would
be required to leave the packet descriptor on the rx side pinned
until the tx side DMA has completed. Then it can unpin the rx side
and return it to the driver. This works if the TX/RX processing is
fast enough to keep up. For many things this is good enough.

For some work loads though this may not be sufficient. In which
case a tpacket_v4 would be useful that can push down a new set
of "slots" every n packets. Where n is sufficiently large to keep
the workload running. This is similar in many ways to virtio/vhost
interaction.

> 
>> Users steer traffic to the selected queue using flow director,
>> tc offload infrastructure or via macvlan offload.
>>
>> The new socket option added to PF_PACKET is called PACKET_RX_DIRECT.
>> It takes a single unsigned int value specifying the queue index,
>>
>>      setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
>> 		&queue_index, sizeof(queue_index));
>>
>> Implementing busy_poll support will allow userspace to kick the
>> drivers receive routine if needed. This work is TBD.
>>
>> To test this I hacked a hardcoded test into  the tool psock_tpacket
>> in the selftests kernel directory here:
>>
>>      ./tools/testing/selftests/net/psock_tpacket.c
>>
>> Running this tool opens a socket and listens for packets over
>> the PACKET_RX_DIRECT enabled socket. Obviously it needs to be
>> reworked to enable all the older tests and not hardcode my
>> interface before it actually gets released.
>>
>> In general this is a rough patch to explore the interface and
>> put something concrete up for debate. The patch does not handle
>> all the error cases correctly and needs to be cleaned up.
>>
>> Known Limitations (TBD):
>>
>>      (1) Users are required to match the number of rx ring
>>          slots with ethtool to the number requested by the
>>          setsockopt PF_PACKET layout. In the future we could
>>          possibly do this automatically.
>>
>>      (2) Users need to configure Flow director or setup_tc
>>          to steer traffic to the correct queues. I don't believe
>>          this needs to be changed it seems to be a good mechanism
>>          for driving directed dma.
>>
>>      (3) Not supporting timestamps or priv space yet, pushing
>> 	 a v4 packet header would resolve this nicely.
>>
>>      (5) Only RX supported so far. TX already supports direct DMA
>>          interface but uses skbs which is really not needed. In
>>          the TX_RING case we can optimize this path as well.
>>
>> To support TX case we can do a similar "slots" mechanism and
>> kick operation. The kick could be a busy_poll like operation
>> but on the TX side. The flow would be user space loads up
>> n number of slots with packets, kicks tx busy poll bit, the
>> driver sends packets, and finally when xmit is complete
>> clears header bits to give slots back. When we have qdisc
>> bypass set today we already bypass the entire stack so no
>> paticular reason to use skb's in this case. Using xdp_buff
>> as a v4 packet header would also allow us to consolidate
>> driver code.
>>
>> To be done:
>>
>>      (1) More testing and performance analysis
>>      (2) Busy polling sockets
>>      (3) Implement v4 xdp_buff headers for analysis
>>      (4) performance testing :/ hopefully it looks good.
> 
> Guess, I don't understand the details of the af_packet versions well
> enough, but can you explain to me, how userspace knows what slots it
> can read/fetch, and how it marks when it is complete/finished so the
> kernel knows it can reuse this slot?
> 

At init time user space allocates a ring of buffers. Each buffer has
space to hold the packet descriptor + packet payload. The API gives this
to the driver to initialize DMA engine and assign addresses. At init
time all buffers are "owned" by the driver which is indicated by a status bit
in the descriptor header.

Userspace can spin on the status bit to know when the driver has handed
it to userspace. The driver will check the status bit before returning
the buffer to the hardware. Then a series of kicks are used to wake up
userspace (if its not spinning) and to wake up the driver if it is overrun
and needs to return buffers into its pool (not implemented yet). The
kick to wake up the driver could in a future v4 be used to push new
buffers to the driver if needed.

It seems to work reasonably well but the perf numbers will let us know
more.

.John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] rx zero copy interface for af_packet
  2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
  2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
  2017-01-27 21:34 ` [RFC PATCH 2/2] ixgbe: add af_packet direct copy support John Fastabend
@ 2017-01-30 22:02 ` David Miller
  2017-01-31 16:30 ` Sowmini Varadhan
  2017-01-31 19:39 ` tndave
  4 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2017-01-30 22:02 UTC (permalink / raw)
  To: john.fastabend
  Cc: bjorn.topel, jasowang, ast, alexander.duyck, brouer,
	john.r.fastabend, netdev

From: John Fastabend <john.fastabend@gmail.com>
Date: Fri, 27 Jan 2017 13:33:20 -0800

> This is an experimental implementation of rx zero copy for af_packet.
> Its a bit rough and likely has errors but the plan is to clean it up
> over the next few months.
> 
> And seeing I said I would post it in another thread a few days back
> here it is.
> 
> Comments welcome and use at your own risk.

The one thing I do like about this is the fact that the user cannot
access the DMA length and DMA address in the hardware descriptors.

So, at least in that sense, the kernel is still providing the
necessary protection between the user and the hardware device.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-30 21:51     ` John Fastabend
@ 2017-01-31  1:31       ` Willem de Bruijn
  2017-02-01  5:09         ` John Fastabend
  2017-01-31 12:20       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 18+ messages in thread
From: Willem de Bruijn @ 2017-01-31  1:31 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesper Dangaard Brouer, bjorn.topel, jasowang, ast,
	alexander.duyck, john.r.fastabend, Network Development

>>> V3 header formats added bulk polling via socket calls and timers
>>> used in the polling interface to return every n milliseconds. Currently,
>>> I don't see any way to support this in hardware because we can't
>>> know if the hardware is in the middle of a DMA operation or not
>>> on a slot. So when a timer fires I don't know how to advance the
>>> descriptor ring leaving empty descriptors similar to how the software
>>> ring works. The easiest (best?) route is to simply not support this.
>>
>> From a performance pov bulking is essential. Systems like netmap that
>> also depend on transferring control between kernel and userspace,
>> report[1] that they need at least bulking size 8, to amortize the overhead.

To introduce interrupt moderation, ixgbe_do_ddma only has to elide the
sk_data_ready, and schedule an hrtimer if one is not scheduled yet.

If I understand correctly, the difficulty lies in v3 requiring that the
timer "close" the block when the timer expires. That may not be worth
implementing, indeed.

Hardware interrupt moderation and napi may already give some
moderation, even with a sock_def_readable call for each packet. If
considering a v4 format, I'll again suggest virtio virtqueues. Those
have interrupt suppression built in with EVENT_IDX.

>> Likely, but I would like that we do a measurement based approach.  Lets
>> benchmark with this V2 header format, and see how far we are from
>> target, and see what lights-up in perf report and if it is something we
>> can address.
>
> Yep I'm hoping to get to this sometime this week.

Perhaps also without filling in the optional metadata data fields
in tpacket and sockaddr_ll.

>> E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
>> packet with this solution (and place it on a TX queue and wait for DMA
>> TX completion).
>>
>
> This is something worth exploring. tpacket_v2 uses a fixed ring with
> slots so all the pages are allocated and assigned to the ring at init
> time. To xmit a packet in this case the user space application would
> be required to leave the packet descriptor on the rx side pinned
> until the tx side DMA has completed. Then it can unpin the rx side
> and return it to the driver. This works if the TX/RX processing is
> fast enough to keep up. For many things this is good enough.
>
> For some work loads though this may not be sufficient. In which
> case a tpacket_v4 would be useful that can push down a new set
> of "slots" every n packets. Where n is sufficiently large to keep
> the workload running.

Here, too, virtio rings may help.

The extra level of indirection allows out of order completions,
reducing the chance of running out of rx descriptors when redirecting
a subset of packets to a tx ring, as that does not block the entire ring.

And passing explicit descriptors from userspace enables pointing to
new memory regions. On the flipside, they now have to be checked for
safety against region bounds.

> This is similar in many ways to virtio/vhost interaction.

Ah, I only saw this after writing the above :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] ixgbe: add af_packet direct copy support
  2017-01-27 21:34 ` [RFC PATCH 2/2] ixgbe: add af_packet direct copy support John Fastabend
@ 2017-01-31  2:53   ` Alexei Starovoitov
  2017-02-01  4:58     ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2017-01-31  2:53 UTC (permalink / raw)
  To: John Fastabend, bjorn.topel, jasowang, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev

On 1/27/17 1:34 PM, John Fastabend wrote:
> +	h2 = page_address(rx_buffer->page) + rx_buffer->page_offset - hdrlen;
> +	eth = page_address(rx_buffer->page) + rx_buffer->page_offset,

I don't think it compiles ;)

> +	/* This indicates a bug in ixgbe leaving for testing purposes */
> +	WARN_ON(TP_STATUS_USER & h2->tp_status);
> +	len = le16_to_cpu(rx_desc->wb.upper.length);
> +	h2->tp_len = len;
> +	h2->tp_snaplen = len;
> +	h2->tp_mac = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES);
> +	h2->tp_net = h2->tp_mac + ETH_HLEN;
> +	h2->tp_sec = div_s64_rem(ns, NSEC_PER_SEC, &rem);
> +	h2->tp_nsec = rem;
> +
> +	sll = (void *)h2 + TPACKET_ALIGN(sizeof(struct tpacket2_hdr));
> +	sll->sll_halen = ETH_HLEN;
> +	memcpy(sll->sll_addr, eth->h_source, ETH_ALEN);
> +	sll->sll_family = AF_PACKET;
> +	sll->sll_hatype = rx_ring->netdev->type;
> +	sll->sll_protocol = eth->h_proto;
> +	sll->sll_pkttype = PACKET_HOST;
> +	sll->sll_ifindex = rx_ring->netdev->ifindex;

performance wise it looks very expensive to do all these header copies
and integer divide for every packet.
I think unless we move to new dumb and simple header format
performance of this approach is not going to be satisfactory.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-30 21:51     ` John Fastabend
  2017-01-31  1:31       ` Willem de Bruijn
@ 2017-01-31 12:20       ` Jesper Dangaard Brouer
  2017-02-01  5:01         ` John Fastabend
  1 sibling, 1 reply; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2017-01-31 12:20 UTC (permalink / raw)
  To: John Fastabend
  Cc: bjorn.topel, jasowang, ast, alexander.duyck, john.r.fastabend,
	netdev, brouer


(next submission fix subj. ineterface -> interface)

On Mon, 30 Jan 2017 13:51:55 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> On 17-01-30 10:16 AM, Jesper Dangaard Brouer wrote:
> > On Fri, 27 Jan 2017 13:33:44 -0800 John Fastabend <john.fastabend@gmail.com> wrote:
> >   
> >> This adds ndo ops for upper layer objects to request direct DMA from
> >> the network interface into memory "slots". The slots must be DMA'able
> >> memory given by a page/offset/size vector in a packet_ring_buffer
> >> structure.
> >>
> >> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
> >> RX from the network device into memory mapped userspace memory. For
> >> this to work drivers encode the correct descriptor blocks and headers
> >> so that existing PF_PACKET applications work without any modification.
> >> This only supports the V2 header formats for now. And works by mapping
> >> a ring of the network device to these slots. Originally I used V2
> >> header formats but this does complicate the driver a bit.
> >>
> >> V3 header formats added bulk polling via socket calls and timers
> >> used in the polling interface to return every n milliseconds. Currently,
> >> I don't see any way to support this in hardware because we can't
> >> know if the hardware is in the middle of a DMA operation or not
> >> on a slot. So when a timer fires I don't know how to advance the
> >> descriptor ring leaving empty descriptors similar to how the software
> >> ring works. The easiest (best?) route is to simply not support this.  
> > 
> > From a performance pov bulking is essential. Systems like netmap that
> > also depend on transferring control between kernel and userspace,
> > report[1] that they need at least bulking size 8, to amortize the overhead.
> >   
> 
> Bulking in general is not a problem it can be done on top of V2 implementation
> without issue. 

Good.

Notice, that the type of bulking I'm looking for can indirectly be
achieved via a queue, as long as there isn't a syscall per dequeue
involved. Looking at some af_packet example code, and your desc below,
it looks like af_packet is doing exactly what is needed, as it is
sync/spinning on a block_status bit.  (Lessons learned from ptr_ring,
indicate this might actually be more efficient)


> Its the poll timer that seemed a bit clumsy to implement.
> Looking at it again though I think we could do something if we cared to.
> I'm not convinced we would gain much though.

I actually think this would slowdown performance.

> Also v2 uses static buffer sizes so that every buffer is 2k or whatever
> size the user configures. V3 allows the buffer size to be updated at
> runtime which could be done in the drivers I suppose but would require
> some ixgbe restructuring.

I think we should stay with the V2 fixed fixed buffers, for performance
reasons.


> > [1] Figure 7, page 10, http://info.iet.unipi.it/~luigi/papers/20120503-netmap-atc12.pdf
> > 
> >   
> >> It might be worth creating a new v4 header that is simple for drivers
> >> to support direct DMA ops with. I can imagine using the xdp_buff
> >> structure as a header for example. Thoughts?  
> > 
> > Likely, but I would like that we do a measurement based approach.  Lets
> > benchmark with this V2 header format, and see how far we are from
> > target, and see what lights-up in perf report and if it is something we
> > can address.  
> 
> Yep I'm hoping to get to this sometime this week.
> 
> > 
> >   
> >> The ndo operations and new socket option PACKET_RX_DIRECT work by
> >> giving a queue_index to run the direct dma operations over. Once
> >> setsockopt returns successfully the indicated queue is mapped
> >> directly to the requesting application and can not be used for
> >> other purposes. Also any kernel layers such as tc will be bypassed
> >> and need to be implemented in the hardware via some other mechanism
> >> such as tc offload or other offload interfaces.  
> > 
> > Will this also need to bypass XDP too?  
> 
> There is nothing stopping this from working with XDP but why would
> you want to run XDP here?
> 
> Dropping a packet for example is not really useful because its
> already in memory user space can read. Modifying the packet seems
> pointless user space can modify it. 
>
> Maybe pushing a copy of the packet
> to kernel stack is useful in some case? But I can't see where I would
> want this.

Wouldn't it be useful to pass ARP packets to kernel stack?
(E.g. if your HW filter is based on MAC addr match)


> > E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
> > packet with this solution (and place it on a TX queue and wait for DMA
> > TX completion).
> >   
> 
> This is something worth exploring. tpacket_v2 uses a fixed ring with
> slots so all the pages are allocated and assigned to the ring at init
> time. To xmit a packet in this case the user space application would
> be required to leave the packet descriptor on the rx side pinned
> until the tx side DMA has completed. Then it can unpin the rx side
> and return it to the driver. This works if the TX/RX processing is
> fast enough to keep up. For many things this is good enough.

Sounds tricky.
 
> For some work loads though this may not be sufficient. In which
> case a tpacket_v4 would be useful that can push down a new set
> of "slots" every n packets. Where n is sufficiently large to keep
> the workload running. This is similar in many ways to virtio/vhost
> interaction.

This starts to sound like to need a page pool like facility with
pages premapped DMA and premapped to userspace...

> >   
[...]
> > 
> > Guess, I don't understand the details of the af_packet versions well
> > enough, but can you explain to me, how userspace knows what slots it
> > can read/fetch, and how it marks when it is complete/finished so the
> > kernel knows it can reuse this slot?
> >   
> 
> At init time user space allocates a ring of buffers. Each buffer has
> space to hold the packet descriptor + packet payload. The API gives this
> to the driver to initialize DMA engine and assign addresses. At init
> time all buffers are "owned" by the driver which is indicated by a status bit
> in the descriptor header.
> 
> Userspace can spin on the status bit to know when the driver has handed
> it to userspace. The driver will check the status bit before returning
> the buffer to the hardware. Then a series of kicks are used to wake up
> userspace (if its not spinning) and to wake up the driver if it is overrun
> and needs to return buffers into its pool (not implemented yet). The
> kick to wake up the driver could in a future v4 be used to push new
> buffers to the driver if needed.

As I wrote above, this status bit spinning approach is good and actually
achieving a bulking effect indirectly.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] rx zero copy interface for af_packet
  2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
                   ` (2 preceding siblings ...)
  2017-01-30 22:02 ` [RFC PATCH 0/2] rx zero copy interface for af_packet David Miller
@ 2017-01-31 16:30 ` Sowmini Varadhan
  2017-02-01  4:23   ` John Fastabend
  2017-01-31 19:39 ` tndave
  4 siblings, 1 reply; 18+ messages in thread
From: Sowmini Varadhan @ 2017-01-31 16:30 UTC (permalink / raw)
  To: John Fastabend
  Cc: bjorn.topel, jasowang, ast, alexander.duyck, brouer,
	john.r.fastabend, netdev

On (01/27/17 13:33), John Fastabend wrote:
> 
> This is an experimental implementation of rx zero copy for af_packet.
> Its a bit rough and likely has errors but the plan is to clean it up
> over the next few months.
> 
> And seeing I said I would post it in another thread a few days back
> here it is.

One question/comment about this: sure, this saves us an skb copy
on the rx side, but at least for the Tx side, I think there may
be a trade-off between the overhead from the skb setup and the
ease of offloading checksum (and UFO where it is available) to
consider, even for PF_PACKET.

Using PF_PACKET works well for stateless datagram protocols like 
UDP, and for UDP sockets, we find that just switching to Jumbo
(to simulate a poor-man's-UFO) gives us significant improvement
in both throughput and latency for our RDBMS workloads - and 
having the sk_buff facilitates using existing driver-kernel interfaces
for offload easily, so while we may gain some perf improvment by shaving
of the sk_buff overhead, the trade-off needs to be considered.

--Sowmini

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] rx zero copy interface for af_packet
  2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
                   ` (3 preceding siblings ...)
  2017-01-31 16:30 ` Sowmini Varadhan
@ 2017-01-31 19:39 ` tndave
  2017-02-01  5:09   ` John Fastabend
  4 siblings, 1 reply; 18+ messages in thread
From: tndave @ 2017-01-31 19:39 UTC (permalink / raw)
  To: John Fastabend, bjorn.topel, jasowang, ast, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev



On 01/27/2017 01:33 PM, John Fastabend wrote:
> This is an experimental implementation of rx zero copy for af_packet.
> Its a bit rough and likely has errors but the plan is to clean it up
> over the next few months.
>
> And seeing I said I would post it in another thread a few days back
> here it is.

This sounds good (believe me I have been thinking along the lines :)
 From driver Rx side, we always premap RX buffers so best to map them to
shmem for PF_PACKET sockets.
Also, I like the idea that user can put selected queue (may be queues in
future?) to PF_PACKET mode keeping rest of the queues as it is.
Zero copy and removing skb setup & processing overhead on RX certainly
makes things faster and help latency. Zero copy is good on Tx however
without skb should we figure out how to use segmentation and checksum 
offloading features of HW. Can this be considered in tpacket V4 hdr!

-Tushar

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] rx zero copy interface for af_packet
  2017-01-31 16:30 ` Sowmini Varadhan
@ 2017-02-01  4:23   ` John Fastabend
  0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2017-02-01  4:23 UTC (permalink / raw)
  To: Sowmini Varadhan
  Cc: bjorn.topel, jasowang, ast, alexander.duyck, brouer,
	john.r.fastabend, netdev

On 17-01-31 08:30 AM, Sowmini Varadhan wrote:
> On (01/27/17 13:33), John Fastabend wrote:
>>
>> This is an experimental implementation of rx zero copy for af_packet.
>> Its a bit rough and likely has errors but the plan is to clean it up
>> over the next few months.
>>
>> And seeing I said I would post it in another thread a few days back
>> here it is.
> 
> One question/comment about this: sure, this saves us an skb copy
> on the rx side, but at least for the Tx side, I think there may
> be a trade-off between the overhead from the skb setup and the
> ease of offloading checksum (and UFO where it is available) to
> consider, even for PF_PACKET.
> 

Yes although as Willem suggested and I pushed a quick comment
at the end of the patch, virtio descriptors might be a better
options for a v4 descriptor type because they have mechanisms
to handle checksum and others in place already.

> Using PF_PACKET works well for stateless datagram protocols like 
> UDP, and for UDP sockets, we find that just switching to Jumbo
> (to simulate a poor-man's-UFO) gives us significant improvement
> in both throughput and latency for our RDBMS workloads - and 
> having the sk_buff facilitates using existing driver-kernel interfaces
> for offload easily, so while we may gain some perf improvment by shaving
> of the sk_buff overhead, the trade-off needs to be considered.

Of course but many workloads/environments can not use jumbo
frames nor would it be helpful if your average pkt size is
128B or something around there.


> 
> --Sowmini
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 2/2] ixgbe: add af_packet direct copy support
  2017-01-31  2:53   ` Alexei Starovoitov
@ 2017-02-01  4:58     ` John Fastabend
  0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2017-02-01  4:58 UTC (permalink / raw)
  To: Alexei Starovoitov, bjorn.topel, jasowang, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev

On 17-01-30 06:53 PM, Alexei Starovoitov wrote:
> On 1/27/17 1:34 PM, John Fastabend wrote:
>> +    h2 = page_address(rx_buffer->page) + rx_buffer->page_offset - hdrlen;
>> +    eth = page_address(rx_buffer->page) + rx_buffer->page_offset,
> 
> I don't think it compiles ;)

Well that is what I get for doing some last minute checkpatch fixes
and not doing a build test before sending it out. Oh well just an RFC
to get some general feedback.

> 
>> +    /* This indicates a bug in ixgbe leaving for testing purposes */
>> +    WARN_ON(TP_STATUS_USER & h2->tp_status);
>> +    len = le16_to_cpu(rx_desc->wb.upper.length);
>> +    h2->tp_len = len;
>> +    h2->tp_snaplen = len;
>> +    h2->tp_mac = ALIGN(TPACKET_ALIGN(TPACKET2_HDRLEN), L1_CACHE_BYTES);
>> +    h2->tp_net = h2->tp_mac + ETH_HLEN;
>> +    h2->tp_sec = div_s64_rem(ns, NSEC_PER_SEC, &rem);
>> +    h2->tp_nsec = rem;
>> +
>> +    sll = (void *)h2 + TPACKET_ALIGN(sizeof(struct tpacket2_hdr));
>> +    sll->sll_halen = ETH_HLEN;
>> +    memcpy(sll->sll_addr, eth->h_source, ETH_ALEN);
>> +    sll->sll_family = AF_PACKET;
>> +    sll->sll_hatype = rx_ring->netdev->type;
>> +    sll->sll_protocol = eth->h_proto;
>> +    sll->sll_pkttype = PACKET_HOST;
>> +    sll->sll_ifindex = rx_ring->netdev->ifindex;
> 
> performance wise it looks very expensive to do all these header copies
> and integer divide for every packet.
> I think unless we move to new dumb and simple header format
> performance of this approach is not going to be satisfactory.
> 

Sure I'm not opposed to moving to a v4 in fact I think it would help
in a lot of ways. I'll try to fire off some benchmarks and then move
to a v4 to see how that works out.

.John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-31 12:20       ` Jesper Dangaard Brouer
@ 2017-02-01  5:01         ` John Fastabend
  0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2017-02-01  5:01 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bjorn.topel, jasowang, ast, alexander.duyck, john.r.fastabend, netdev

[...]

>>>   
>>>> The ndo operations and new socket option PACKET_RX_DIRECT work by
>>>> giving a queue_index to run the direct dma operations over. Once
>>>> setsockopt returns successfully the indicated queue is mapped
>>>> directly to the requesting application and can not be used for
>>>> other purposes. Also any kernel layers such as tc will be bypassed
>>>> and need to be implemented in the hardware via some other mechanism
>>>> such as tc offload or other offload interfaces.  
>>>
>>> Will this also need to bypass XDP too?  
>>
>> There is nothing stopping this from working with XDP but why would
>> you want to run XDP here?
>>
>> Dropping a packet for example is not really useful because its
>> already in memory user space can read. Modifying the packet seems
>> pointless user space can modify it. 
>>
>> Maybe pushing a copy of the packet
>> to kernel stack is useful in some case? But I can't see where I would
>> want this.
> 
> Wouldn't it be useful to pass ARP packets to kernel stack?
> (E.g. if your HW filter is based on MAC addr match)
> 

Problem is we already zero-copied the packet into user space. I really
don't want to have a packet in user space and kernel space at the same
time. That seems like a big mess to me around security and isolation.

Much better to have the hardware push arp packet onto the correct hardware
queue so that arp packets get sent into the stack. With the ARP example
its easy enough to put a high priority rule in hardware to match the
protocol.

> 
>>> E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
>>> packet with this solution (and place it on a TX queue and wait for DMA
>>> TX completion).
>>>   
>>
>> This is something worth exploring. tpacket_v2 uses a fixed ring with
>> slots so all the pages are allocated and assigned to the ring at init
>> time. To xmit a packet in this case the user space application would
>> be required to leave the packet descriptor on the rx side pinned
>> until the tx side DMA has completed. Then it can unpin the rx side
>> and return it to the driver. This works if the TX/RX processing is
>> fast enough to keep up. For many things this is good enough.
> 
> Sounds tricky.
>  
>> For some work loads though this may not be sufficient. In which
>> case a tpacket_v4 would be useful that can push down a new set
>> of "slots" every n packets. Where n is sufficiently large to keep
>> the workload running. This is similar in many ways to virtio/vhost
>> interaction.
> 
> This starts to sound like to need a page pool like facility with
> pages premapped DMA and premapped to userspace...
> 

I'm not sure what premapped to userspace means in this case. Here the
application uses mmap or some other mechanism to get a set of pages and
then pushes them down to the device. I think a mechanism such as that
used in virtio would solve this problem nicely. I'll take a look at it
and send another RFC out.

>>>   
> [...]
>>>
>>> Guess, I don't understand the details of the af_packet versions well
>>> enough, but can you explain to me, how userspace knows what slots it
>>> can read/fetch, and how it marks when it is complete/finished so the
>>> kernel knows it can reuse this slot?
>>>   
>>
>> At init time user space allocates a ring of buffers. Each buffer has
>> space to hold the packet descriptor + packet payload. The API gives this
>> to the driver to initialize DMA engine and assign addresses. At init
>> time all buffers are "owned" by the driver which is indicated by a status bit
>> in the descriptor header.
>>
>> Userspace can spin on the status bit to know when the driver has handed
>> it to userspace. The driver will check the status bit before returning
>> the buffer to the hardware. Then a series of kicks are used to wake up
>> userspace (if its not spinning) and to wake up the driver if it is overrun
>> and needs to return buffers into its pool (not implemented yet). The
>> kick to wake up the driver could in a future v4 be used to push new
>> buffers to the driver if needed.
> 
> As I wrote above, this status bit spinning approach is good and actually
> achieving a bulking effect indirectly.
> 

Yep.

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/2] rx zero copy interface for af_packet
  2017-01-31 19:39 ` tndave
@ 2017-02-01  5:09   ` John Fastabend
  0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2017-02-01  5:09 UTC (permalink / raw)
  To: tndave, bjorn.topel, jasowang, ast, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev

On 17-01-31 11:39 AM, tndave wrote:
> 
> 
> On 01/27/2017 01:33 PM, John Fastabend wrote:
>> This is an experimental implementation of rx zero copy for af_packet.
>> Its a bit rough and likely has errors but the plan is to clean it up
>> over the next few months.
>>
>> And seeing I said I would post it in another thread a few days back
>> here it is.
> 
> This sounds good (believe me I have been thinking along the lines :)
> From driver Rx side, we always premap RX buffers so best to map them to
> shmem for PF_PACKET sockets.
> Also, I like the idea that user can put selected queue (may be queues in
> future?) to PF_PACKET mode keeping rest of the queues as it is.
> Zero copy and removing skb setup & processing overhead on RX certainly
> makes things faster and help latency. Zero copy is good on Tx however
> without skb should we figure out how to use segmentation and checksum offloading
> features of HW. Can this be considered in tpacket V4 hdr!
> 

Yes, I'll try to create another RFC in a week or two. Thanks.

> -Tushar

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-31  1:31       ` Willem de Bruijn
@ 2017-02-01  5:09         ` John Fastabend
  2017-03-06 21:28           ` chetan loke
  0 siblings, 1 reply; 18+ messages in thread
From: John Fastabend @ 2017-02-01  5:09 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Jesper Dangaard Brouer, bjorn.topel, jasowang, ast,
	alexander.duyck, john.r.fastabend, Network Development

On 17-01-30 05:31 PM, Willem de Bruijn wrote:
>>>> V3 header formats added bulk polling via socket calls and timers
>>>> used in the polling interface to return every n milliseconds. Currently,
>>>> I don't see any way to support this in hardware because we can't
>>>> know if the hardware is in the middle of a DMA operation or not
>>>> on a slot. So when a timer fires I don't know how to advance the
>>>> descriptor ring leaving empty descriptors similar to how the software
>>>> ring works. The easiest (best?) route is to simply not support this.
>>>
>>> From a performance pov bulking is essential. Systems like netmap that
>>> also depend on transferring control between kernel and userspace,
>>> report[1] that they need at least bulking size 8, to amortize the overhead.
> 
> To introduce interrupt moderation, ixgbe_do_ddma only has to elide the
> sk_data_ready, and schedule an hrtimer if one is not scheduled yet.
> 
> If I understand correctly, the difficulty lies in v3 requiring that the
> timer "close" the block when the timer expires. That may not be worth
> implementing, indeed.
> 

Yep that is where I just gave up and decided it wasn't worth it.

> Hardware interrupt moderation and napi may already give some
> moderation, even with a sock_def_readable call for each packet. If
> considering a v4 format, I'll again suggest virtio virtqueues. Those
> have interrupt suppression built in with EVENT_IDX.


Agreed. On paper now I'm considering moving to something like this after
getting some feedback here. Of course I'll need to play with the code a
bit to see what it looks like. I'll need a couple weeks probably to get
this sorted out.

> 
>>> Likely, but I would like that we do a measurement based approach.  Lets
>>> benchmark with this V2 header format, and see how far we are from
>>> target, and see what lights-up in perf report and if it is something we
>>> can address.
>>
>> Yep I'm hoping to get to this sometime this week.
> 
> Perhaps also without filling in the optional metadata data fields
> in tpacket and sockaddr_ll.
> 
>>> E.g. how will you support XDP_TX?  AFAIK you cannot remove/detach a
>>> packet with this solution (and place it on a TX queue and wait for DMA
>>> TX completion).
>>>
>>
>> This is something worth exploring. tpacket_v2 uses a fixed ring with
>> slots so all the pages are allocated and assigned to the ring at init
>> time. To xmit a packet in this case the user space application would
>> be required to leave the packet descriptor on the rx side pinned
>> until the tx side DMA has completed. Then it can unpin the rx side
>> and return it to the driver. This works if the TX/RX processing is
>> fast enough to keep up. For many things this is good enough.
>>
>> For some work loads though this may not be sufficient. In which
>> case a tpacket_v4 would be useful that can push down a new set
>> of "slots" every n packets. Where n is sufficiently large to keep
>> the workload running.
> 
> Here, too, virtio rings may help.
> 
> The extra level of indirection allows out of order completions,
> reducing the chance of running out of rx descriptors when redirecting
> a subset of packets to a tx ring, as that does not block the entire ring.
> 
> And passing explicit descriptors from userspace enables pointing to
> new memory regions. On the flipside, they now have to be checked for
> safety against region bounds.
> 
>> This is similar in many ways to virtio/vhost interaction.
> 
> Ah, I only saw this after writing the above :)
> 

yep but glad to get some validation on this idea.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
  2017-01-30 18:16   ` Jesper Dangaard Brouer
@ 2017-02-04  3:10   ` Jason Wang
  1 sibling, 0 replies; 18+ messages in thread
From: Jason Wang @ 2017-02-04  3:10 UTC (permalink / raw)
  To: John Fastabend, bjorn.topel, ast, alexander.duyck, brouer
  Cc: john.r.fastabend, netdev



On 2017年01月28日 05:33, John Fastabend wrote:
> This adds ndo ops for upper layer objects to request direct DMA from
> the network interface into memory "slots". The slots must be DMA'able
> memory given by a page/offset/size vector in a packet_ring_buffer
> structure.
>
> The PF_PACKET socket interface can use these ndo_ops to do zerocopy
> RX from the network device into memory mapped userspace memory. For
> this to work drivers encode the correct descriptor blocks and headers
> so that existing PF_PACKET applications work without any modification.
> This only supports the V2 header formats for now. And works by mapping
> a ring of the network device to these slots. Originally I used V2
> header formats but this does complicate the driver a bit.
>
> V3 header formats added bulk polling via socket calls and timers
> used in the polling interface to return every n milliseconds. Currently,
> I don't see any way to support this in hardware because we can't
> know if the hardware is in the middle of a DMA operation or not
> on a slot. So when a timer fires I don't know how to advance the
> descriptor ring leaving empty descriptors similar to how the software
> ring works. The easiest (best?) route is to simply not support this.
>
> It might be worth creating a new v4 header that is simple for drivers
> to support direct DMA ops with. I can imagine using the xdp_buff
> structure as a header for example. Thoughts?
>
> The ndo operations and new socket option PACKET_RX_DIRECT work by
> giving a queue_index to run the direct dma operations over. Once
> setsockopt returns successfully the indicated queue is mapped
> directly to the requesting application and can not be used for
> other purposes. Also any kernel layers such as tc will be bypassed
> and need to be implemented in the hardware via some other mechanism
> such as tc offload or other offload interfaces.
>
> Users steer traffic to the selected queue using flow director,
> tc offload infrastructure or via macvlan offload.
>
> The new socket option added to PF_PACKET is called PACKET_RX_DIRECT.
> It takes a single unsigned int value specifying the queue index,
>
>       setsockopt(sock, SOL_PACKET, PACKET_RX_DIRECT,
> 		&queue_index, sizeof(queue_index));
>
> Implementing busy_poll support will allow userspace to kick the
> drivers receive routine if needed. This work is TBD.
>
> To test this I hacked a hardcoded test into  the tool psock_tpacket
> in the selftests kernel directory here:
>
>       ./tools/testing/selftests/net/psock_tpacket.c
>
> Running this tool opens a socket and listens for packets over
> the PACKET_RX_DIRECT enabled socket. Obviously it needs to be
> reworked to enable all the older tests and not hardcode my
> interface before it actually gets released.
>
> In general this is a rough patch to explore the interface and
> put something concrete up for debate. The patch does not handle
> all the error cases correctly and needs to be cleaned up.
>
> Known Limitations (TBD):
>
>       (1) Users are required to match the number of rx ring
>           slots with ethtool to the number requested by the
>           setsockopt PF_PACKET layout. In the future we could
>           possibly do this automatically.
>
>       (2) Users need to configure Flow director or setup_tc
>           to steer traffic to the correct queues. I don't believe
>           this needs to be changed it seems to be a good mechanism
>           for driving directed dma.
>
>       (3) Not supporting timestamps or priv space yet, pushing
> 	 a v4 packet header would resolve this nicely.
>
>       (5) Only RX supported so far. TX already supports direct DMA
>           interface but uses skbs which is really not needed. In
>           the TX_RING case we can optimize this path as well.
>
> To support TX case we can do a similar "slots" mechanism and
> kick operation. The kick could be a busy_poll like operation
> but on the TX side. The flow would be user space loads up
> n number of slots with packets, kicks tx busy poll bit, the
> driver sends packets, and finally when xmit is complete
> clears header bits to give slots back. When we have qdisc
> bypass set today we already bypass the entire stack so no
> paticular reason to use skb's in this case. Using xdp_buff
> as a v4 packet header would also allow us to consolidate
> driver code.
>
> To be done:
>
>       (1) More testing and performance analysis
>       (2) Busy polling sockets
>       (3) Implement v4 xdp_buff headers for analysis

I like this idea and we should generalize the API that make rx zerocopy 
not specific to packet socket. Then we can make this use for e.g macvtap 
(pass-through mode). But instead of the headers, ndo_ops should support 
refill from non-fixed memory location from userspace (per packet or 
packets) to satisfy the requirement of virtqueues.

Thanks

>       (4) performance testing :/ hopefully it looks good.
>
> Signed-off-by: John Fastabend<john.r.fastabend@intel.com>

[...]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/2] af_packet: direct dma for packet ineterface
  2017-02-01  5:09         ` John Fastabend
@ 2017-03-06 21:28           ` chetan loke
  0 siblings, 0 replies; 18+ messages in thread
From: chetan loke @ 2017-03-06 21:28 UTC (permalink / raw)
  To: John Fastabend
  Cc: Willem de Bruijn, Jesper Dangaard Brouer, bjorn.topel, jasowang,
	ast, alexander.duyck, john.r.fastabend, Network Development

On Tue, Jan 31, 2017 at 9:09 PM, John Fastabend
<john.fastabend@gmail.com> wrote:

>> If I understand correctly, the difficulty lies in v3 requiring that the
>> timer "close" the block when the timer expires. That may not be worth
>> implementing, indeed.
>>
>
> Yep that is where I just gave up and decided it wasn't worth it.
>

Without a support for timeout, when a user-space app has to do its own
book-keeping or lets say  - shutdown for maintenance/upgrade, then how
can they(app) unblock from this operation? Because if the link is idle
then the DMA may never happen because there are no frames on the wire.
So is there a way to handle this?

Chetan

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-03-06 21:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-27 21:33 [RFC PATCH 0/2] rx zero copy interface for af_packet John Fastabend
2017-01-27 21:33 ` [RFC PATCH 1/2] af_packet: direct dma for packet ineterface John Fastabend
2017-01-30 18:16   ` Jesper Dangaard Brouer
2017-01-30 21:51     ` John Fastabend
2017-01-31  1:31       ` Willem de Bruijn
2017-02-01  5:09         ` John Fastabend
2017-03-06 21:28           ` chetan loke
2017-01-31 12:20       ` Jesper Dangaard Brouer
2017-02-01  5:01         ` John Fastabend
2017-02-04  3:10   ` Jason Wang
2017-01-27 21:34 ` [RFC PATCH 2/2] ixgbe: add af_packet direct copy support John Fastabend
2017-01-31  2:53   ` Alexei Starovoitov
2017-02-01  4:58     ` John Fastabend
2017-01-30 22:02 ` [RFC PATCH 0/2] rx zero copy interface for af_packet David Miller
2017-01-31 16:30 ` Sowmini Varadhan
2017-02-01  4:23   ` John Fastabend
2017-01-31 19:39 ` tndave
2017-02-01  5:09   ` John Fastabend

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.