All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/7] PMD driver for AF_XDP
@ 2018-02-27  9:32 Qi Zhang
  2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:32 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

The RFC patches add a new PMD driver for AF_XDP which is a proposed
faster version of AF_PACKET interface in Linux, see below link for 
detail AF_XDP introduction:
https://fosdem.org/2018/schedule/event/af_xdp/
https://lwn.net/Articles/745934/

This patchset is base on v18.02.
It also require a linux kernel that have below AF_XDP RFC patches be
applied.
https://patchwork.ozlabs.org/patch/867961/
https://patchwork.ozlabs.org/patch/867960/
https://patchwork.ozlabs.org/patch/867938/
https://patchwork.ozlabs.org/patch/867939/
https://patchwork.ozlabs.org/patch/867940/
https://patchwork.ozlabs.org/patch/867941/
https://patchwork.ozlabs.org/patch/867942/
https://patchwork.ozlabs.org/patch/867943/
https://patchwork.ozlabs.org/patch/867944/
https://patchwork.ozlabs.org/patch/867945/
https://patchwork.ozlabs.org/patch/867946/
https://patchwork.ozlabs.org/patch/867947/
https://patchwork.ozlabs.org/patch/867948/
https://patchwork.ozlabs.org/patch/867949/
https://patchwork.ozlabs.org/patch/867950/
https://patchwork.ozlabs.org/patch/867951/
https://patchwork.ozlabs.org/patch/867952/
https://patchwork.ozlabs.org/patch/867953/
https://patchwork.ozlabs.org/patch/867954/
https://patchwork.ozlabs.org/patch/867955/
https://patchwork.ozlabs.org/patch/867956/
https://patchwork.ozlabs.org/patch/867957/
https://patchwork.ozlabs.org/patch/867958/
https://patchwork.ozlabs.org/patch/867959/

There is no clean upstream target yet since kernel patch is still in
RFC stage, The purpose of the patchset is just for anyone that want to
eveluate af_xdp with DPDK application and get feedback for further
improvement.

To try with the new PMD
1. compile and install the kernel with above patches applied.
2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
   and $TOOLS_DIR (dir at <kernel_src>/tools) at driver/net/af_xdp/Makefile
   before compile DPDK.
3. make sure libelf and libbpf is installed.

BTW, performance test shows our PMD can reach 94%~98% of the orignal benchmark
when share memory is enabled.

Qi Zhang (7):
  net/af_xdp: new PMD driver
  lib/mbuf: enable parse flags when create mempool
  lib/mempool: allow page size aligned mempool
  net/af_xdp: use mbuf mempool for buffer management
  net/af_xdp: enable share mempool
  net/af_xdp: load BPF file
  app/testpmd: enable parameter for mempool flags

 app/test-pmd/parameters.c                     |  12 +
 app/test-pmd/testpmd.c                        |  15 +-
 app/test-pmd/testpmd.h                        |   1 +
 config/common_base                            |   5 +
 config/common_linuxapp                        |   1 +
 drivers/net/Makefile                          |   1 +
 drivers/net/af_xdp/Makefile                   |  60 ++
 drivers/net/af_xdp/bpf_load.c                 | 798 +++++++++++++++++++++++
 drivers/net/af_xdp/bpf_load.h                 |  65 ++
 drivers/net/af_xdp/libbpf.h                   | 199 ++++++
 drivers/net/af_xdp/meson.build                |   7 +
 drivers/net/af_xdp/rte_eth_af_xdp.c           | 878 ++++++++++++++++++++++++++
 drivers/net/af_xdp/rte_pmd_af_xdp_version.map |   4 +
 drivers/net/af_xdp/xdpsock_queue.h            |  62 ++
 lib/librte_mbuf/rte_mbuf.c                    |  15 +-
 lib/librte_mbuf/rte_mbuf.h                    |   8 +-
 lib/librte_mempool/rte_mempool.c              |   2 +
 lib/librte_mempool/rte_mempool.h              |   1 +
 mk/rte.app.mk                                 |   1 +
 19 files changed, 2125 insertions(+), 10 deletions(-)
 create mode 100644 drivers/net/af_xdp/Makefile
 create mode 100644 drivers/net/af_xdp/bpf_load.c
 create mode 100644 drivers/net/af_xdp/bpf_load.h
 create mode 100644 drivers/net/af_xdp/libbpf.h
 create mode 100644 drivers/net/af_xdp/meson.build
 create mode 100644 drivers/net/af_xdp/rte_eth_af_xdp.c
 create mode 100644 drivers/net/af_xdp/rte_pmd_af_xdp_version.map
 create mode 100644 drivers/net/af_xdp/xdpsock_queue.h

-- 
2.13.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-28 23:40   ` Stephen Hemminger
                     ` (3 more replies)
  2018-02-27  9:33 ` [RFC 2/7] lib/mbuf: enable parse flags when create mempool Qi Zhang
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

This is the vanilla version.
Packet data will copy between af_xdp memory buffer and mbuf mempool.
indexes of memory buffer is simply managed by a fifo ring.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 config/common_base                            |   5 +
 config/common_linuxapp                        |   1 +
 drivers/net/Makefile                          |   1 +
 drivers/net/af_xdp/Makefile                   |  56 ++
 drivers/net/af_xdp/meson.build                |   7 +
 drivers/net/af_xdp/rte_eth_af_xdp.c           | 763 ++++++++++++++++++++++++++
 drivers/net/af_xdp/rte_pmd_af_xdp_version.map |   4 +
 drivers/net/af_xdp/xdpsock_queue.h            |  62 +++
 mk/rte.app.mk                                 |   1 +
 9 files changed, 900 insertions(+)
 create mode 100644 drivers/net/af_xdp/Makefile
 create mode 100644 drivers/net/af_xdp/meson.build
 create mode 100644 drivers/net/af_xdp/rte_eth_af_xdp.c
 create mode 100644 drivers/net/af_xdp/rte_pmd_af_xdp_version.map
 create mode 100644 drivers/net/af_xdp/xdpsock_queue.h

diff --git a/config/common_base b/config/common_base
index ad03cf433..84b7b3b7e 100644
--- a/config/common_base
+++ b/config/common_base
@@ -368,6 +368,11 @@ CONFIG_RTE_LIBRTE_VMXNET3_DEBUG_TX_FREE=n
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=n
 
 #
+# Compile software PMD backed by AF_XDP sockets (Linux only)
+#
+CONFIG_RTE_LIBRTE_PMD_AF_XDP=n
+
+#
 # Compile link bonding PMD library
 #
 CONFIG_RTE_LIBRTE_PMD_BOND=y
diff --git a/config/common_linuxapp b/config/common_linuxapp
index ff98f2355..3b10695b6 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -16,6 +16,7 @@ CONFIG_RTE_LIBRTE_VHOST=y
 CONFIG_RTE_LIBRTE_VHOST_NUMA=y
 CONFIG_RTE_LIBRTE_PMD_VHOST=y
 CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
+CONFIG_RTE_LIBRTE_PMD_AF_XDP=y
 CONFIG_RTE_LIBRTE_PMD_TAP=y
 CONFIG_RTE_LIBRTE_AVP_PMD=y
 CONFIG_RTE_LIBRTE_VDEV_NETVSC_PMD=y
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index e1127326b..409234ac3 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -9,6 +9,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_THUNDERX_NICVF_PMD),d)
 endif
 
 DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET) += af_packet
+DIRS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += af_xdp
 DIRS-$(CONFIG_RTE_LIBRTE_ARK_PMD) += ark
 DIRS-$(CONFIG_RTE_LIBRTE_AVF_PMD) += avf
 DIRS-$(CONFIG_RTE_LIBRTE_AVP_PMD) += avp
diff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile
new file mode 100644
index 000000000..ac38e20bf
--- /dev/null
+++ b/drivers/net/af_xdp/Makefile
@@ -0,0 +1,56 @@
+#   BSD LICENSE
+#
+#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
+#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+#   Copyright(c) 2014 6WIND S.A.
+#   All rights reserved.
+#
+#   Redistribution and use in source and binary forms, with or without
+#   modification, are permitted provided that the following conditions
+#   are met:
+#
+#     * Redistributions of source code must retain the above copyright
+#       notice, this list of conditions and the following disclaimer.
+#     * Redistributions in binary form must reproduce the above copyright
+#       notice, this list of conditions and the following disclaimer in
+#       the documentation and/or other materials provided with the
+#       distribution.
+#     * Neither the name of Intel Corporation nor the names of its
+#       contributors may be used to endorse or promote products derived
+#       from this software without specific prior written permission.
+#
+#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_pmd_af_xdp.a
+
+EXPORT_MAP := rte_pmd_af_xdp_version.map
+
+LIBABIVER := 1
+
+CFLAGS += -O3 -I/opt/af_xdp/linux_headers/include
+CFLAGS += $(WERROR_FLAGS)
+LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
+LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs
+LDLIBS += -lrte_bus_vdev
+
+#
+# all source are stored in SRCS-y
+#
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += rte_eth_af_xdp.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/af_xdp/meson.build b/drivers/net/af_xdp/meson.build
new file mode 100644
index 000000000..4b5299c8e
--- /dev/null
+++ b/drivers/net/af_xdp/meson.build
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2017 Intel Corporation
+
+if host_machine.system() != 'linux'
+	build = false
+endif
+sources = files('rte_eth_af_xdp.c')
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
new file mode 100644
index 000000000..4eb8a2c28
--- /dev/null
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -0,0 +1,763 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2014 John W. Linville <linville@tuxdriver.com>
+ * Originally based upon librte_pmd_pcap code:
+ * Copyright(c) 2010-2015 Intel Corporation.
+ * Copyright(c) 2014 6WIND S.A.
+ * All rights reserved.
+ */
+
+#include <rte_mbuf.h>
+#include <rte_ethdev_driver.h>
+#include <rte_ethdev_vdev.h>
+#include <rte_malloc.h>
+#include <rte_kvargs.h>
+#include <rte_bus_vdev.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_xdp.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+#include "xdpsock_queue.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define ETH_AF_XDP_IFACE_ARG		"iface"
+#define ETH_AF_XDP_QUEUE_IDX_ARG	"queue"
+#define ETH_AF_XDP_RING_SIZE_ARG	"ringsz"
+
+#define ETH_AF_XDP_FRAME_SIZE		2048
+#define ETH_AF_XDP_NUM_BUFFERS		131072
+#define ETH_AF_XDP_DATA_HEADROOM	0
+#define ETH_AF_XDP_DFLT_RING_SIZE	1024
+#define ETH_AF_XDP_DFLT_QUEUE_IDX	0
+
+#define ETH_AF_XDP_RX_BATCH_SIZE	32
+#define ETH_AF_XDP_TX_BATCH_SIZE	32
+
+struct xdp_umem {
+	char *buffer;
+	size_t size;
+	unsigned int frame_size;
+	unsigned int frame_size_log2;
+	unsigned int nframes;
+	int mr_fd;
+};
+
+struct pmd_internals {
+	int sfd;
+	int if_index;
+	char if_name[0x100];
+	struct ether_addr eth_addr;
+	struct xdp_queue rx;
+	struct xdp_queue tx;
+	struct xdp_umem *umem;
+	struct rte_mempool *mb_pool;
+
+	unsigned long rx_pkts;
+	unsigned long rx_bytes;
+	unsigned long rx_dropped;
+
+	unsigned long tx_pkts;
+	unsigned long err_pkts;
+	unsigned long tx_bytes;
+
+	uint16_t port_id;
+	uint16_t queue_idx;
+	int ring_size;
+	struct rte_ring *buf_ring;
+};
+
+static const char * const valid_arguments[] = {
+	ETH_AF_XDP_IFACE_ARG,
+	ETH_AF_XDP_QUEUE_IDX_ARG,
+	ETH_AF_XDP_RING_SIZE_ARG,
+	NULL
+};
+
+static struct rte_eth_link pmd_link = {
+	.link_speed = ETH_SPEED_NUM_10G,
+	.link_duplex = ETH_LINK_FULL_DUPLEX,
+	.link_status = ETH_LINK_DOWN,
+	.link_autoneg = ETH_LINK_AUTONEG
+};
+
+static void *get_pkt_data(struct pmd_internals *internals,
+			  uint32_t index,
+			  uint32_t offset)
+{
+	return (uint8_t *)(internals->umem->buffer +
+			   (index << internals->umem->frame_size_log2) +
+			   offset);
+}
+
+static uint16_t
+eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct pmd_internals *internals = queue;
+	struct xdp_queue *rxq = &internals->rx;
+	struct rte_mbuf *mbuf;
+	unsigned long dropped = 0;
+	unsigned long rx_bytes = 0;
+	uint16_t count = 0;
+
+	nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ?
+		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
+
+	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
+	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
+	int rcvd, i;
+	/* fill rx ring */
+	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {
+		int n = rte_ring_dequeue_bulk(internals->buf_ring,
+					      indexes,
+					      ETH_AF_XDP_RX_BATCH_SIZE,
+					      NULL);
+		for (i = 0; i < n; i++)
+			descs[i].idx = (uint32_t)((long int)indexes[i]);
+		xq_enq(rxq, descs, n);
+	}
+
+	/* read data */
+	rcvd = xq_deq(rxq, descs, nb_pkts);
+	if (rcvd == 0)
+		return 0;
+
+	for (i = 0; i < rcvd; i++) {
+		char *pkt;
+		uint32_t idx = descs[i].idx;
+
+		mbuf = rte_pktmbuf_alloc(internals->mb_pool);
+		rte_pktmbuf_pkt_len(mbuf) =
+			rte_pktmbuf_data_len(mbuf) =
+			descs[i].len;
+		if (mbuf) {
+			pkt = get_pkt_data(internals, idx, descs[i].offset);
+			memcpy(rte_pktmbuf_mtod(mbuf, void *),
+			       pkt, descs[i].len);
+			rx_bytes += descs[i].len;
+			bufs[count++] = mbuf;
+		} else {
+			dropped++;
+		}
+		indexes[i] = (void *)((long int)idx);
+	}
+
+	rte_ring_enqueue_bulk(internals->buf_ring, indexes, rcvd, NULL);
+
+	internals->rx_pkts += (rcvd - dropped);
+	internals->rx_bytes += rx_bytes;
+	internals->rx_dropped += dropped;
+
+	return count;
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	for (;;) {
+		ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+		if (ret >= 0 || errno == ENOBUFS)
+			return;
+		if (errno == EAGAIN)
+			continue;
+	}
+}
+
+static uint16_t
+eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
+{
+	struct pmd_internals *internals = queue;
+	struct xdp_queue *txq = &internals->tx;
+	struct rte_mbuf *mbuf;
+	struct xdp_desc descs[ETH_AF_XDP_TX_BATCH_SIZE];
+	void *indexes[ETH_AF_XDP_TX_BATCH_SIZE];
+	uint16_t i, valid;
+	unsigned long tx_bytes = 0;
+
+	nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ?
+		  nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE;
+
+	if (txq->num_free < ETH_AF_XDP_TX_BATCH_SIZE * 2) {
+		int n = xq_deq(txq, descs, ETH_AF_XDP_TX_BATCH_SIZE);
+
+		for (i = 0; i < n; i++)
+			indexes[i] = (void *)((long int)descs[i].idx);
+		rte_ring_enqueue_bulk(internals->buf_ring, indexes, n, NULL);
+	}
+
+	nb_pkts = nb_pkts > txq->num_free ? txq->num_free : nb_pkts;
+	nb_pkts = rte_ring_dequeue_bulk(internals->buf_ring, indexes,
+					nb_pkts, NULL);
+
+	valid = 0;
+	for (i = 0; i < nb_pkts; i++) {
+		char *pkt;
+		unsigned int buf_len =
+			internals->umem->frame_size - ETH_AF_XDP_DATA_HEADROOM;
+		mbuf = bufs[i];
+		if (mbuf->pkt_len <= buf_len) {
+			descs[valid].idx = (uint32_t)((long int)indexes[valid]);
+			descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
+			descs[valid].flags = 0;
+			descs[valid].len = mbuf->pkt_len;
+			pkt = get_pkt_data(internals, descs[i].idx,
+					   descs[i].offset);
+			memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
+			       descs[i].len);
+			valid++;
+			tx_bytes += mbuf->pkt_len;
+		}
+		rte_pktmbuf_free(mbuf);
+	}
+
+	xq_enq(txq, descs, valid);
+	kick_tx(internals->sfd);
+
+	if (valid < nb_pkts)
+		rte_ring_enqueue_bulk(internals->buf_ring, &indexes[valid],
+				      nb_pkts - valid, NULL);
+
+	internals->err_pkts += (nb_pkts - valid);
+	internals->tx_pkts += valid;
+	internals->tx_bytes += tx_bytes;
+
+	return valid;
+}
+
+static void
+fill_rx_desc(struct pmd_internals *internals)
+{
+	int num_free = internals->rx.num_free;
+	void *p = NULL;
+	int i;
+
+	for (i = 0; i < num_free; i++) {
+		struct xdp_desc desc = {};
+
+		rte_ring_dequeue(internals->buf_ring, &p);
+		desc.idx = (uint32_t)((long int)p);
+		xq_enq(&internals->rx, &desc, 1);
+	}
+}
+
+static int
+eth_dev_start(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev->data->dev_link.link_status = ETH_LINK_UP;
+	fill_rx_desc(internals);
+
+	return 0;
+}
+
+/* This function gets called when the current port gets stopped. */
+static void
+eth_dev_stop(struct rte_eth_dev *dev)
+{
+	dev->data->dev_link.link_status = ETH_LINK_DOWN;
+}
+
+static int
+eth_dev_configure(struct rte_eth_dev *dev __rte_unused)
+{
+	return 0;
+}
+
+static void
+eth_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	dev_info->if_index = internals->if_index;
+	dev_info->max_mac_addrs = 1;
+	dev_info->max_rx_pktlen = (uint32_t)ETH_FRAME_LEN;
+	dev_info->max_rx_queues = 1;
+	dev_info->max_tx_queues = 1;
+	dev_info->min_rx_bufsize = 0;
+}
+
+static int
+eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
+{
+	const struct pmd_internals *internal = dev->data->dev_private;
+
+	stats->ipackets = stats->q_ipackets[0] =
+		internal->rx_pkts;
+	stats->ibytes = stats->q_ibytes[0] =
+		internal->rx_bytes;
+	stats->imissed =
+		internal->rx_dropped;
+
+	stats->opackets = stats->q_opackets[0]
+		= internal->tx_pkts;
+	stats->oerrors = stats->q_errors[0] =
+		internal->err_pkts;
+	stats->obytes = stats->q_obytes[0] =
+		internal->tx_bytes;
+
+	return 0;
+}
+
+static void
+eth_stats_reset(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internal = dev->data->dev_private;
+
+	internal->rx_pkts = 0;
+	internal->rx_bytes = 0;
+	internal->rx_dropped = 0;
+
+	internal->tx_pkts = 0;
+	internal->err_pkts = 0;
+	internal->tx_bytes = 0;
+}
+
+static void
+eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+{
+}
+
+static void
+eth_queue_release(void *q __rte_unused)
+{
+}
+
+static int
+eth_link_update(struct rte_eth_dev *dev __rte_unused,
+		int wait_to_complete __rte_unused)
+{
+	return 0;
+}
+
+static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd, size_t nbuffers)
+{
+	struct xdp_mr_req req = { .frame_size = ETH_AF_XDP_FRAME_SIZE,
+				  .data_headroom = ETH_AF_XDP_DATA_HEADROOM };
+	struct xdp_umem *umem;
+	void *bufs;
+	int ret;
+
+	ret = posix_memalign((void **)&bufs, getpagesize(),
+			     nbuffers * req.frame_size);
+	if (ret)
+		return NULL;
+
+	umem = calloc(1, sizeof(*umem));
+	if (!umem) {
+		free(bufs);
+		return NULL;
+	}
+
+	req.addr = (unsigned long)bufs;
+	req.len = nbuffers * req.frame_size;
+	ret = setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
+	RTE_ASSERT(ret == 0);
+
+	umem->frame_size = ETH_AF_XDP_FRAME_SIZE;
+	umem->frame_size_log2 = 11;
+	umem->buffer = bufs;
+	umem->size = nbuffers * req.frame_size;
+	umem->nframes = nbuffers;
+	umem->mr_fd = sfd;
+
+	return umem;
+}
+
+static int
+xdp_configure(struct pmd_internals *internals)
+{
+	struct sockaddr_xdp sxdp;
+	struct xdp_ring_req req;
+	char ring_name[0x100];
+	int ret = 0;
+	long int i;
+
+	snprintf(ring_name, 0x100, "%s_%s_%d", "af_xdp_ring",
+		 internals->if_name, internals->queue_idx);
+	internals->buf_ring = rte_ring_create(ring_name,
+					      ETH_AF_XDP_NUM_BUFFERS,
+					      SOCKET_ID_ANY,
+					      0x0);
+	if (!internals->buf_ring)
+		return -1;
+
+	for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++)
+		rte_ring_enqueue(internals->buf_ring, (void *)i);
+
+	internals->umem = xsk_alloc_and_mem_reg_buffers(internals->sfd,
+							ETH_AF_XDP_NUM_BUFFERS);
+	if (!internals->umem)
+		goto error;
+
+	req.mr_fd = internals->umem->mr_fd;
+	req.desc_nr = internals->ring_size;
+
+	ret = setsockopt(internals->sfd, SOL_XDP, XDP_RX_RING,
+			 &req, sizeof(req));
+
+	RTE_ASSERT(ret == 0);
+
+	ret = setsockopt(internals->sfd, SOL_XDP, XDP_TX_RING,
+			 &req, sizeof(req));
+
+	RTE_ASSERT(ret == 0);
+
+	internals->rx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
+				  PROT_READ | PROT_WRITE,
+				  MAP_SHARED | MAP_LOCKED | MAP_POPULATE,
+				  internals->sfd,
+				  XDP_PGOFF_RX_RING);
+	RTE_ASSERT(internals->rx.ring != MAP_FAILED);
+
+	internals->rx.num_free = req.desc_nr;
+	internals->rx.ring_mask = req.desc_nr - 1;
+
+	internals->tx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
+				  PROT_READ | PROT_WRITE,
+				  MAP_SHARED | MAP_LOCKED | MAP_POPULATE,
+				  internals->sfd,
+				  XDP_PGOFF_TX_RING);
+	RTE_ASSERT(internals->tx.ring != MAP_FAILED);
+
+	internals->tx.num_free = req.desc_nr;
+	internals->tx.ring_mask = req.desc_nr - 1;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = internals->if_index;
+	sxdp.sxdp_queue_id = internals->queue_idx;
+
+	ret = bind(internals->sfd, (struct sockaddr *)&sxdp, sizeof(sxdp));
+	RTE_ASSERT(ret == 0);
+
+	return ret;
+error:
+	rte_ring_free(internals->buf_ring);
+	internals->buf_ring = NULL;
+	return -1;
+}
+
+static int
+eth_rx_queue_setup(struct rte_eth_dev *dev,
+		   uint16_t rx_queue_id,
+		   uint16_t nb_rx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_rxconf *rx_conf __rte_unused,
+		   struct rte_mempool *mb_pool)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	unsigned int buf_size, data_size;
+
+	RTE_ASSERT(rx_queue_id == 0);
+	internals->mb_pool = mb_pool;
+	xdp_configure(internals);
+
+	/* Now get the space available for data in the mbuf */
+	buf_size = rte_pktmbuf_data_room_size(internals->mb_pool) -
+		RTE_PKTMBUF_HEADROOM;
+	data_size = internals->umem->frame_size;
+
+	if (data_size > buf_size) {
+		RTE_LOG(ERR, PMD,
+			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
+			dev->device->name, data_size, buf_size);
+		return -ENOMEM;
+	}
+
+	dev->data->rx_queues[rx_queue_id] = internals;
+	return 0;
+}
+
+static int
+eth_tx_queue_setup(struct rte_eth_dev *dev,
+		   uint16_t tx_queue_id,
+		   uint16_t nb_tx_desc __rte_unused,
+		   unsigned int socket_id __rte_unused,
+		   const struct rte_eth_txconf *tx_conf __rte_unused)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	RTE_ASSERT(tx_queue_id == 0);
+	dev->data->tx_queues[tx_queue_id] = internals;
+	return 0;
+}
+
+static int
+eth_dev_mtu_set(struct rte_eth_dev *dev, uint16_t mtu)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+	struct ifreq ifr = { .ifr_mtu = mtu };
+	int ret;
+	int s;
+
+	s = socket(PF_INET, SOCK_DGRAM, 0);
+	if (s < 0)
+		return -EINVAL;
+
+	snprintf(ifr.ifr_name, IFNAMSIZ, "%s", internals->if_name);
+	ret = ioctl(s, SIOCSIFMTU, &ifr);
+	close(s);
+
+	if (ret < 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+static void
+eth_dev_change_flags(char *if_name, uint32_t flags, uint32_t mask)
+{
+	struct ifreq ifr;
+	int s;
+
+	s = socket(PF_INET, SOCK_DGRAM, 0);
+	if (s < 0)
+		return;
+
+	snprintf(ifr.ifr_name, IFNAMSIZ, "%s", if_name);
+	if (ioctl(s, SIOCGIFFLAGS, &ifr) < 0)
+		goto out;
+	ifr.ifr_flags &= mask;
+	ifr.ifr_flags |= flags;
+	if (ioctl(s, SIOCSIFFLAGS, &ifr) < 0)
+		goto out;
+out:
+	close(s);
+}
+
+static void
+eth_dev_promiscuous_enable(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	eth_dev_change_flags(internals->if_name, IFF_PROMISC, ~0);
+}
+
+static void
+eth_dev_promiscuous_disable(struct rte_eth_dev *dev)
+{
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	eth_dev_change_flags(internals->if_name, 0, ~IFF_PROMISC);
+}
+
+static const struct eth_dev_ops ops = {
+	.dev_start = eth_dev_start,
+	.dev_stop = eth_dev_stop,
+	.dev_close = eth_dev_close,
+	.dev_configure = eth_dev_configure,
+	.dev_infos_get = eth_dev_info,
+	.mtu_set = eth_dev_mtu_set,
+	.promiscuous_enable = eth_dev_promiscuous_enable,
+	.promiscuous_disable = eth_dev_promiscuous_disable,
+	.rx_queue_setup = eth_rx_queue_setup,
+	.tx_queue_setup = eth_tx_queue_setup,
+	.rx_queue_release = eth_queue_release,
+	.tx_queue_release = eth_queue_release,
+	.link_update = eth_link_update,
+	.stats_get = eth_stats_get,
+	.stats_reset = eth_stats_reset,
+};
+
+static struct rte_vdev_driver pmd_af_xdp_drv;
+
+static void
+parse_parameters(struct rte_kvargs *kvlist,
+		 char **if_name,
+		 int *queue_idx,
+		 int *ring_size)
+{
+	struct rte_kvargs_pair *pair = NULL;
+	unsigned int k_idx;
+
+	for (k_idx = 0; k_idx < kvlist->count; k_idx++) {
+		pair = &kvlist->pairs[k_idx];
+		if (strstr(pair->key, ETH_AF_XDP_IFACE_ARG))
+			*if_name = pair->value;
+		else if (strstr(pair->key, ETH_AF_XDP_QUEUE_IDX_ARG))
+			*queue_idx = atoi(pair->value);
+		else if (strstr(pair->key, ETH_AF_XDP_RING_SIZE_ARG))
+			*ring_size = atoi(pair->value);
+	}
+}
+
+static int
+get_iface_info(const char *if_name,
+	       struct ether_addr *eth_addr,
+	       int *if_index)
+{
+	struct ifreq ifr;
+	int sock = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
+
+	if (sock < 0)
+		return -1;
+
+	strcpy(ifr.ifr_name, if_name);
+	if (ioctl(sock, SIOCGIFINDEX, &ifr))
+		goto error;
+	*if_index = ifr.ifr_ifindex;
+
+	if (ioctl(sock, SIOCGIFHWADDR, &ifr))
+		goto error;
+
+	memcpy(eth_addr, ifr.ifr_hwaddr.sa_data, 6);
+
+	close(sock);
+	return 0;
+
+error:
+	close(sock);
+	return -1;
+}
+
+static int
+init_internals(struct rte_vdev_device *dev,
+	       const char *if_name,
+	       int queue_idx,
+	       int ring_size)
+{
+	const char *name = rte_vdev_device_name(dev);
+	struct rte_eth_dev *eth_dev = NULL;
+	struct rte_eth_dev_data *data = NULL;
+	const unsigned int numa_node = dev->device.numa_node;
+	struct pmd_internals *internals = NULL;
+	int ret;
+
+	data = rte_zmalloc_socket(name, sizeof(*internals), 0, numa_node);
+	if (!data)
+		return -1;
+
+	internals = rte_zmalloc_socket(name, sizeof(*internals), 0, numa_node);
+	if (!internals)
+		goto error_1;
+
+	internals->queue_idx = queue_idx;
+	internals->ring_size = ring_size;
+	strcpy(internals->if_name, if_name);
+	internals->sfd = socket(PF_XDP, SOCK_RAW, 0);
+	if (internals->sfd < 0)
+		goto error_2;
+
+	ret = get_iface_info(if_name, &internals->eth_addr,
+			     &internals->if_index);
+	if (ret)
+		goto error_3;
+
+	eth_dev = rte_eth_vdev_allocate(dev, 0);
+	if (!eth_dev)
+		goto error_3;
+
+	rte_memcpy(data, eth_dev->data, sizeof(*data));
+	internals->port_id = eth_dev->data->port_id;
+	data->dev_private = internals;
+	data->nb_rx_queues = 1;
+	data->nb_tx_queues = 1;
+	data->dev_link = pmd_link;
+	data->mac_addrs = &internals->eth_addr;
+
+	eth_dev->data = data;
+	eth_dev->dev_ops = &ops;
+
+	eth_dev->rx_pkt_burst = eth_af_xdp_rx;
+	eth_dev->tx_pkt_burst = eth_af_xdp_tx;
+
+	return 0;
+
+error_3:
+	close(internals->sfd);
+
+error_2:
+	rte_free(internals);
+
+error_1:
+	rte_free(data);
+	return -1;
+}
+
+static int
+rte_pmd_af_xdp_probe(struct rte_vdev_device *dev)
+{
+	struct rte_kvargs *kvlist;
+	char *if_name = NULL;
+	int ring_size = ETH_AF_XDP_DFLT_RING_SIZE;
+	int queue_idx = ETH_AF_XDP_DFLT_QUEUE_IDX;
+	int ret;
+
+	RTE_LOG(INFO, PMD, "Initializing pmd_af_packet for %s\n",
+		rte_vdev_device_name(dev));
+
+	kvlist = rte_kvargs_parse(rte_vdev_device_args(dev), valid_arguments);
+	if (!kvlist) {
+		RTE_LOG(ERR, PMD,
+			"Invalid kvargs");
+		return -1;
+	}
+
+	if (dev->device.numa_node == SOCKET_ID_ANY)
+		dev->device.numa_node = rte_socket_id();
+
+	parse_parameters(kvlist, &if_name, &queue_idx, &ring_size);
+
+	ret = init_internals(dev, if_name, queue_idx, ring_size);
+	rte_kvargs_free(kvlist);
+
+	return ret;
+}
+
+static int
+rte_pmd_af_xdp_remove(struct rte_vdev_device *dev)
+{
+	struct rte_eth_dev *eth_dev = NULL;
+	struct pmd_internals *internals;
+
+	RTE_LOG(INFO, PMD, "Closing AF_XDP ethdev on numa socket %u\n",
+		rte_socket_id());
+
+	if (!dev)
+		return -1;
+
+	/* find the ethdev entry */
+	eth_dev = rte_eth_dev_allocated(rte_vdev_device_name(dev));
+	if (!eth_dev)
+		return -1;
+
+	internals = eth_dev->data->dev_private;
+	rte_ring_free(internals->buf_ring);
+	rte_free(internals->umem);
+	rte_free(eth_dev->data->dev_private);
+	rte_free(eth_dev->data);
+	close(internals->sfd);
+
+	rte_eth_dev_release_port(eth_dev);
+
+	return 0;
+}
+
+static struct rte_vdev_driver pmd_af_xdp_drv = {
+	.probe = rte_pmd_af_xdp_probe,
+	.remove = rte_pmd_af_xdp_remove,
+};
+
+RTE_PMD_REGISTER_VDEV(net_af_xdp, pmd_af_xdp_drv);
+RTE_PMD_REGISTER_ALIAS(net_af_xdp, eth_af_xdp);
+RTE_PMD_REGISTER_PARAM_STRING(net_af_xdp,
+			      "iface=<string> "
+			      "queue=<int> "
+			      "ringsz=<int> ");
diff --git a/drivers/net/af_xdp/rte_pmd_af_xdp_version.map b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map
new file mode 100644
index 000000000..ef3539840
--- /dev/null
+++ b/drivers/net/af_xdp/rte_pmd_af_xdp_version.map
@@ -0,0 +1,4 @@
+DPDK_2.0 {
+
+	local: *;
+};
diff --git a/drivers/net/af_xdp/xdpsock_queue.h b/drivers/net/af_xdp/xdpsock_queue.h
new file mode 100644
index 000000000..0dc666a08
--- /dev/null
+++ b/drivers/net/af_xdp/xdpsock_queue.h
@@ -0,0 +1,62 @@
+#ifndef __XDPSOCK_QUEUE_H
+#define __XDPSOCK_QUEUE_H
+
+static inline int xq_enq(struct xdp_queue *q,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	unsigned int avail_idx = q->avail_idx;
+	unsigned int i;
+	int j;
+
+	if (q->num_free < ndescs)
+		return -ENOSPC;
+
+	q->num_free -= ndescs;
+
+	for (i = 0; i < ndescs; i++) {
+		unsigned int idx = avail_idx++ & q->ring_mask;
+
+		q->ring[idx].idx	= descs[i].idx;
+		q->ring[idx].len	= descs[i].len;
+		q->ring[idx].offset	= descs[i].offset;
+		q->ring[idx].error	= 0;
+	}
+	rte_smp_wmb();
+
+	for (j = ndescs - 1; j >= 0; j--) {
+		unsigned int idx = (q->avail_idx + j) & q->ring_mask;
+
+		q->ring[idx].flags = descs[j].flags | XDP_DESC_KERNEL;
+	}
+	q->avail_idx += ndescs;
+
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_queue *q,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	unsigned int idx, last_used_idx = q->last_used_idx;
+	int i, entries = 0;
+
+	for (i = 0; i < ndescs; i++) {
+		idx = (last_used_idx++) & q->ring_mask;
+		if (q->ring[idx].flags & XDP_DESC_KERNEL)
+			break;
+		entries++;
+	}
+	q->num_free += entries;
+
+	rte_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = q->last_used_idx++ & q->ring_mask;
+		descs[i] = q->ring[idx];
+	}
+
+	return entries;
+}
+
+#endif /* __XDPSOCK_QUEUE_H */
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 3eb41d176..bc26e1457 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -120,6 +120,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET)  += -lrte_pmd_af_packet
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP)     += -lrte_pmd_af_xdp
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ARK_PMD)        += -lrte_pmd_ark
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVF_PMD)        += -lrte_pmd_avf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVP_PMD)        += -lrte_pmd_avp
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 2/7] lib/mbuf: enable parse flags when create mempool
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
  2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-27  9:33 ` [RFC 3/7] lib/mempool: allow page size aligned mempool Qi Zhang
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

This give the option that applicaiton can configure each
memory chunk's size precisely. (by MEMPOOL_F_NO_SPREAD).

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 lib/librte_mbuf/rte_mbuf.c | 15 ++++++++++++---
 lib/librte_mbuf/rte_mbuf.h |  8 +++++++-
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 091d388d3..5fd91c87c 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -125,7 +125,7 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 struct rte_mempool * __rte_experimental
 rte_pktmbuf_pool_create_by_ops(const char *name, unsigned int n,
 	unsigned int cache_size, uint16_t priv_size, uint16_t data_room_size,
-	int socket_id, const char *ops_name)
+	unsigned int flags, int socket_id, const char *ops_name)
 {
 	struct rte_mempool *mp;
 	struct rte_pktmbuf_pool_private mbp_priv;
@@ -145,7 +145,7 @@ rte_pktmbuf_pool_create_by_ops(const char *name, unsigned int n,
 	mbp_priv.mbuf_priv_size = priv_size;
 
 	mp = rte_mempool_create_empty(name, n, elt_size, cache_size,
-		 sizeof(struct rte_pktmbuf_pool_private), socket_id, 0);
+		 sizeof(struct rte_pktmbuf_pool_private), socket_id, flags);
 	if (mp == NULL)
 		return NULL;
 
@@ -179,9 +179,18 @@ rte_pktmbuf_pool_create(const char *name, unsigned int n,
 	int socket_id)
 {
 	return rte_pktmbuf_pool_create_by_ops(name, n, cache_size, priv_size,
-			data_room_size, socket_id, NULL);
+			data_room_size, 0, socket_id, NULL);
 }
 
+/* helper to create a mbuf pool with NO_SPREAD */
+struct rte_mempool *
+rte_pktmbuf_pool_create_with_flags(const char *name, unsigned int n,
+	unsigned int cache_size, uint16_t priv_size, uint16_t data_room_size,
+	unsigned int flags, int socket_id)
+{
+	return rte_pktmbuf_pool_create_by_ops(name, n, cache_size, priv_size,
+			data_room_size, flags, socket_id, NULL);
+}
 /* do some sanity checks on a mbuf: panic if it fails */
 void
 rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header)
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 62740254d..6f6af42a8 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -1079,6 +1079,12 @@ rte_pktmbuf_pool_create(const char *name, unsigned n,
 	unsigned cache_size, uint16_t priv_size, uint16_t data_room_size,
 	int socket_id);
 
+struct rte_mempool *
+rte_pktmbuf_pool_create_with_flags(const char *name, unsigned int n,
+	unsigned cache_size, uint16_t priv_size, uint16_t data_room_size,
+	unsigned flags, int socket_id);
+
+
 /**
  * Create a mbuf pool with a given mempool ops name
  *
@@ -1119,7 +1125,7 @@ rte_pktmbuf_pool_create(const char *name, unsigned n,
 struct rte_mempool * __rte_experimental
 rte_pktmbuf_pool_create_by_ops(const char *name, unsigned int n,
 	unsigned int cache_size, uint16_t priv_size, uint16_t data_room_size,
-	int socket_id, const char *ops_name);
+	unsigned int flags, int socket_id, const char *ops_name);
 
 /**
  * Get the data room size of mbufs stored in a pktmbuf_pool
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 3/7] lib/mempool: allow page size aligned mempool
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
  2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
  2018-02-27  9:33 ` [RFC 2/7] lib/mbuf: enable parse flags when create mempool Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-27  9:33 ` [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Allow create a mempool with page size aligned base address.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 lib/librte_mempool/rte_mempool.c | 2 ++
 lib/librte_mempool/rte_mempool.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 54f7f4ba4..f8d4814ad 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -567,6 +567,8 @@ rte_mempool_populate_default(struct rte_mempool *mp)
 		pg_shift = 0; /* not needed, zone is physically contiguous */
 		pg_sz = 0;
 		align = RTE_CACHE_LINE_SIZE;
+		if (mp->flags & MEMPOOL_F_PAGE_ALIGN)
+			align = getpagesize();
 	} else {
 		pg_sz = getpagesize();
 		pg_shift = rte_bsf32(pg_sz);
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 8b1b7f7ed..774ab0f66 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -245,6 +245,7 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_PHYS_CONTIG 0x0020 /**< Don't need physically contiguous objs. */
+#define MEMPOOL_F_PAGE_ALIGN     0x0040 /**< Base address is page aligned. */
 /**
  * This capability flag is advertised by a mempool handler, if the whole
  * memory area containing the objects must be physically contiguous.
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (2 preceding siblings ...)
  2018-02-27  9:33 ` [RFC 3/7] lib/mempool: allow page size aligned mempool Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-03-01  2:08   ` Stephen Hemminger
  2018-02-27  9:33 ` [RFC 5/7] net/af_xdp: enable share mempool Qi Zhang
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Now, af_xdp registered memory buffer is managed by rte_mempool.
mbuf be allocated from rte_mempool can be convert to descriptor
index and vice versa.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 165 +++++++++++++++++++++---------------
 1 file changed, 97 insertions(+), 68 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 4eb8a2c28..3c534c77c 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -43,7 +43,11 @@
 
 #define ETH_AF_XDP_FRAME_SIZE		2048
 #define ETH_AF_XDP_NUM_BUFFERS		131072
-#define ETH_AF_XDP_DATA_HEADROOM	0
+/* mempool hdrobj size (64 bytes) + sizeof(struct rte_mbuf) (128 bytes) */
+#define ETH_AF_XDP_MBUF_OVERHEAD	192
+/* data start from offset 320 (192 + 128) bytes */
+#define ETH_AF_XDP_DATA_HEADROOM \
+	(ETH_AF_XDP_MBUF_OVERHEAD + RTE_PKTMBUF_HEADROOM)
 #define ETH_AF_XDP_DFLT_RING_SIZE	1024
 #define ETH_AF_XDP_DFLT_QUEUE_IDX	0
 
@@ -57,6 +61,7 @@ struct xdp_umem {
 	unsigned int frame_size_log2;
 	unsigned int nframes;
 	int mr_fd;
+	struct rte_mempool *mb_pool;
 };
 
 struct pmd_internals {
@@ -67,7 +72,7 @@ struct pmd_internals {
 	struct xdp_queue rx;
 	struct xdp_queue tx;
 	struct xdp_umem *umem;
-	struct rte_mempool *mb_pool;
+	struct rte_mempool *ext_mb_pool;
 
 	unsigned long rx_pkts;
 	unsigned long rx_bytes;
@@ -80,7 +85,6 @@ struct pmd_internals {
 	uint16_t port_id;
 	uint16_t queue_idx;
 	int ring_size;
-	struct rte_ring *buf_ring;
 };
 
 static const char * const valid_arguments[] = {
@@ -106,6 +110,21 @@ static void *get_pkt_data(struct pmd_internals *internals,
 			   offset);
 }
 
+static uint32_t
+mbuf_to_idx(struct pmd_internals *internals, struct rte_mbuf *mbuf)
+{
+	return (uint32_t)(((uint64_t)mbuf->buf_addr -
+			   (uint64_t)internals->umem->buffer) >>
+			  internals->umem->frame_size_log2);
+}
+
+static struct rte_mbuf *
+idx_to_mbuf(struct pmd_internals *internals, uint32_t idx)
+{
+	return (struct rte_mbuf *)(void *)(internals->umem->buffer + (idx
+			<< internals->umem->frame_size_log2) + 0x40);
+}
+
 static uint16_t
 eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 {
@@ -120,17 +139,18 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
 
 	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
-	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
+	struct rte_mbuf *mbufs[ETH_AF_XDP_RX_BATCH_SIZE];
 	int rcvd, i;
 	/* fill rx ring */
 	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {
-		int n = rte_ring_dequeue_bulk(internals->buf_ring,
-					      indexes,
-					      ETH_AF_XDP_RX_BATCH_SIZE,
-					      NULL);
-		for (i = 0; i < n; i++)
-			descs[i].idx = (uint32_t)((long int)indexes[i]);
-		xq_enq(rxq, descs, n);
+		int ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+					     (void *)mbufs,
+					     ETH_AF_XDP_RX_BATCH_SIZE);
+		if (!ret) {
+			for (i = 0; i < ETH_AF_XDP_RX_BATCH_SIZE; i++)
+				descs[i].idx = mbuf_to_idx(internals, mbufs[i]);
+			xq_enq(rxq, descs, ETH_AF_XDP_RX_BATCH_SIZE);
+		}
 	}
 
 	/* read data */
@@ -142,7 +162,7 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		char *pkt;
 		uint32_t idx = descs[i].idx;
 
-		mbuf = rte_pktmbuf_alloc(internals->mb_pool);
+		mbuf = rte_pktmbuf_alloc(internals->ext_mb_pool);
 		rte_pktmbuf_pkt_len(mbuf) =
 			rte_pktmbuf_data_len(mbuf) =
 			descs[i].len;
@@ -155,11 +175,9 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		} else {
 			dropped++;
 		}
-		indexes[i] = (void *)((long int)idx);
+		rte_pktmbuf_free(idx_to_mbuf(internals, idx));
 	}
 
-	rte_ring_enqueue_bulk(internals->buf_ring, indexes, rcvd, NULL);
-
 	internals->rx_pkts += (rcvd - dropped);
 	internals->rx_bytes += rx_bytes;
 	internals->rx_dropped += dropped;
@@ -187,9 +205,10 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	struct xdp_queue *txq = &internals->tx;
 	struct rte_mbuf *mbuf;
 	struct xdp_desc descs[ETH_AF_XDP_TX_BATCH_SIZE];
-	void *indexes[ETH_AF_XDP_TX_BATCH_SIZE];
+	struct rte_mbuf *mbufs[ETH_AF_XDP_TX_BATCH_SIZE];
 	uint16_t i, valid;
 	unsigned long tx_bytes = 0;
+	int ret;
 
 	nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ?
 		  nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE;
@@ -198,13 +217,15 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		int n = xq_deq(txq, descs, ETH_AF_XDP_TX_BATCH_SIZE);
 
 		for (i = 0; i < n; i++)
-			indexes[i] = (void *)((long int)descs[i].idx);
-		rte_ring_enqueue_bulk(internals->buf_ring, indexes, n, NULL);
+			rte_pktmbuf_free(idx_to_mbuf(internals, descs[i].idx));
 	}
 
 	nb_pkts = nb_pkts > txq->num_free ? txq->num_free : nb_pkts;
-	nb_pkts = rte_ring_dequeue_bulk(internals->buf_ring, indexes,
-					nb_pkts, NULL);
+	ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+				   (void *)mbufs,
+				   nb_pkts);
+	if (ret)
+		return 0;
 
 	valid = 0;
 	for (i = 0; i < nb_pkts; i++) {
@@ -213,14 +234,14 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 			internals->umem->frame_size - ETH_AF_XDP_DATA_HEADROOM;
 		mbuf = bufs[i];
 		if (mbuf->pkt_len <= buf_len) {
-			descs[valid].idx = (uint32_t)((long int)indexes[valid]);
+			descs[valid].idx = mbuf_to_idx(internals, mbufs[i]);
 			descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
 			descs[valid].flags = 0;
 			descs[valid].len = mbuf->pkt_len;
 			pkt = get_pkt_data(internals, descs[i].idx,
 					   descs[i].offset);
 			memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
-			       descs[i].len);
+					   descs[i].len);
 			valid++;
 			tx_bytes += mbuf->pkt_len;
 		}
@@ -230,9 +251,10 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	xq_enq(txq, descs, valid);
 	kick_tx(internals->sfd);
 
-	if (valid < nb_pkts)
-		rte_ring_enqueue_bulk(internals->buf_ring, &indexes[valid],
-				      nb_pkts - valid, NULL);
+	if (valid < nb_pkts) {
+		for (i = valid; i < nb_pkts; i++)
+			rte_pktmbuf_free(mbufs[i]);
+	}
 
 	internals->err_pkts += (nb_pkts - valid);
 	internals->tx_pkts += valid;
@@ -245,14 +267,13 @@ static void
 fill_rx_desc(struct pmd_internals *internals)
 {
 	int num_free = internals->rx.num_free;
-	void *p = NULL;
 	int i;
-
 	for (i = 0; i < num_free; i++) {
 		struct xdp_desc desc = {};
+		struct rte_mbuf *mbuf =
+			rte_pktmbuf_alloc(internals->umem->mb_pool);
 
-		rte_ring_dequeue(internals->buf_ring, &p);
-		desc.idx = (uint32_t)((long int)p);
+		desc.idx = mbuf_to_idx(internals, mbuf);
 		xq_enq(&internals->rx, &desc, 1);
 	}
 }
@@ -347,33 +368,53 @@ eth_link_update(struct rte_eth_dev *dev __rte_unused,
 	return 0;
 }
 
-static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd, size_t nbuffers)
+static void *get_base_addr(struct rte_mempool *mb_pool)
+{
+	struct rte_mempool_memhdr *memhdr;
+
+	STAILQ_FOREACH(memhdr, &mb_pool->mem_list, next) {
+		return memhdr->addr;
+	}
+	return NULL;
+}
+
+static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd,
+						      size_t nbuffers,
+						      const char *pool_name)
 {
 	struct xdp_mr_req req = { .frame_size = ETH_AF_XDP_FRAME_SIZE,
 				  .data_headroom = ETH_AF_XDP_DATA_HEADROOM };
-	struct xdp_umem *umem;
-	void *bufs;
-	int ret;
+	struct xdp_umem *umem = calloc(1, sizeof(*umem));
 
-	ret = posix_memalign((void **)&bufs, getpagesize(),
-			     nbuffers * req.frame_size);
-	if (ret)
+	if (!umem)
+		return NULL;
+
+	umem->mb_pool =
+		rte_pktmbuf_pool_create_with_flags(
+			pool_name, nbuffers,
+			250, 0,
+			(ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_MBUF_OVERHEAD),
+			MEMPOOL_F_NO_SPREAD | MEMPOOL_F_PAGE_ALIGN,
+			SOCKET_ID_ANY);
+
+	if (!umem->mb_pool) {
+		free(umem);
 		return NULL;
+	}
 
-	umem = calloc(1, sizeof(*umem));
-	if (!umem) {
-		free(bufs);
+	if (umem->mb_pool->nb_mem_chunks > 1) {
+		rte_mempool_free(umem->mb_pool);
+		free(umem);
 		return NULL;
 	}
 
-	req.addr = (unsigned long)bufs;
+	req.addr = (uint64_t)get_base_addr(umem->mb_pool);
 	req.len = nbuffers * req.frame_size;
-	ret = setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
-	RTE_ASSERT(ret == 0);
+	setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
 
 	umem->frame_size = ETH_AF_XDP_FRAME_SIZE;
 	umem->frame_size_log2 = 11;
-	umem->buffer = bufs;
+	umem->buffer = (char *)req.addr;
 	umem->size = nbuffers * req.frame_size;
 	umem->nframes = nbuffers;
 	umem->mr_fd = sfd;
@@ -386,38 +427,27 @@ xdp_configure(struct pmd_internals *internals)
 {
 	struct sockaddr_xdp sxdp;
 	struct xdp_ring_req req;
-	char ring_name[0x100];
+	char pool_name[0x100];
+
 	int ret = 0;
-	long int i;
 
-	snprintf(ring_name, 0x100, "%s_%s_%d", "af_xdp_ring",
+	snprintf(pool_name, 0x100, "%s_%s_%d", "af_xdp_pool",
 		 internals->if_name, internals->queue_idx);
-	internals->buf_ring = rte_ring_create(ring_name,
-					      ETH_AF_XDP_NUM_BUFFERS,
-					      SOCKET_ID_ANY,
-					      0x0);
-	if (!internals->buf_ring)
-		return -1;
-
-	for (i = 0; i < ETH_AF_XDP_NUM_BUFFERS; i++)
-		rte_ring_enqueue(internals->buf_ring, (void *)i);
-
 	internals->umem = xsk_alloc_and_mem_reg_buffers(internals->sfd,
-							ETH_AF_XDP_NUM_BUFFERS);
+							ETH_AF_XDP_NUM_BUFFERS,
+							pool_name);
 	if (!internals->umem)
-		goto error;
+		return -1;
 
 	req.mr_fd = internals->umem->mr_fd;
 	req.desc_nr = internals->ring_size;
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_RX_RING,
 			 &req, sizeof(req));
-
 	RTE_ASSERT(ret == 0);
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_TX_RING,
 			 &req, sizeof(req));
-
 	RTE_ASSERT(ret == 0);
 
 	internals->rx.ring = mmap(0, req.desc_nr * sizeof(struct xdp_desc),
@@ -448,10 +478,6 @@ xdp_configure(struct pmd_internals *internals)
 	RTE_ASSERT(ret == 0);
 
 	return ret;
-error:
-	rte_ring_free(internals->buf_ring);
-	internals->buf_ring = NULL;
-	return -1;
 }
 
 static int
@@ -466,11 +492,11 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 	unsigned int buf_size, data_size;
 
 	RTE_ASSERT(rx_queue_id == 0);
-	internals->mb_pool = mb_pool;
+	internals->ext_mb_pool = mb_pool;
 	xdp_configure(internals);
 
 	/* Now get the space available for data in the mbuf */
-	buf_size = rte_pktmbuf_data_room_size(internals->mb_pool) -
+	buf_size = rte_pktmbuf_data_room_size(internals->ext_mb_pool) -
 		RTE_PKTMBUF_HEADROOM;
 	data_size = internals->umem->frame_size;
 
@@ -739,8 +765,11 @@ rte_pmd_af_xdp_remove(struct rte_vdev_device *dev)
 		return -1;
 
 	internals = eth_dev->data->dev_private;
-	rte_ring_free(internals->buf_ring);
-	rte_free(internals->umem);
+	if (internals->umem) {
+		if (internals->umem->mb_pool)
+			rte_mempool_free(internals->umem->mb_pool);
+		rte_free(internals->umem);
+	}
 	rte_free(eth_dev->data->dev_private);
 	rte_free(eth_dev->data);
 	close(internals->sfd);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 5/7] net/af_xdp: enable share mempool
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (3 preceding siblings ...)
  2018-02-27  9:33 ` [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-02-27  9:33 ` [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Try to check if external mempool (from rx_queue_setup) is fit for
af_xdp, if it is, it will be registered to af_xdp socket directly and
there will be no packet data copy on Rx and Tx.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/af_xdp/rte_eth_af_xdp.c | 191 +++++++++++++++++++++++-------------
 1 file changed, 125 insertions(+), 66 deletions(-)

diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index 3c534c77c..d0939022b 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -60,7 +60,6 @@ struct xdp_umem {
 	unsigned int frame_size;
 	unsigned int frame_size_log2;
 	unsigned int nframes;
-	int mr_fd;
 	struct rte_mempool *mb_pool;
 };
 
@@ -73,6 +72,7 @@ struct pmd_internals {
 	struct xdp_queue tx;
 	struct xdp_umem *umem;
 	struct rte_mempool *ext_mb_pool;
+	uint8_t share_mb_pool;
 
 	unsigned long rx_pkts;
 	unsigned long rx_bytes;
@@ -162,20 +162,30 @@ eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 		char *pkt;
 		uint32_t idx = descs[i].idx;
 
-		mbuf = rte_pktmbuf_alloc(internals->ext_mb_pool);
-		rte_pktmbuf_pkt_len(mbuf) =
-			rte_pktmbuf_data_len(mbuf) =
-			descs[i].len;
-		if (mbuf) {
-			pkt = get_pkt_data(internals, idx, descs[i].offset);
-			memcpy(rte_pktmbuf_mtod(mbuf, void *),
-			       pkt, descs[i].len);
-			rx_bytes += descs[i].len;
-			bufs[count++] = mbuf;
+		if (!internals->share_mb_pool) {
+			mbuf = rte_pktmbuf_alloc(internals->ext_mb_pool);
+			rte_pktmbuf_pkt_len(mbuf) =
+				rte_pktmbuf_data_len(mbuf) =
+				descs[i].len;
+			if (mbuf) {
+				pkt = get_pkt_data(internals, idx,
+						   descs[i].offset);
+				memcpy(rte_pktmbuf_mtod(mbuf, void *), pkt,
+				       descs[i].len);
+				rx_bytes += descs[i].len;
+				bufs[count++] = mbuf;
+			} else {
+				dropped++;
+			}
+			rte_pktmbuf_free(idx_to_mbuf(internals, idx));
 		} else {
-			dropped++;
+			mbuf = idx_to_mbuf(internals, idx);
+			rte_pktmbuf_pkt_len(mbuf) =
+				rte_pktmbuf_data_len(mbuf) =
+				descs[i].len;
+			bufs[count++] = mbuf;
+			rx_bytes += descs[i].len;
 		}
-		rte_pktmbuf_free(idx_to_mbuf(internals, idx));
 	}
 
 	internals->rx_pkts += (rcvd - dropped);
@@ -209,51 +219,71 @@ eth_af_xdp_tx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	uint16_t i, valid;
 	unsigned long tx_bytes = 0;
 	int ret;
+	uint8_t share_mempool = 0;
 
 	nb_pkts = nb_pkts < ETH_AF_XDP_TX_BATCH_SIZE ?
 		  nb_pkts : ETH_AF_XDP_TX_BATCH_SIZE;
 
 	if (txq->num_free < ETH_AF_XDP_TX_BATCH_SIZE * 2) {
 		int n = xq_deq(txq, descs, ETH_AF_XDP_TX_BATCH_SIZE);
-
 		for (i = 0; i < n; i++)
 			rte_pktmbuf_free(idx_to_mbuf(internals, descs[i].idx));
 	}
 
 	nb_pkts = nb_pkts > txq->num_free ? txq->num_free : nb_pkts;
-	ret = rte_mempool_get_bulk(internals->umem->mb_pool,
-				   (void *)mbufs,
-				   nb_pkts);
-	if (ret)
+	if (nb_pkts == 0)
 		return 0;
 
+	if (bufs[0]->pool == internals->ext_mb_pool && internals->share_mb_pool)
+		share_mempool = 1;
+
+	if (!share_mempool) {
+		ret = rte_mempool_get_bulk(internals->umem->mb_pool,
+					   (void *)mbufs,
+					   nb_pkts);
+		if (ret)
+			return 0;
+	}
+
 	valid = 0;
 	for (i = 0; i < nb_pkts; i++) {
 		char *pkt;
-		unsigned int buf_len =
-			internals->umem->frame_size - ETH_AF_XDP_DATA_HEADROOM;
 		mbuf = bufs[i];
-		if (mbuf->pkt_len <= buf_len) {
-			descs[valid].idx = mbuf_to_idx(internals, mbufs[i]);
-			descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
-			descs[valid].flags = 0;
-			descs[valid].len = mbuf->pkt_len;
-			pkt = get_pkt_data(internals, descs[i].idx,
-					   descs[i].offset);
-			memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
-					   descs[i].len);
-			valid++;
+		if (!share_mempool) {
+			if (mbuf->pkt_len <=
+				(internals->umem->frame_size -
+				 ETH_AF_XDP_DATA_HEADROOM)) {
+				descs[valid].idx =
+					mbuf_to_idx(internals, mbufs[i]);
+				descs[valid].offset = ETH_AF_XDP_DATA_HEADROOM;
+				descs[valid].flags = 0;
+				descs[valid].len = mbuf->pkt_len;
+				pkt = get_pkt_data(internals, descs[i].idx,
+						   descs[i].offset);
+				memcpy(pkt, rte_pktmbuf_mtod(mbuf, void *),
+				       descs[i].len);
+				valid++;
+				tx_bytes += mbuf->pkt_len;
+			}
+			rte_pktmbuf_free(mbuf);
+		} else {
+			descs[i].idx = mbuf_to_idx(internals, mbuf);
+			descs[i].offset = ETH_AF_XDP_DATA_HEADROOM;
+			descs[i].flags = 0;
+			descs[i].len = mbuf->pkt_len;
 			tx_bytes += mbuf->pkt_len;
+			valid++;
 		}
-		rte_pktmbuf_free(mbuf);
 	}
 
 	xq_enq(txq, descs, valid);
 	kick_tx(internals->sfd);
 
-	if (valid < nb_pkts) {
-		for (i = valid; i < nb_pkts; i++)
-			rte_pktmbuf_free(mbufs[i]);
+	if (!share_mempool) {
+		if (valid < nb_pkts) {
+			for (i = valid; i < nb_pkts; i++)
+				rte_pktmbuf_free(mbufs[i]);
+		}
 	}
 
 	internals->err_pkts += (nb_pkts - valid);
@@ -378,46 +408,81 @@ static void *get_base_addr(struct rte_mempool *mb_pool)
 	return NULL;
 }
 
-static struct xdp_umem *xsk_alloc_and_mem_reg_buffers(int sfd,
-						      size_t nbuffers,
-						      const char *pool_name)
+static uint8_t
+check_mempool(struct rte_mempool *mp)
+{
+	RTE_ASSERT(mp);
+
+	/* must continues */
+	if (mp->nb_mem_chunks > 1)
+		return 0;
+
+	/* check header size */
+	if (mp->header_size != RTE_CACHE_LINE_SIZE)
+		return 0;
+
+	/* check base address */
+	if ((uint64_t)get_base_addr(mp) % getpagesize() != 0)
+		return 0;
+
+	/* check chunk size */
+	if ((mp->elt_size + mp->header_size + mp->trailer_size) %
+			ETH_AF_XDP_FRAME_SIZE != 0)
+		return 0;
+
+	return 1;
+}
+
+static struct xdp_umem *
+xsk_alloc_and_mem_reg_buffers(struct pmd_internals *internals)
 {
 	struct xdp_mr_req req = { .frame_size = ETH_AF_XDP_FRAME_SIZE,
 				  .data_headroom = ETH_AF_XDP_DATA_HEADROOM };
+	char pool_name[0x100];
+	int nbuffers;
 	struct xdp_umem *umem = calloc(1, sizeof(*umem));
 
 	if (!umem)
 		return NULL;
 
-	umem->mb_pool =
-		rte_pktmbuf_pool_create_with_flags(
-			pool_name, nbuffers,
-			250, 0,
-			(ETH_AF_XDP_FRAME_SIZE - ETH_AF_XDP_MBUF_OVERHEAD),
-			MEMPOOL_F_NO_SPREAD | MEMPOOL_F_PAGE_ALIGN,
-			SOCKET_ID_ANY);
-
-	if (!umem->mb_pool) {
-		free(umem);
-		return NULL;
-	}
+	internals->share_mb_pool = check_mempool(internals->ext_mb_pool);
+	if (!internals->share_mb_pool) {
+		snprintf(pool_name, 0x100, "%s_%s_%d", "af_xdp_pool",
+			 internals->if_name, internals->queue_idx);
+		umem->mb_pool =
+			rte_pktmbuf_pool_create_with_flags(
+				pool_name,
+				ETH_AF_XDP_NUM_BUFFERS,
+				250, 0,
+				(ETH_AF_XDP_FRAME_SIZE -
+				 ETH_AF_XDP_MBUF_OVERHEAD),
+				MEMPOOL_F_NO_SPREAD | MEMPOOL_F_PAGE_ALIGN,
+				SOCKET_ID_ANY);
+		if (!umem->mb_pool) {
+			free(umem);
+			return NULL;
+		}
 
-	if (umem->mb_pool->nb_mem_chunks > 1) {
-		rte_mempool_free(umem->mb_pool);
-		free(umem);
-		return NULL;
+		if (umem->mb_pool->nb_mem_chunks > 1) {
+			rte_mempool_free(umem->mb_pool);
+			free(umem);
+			return NULL;
+		}
+		nbuffers = ETH_AF_XDP_NUM_BUFFERS;
+	} else {
+		umem->mb_pool = internals->ext_mb_pool;
+		nbuffers = umem->mb_pool->populated_size;
 	}
 
 	req.addr = (uint64_t)get_base_addr(umem->mb_pool);
-	req.len = nbuffers * req.frame_size;
-	setsockopt(sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
+	req.len = ETH_AF_XDP_NUM_BUFFERS * req.frame_size;
+	setsockopt(internals->sfd, SOL_XDP, XDP_MEM_REG, &req, sizeof(req));
 
 	umem->frame_size = ETH_AF_XDP_FRAME_SIZE;
 	umem->frame_size_log2 = 11;
 	umem->buffer = (char *)req.addr;
 	umem->size = nbuffers * req.frame_size;
 	umem->nframes = nbuffers;
-	umem->mr_fd = sfd;
 
 	return umem;
 }
@@ -427,19 +492,13 @@ xdp_configure(struct pmd_internals *internals)
 {
 	struct sockaddr_xdp sxdp;
 	struct xdp_ring_req req;
-	char pool_name[0x100];
-
 	int ret = 0;
 
-	snprintf(pool_name, 0x100, "%s_%s_%d", "af_xdp_pool",
-		 internals->if_name, internals->queue_idx);
-	internals->umem = xsk_alloc_and_mem_reg_buffers(internals->sfd,
-							ETH_AF_XDP_NUM_BUFFERS,
-							pool_name);
+	internals->umem = xsk_alloc_and_mem_reg_buffers(internals);
 	if (!internals->umem)
 		return -1;
 
-	req.mr_fd = internals->umem->mr_fd;
+	req.mr_fd = internals->sfd;
 	req.desc_nr = internals->ring_size;
 
 	ret = setsockopt(internals->sfd, SOL_XDP, XDP_RX_RING,
@@ -500,7 +559,7 @@ eth_rx_queue_setup(struct rte_eth_dev *dev,
 		RTE_PKTMBUF_HEADROOM;
 	data_size = internals->umem->frame_size;
 
-	if (data_size > buf_size) {
+	if (data_size - ETH_AF_XDP_DATA_HEADROOM > buf_size) {
 		RTE_LOG(ERR, PMD,
 			"%s: %d bytes will not fit in mbuf (%d bytes)\n",
 			dev->device->name, data_size, buf_size);
@@ -766,7 +825,7 @@ rte_pmd_af_xdp_remove(struct rte_vdev_device *dev)
 
 	internals = eth_dev->data->dev_private;
 	if (internals->umem) {
-		if (internals->umem->mb_pool)
+		if (internals->umem->mb_pool && !internals->share_mb_pool)
 			rte_mempool_free(internals->umem->mb_pool);
 		rte_free(internals->umem);
 	}
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 6/7] net/af_xdp: load BPF file
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (4 preceding siblings ...)
  2018-02-27  9:33 ` [RFC 5/7] net/af_xdp: enable share mempool Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-03-01  2:10   ` Stephen Hemminger
  2018-02-27  9:33 ` [RFC 7/7] app/testpmd: enable parameter for mempool flags Qi Zhang
  2018-03-01  2:52 ` [RFC 0/7] PMD driver for AF_XDP Jason Wang
  7 siblings, 1 reply; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Add libbpf and libelf dependency in Makefile.
Durring initialization, bpf file "xdpsock_kern.o" will be loaded.
Then the driver will always try to link XDP fd with DRV mode first,
then SKB mode if failed in previoius.
Link will be released during dev_close.

Note: this is workaround solution, af_xdp may remove BPF dependency
in future.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 drivers/net/af_xdp/Makefile         |   6 +-
 drivers/net/af_xdp/bpf_load.c       | 798 ++++++++++++++++++++++++++++++++++++
 drivers/net/af_xdp/bpf_load.h       |  65 +++
 drivers/net/af_xdp/libbpf.h         | 199 +++++++++
 drivers/net/af_xdp/rte_eth_af_xdp.c |  31 +-
 mk/rte.app.mk                       |   2 +-
 6 files changed, 1097 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/af_xdp/bpf_load.c
 create mode 100644 drivers/net/af_xdp/bpf_load.h
 create mode 100644 drivers/net/af_xdp/libbpf.h

diff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile
index ac38e20bf..a642786de 100644
--- a/drivers/net/af_xdp/Makefile
+++ b/drivers/net/af_xdp/Makefile
@@ -42,7 +42,10 @@ EXPORT_MAP := rte_pmd_af_xdp_version.map
 
 LIBABIVER := 1
 
-CFLAGS += -O3 -I/opt/af_xdp/linux_headers/include
+LINUX_HEADER_DIR := /opt/af_xdp/linux_headers/include
+TOOLS_DIR := /root/af_xdp/npg_dna-dna-linux/tools
+
+CFLAGS += -O3 -I$(LINUX_HEADER_DIR) -I$(TOOLS_DIR)/perf -I$(TOOLS_DIR)/include -Wno-error=sign-compare -Wno-error=cast-qual
 CFLAGS += $(WERROR_FLAGS)
 LDLIBS += -lrte_eal -lrte_mbuf -lrte_mempool -lrte_ring
 LDLIBS += -lrte_ethdev -lrte_net -lrte_kvargs
@@ -52,5 +55,6 @@ LDLIBS += -lrte_bus_vdev
 # all source are stored in SRCS-y
 #
 SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += rte_eth_af_xdp.c
+SRCS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP) += bpf_load.c
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/net/af_xdp/bpf_load.c b/drivers/net/af_xdp/bpf_load.c
new file mode 100644
index 000000000..aa632207f
--- /dev/null
+++ b/drivers/net/af_xdp/bpf_load.c
@@ -0,0 +1,798 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <libelf.h>
+#include <gelf.h>
+#include <errno.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/perf_event.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <linux/types.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <poll.h>
+#include <ctype.h>
+#include <assert.h>
+#include "libbpf.h"
+#include "bpf_load.h"
+#include "perf-sys.h"
+
+#define DEBUGFS "/sys/kernel/debug/tracing/"
+
+static char license[128];
+static int kern_version;
+static bool processed_sec[128];
+char bpf_log_buf[BPF_LOG_BUF_SIZE];
+int map_fd[MAX_MAPS];
+int prog_fd[MAX_PROGS];
+int event_fd[MAX_PROGS];
+int prog_cnt;
+int prog_array_fd = -1;
+
+struct bpf_map_data map_data[MAX_MAPS];
+int map_data_count = 0;
+
+static int populate_prog_array(const char *event, int prog_fd)
+{
+	int ind = atoi(event), err;
+
+	err = bpf_map_update_elem(prog_array_fd, &ind, &prog_fd, BPF_ANY);
+	if (err < 0) {
+		printf("failed to store prog_fd in prog_array\n");
+		return -1;
+	}
+	return 0;
+}
+
+static int load_and_attach(const char *event, struct bpf_insn *prog, int size)
+{
+	bool is_socket = strncmp(event, "socket", 6) == 0;
+	bool is_kprobe = strncmp(event, "kprobe/", 7) == 0;
+	bool is_kretprobe = strncmp(event, "kretprobe/", 10) == 0;
+	bool is_tracepoint = strncmp(event, "tracepoint/", 11) == 0;
+	bool is_xdp = strncmp(event, "xdp", 3) == 0;
+	bool is_perf_event = strncmp(event, "perf_event", 10) == 0;
+	bool is_cgroup_skb = strncmp(event, "cgroup/skb", 10) == 0;
+	bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
+	bool is_sockops = strncmp(event, "sockops", 7) == 0;
+	bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
+	size_t insns_cnt = size / sizeof(struct bpf_insn);
+	enum bpf_prog_type prog_type;
+	char buf[256];
+	int fd, efd, err, id;
+	struct perf_event_attr attr = {};
+
+	attr.type = PERF_TYPE_TRACEPOINT;
+	attr.sample_type = PERF_SAMPLE_RAW;
+	attr.sample_period = 1;
+	attr.wakeup_events = 1;
+
+	if (is_socket) {
+		prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+	} else if (is_kprobe || is_kretprobe) {
+		prog_type = BPF_PROG_TYPE_KPROBE;
+	} else if (is_tracepoint) {
+		prog_type = BPF_PROG_TYPE_TRACEPOINT;
+	} else if (is_xdp) {
+		prog_type = BPF_PROG_TYPE_XDP;
+	} else if (is_perf_event) {
+		prog_type = BPF_PROG_TYPE_PERF_EVENT;
+	} else if (is_cgroup_skb) {
+		prog_type = BPF_PROG_TYPE_CGROUP_SKB;
+	} else if (is_cgroup_sk) {
+		prog_type = BPF_PROG_TYPE_CGROUP_SOCK;
+	} else if (is_sockops) {
+		prog_type = BPF_PROG_TYPE_SOCK_OPS;
+	} else if (is_sk_skb) {
+		prog_type = BPF_PROG_TYPE_SK_SKB;
+	} else {
+		printf("Unknown event '%s'\n", event);
+		return -1;
+	}
+
+	fd = bpf_load_program(prog_type, prog, insns_cnt, license, kern_version,
+			      bpf_log_buf, BPF_LOG_BUF_SIZE);
+	if (fd < 0) {
+		printf("bpf_load_program() err=%d\n%s", errno, bpf_log_buf);
+		return -1;
+	}
+
+	prog_fd[prog_cnt++] = fd;
+
+	if (is_xdp || is_perf_event || is_cgroup_skb || is_cgroup_sk)
+		return 0;
+
+	if (is_socket || is_sockops || is_sk_skb) {
+		if (is_socket)
+			event += 6;
+		else
+			event += 7;
+		if (*event != '/')
+			return 0;
+		event++;
+		if (!isdigit(*event)) {
+			printf("invalid prog number\n");
+			return -1;
+		}
+		return populate_prog_array(event, fd);
+	}
+
+	if (is_kprobe || is_kretprobe) {
+		if (is_kprobe)
+			event += 7;
+		else
+			event += 10;
+
+		if (*event == 0) {
+			printf("event name cannot be empty\n");
+			return -1;
+		}
+
+		if (isdigit(*event))
+			return populate_prog_array(event, fd);
+
+		snprintf(buf, sizeof(buf),
+			 "echo '%c:%s %s' >> /sys/kernel/debug/tracing/kprobe_events",
+			 is_kprobe ? 'p' : 'r', event, event);
+		err = system(buf);
+		if (err < 0) {
+			printf("failed to create kprobe '%s' error '%s'\n",
+			       event, strerror(errno));
+			return -1;
+		}
+
+		strcpy(buf, DEBUGFS);
+		strcat(buf, "events/kprobes/");
+		strcat(buf, event);
+		strcat(buf, "/id");
+	} else if (is_tracepoint) {
+		event += 11;
+
+		if (*event == 0) {
+			printf("event name cannot be empty\n");
+			return -1;
+		}
+		strcpy(buf, DEBUGFS);
+		strcat(buf, "events/");
+		strcat(buf, event);
+		strcat(buf, "/id");
+	}
+
+	efd = open(buf, O_RDONLY, 0);
+	if (efd < 0) {
+		printf("failed to open event %s\n", event);
+		return -1;
+	}
+
+	err = read(efd, buf, sizeof(buf));
+	if (err < 0 || err >= sizeof(buf)) {
+		printf("read from '%s' failed '%s'\n", event, strerror(errno));
+		return -1;
+	}
+
+	close(efd);
+
+	buf[err] = 0;
+	id = atoi(buf);
+	attr.config = id;
+
+	efd = sys_perf_event_open(&attr, -1/*pid*/, 0/*cpu*/, -1/*group_fd*/, 0);
+	if (efd < 0) {
+		printf("event %d fd %d err %s\n", id, efd, strerror(errno));
+		return -1;
+	}
+	event_fd[prog_cnt - 1] = efd;
+	err = ioctl(efd, PERF_EVENT_IOC_ENABLE, 0);
+	if (err < 0) {
+		printf("ioctl PERF_EVENT_IOC_ENABLE failed err %s\n",
+		       strerror(errno));
+		return -1;
+	}
+	err = ioctl(efd, PERF_EVENT_IOC_SET_BPF, fd);
+	if (err < 0) {
+		printf("ioctl PERF_EVENT_IOC_SET_BPF failed err %s\n",
+		       strerror(errno));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int load_maps(struct bpf_map_data *maps, int nr_maps,
+		     fixup_map_cb fixup_map)
+{
+	int i, numa_node;
+
+	for (i = 0; i < nr_maps; i++) {
+		if (fixup_map) {
+			fixup_map(&maps[i], i);
+			/* Allow userspace to assign map FD prior to creation */
+			if (maps[i].fd != -1) {
+				map_fd[i] = maps[i].fd;
+				continue;
+			}
+		}
+
+		numa_node = maps[i].def.map_flags & BPF_F_NUMA_NODE ?
+			maps[i].def.numa_node : -1;
+
+		if (maps[i].def.type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+		    maps[i].def.type == BPF_MAP_TYPE_HASH_OF_MAPS) {
+			int inner_map_fd = map_fd[maps[i].def.inner_map_idx];
+
+			map_fd[i] = bpf_create_map_in_map_node(maps[i].def.type,
+							maps[i].name,
+							maps[i].def.key_size,
+							inner_map_fd,
+							maps[i].def.max_entries,
+							maps[i].def.map_flags,
+							numa_node);
+		} else {
+			map_fd[i] = bpf_create_map_node(maps[i].def.type,
+							maps[i].name,
+							maps[i].def.key_size,
+							maps[i].def.value_size,
+							maps[i].def.max_entries,
+							maps[i].def.map_flags,
+							numa_node);
+		}
+		if (map_fd[i] < 0) {
+			printf("failed to create a map: %d %s\n",
+			       errno, strerror(errno));
+			return 1;
+		}
+		maps[i].fd = map_fd[i];
+
+		if (maps[i].def.type == BPF_MAP_TYPE_PROG_ARRAY)
+			prog_array_fd = map_fd[i];
+	}
+	return 0;
+}
+
+static int get_sec(Elf *elf, int i, GElf_Ehdr *ehdr, char **shname,
+		   GElf_Shdr *shdr, Elf_Data **data)
+{
+	Elf_Scn *scn;
+
+	scn = elf_getscn(elf, i);
+	if (!scn)
+		return 1;
+
+	if (gelf_getshdr(scn, shdr) != shdr)
+		return 2;
+
+	*shname = elf_strptr(elf, ehdr->e_shstrndx, shdr->sh_name);
+	if (!*shname || !shdr->sh_size)
+		return 3;
+
+	*data = elf_getdata(scn, 0);
+	if (!*data || elf_getdata(scn, *data) != NULL)
+		return 4;
+
+	return 0;
+}
+
+static int parse_relo_and_apply(Elf_Data *data, Elf_Data *symbols,
+				GElf_Shdr *shdr, struct bpf_insn *insn,
+				struct bpf_map_data *maps, int nr_maps)
+{
+	int i, nrels;
+
+	nrels = shdr->sh_size / shdr->sh_entsize;
+
+	for (i = 0; i < nrels; i++) {
+		GElf_Sym sym;
+		GElf_Rel rel;
+		unsigned int insn_idx;
+		bool match = false;
+		int map_idx;
+
+		gelf_getrel(data, i, &rel);
+
+		insn_idx = rel.r_offset / sizeof(struct bpf_insn);
+
+		gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym);
+
+		if (insn[insn_idx].code != (BPF_LD | BPF_IMM | BPF_DW)) {
+			printf("invalid relo for insn[%d].code 0x%x\n",
+			       insn_idx, insn[insn_idx].code);
+			return 1;
+		}
+		insn[insn_idx].src_reg = BPF_PSEUDO_MAP_FD;
+
+		/* Match FD relocation against recorded map_data[] offset */
+		for (map_idx = 0; map_idx < nr_maps; map_idx++) {
+			if (maps[map_idx].elf_offset == sym.st_value) {
+				match = true;
+				break;
+			}
+		}
+		if (match) {
+			insn[insn_idx].imm = maps[map_idx].fd;
+		} else {
+			printf("invalid relo for insn[%d] no map_data match\n",
+			       insn_idx);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+static int cmp_symbols(const void *l, const void *r)
+{
+	const GElf_Sym *lsym = (const GElf_Sym *)l;
+	const GElf_Sym *rsym = (const GElf_Sym *)r;
+
+	if (lsym->st_value < rsym->st_value)
+		return -1;
+	else if (lsym->st_value > rsym->st_value)
+		return 1;
+	else
+		return 0;
+}
+
+static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
+				 Elf *elf, Elf_Data *symbols, int strtabidx)
+{
+	int map_sz_elf, map_sz_copy;
+	bool validate_zero = false;
+	Elf_Data *data_maps;
+	int i, nr_maps;
+	GElf_Sym *sym;
+	Elf_Scn *scn;
+
+	if (maps_shndx < 0)
+		return -EINVAL;
+	if (!symbols)
+		return -EINVAL;
+
+	/* Get data for maps section via elf index */
+	scn = elf_getscn(elf, maps_shndx);
+	if (scn)
+		data_maps = elf_getdata(scn, NULL);
+	if (!scn || !data_maps) {
+		printf("Failed to get Elf_Data from maps section %d\n",
+		       maps_shndx);
+		return -EINVAL;
+	}
+
+	/* For each map get corrosponding symbol table entry */
+	sym = calloc(MAX_MAPS+1, sizeof(GElf_Sym));
+	for (i = 0, nr_maps = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) {
+		assert(nr_maps < MAX_MAPS+1);
+		if (!gelf_getsym(symbols, i, &sym[nr_maps]))
+			continue;
+		if (sym[nr_maps].st_shndx != maps_shndx)
+			continue;
+		/* Only increment iif maps section */
+		nr_maps++;
+	}
+
+	/* Align to map_fd[] order, via sort on offset in sym.st_value */
+	qsort(sym, nr_maps, sizeof(GElf_Sym), cmp_symbols);
+
+	/* Keeping compatible with ELF maps section changes
+	 * ------------------------------------------------
+	 * The program size of struct bpf_map_def is known by loader
+	 * code, but struct stored in ELF file can be different.
+	 *
+	 * Unfortunately sym[i].st_size is zero.  To calculate the
+	 * struct size stored in the ELF file, assume all struct have
+	 * the same size, and simply divide with number of map
+	 * symbols.
+	 */
+	map_sz_elf = data_maps->d_size / nr_maps;
+	map_sz_copy = sizeof(struct bpf_map_def);
+	if (map_sz_elf < map_sz_copy) {
+		/*
+		 * Backward compat, loading older ELF file with
+		 * smaller struct, keeping remaining bytes zero.
+		 */
+		map_sz_copy = map_sz_elf;
+	} else if (map_sz_elf > map_sz_copy) {
+		/*
+		 * Forward compat, loading newer ELF file with larger
+		 * struct with unknown features. Assume zero means
+		 * feature not used.  Thus, validate rest of struct
+		 * data is zero.
+		 */
+		validate_zero = true;
+	}
+
+	/* Memcpy relevant part of ELF maps data to loader maps */
+	for (i = 0; i < nr_maps; i++) {
+		unsigned char *addr, *end;
+		struct bpf_map_def *def;
+		const char *map_name;
+		size_t offset;
+
+		map_name = elf_strptr(elf, strtabidx, sym[i].st_name);
+		maps[i].name = strdup(map_name);
+		if (!maps[i].name) {
+			printf("strdup(%s): %s(%d)\n", map_name,
+			       strerror(errno), errno);
+			free(sym);
+			return -errno;
+		}
+
+		/* Symbol value is offset into ELF maps section data area */
+		offset = sym[i].st_value;
+		def = (struct bpf_map_def *)((uint8_t *)data_maps->d_buf + offset);
+		maps[i].elf_offset = offset;
+		memset(&maps[i].def, 0, sizeof(struct bpf_map_def));
+		memcpy(&maps[i].def, def, map_sz_copy);
+
+		/* Verify no newer features were requested */
+		if (validate_zero) {
+			addr = (unsigned char*) def + map_sz_copy;
+			end  = (unsigned char*) def + map_sz_elf;
+			for (; addr < end; addr++) {
+				if (*addr != 0) {
+					free(sym);
+					return -EFBIG;
+				}
+			}
+		}
+	}
+
+	free(sym);
+	return nr_maps;
+}
+
+static int do_load_bpf_file(const char *path, fixup_map_cb fixup_map)
+{
+	int fd, i, ret, maps_shndx = -1, strtabidx = -1;
+	Elf *elf;
+	GElf_Ehdr ehdr;
+	GElf_Shdr shdr, shdr_prog;
+	Elf_Data *data, *data_prog, *data_maps = NULL, *symbols = NULL;
+	char *shname, *shname_prog;
+	int nr_maps = 0;
+
+	/* reset global variables */
+	kern_version = 0;
+	memset(license, 0, sizeof(license));
+	memset(processed_sec, 0, sizeof(processed_sec));
+
+	if (elf_version(EV_CURRENT) == EV_NONE)
+		return 1;
+
+	fd = open(path, O_RDONLY, 0);
+	if (fd < 0)
+		return 1;
+
+	elf = elf_begin(fd, ELF_C_READ, NULL);
+
+	if (!elf)
+		return 1;
+
+	if (gelf_getehdr(elf, &ehdr) != &ehdr)
+		return 1;
+
+	/* clear all kprobes */
+	i = system("echo \"\" > /sys/kernel/debug/tracing/kprobe_events");
+
+	/* scan over all elf sections to get license and map info */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (0) /* helpful for llvm debugging */
+			printf("section %d:%s data %p size %zd link %d flags %d\n",
+			       i, shname, data->d_buf, data->d_size,
+			       shdr.sh_link, (int) shdr.sh_flags);
+
+		if (strcmp(shname, "license") == 0) {
+			processed_sec[i] = true;
+			memcpy(license, data->d_buf, data->d_size);
+		} else if (strcmp(shname, "version") == 0) {
+			processed_sec[i] = true;
+			if (data->d_size != sizeof(int)) {
+				printf("invalid size of version section %zd\n",
+				       data->d_size);
+				return 1;
+			}
+			memcpy(&kern_version, data->d_buf, sizeof(int));
+		} else if (strcmp(shname, "maps") == 0) {
+			int j;
+
+			maps_shndx = i;
+			data_maps = data;
+			for (j = 0; j < MAX_MAPS; j++)
+				map_data[j].fd = -1;
+		} else if (shdr.sh_type == SHT_SYMTAB) {
+			strtabidx = shdr.sh_link;
+			symbols = data;
+		}
+	}
+
+	ret = 1;
+
+	if (!symbols) {
+		printf("missing SHT_SYMTAB section\n");
+		goto done;
+	}
+
+	if (data_maps) {
+		nr_maps = load_elf_maps_section(map_data, maps_shndx,
+						elf, symbols, strtabidx);
+		if (nr_maps < 0) {
+			printf("Error: Failed loading ELF maps (errno:%d):%s\n",
+			       nr_maps, strerror(-nr_maps));
+			ret = 1;
+			goto done;
+		}
+		if (load_maps(map_data, nr_maps, fixup_map))
+			goto done;
+		map_data_count = nr_maps;
+
+		processed_sec[maps_shndx] = true;
+	}
+
+	/* process all relo sections, and rewrite bpf insns for maps */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+		if (processed_sec[i])
+			continue;
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (shdr.sh_type == SHT_REL) {
+			struct bpf_insn *insns;
+
+			/* locate prog sec that need map fixup (relocations) */
+			if (get_sec(elf, shdr.sh_info, &ehdr, &shname_prog,
+				    &shdr_prog, &data_prog))
+				continue;
+
+			if (shdr_prog.sh_type != SHT_PROGBITS ||
+			    !(shdr_prog.sh_flags & SHF_EXECINSTR))
+				continue;
+
+			insns = (struct bpf_insn *) data_prog->d_buf;
+			processed_sec[i] = true; /* relo section */
+
+			if (parse_relo_and_apply(data, symbols, &shdr, insns,
+						 map_data, nr_maps))
+				continue;
+		}
+	}
+
+	/* load programs */
+	for (i = 1; i < ehdr.e_shnum; i++) {
+
+		if (processed_sec[i])
+			continue;
+
+		if (get_sec(elf, i, &ehdr, &shname, &shdr, &data))
+			continue;
+
+		if (memcmp(shname, "kprobe/", 7) == 0 ||
+		    memcmp(shname, "kretprobe/", 10) == 0 ||
+		    memcmp(shname, "tracepoint/", 11) == 0 ||
+		    memcmp(shname, "xdp", 3) == 0 ||
+		    memcmp(shname, "perf_event", 10) == 0 ||
+		    memcmp(shname, "socket", 6) == 0 ||
+		    memcmp(shname, "cgroup/", 7) == 0 ||
+		    memcmp(shname, "sockops", 7) == 0 ||
+		    memcmp(shname, "sk_skb", 6) == 0) {
+			ret = load_and_attach(shname, data->d_buf,
+					      data->d_size);
+			if (ret != 0)
+				goto done;
+		}
+	}
+
+	ret = 0;
+done:
+	close(fd);
+	return ret;
+}
+
+int load_bpf_file(const char *path)
+{
+	return do_load_bpf_file(path, NULL);
+}
+
+int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map)
+{
+	return do_load_bpf_file(path, fixup_map);
+}
+
+void read_trace_pipe(void)
+{
+	int trace_fd;
+
+	trace_fd = open(DEBUGFS "trace_pipe", O_RDONLY, 0);
+	if (trace_fd < 0)
+		return;
+
+	while (1) {
+		static char buf[4096];
+		ssize_t sz;
+
+		sz = read(trace_fd, buf, sizeof(buf));
+		if (sz > 0) {
+			buf[sz] = 0;
+			puts(buf);
+		}
+	}
+}
+
+#define MAX_SYMS 300000
+static struct ksym syms[MAX_SYMS];
+static int sym_cnt;
+
+static int ksym_cmp(const void *p1, const void *p2)
+{
+	return ((struct ksym *)p1)->addr - ((struct ksym *)p2)->addr;
+}
+
+int load_kallsyms(void)
+{
+	FILE *f = fopen("/proc/kallsyms", "r");
+	char func[256], buf[256];
+	char symbol;
+	void *addr;
+	int i = 0;
+
+	if (!f)
+		return -ENOENT;
+
+	while (!feof(f)) {
+		if (!fgets(buf, sizeof(buf), f))
+			break;
+		if (sscanf(buf, "%p %c %s", &addr, &symbol, func) != 3)
+			break;
+		if (!addr)
+			continue;
+		syms[i].addr = (long) addr;
+		syms[i].name = strdup(func);
+		i++;
+	}
+	sym_cnt = i;
+	qsort(syms, sym_cnt, sizeof(struct ksym), ksym_cmp);
+	return 0;
+}
+
+struct ksym *ksym_search(long key)
+{
+	int start = 0, end = sym_cnt;
+	int result;
+
+	while (start < end) {
+		size_t mid = start + (end - start) / 2;
+
+		result = key - syms[mid].addr;
+		if (result < 0)
+			end = mid;
+		else if (result > 0)
+			start = mid + 1;
+		else
+			return &syms[mid];
+	}
+
+	if (start >= 1 && syms[start - 1].addr < key &&
+	    key < syms[start].addr)
+		/* valid ksym */
+		return &syms[start - 1];
+
+	/* out of range. return _stext */
+	return &syms[0];
+}
+
+int set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		printf("open netlink socket: %s\n", strerror(errno));
+		return -1;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		printf("bind to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+
+	/* started nested attribute for XDP */
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+	nla->nla_len = NLA_HDRLEN;
+
+	/* add XDP fd */
+	nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+	nla_xdp->nla_type = 1/*IFLA_XDP_FD*/;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len += nla_xdp->nla_len;
+
+	/* if user passed in any flags, add those too */
+	if (flags) {
+		nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+		nla_xdp->nla_type = 3/*IFLA_XDP_FLAGS*/;
+		nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+		memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+		nla->nla_len += nla_xdp->nla_len;
+	}
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		printf("send to netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		printf("recv from netlink: %s\n", strerror(errno));
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != getpid()) {
+			printf("Wrong pid %d, expected %d\n",
+			       nh->nlmsg_pid, getpid());
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			printf("Wrong seq %d, expected %d\n",
+			       nh->nlmsg_seq, seq);
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			printf("nlmsg error %s\n", strerror(-err->error));
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
diff --git a/drivers/net/af_xdp/bpf_load.h b/drivers/net/af_xdp/bpf_load.h
new file mode 100644
index 000000000..5450e8b19
--- /dev/null
+++ b/drivers/net/af_xdp/bpf_load.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __BPF_LOAD_H
+#define __BPF_LOAD_H
+
+#include "libbpf.h"
+
+#define MAX_MAPS 32
+#define MAX_PROGS 32
+
+struct bpf_map_def {
+	unsigned int type;
+	unsigned int key_size;
+	unsigned int value_size;
+	unsigned int max_entries;
+	unsigned int map_flags;
+	unsigned int inner_map_idx;
+	unsigned int numa_node;
+};
+
+struct bpf_map_data {
+	int fd;
+	char *name;
+	size_t elf_offset;
+	struct bpf_map_def def;
+};
+
+typedef void (*fixup_map_cb)(struct bpf_map_data *map, int idx);
+
+extern int prog_fd[MAX_PROGS];
+extern int event_fd[MAX_PROGS];
+extern char bpf_log_buf[BPF_LOG_BUF_SIZE];
+extern int prog_cnt;
+
+/* There is a one-to-one mapping between map_fd[] and map_data[].
+ * The map_data[] just contains more rich info on the given map.
+ */
+extern int map_fd[MAX_MAPS];
+extern struct bpf_map_data map_data[MAX_MAPS];
+extern int map_data_count;
+
+/* parses elf file compiled by llvm .c->.o
+ * . parses 'maps' section and creates maps via BPF syscall
+ * . parses 'license' section and passes it to syscall
+ * . parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns by
+ *   storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
+ * . loads eBPF programs via BPF syscall
+ *
+ * One ELF file can contain multiple BPF programs which will be loaded
+ * and their FDs stored stored in prog_fd array
+ *
+ * returns zero on success
+ */
+int load_bpf_file(const char *path);
+int load_bpf_file_fixup_map(const char *path, fixup_map_cb fixup_map);
+
+void read_trace_pipe(void);
+struct ksym {
+	long addr;
+	char *name;
+};
+
+int load_kallsyms(void);
+struct ksym *ksym_search(long key);
+int set_link_xdp_fd(int ifindex, int fd, __u32 flags);
+#endif
diff --git a/drivers/net/af_xdp/libbpf.h b/drivers/net/af_xdp/libbpf.h
new file mode 100644
index 000000000..18bfee5aa
--- /dev/null
+++ b/drivers/net/af_xdp/libbpf.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* eBPF mini library */
+#ifndef __LIBBPF_H
+#define __LIBBPF_H
+
+#include <bpf/bpf.h>
+
+struct bpf_insn;
+
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_ALU32_REG(OP, DST, SRC)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
+
+#define BPF_ALU64_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_ALU32_IMM(OP, DST, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Short form of mov, dst_reg = src_reg */
+
+#define BPF_MOV64_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#define BPF_MOV32_REG(DST, SRC)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+/* Short form of mov, dst_reg = imm32 */
+
+#define BPF_MOV64_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU64 | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+#define BPF_MOV32_IMM(DST, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_ALU | BPF_MOV | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM)					\
+	BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_DW | BPF_IMM,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = 0,					\
+		.imm   = (__u32) (IMM) }),			\
+	((struct bpf_insn) {					\
+		.code  = 0, /* zero is reserved opcode */	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = ((__u64) (IMM)) >> 32 })
+
+#ifndef BPF_PSEUDO_MAP_FD
+# define BPF_PSEUDO_MAP_FD	1
+#endif
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD)				\
+	BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
+
+
+/* Direct packet access, R0 = *(uint *) (skb->data + imm32) */
+
+#define BPF_LD_ABS(SIZE, IMM)					\
+	((struct bpf_insn) {					\
+		.code  = BPF_LD | BPF_SIZE(SIZE) | BPF_ABS,	\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = IMM })
+
+/* Memory load, dst_reg = *(uint *) (src_reg + off16) */
+
+#define BPF_LDX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_LDX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = src_reg */
+
+#define BPF_STX_MEM(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Atomic memory add, *(uint *)(dst_reg + off16) += src_reg */
+
+#define BPF_STX_XADD(SIZE, DST, SRC, OFF)			\
+	((struct bpf_insn) {					\
+		.code  = BPF_STX | BPF_SIZE(SIZE) | BPF_XADD,	\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Memory store, *(uint *) (dst_reg + off16) = imm32 */
+
+#define BPF_ST_MEM(SIZE, DST, OFF, IMM)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_ST | BPF_SIZE(SIZE) | BPF_MEM,	\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Conditional jumps against registers, if (dst_reg 'op' src_reg) goto pc + off16 */
+
+#define BPF_JMP_REG(OP, DST, SRC, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_X,		\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = 0 })
+
+/* Conditional jumps against immediates, if (dst_reg 'op' imm32) goto pc + off16 */
+
+#define BPF_JMP_IMM(OP, DST, IMM, OFF)				\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_OP(OP) | BPF_K,		\
+		.dst_reg = DST,					\
+		.src_reg = 0,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Raw code statement block */
+
+#define BPF_RAW_INSN(CODE, DST, SRC, OFF, IMM)			\
+	((struct bpf_insn) {					\
+		.code  = CODE,					\
+		.dst_reg = DST,					\
+		.src_reg = SRC,					\
+		.off   = OFF,					\
+		.imm   = IMM })
+
+/* Program exit */
+
+#define BPF_EXIT_INSN()						\
+	((struct bpf_insn) {					\
+		.code  = BPF_JMP | BPF_EXIT,			\
+		.dst_reg = 0,					\
+		.src_reg = 0,					\
+		.off   = 0,					\
+		.imm   = 0 })
+
+#endif
diff --git a/drivers/net/af_xdp/rte_eth_af_xdp.c b/drivers/net/af_xdp/rte_eth_af_xdp.c
index d0939022b..903ca0d01 100644
--- a/drivers/net/af_xdp/rte_eth_af_xdp.c
+++ b/drivers/net/af_xdp/rte_eth_af_xdp.c
@@ -15,6 +15,7 @@
 
 #include <linux/if_ether.h>
 #include <linux/if_xdp.h>
+#include <linux/if_link.h>
 #include <arpa/inet.h>
 #include <net/if.h>
 #include <sys/types.h>
@@ -24,6 +25,7 @@
 #include <unistd.h>
 #include <poll.h>
 #include "xdpsock_queue.h"
+#include "bpf_load.h"
 
 #ifndef SOL_XDP
 #define SOL_XDP 283
@@ -85,6 +87,8 @@ struct pmd_internals {
 	uint16_t port_id;
 	uint16_t queue_idx;
 	int ring_size;
+
+	uint32_t xdp_flags;
 };
 
 static const char * const valid_arguments[] = {
@@ -382,8 +386,12 @@ eth_stats_reset(struct rte_eth_dev *dev)
 }
 
 static void
-eth_dev_close(struct rte_eth_dev *dev __rte_unused)
+eth_dev_close(struct rte_eth_dev *dev)
 {
+	struct pmd_internals *internals = dev->data->dev_private;
+
+	if (internals->xdp_flags)
+		set_link_xdp_fd(internals->if_index, -1, internals->xdp_flags);
 }
 
 static void
@@ -745,9 +753,25 @@ init_internals(struct rte_vdev_device *dev,
 	if (ret)
 		goto error_3;
 
+	/* need fix: hard coded bpf file */
+	if (load_bpf_file("xdpsock_kern.o")) {
+		printf("load bpf file failed\n");
+		goto error_3;
+	}
+	RTE_ASSERT(prog_fd[0]);
+
+	if (!set_link_xdp_fd(internals->if_index, prog_fd[0],
+			     XDP_FLAGS_DRV_MODE))
+		internals->xdp_flags = XDP_FLAGS_DRV_MODE;
+	else if (!set_link_xdp_fd(internals->if_index, prog_fd[0],
+				  XDP_FLAGS_SKB_MODE))
+		internals->xdp_flags = XDP_FLAGS_SKB_MODE;
+	else
+		goto error_3;
+
 	eth_dev = rte_eth_vdev_allocate(dev, 0);
 	if (!eth_dev)
-		goto error_3;
+		goto error_4;
 
 	rte_memcpy(data, eth_dev->data, sizeof(*data));
 	internals->port_id = eth_dev->data->port_id;
@@ -765,6 +789,9 @@ init_internals(struct rte_vdev_device *dev,
 
 	return 0;
 
+error_4:
+	set_link_xdp_fd(internals->if_index, -1, internals->xdp_flags);
+
 error_3:
 	close(internals->sfd);
 
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index bc26e1457..d05e6c0e4 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -120,7 +120,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
 
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_PACKET)  += -lrte_pmd_af_packet
-_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP)     += -lrte_pmd_af_xdp
+_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AF_XDP)     += -lrte_pmd_af_xdp -lelf -lbpf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_ARK_PMD)        += -lrte_pmd_ark
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVF_PMD)        += -lrte_pmd_avf
 _LDLIBS-$(CONFIG_RTE_LIBRTE_AVP_PMD)        += -lrte_pmd_avp
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC 7/7] app/testpmd: enable parameter for mempool flags
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (5 preceding siblings ...)
  2018-02-27  9:33 ` [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
@ 2018-02-27  9:33 ` Qi Zhang
  2018-03-01  2:52 ` [RFC 0/7] PMD driver for AF_XDP Jason Wang
  7 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:33 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Now, it is possible for testpmd to create a af_xdp friendly mempool.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 app/test-pmd/parameters.c | 12 ++++++++++++
 app/test-pmd/testpmd.c    | 15 +++++++++------
 app/test-pmd/testpmd.h    |  1 +
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/app/test-pmd/parameters.c b/app/test-pmd/parameters.c
index 97d22b860..19675671e 100644
--- a/app/test-pmd/parameters.c
+++ b/app/test-pmd/parameters.c
@@ -61,6 +61,7 @@ usage(char* progname)
 	       "--tx-first | --stats-period=PERIOD | "
 	       "--coremask=COREMASK --portmask=PORTMASK --numa "
 	       "--mbuf-size= | --total-num-mbufs= | "
+	       "--mp-flags= | "
 	       "--nb-cores= | --nb-ports= | "
 #ifdef RTE_LIBRTE_CMDLINE
 	       "--eth-peers-configfile= | "
@@ -105,6 +106,7 @@ usage(char* progname)
 	printf("  --socket-num=N: set socket from which all memory is allocated "
 	       "in NUMA mode.\n");
 	printf("  --mbuf-size=N: set the data size of mbuf to N bytes.\n");
+	printf("  --mp-flags=N: set the flags when create mbuf memory pool.\n");
 	printf("  --total-num-mbufs=N: set the number of mbufs to be allocated "
 	       "in mbuf pools.\n");
 	printf("  --max-pkt-len=N: set the maximum size of packet to N bytes.\n");
@@ -568,6 +570,7 @@ launch_args_parse(int argc, char** argv)
 		{ "ring-numa-config",           1, 0, 0 },
 		{ "socket-num",			1, 0, 0 },
 		{ "mbuf-size",			1, 0, 0 },
+		{ "mp-flags",			1, 0, 0 },
 		{ "total-num-mbufs",		1, 0, 0 },
 		{ "max-pkt-len",		1, 0, 0 },
 		{ "pkt-filter-mode",            1, 0, 0 },
@@ -769,6 +772,15 @@ launch_args_parse(int argc, char** argv)
 					rte_exit(EXIT_FAILURE,
 						 "mbuf-size should be > 0 and < 65536\n");
 			}
+			if (!strcmp(lgopts[opt_idx].name, "mp-flags")) {
+				n = atoi(optarg);
+				if (n > 0 && n <= 0xFFFF)
+					mp_flags = (uint16_t)n;
+				else
+					rte_exit(EXIT_FAILURE,
+						 "mp-flags should be > 0 and < 65536\n");
+			}
+
 			if (!strcmp(lgopts[opt_idx].name, "total-num-mbufs")) {
 				n = atoi(optarg);
 				if (n > 1024)
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 4c0e2586c..887899919 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -171,6 +171,7 @@ uint32_t burst_tx_delay_time = BURST_TX_WAIT_US;
 uint32_t burst_tx_retry_num = BURST_TX_RETRIES;
 
 uint16_t mbuf_data_size = DEFAULT_MBUF_DATA_SIZE; /**< Mbuf data space size. */
+uint16_t mp_flags = 0; /**< flags parsed when create mempool */
 uint32_t param_total_num_mbufs = 0;  /**< number of mbufs in all pools - if
                                       * specified on command-line. */
 uint16_t stats_period; /**< Period to show statistics (disabled by default) */
@@ -486,6 +487,7 @@ set_def_fwd_config(void)
  */
 static void
 mbuf_pool_create(uint16_t mbuf_seg_size, unsigned nb_mbuf,
+		 unsigned int flags,
 		 unsigned int socket_id)
 {
 	char pool_name[RTE_MEMPOOL_NAMESIZE];
@@ -503,7 +505,7 @@ mbuf_pool_create(uint16_t mbuf_seg_size, unsigned nb_mbuf,
 		rte_mp = rte_mempool_create_empty(pool_name, nb_mbuf,
 			mb_size, (unsigned) mb_mempool_cache,
 			sizeof(struct rte_pktmbuf_pool_private),
-			socket_id, 0);
+			socket_id, flags);
 		if (rte_mp == NULL)
 			goto err;
 
@@ -518,8 +520,8 @@ mbuf_pool_create(uint16_t mbuf_seg_size, unsigned nb_mbuf,
 		/* wrapper to rte_mempool_create() */
 		TESTPMD_LOG(INFO, "preferred mempool ops selected: %s\n",
 				rte_mbuf_best_mempool_ops());
-		rte_mp = rte_pktmbuf_pool_create(pool_name, nb_mbuf,
-			mb_mempool_cache, 0, mbuf_seg_size, socket_id);
+		rte_mp = rte_pktmbuf_pool_create_with_flags(pool_name, nb_mbuf,
+			mb_mempool_cache, 0, mbuf_seg_size, flags, socket_id);
 	}
 
 err:
@@ -735,13 +737,14 @@ init_config(void)
 
 		for (i = 0; i < num_sockets; i++)
 			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool,
-					 socket_ids[i]);
+					 mp_flags, socket_ids[i]);
 	} else {
 		if (socket_num == UMA_NO_CONFIG)
-			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool, 0);
+			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool,
+					 mp_flags, 0);
 		else
 			mbuf_pool_create(mbuf_data_size, nb_mbuf_per_pool,
-						 socket_num);
+					 mp_flags, socket_num);
 	}
 
 	init_port_config();
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 153abea05..11c2ea681 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -386,6 +386,7 @@ extern uint8_t dcb_config;
 extern uint8_t dcb_test;
 
 extern uint16_t mbuf_data_size; /**< Mbuf data space size. */
+extern uint16_t mp_flags;  /**< flags for mempool creation. */
 extern uint32_t param_total_num_mbufs;
 
 extern uint16_t stats_period;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
@ 2018-02-28 23:40   ` Stephen Hemminger
  2018-02-28 23:42   ` Stephen Hemminger
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:40 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> iff --git a/drivers/net/af_xdp/Makefile b/drivers/net/af_xdp/Makefile
> new file mode 100644
> index 000000000..ac38e20bf
> --- /dev/null
> +++ b/drivers/net/af_xdp/Makefile
> @@ -0,0 +1,56 @@
> +#   BSD LICENSE
> +#
> +#   Copyright(c) 2014 John W. Linville <linville@redhat.com>
> +#   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> +#   Copyright(c) 2014 6WIND S.A.
> +#   All rights reserved.
> +#
> +#   Redistribution and use in source and binary forms, with or without
> +#   modification, are permitted provided that the following conditions
> +#   are met:
> +#
> +#     * Redistributions of source code must retain the above copyright
> +#       notice, this list of conditions and the following disclaimer.
> +#     * Redistributions in binary form must reproduce the above copyright
> +#       notice, this list of conditions and the following disclaimer in
> +#       the documentation and/or other materials provided with the
> +#       distribution.
> +#     * Neither the name of Intel Corporation nor the names of its
> +#       contributors may be used to endorse or promote products derived
> +#       from this software without specific prior written permission.
> +#
> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> +#   OF THIS SOFTWARE, EVEN IF ADVI

Please use SPDX on new files.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
  2018-02-28 23:40   ` Stephen Hemminger
@ 2018-02-28 23:42   ` Stephen Hemminger
  2018-03-01  1:51     ` Zhang, Qi Z
  2018-02-28 23:42   ` Stephen Hemminger
  2018-02-28 23:45   ` Stephen Hemminger
  3 siblings, 1 reply; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:42 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> struct pmd_internals {
> +	int sfd;
> +	int if_index;
> +	char if_name[0x100];

why not IFNAMSIZ?

> +	struct ether_addr eth_addr;
> +	struct xdp_queue rx;
> +	struct xdp_queue tx;
> +	struct xdp_umem *umem;
> +	struct rte_mempool *mb_pool;
> +
> +	unsigned long rx_pkts;
> +	unsigned long rx_bytes;
> +	unsigned long rx_dropped;
> +
> +	unsigned long tx_pkts;
> +	unsigned long err_pkts;
> +	unsigned long tx_bytes;

why not per-queue stats? per-port stats are expensive

> +	uint16_t port_id;
> +	uint16_t queue_idx;
> +	int ring_size;
> +	struct rte_ring *buf_ring;
> +};

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
  2018-02-28 23:40   ` Stephen Hemminger
  2018-02-28 23:42   ` Stephen Hemminger
@ 2018-02-28 23:42   ` Stephen Hemminger
  2018-02-28 23:45   ` Stephen Hemminger
  3 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:42 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> +
> +static void *get_pkt_data(struct pmd_internals *internals,
> +			  uint32_t index,
> +			  uint32_t offset)
> +{
> +	return (uint8_t *)(internals->umem->buffer +
> +			   (index << internals->umem->frame_size_log2) +
> +			   offset);

You are returning void *, cast here is unnecessary

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
                     ` (2 preceding siblings ...)
  2018-02-28 23:42   ` Stephen Hemminger
@ 2018-02-28 23:45   ` Stephen Hemminger
  2018-03-01  1:59     ` Zhang, Qi Z
  3 siblings, 1 reply; 24+ messages in thread
From: Stephen Hemminger @ 2018-02-28 23:45 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:00 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> +
> +static uint16_t
> +eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> +{
> +	struct pmd_internals *internals = queue;
> +	struct xdp_queue *rxq = &internals->rx;
> +	struct rte_mbuf *mbuf;
> +	unsigned long dropped = 0;
> +	unsigned long rx_bytes = 0;
> +	uint16_t count = 0;
> +
> +	nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ?
> +		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
> +

Put declarations first.
Why not iterate if nb_pkts is huge?

> +	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
> +	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
> +	int rcvd, i;
> +	/* fill rx ring */
> +	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {

Blank line after declarations before code please.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-28 23:42   ` Stephen Hemminger
@ 2018-03-01  1:51     ` Zhang, Qi Z
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  1:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, Karlsson, Magnus, Topel, Bjorn



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, March 1, 2018 7:42 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: dev@dpdk.org; magnus.karlsson@intei.com; Topel, Bjorn
> <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
> 
> On Tue, 27 Feb 2018 17:33:00 +0800
> Qi Zhang <qi.z.zhang@intel.com> wrote:
> 
> > struct pmd_internals {
> > +	int sfd;
> > +	int if_index;
> > +	char if_name[0x100];
> 
> why not IFNAMSIZ?
> 
> > +	struct ether_addr eth_addr;
> > +	struct xdp_queue rx;
> > +	struct xdp_queue tx;
> > +	struct xdp_umem *umem;
> > +	struct rte_mempool *mb_pool;
> > +
> > +	unsigned long rx_pkts;
> > +	unsigned long rx_bytes;
> > +	unsigned long rx_dropped;
> > +
> > +	unsigned long tx_pkts;
> > +	unsigned long err_pkts;
> > +	unsigned long tx_bytes;
> 
> why not per-queue stats? per-port stats are expensive

multi-queue is not supported in this implementation, but will be considered.

Regards
Qi
> 
> > +	uint16_t port_id;
> > +	uint16_t queue_idx;
> > +	int ring_size;
> > +	struct rte_ring *buf_ring;
> > +};

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 1/7] net/af_xdp: new PMD driver
  2018-02-28 23:45   ` Stephen Hemminger
@ 2018-03-01  1:59     ` Zhang, Qi Z
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  1:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, magnus.karlsson, Topel, Bjorn



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, March 1, 2018 7:45 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: dev@dpdk.org; magnus.karlsson@intei.com; Topel, Bjorn
> <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 1/7] net/af_xdp: new PMD driver
> 
> On Tue, 27 Feb 2018 17:33:00 +0800
> Qi Zhang <qi.z.zhang@intel.com> wrote:
> 
> > +
> > +static uint16_t
> > +eth_af_xdp_rx(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
> > +{
> > +	struct pmd_internals *internals = queue;
> > +	struct xdp_queue *rxq = &internals->rx;
> > +	struct rte_mbuf *mbuf;
> > +	unsigned long dropped = 0;
> > +	unsigned long rx_bytes = 0;
> > +	uint16_t count = 0;
> > +
> > +	nb_pkts = nb_pkts < ETH_AF_XDP_RX_BATCH_SIZE ?
> > +		  nb_pkts : ETH_AF_XDP_RX_BATCH_SIZE;
> > +
> 
> Put declarations first.
> Why not iterate if nb_pkts is huge?
Yes, it is not necessary to only read one batch, just for simple implementation, will be fixed.
> 
> > +	struct xdp_desc descs[ETH_AF_XDP_RX_BATCH_SIZE];
> > +	void *indexes[ETH_AF_XDP_RX_BATCH_SIZE];
> > +	int rcvd, i;
> > +	/* fill rx ring */
> > +	if (rxq->num_free >= ETH_AF_XDP_RX_BATCH_SIZE) {
> 
> Blank line after declarations before code please.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management
  2018-02-27  9:33 ` [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
@ 2018-03-01  2:08   ` Stephen Hemminger
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-03-01  2:08 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:03 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

> +static uint32_t
> +mbuf_to_idx(struct pmd_internals *internals, struct rte_mbuf *mbuf)
> +{
> +	return (uint32_t)(((uint64_t)mbuf->buf_addr -
> +			   (uint64_t)internals->umem->buffer) >>
> +			  internals->umem->frame_size_log2);
> +}
> +
> +static struct rte_mbuf *
> +idx_to_mbuf(struct pmd_internals *internals, uint32_t idx)
> +{
> +	return (struct rte_mbuf *)(void *)(internals->umem->buffer + (idx
> +			<< internals->umem->frame_size_log2) + 0x40);
> +}

More unnecessary casts's here.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 6/7] net/af_xdp: load BPF file
  2018-02-27  9:33 ` [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
@ 2018-03-01  2:10   ` Stephen Hemminger
  0 siblings, 0 replies; 24+ messages in thread
From: Stephen Hemminger @ 2018-03-01  2:10 UTC (permalink / raw)
  To: Qi Zhang; +Cc: dev, magnus.karlsson, bjorn.topel

On Tue, 27 Feb 2018 17:33:05 +0800
Qi Zhang <qi.z.zhang@intel.com> wrote:

>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/net/af_xdp/bpf_load.c b/drivers/net/af_xdp/bpf_load.c
> new file mode 100644
> index 000000000..aa632207f
> --- /dev/null
> +++ b/drivers/net/af_xdp/bpf_load.c
> @@ -0,0 +1,798 @@
> +// SPDX-License-Identifier: GPL-2.0

Sorry all DPDK drivers must be BSD licensed. You can't use GPL-2.0 code.
Either get the code dual licensed or write a new loader.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/7] PMD driver for AF_XDP
  2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
                   ` (6 preceding siblings ...)
  2018-02-27  9:33 ` [RFC 7/7] app/testpmd: enable parameter for mempool flags Qi Zhang
@ 2018-03-01  2:52 ` Jason Wang
  2018-03-01  4:18   ` Zhang, Qi Z
  7 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2018-03-01  2:52 UTC (permalink / raw)
  To: Qi Zhang, dev; +Cc: magnus.karlsson, bjorn.topel



On 2018年02月27日 17:32, Qi Zhang wrote:
> The RFC patches add a new PMD driver for AF_XDP which is a proposed
> faster version of AF_PACKET interface in Linux, see below link for
> detail AF_XDP introduction:
> https://fosdem.org/2018/schedule/event/af_xdp/
> https://lwn.net/Articles/745934/
>
> This patchset is base on v18.02.
> It also require a linux kernel that have below AF_XDP RFC patches be
> applied.
> https://patchwork.ozlabs.org/patch/867961/
> https://patchwork.ozlabs.org/patch/867960/
> https://patchwork.ozlabs.org/patch/867938/
> https://patchwork.ozlabs.org/patch/867939/
> https://patchwork.ozlabs.org/patch/867940/
> https://patchwork.ozlabs.org/patch/867941/
> https://patchwork.ozlabs.org/patch/867942/
> https://patchwork.ozlabs.org/patch/867943/
> https://patchwork.ozlabs.org/patch/867944/
> https://patchwork.ozlabs.org/patch/867945/
> https://patchwork.ozlabs.org/patch/867946/
> https://patchwork.ozlabs.org/patch/867947/
> https://patchwork.ozlabs.org/patch/867948/
> https://patchwork.ozlabs.org/patch/867949/
> https://patchwork.ozlabs.org/patch/867950/
> https://patchwork.ozlabs.org/patch/867951/
> https://patchwork.ozlabs.org/patch/867952/
> https://patchwork.ozlabs.org/patch/867953/
> https://patchwork.ozlabs.org/patch/867954/
> https://patchwork.ozlabs.org/patch/867955/
> https://patchwork.ozlabs.org/patch/867956/
> https://patchwork.ozlabs.org/patch/867957/
> https://patchwork.ozlabs.org/patch/867958/
> https://patchwork.ozlabs.org/patch/867959/
>
> There is no clean upstream target yet since kernel patch is still in
> RFC stage, The purpose of the patchset is just for anyone that want to
> eveluate af_xdp with DPDK application and get feedback for further
> improvement.
>
> To try with the new PMD
> 1. compile and install the kernel with above patches applied.
> 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
>     and $TOOLS_DIR (dir at <kernel_src>/tools) at driver/net/af_xdp/Makefile
>     before compile DPDK.
> 3. make sure libelf and libbpf is installed.
>
> BTW, performance test shows our PMD can reach 94%~98% of the orignal benchmark
> when share memory is enabled.

Hi:

Looks like zerocopy is not used in this series. Any plan to support 
that? If not, what's the advantage compared to vhost-net + tap + 
XDP_REDIRECT?

Have you measured l2fwd performance in this case? I believe the number 
you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  2:52 ` [RFC 0/7] PMD driver for AF_XDP Jason Wang
@ 2018-03-01  4:18   ` Zhang, Qi Z
  2018-03-01  4:20     ` Zhang, Qi Z
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  4:18 UTC (permalink / raw)
  To: Jason Wang, dev; +Cc: magnus.karlsson, Topel, Bjorn



> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, March 1, 2018 10:52 AM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; dev@dpdk.org
> Cc: magnus.karlsson@intei.com; Topel, Bjorn <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> On 2018年02月27日 17:32, Qi Zhang wrote:
> > The RFC patches add a new PMD driver for AF_XDP which is a proposed
> > faster version of AF_PACKET interface in Linux, see below link for
> > detail AF_XDP introduction:
> > https://fosdem.org/2018/schedule/event/af_xdp/
> > https://lwn.net/Articles/745934/
> >
> > This patchset is base on v18.02.
> > It also require a linux kernel that have below AF_XDP RFC patches be
> > applied.
> > https://patchwork.ozlabs.org/patch/867961/
> > https://patchwork.ozlabs.org/patch/867960/
> > https://patchwork.ozlabs.org/patch/867938/
> > https://patchwork.ozlabs.org/patch/867939/
> > https://patchwork.ozlabs.org/patch/867940/
> > https://patchwork.ozlabs.org/patch/867941/
> > https://patchwork.ozlabs.org/patch/867942/
> > https://patchwork.ozlabs.org/patch/867943/
> > https://patchwork.ozlabs.org/patch/867944/
> > https://patchwork.ozlabs.org/patch/867945/
> > https://patchwork.ozlabs.org/patch/867946/
> > https://patchwork.ozlabs.org/patch/867947/
> > https://patchwork.ozlabs.org/patch/867948/
> > https://patchwork.ozlabs.org/patch/867949/
> > https://patchwork.ozlabs.org/patch/867950/
> > https://patchwork.ozlabs.org/patch/867951/
> > https://patchwork.ozlabs.org/patch/867952/
> > https://patchwork.ozlabs.org/patch/867953/
> > https://patchwork.ozlabs.org/patch/867954/
> > https://patchwork.ozlabs.org/patch/867955/
> > https://patchwork.ozlabs.org/patch/867956/
> > https://patchwork.ozlabs.org/patch/867957/
> > https://patchwork.ozlabs.org/patch/867958/
> > https://patchwork.ozlabs.org/patch/867959/
> >
> > There is no clean upstream target yet since kernel patch is still in
> > RFC stage, The purpose of the patchset is just for anyone that want to
> > eveluate af_xdp with DPDK application and get feedback for further
> > improvement.
> >
> > To try with the new PMD
> > 1. compile and install the kernel with above patches applied.
> > 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
> >     and $TOOLS_DIR (dir at <kernel_src>/tools) at
> driver/net/af_xdp/Makefile
> >     before compile DPDK.
> > 3. make sure libelf and libbpf is installed.
> >
> > BTW, performance test shows our PMD can reach 94%~98% of the orignal
> > benchmark when share memory is enabled.
> 
> Hi:
> 
> Looks like zero copy is not used in this series. Any plan to support that? 

Zero copy is enabled in patch 5, if a mempool passed check_mempool, it will be registered to af_xdp socket.
so there will be no memcpy between mbuf and af_xdp.

> If not, what's the advantage compared to vhost-net + tap + XDP_REDIRECT?
> 
> Have you measured l2fwd performance in this case? I believe the number
> you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.

Actually we measure the performance on rxonly / txonly / l2fwd on i40e with XDP_SKB and XDP_DRV_ZC 

Regards
Qi

> 
> Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  4:18   ` Zhang, Qi Z
@ 2018-03-01  4:20     ` Zhang, Qi Z
  2018-03-01  7:46       ` Jason Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01  4:20 UTC (permalink / raw)
  To: Zhang, Qi Z, Jason Wang, dev; +Cc: Karlsson, Magnus, Topel, Bjorn

+Magnus, since a typo in my first batch in email address.

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhang, Qi Z
> Sent: Thursday, March 1, 2018 12:19 PM
> To: Jason Wang <jasowang@redhat.com>; dev@dpdk.org
> Cc: magnus.karlsson@intei.com; Topel, Bjorn <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> > -----Original Message-----
> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Thursday, March 1, 2018 10:52 AM
> > To: Zhang, Qi Z <qi.z.zhang@intel.com>; dev@dpdk.org
> > Cc: magnus.karlsson@intei.com; Topel, Bjorn <bjorn.topel@intel.com>
> > Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> >
> >
> >
> > On 2018年02月27日 17:32, Qi Zhang wrote:
> > > The RFC patches add a new PMD driver for AF_XDP which is a proposed
> > > faster version of AF_PACKET interface in Linux, see below link for
> > > detail AF_XDP introduction:
> > > https://fosdem.org/2018/schedule/event/af_xdp/
> > > https://lwn.net/Articles/745934/
> > >
> > > This patchset is base on v18.02.
> > > It also require a linux kernel that have below AF_XDP RFC patches be
> > > applied.
> > > https://patchwork.ozlabs.org/patch/867961/
> > > https://patchwork.ozlabs.org/patch/867960/
> > > https://patchwork.ozlabs.org/patch/867938/
> > > https://patchwork.ozlabs.org/patch/867939/
> > > https://patchwork.ozlabs.org/patch/867940/
> > > https://patchwork.ozlabs.org/patch/867941/
> > > https://patchwork.ozlabs.org/patch/867942/
> > > https://patchwork.ozlabs.org/patch/867943/
> > > https://patchwork.ozlabs.org/patch/867944/
> > > https://patchwork.ozlabs.org/patch/867945/
> > > https://patchwork.ozlabs.org/patch/867946/
> > > https://patchwork.ozlabs.org/patch/867947/
> > > https://patchwork.ozlabs.org/patch/867948/
> > > https://patchwork.ozlabs.org/patch/867949/
> > > https://patchwork.ozlabs.org/patch/867950/
> > > https://patchwork.ozlabs.org/patch/867951/
> > > https://patchwork.ozlabs.org/patch/867952/
> > > https://patchwork.ozlabs.org/patch/867953/
> > > https://patchwork.ozlabs.org/patch/867954/
> > > https://patchwork.ozlabs.org/patch/867955/
> > > https://patchwork.ozlabs.org/patch/867956/
> > > https://patchwork.ozlabs.org/patch/867957/
> > > https://patchwork.ozlabs.org/patch/867958/
> > > https://patchwork.ozlabs.org/patch/867959/
> > >
> > > There is no clean upstream target yet since kernel patch is still in
> > > RFC stage, The purpose of the patchset is just for anyone that want
> > > to eveluate af_xdp with DPDK application and get feedback for
> > > further improvement.
> > >
> > > To try with the new PMD
> > > 1. compile and install the kernel with above patches applied.
> > > 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
> > >     and $TOOLS_DIR (dir at <kernel_src>/tools) at
> > driver/net/af_xdp/Makefile
> > >     before compile DPDK.
> > > 3. make sure libelf and libbpf is installed.
> > >
> > > BTW, performance test shows our PMD can reach 94%~98% of the orignal
> > > benchmark when share memory is enabled.
> >
> > Hi:
> >
> > Looks like zero copy is not used in this series. Any plan to support that?
> 
> Zero copy is enabled in patch 5, if a mempool passed check_mempool, it will
> be registered to af_xdp socket.
> so there will be no memcpy between mbuf and af_xdp.
> 
> > If not, what's the advantage compared to vhost-net + tap + XDP_REDIRECT?
> >
> > Have you measured l2fwd performance in this case? I believe the number
> > you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.
> 
> Actually we measure the performance on rxonly / txonly / l2fwd on i40e with
> XDP_SKB and XDP_DRV_ZC
> 
> Regards
> Qi
> 
> >
> > Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  4:20     ` Zhang, Qi Z
@ 2018-03-01  7:46       ` Jason Wang
  2018-03-01 12:56         ` Zhang, Qi Z
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2018-03-01  7:46 UTC (permalink / raw)
  To: Zhang, Qi Z, dev; +Cc: Karlsson, Magnus, Topel, Bjorn



On 2018年03月01日 12:20, Zhang, Qi Z wrote:
> +Magnus, since a typo in my first batch in email address.
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhang, Qi Z
>> Sent: Thursday, March 1, 2018 12:19 PM
>> To: Jason Wang<jasowang@redhat.com>;dev@dpdk.org
>> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
>> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
>>
>>
>>
>>> -----Original Message-----
>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>> Sent: Thursday, March 1, 2018 10:52 AM
>>> To: Zhang, Qi Z<qi.z.zhang@intel.com>;dev@dpdk.org
>>> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
>>> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
>>>
>>>
>>>
>>> On 2018年02月27日 17:32, Qi Zhang wrote:
>>>> The RFC patches add a new PMD driver for AF_XDP which is a proposed
>>>> faster version of AF_PACKET interface in Linux, see below link for
>>>> detail AF_XDP introduction:
>>>> https://fosdem.org/2018/schedule/event/af_xdp/
>>>> https://lwn.net/Articles/745934/
>>>>
>>>> This patchset is base on v18.02.
>>>> It also require a linux kernel that have below AF_XDP RFC patches be
>>>> applied.
>>>> https://patchwork.ozlabs.org/patch/867961/
>>>> https://patchwork.ozlabs.org/patch/867960/
>>>> https://patchwork.ozlabs.org/patch/867938/
>>>> https://patchwork.ozlabs.org/patch/867939/
>>>> https://patchwork.ozlabs.org/patch/867940/
>>>> https://patchwork.ozlabs.org/patch/867941/
>>>> https://patchwork.ozlabs.org/patch/867942/
>>>> https://patchwork.ozlabs.org/patch/867943/
>>>> https://patchwork.ozlabs.org/patch/867944/
>>>> https://patchwork.ozlabs.org/patch/867945/
>>>> https://patchwork.ozlabs.org/patch/867946/
>>>> https://patchwork.ozlabs.org/patch/867947/
>>>> https://patchwork.ozlabs.org/patch/867948/
>>>> https://patchwork.ozlabs.org/patch/867949/
>>>> https://patchwork.ozlabs.org/patch/867950/
>>>> https://patchwork.ozlabs.org/patch/867951/
>>>> https://patchwork.ozlabs.org/patch/867952/
>>>> https://patchwork.ozlabs.org/patch/867953/
>>>> https://patchwork.ozlabs.org/patch/867954/
>>>> https://patchwork.ozlabs.org/patch/867955/
>>>> https://patchwork.ozlabs.org/patch/867956/
>>>> https://patchwork.ozlabs.org/patch/867957/
>>>> https://patchwork.ozlabs.org/patch/867958/
>>>> https://patchwork.ozlabs.org/patch/867959/
>>>>
>>>> There is no clean upstream target yet since kernel patch is still in
>>>> RFC stage, The purpose of the patchset is just for anyone that want
>>>> to eveluate af_xdp with DPDK application and get feedback for
>>>> further improvement.
>>>>
>>>> To try with the new PMD
>>>> 1. compile and install the kernel with above patches applied.
>>>> 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
>>>>      and $TOOLS_DIR (dir at <kernel_src>/tools) at
>>> driver/net/af_xdp/Makefile
>>>>      before compile DPDK.
>>>> 3. make sure libelf and libbpf is installed.
>>>>
>>>> BTW, performance test shows our PMD can reach 94%~98% of the orignal
>>>> benchmark when share memory is enabled.
>>> Hi:
>>>
>>> Looks like zero copy is not used in this series. Any plan to support that?
>> Zero copy is enabled in patch 5, if a mempool passed check_mempool, it will
>> be registered to af_xdp socket.
>> so there will be no memcpy between mbuf and af_xdp.

Aha, I see. So the zerocopy was limited to some specific use case. And 
if I understand it correctly, zc mode could not be used for VM.

Thanks

>>> If not, what's the advantage compared to vhost-net + tap + XDP_REDIRECT?
>>>
>>> Have you measured l2fwd performance in this case? I believe the number
>>> you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.
>> Actually we measure the performance on rxonly / txonly / l2fwd on i40e with
>> XDP_SKB and XDP_DRV_ZC
>>
>> Regards
>> Qi
>>
>>> Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/7] PMD driver for AF_XDP
  2018-03-01  7:46       ` Jason Wang
@ 2018-03-01 12:56         ` Zhang, Qi Z
  2018-03-01 13:18           ` Jason Wang
  0 siblings, 1 reply; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-01 12:56 UTC (permalink / raw)
  To: 'Jason Wang'; +Cc: Karlsson, Magnus, Topel, Bjorn, dev



> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, March 1, 2018 3:46 PM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>; dev@dpdk.org
> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Topel, Bjorn
> <bjorn.topel@intel.com>
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> On 2018年03月01日 12:20, Zhang, Qi Z wrote:
> > +Magnus, since a typo in my first batch in email address.
> >
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhang, Qi Z
> >> Sent: Thursday, March 1, 2018 12:19 PM
> >> To: Jason Wang<jasowang@redhat.com>;dev@dpdk.org
> >> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
> >> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>> Sent: Thursday, March 1, 2018 10:52 AM
> >>> To: Zhang, Qi Z<qi.z.zhang@intel.com>;dev@dpdk.org
> >>> Cc:magnus.karlsson@intei.com; Topel, Bjorn<bjorn.topel@intel.com>
> >>> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> >>>
> >>>
> >>>
> >>> On 2018年02月27日 17:32, Qi Zhang wrote:
> >>>> The RFC patches add a new PMD driver for AF_XDP which is a proposed
> >>>> faster version of AF_PACKET interface in Linux, see below link for
> >>>> detail AF_XDP introduction:
> >>>> https://fosdem.org/2018/schedule/event/af_xdp/
> >>>> https://lwn.net/Articles/745934/
> >>>>
> >>>> This patchset is base on v18.02.
> >>>> It also require a linux kernel that have below AF_XDP RFC patches
> >>>> be applied.
> >>>> https://patchwork.ozlabs.org/patch/867961/
> >>>> https://patchwork.ozlabs.org/patch/867960/
> >>>> https://patchwork.ozlabs.org/patch/867938/
> >>>> https://patchwork.ozlabs.org/patch/867939/
> >>>> https://patchwork.ozlabs.org/patch/867940/
> >>>> https://patchwork.ozlabs.org/patch/867941/
> >>>> https://patchwork.ozlabs.org/patch/867942/
> >>>> https://patchwork.ozlabs.org/patch/867943/
> >>>> https://patchwork.ozlabs.org/patch/867944/
> >>>> https://patchwork.ozlabs.org/patch/867945/
> >>>> https://patchwork.ozlabs.org/patch/867946/
> >>>> https://patchwork.ozlabs.org/patch/867947/
> >>>> https://patchwork.ozlabs.org/patch/867948/
> >>>> https://patchwork.ozlabs.org/patch/867949/
> >>>> https://patchwork.ozlabs.org/patch/867950/
> >>>> https://patchwork.ozlabs.org/patch/867951/
> >>>> https://patchwork.ozlabs.org/patch/867952/
> >>>> https://patchwork.ozlabs.org/patch/867953/
> >>>> https://patchwork.ozlabs.org/patch/867954/
> >>>> https://patchwork.ozlabs.org/patch/867955/
> >>>> https://patchwork.ozlabs.org/patch/867956/
> >>>> https://patchwork.ozlabs.org/patch/867957/
> >>>> https://patchwork.ozlabs.org/patch/867958/
> >>>> https://patchwork.ozlabs.org/patch/867959/
> >>>>
> >>>> There is no clean upstream target yet since kernel patch is still
> >>>> in RFC stage, The purpose of the patchset is just for anyone that
> >>>> want to eveluate af_xdp with DPDK application and get feedback for
> >>>> further improvement.
> >>>>
> >>>> To try with the new PMD
> >>>> 1. compile and install the kernel with above patches applied.
> >>>> 2. configure $LINUX_HEADER_DIR (dir of "make headers_install")
> >>>>      and $TOOLS_DIR (dir at <kernel_src>/tools) at
> >>> driver/net/af_xdp/Makefile
> >>>>      before compile DPDK.
> >>>> 3. make sure libelf and libbpf is installed.
> >>>>
> >>>> BTW, performance test shows our PMD can reach 94%~98% of the
> >>>> orignal benchmark when share memory is enabled.
> >>> Hi:
> >>>
> >>> Looks like zero copy is not used in this series. Any plan to support that?
> >> Zero copy is enabled in patch 5, if a mempool passed check_mempool,
> >> it will be registered to af_xdp socket.
> >> so there will be no memcpy between mbuf and af_xdp.
> 
> Aha, I see. So the zerocopy was limited to some specific use case. And if I
> understand it correctly, zc mode could not be used for VM.

I think except the limitation for mempool layout, zerocopy is transparent to DPDK application, only difference is performance.
Sorry, I may not get your point, if you could explain more about the VM usage.

Regards
Qi
> 
> Thanks
> 
> >>> If not, what's the advantage compared to vhost-net + tap +
> XDP_REDIRECT?
> >>>
> >>> Have you measured l2fwd performance in this case? I believe the
> >>> number you refer here is rxdrop (XDP_DRV) which is 11.6Mpps.
> >> Actually we measure the performance on rxonly / txonly / l2fwd on
> >> i40e with XDP_SKB and XDP_DRV_ZC
> >>
> >> Regards
> >> Qi
> >>
> >>> Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/7] PMD driver for AF_XDP
  2018-03-01 12:56         ` Zhang, Qi Z
@ 2018-03-01 13:18           ` Jason Wang
  2018-03-02  4:05             ` Zhang, Qi Z
  0 siblings, 1 reply; 24+ messages in thread
From: Jason Wang @ 2018-03-01 13:18 UTC (permalink / raw)
  To: Zhang, Qi Z; +Cc: Karlsson, Magnus, Topel, Bjorn, dev



On 2018年03月01日 20:56, Zhang, Qi Z wrote:
>>>>>> BTW, performance test shows our PMD can reach 94%~98% of the
>>>>>> orignal benchmark when share memory is enabled.
>>>>> Hi:
>>>>>
>>>>> Looks like zero copy is not used in this series. Any plan to support that?
>>>> Zero copy is enabled in patch 5, if a mempool passed check_mempool,
>>>> it will be registered to af_xdp socket.
>>>> so there will be no memcpy between mbuf and af_xdp.
>> Aha, I see. So the zerocopy was limited to some specific use case. And if I
>> understand it correctly, zc mode could not be used for VM.
> I think except the limitation for mempool layout, zerocopy is transparent to DPDK application, only difference is performance.
> Sorry, I may not get your point, if you could explain more about the VM usage.
>
> Regards
> Qi

No problem, so the question is:

Can zerocopy be used when using testpmd to foward packets between 
vhost-user and AF_XDP socket?

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC 0/7] PMD driver for AF_XDP
  2018-03-01 13:18           ` Jason Wang
@ 2018-03-02  4:05             ` Zhang, Qi Z
  0 siblings, 0 replies; 24+ messages in thread
From: Zhang, Qi Z @ 2018-03-02  4:05 UTC (permalink / raw)
  To: Jason Wang; +Cc: Karlsson, Magnus, Topel, Bjorn, dev



> -----Original Message-----
> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, March 1, 2018 9:18 PM
> To: Zhang, Qi Z <qi.z.zhang@intel.com>
> Cc: Karlsson, Magnus <magnus.karlsson@intel.com>; Topel, Bjorn
> <bjorn.topel@intel.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [RFC 0/7] PMD driver for AF_XDP
> 
> 
> 
> On 2018年03月01日 20:56, Zhang, Qi Z wrote:
> >>>>>> BTW, performance test shows our PMD can reach 94%~98% of the
> >>>>>> orignal benchmark when share memory is enabled.
> >>>>> Hi:
> >>>>>
> >>>>> Looks like zero copy is not used in this series. Any plan to support
> that?
> >>>> Zero copy is enabled in patch 5, if a mempool passed check_mempool,
> >>>> it will be registered to af_xdp socket.
> >>>> so there will be no memcpy between mbuf and af_xdp.
> >> Aha, I see. So the zerocopy was limited to some specific use case.
> >> And if I understand it correctly, zc mode could not be used for VM.
> > I think except the limitation for mempool layout, zerocopy is transparent
> to DPDK application, only difference is performance.
> > Sorry, I may not get your point, if you could explain more about the VM
> usage.
> >
> > Regards
> > Qi
> 
> No problem, so the question is:
> 
> Can zerocopy be used when using testpmd to foward packets between
> vhost-user and AF_XDP socket?

I'm not very familiar with vhost-user, but I guess the answer should be same as the case for forward packet between vhost-user and i40e, 
(if vhost-user does not have any special requirement for mempool that conflict with af_xdp ZC's requirement)

Regards
Qi

> 
> Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC 3/7] lib/mempool: allow page size aligned mempool
  2018-02-27  9:35 Qi Zhang
@ 2018-02-27  9:35 ` Qi Zhang
  0 siblings, 0 replies; 24+ messages in thread
From: Qi Zhang @ 2018-02-27  9:35 UTC (permalink / raw)
  To: dev; +Cc: magnus.karlsson, bjorn.topel, Qi Zhang

Allow create a mempool with page size aligned base address.

Signed-off-by: Qi Zhang <qi.z.zhang@intel.com>
---
 lib/librte_mempool/rte_mempool.c | 2 ++
 lib/librte_mempool/rte_mempool.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/lib/librte_mempool/rte_mempool.c b/lib/librte_mempool/rte_mempool.c
index 54f7f4ba4..f8d4814ad 100644
--- a/lib/librte_mempool/rte_mempool.c
+++ b/lib/librte_mempool/rte_mempool.c
@@ -567,6 +567,8 @@ rte_mempool_populate_default(struct rte_mempool *mp)
 		pg_shift = 0; /* not needed, zone is physically contiguous */
 		pg_sz = 0;
 		align = RTE_CACHE_LINE_SIZE;
+		if (mp->flags & MEMPOOL_F_PAGE_ALIGN)
+			align = getpagesize();
 	} else {
 		pg_sz = getpagesize();
 		pg_shift = rte_bsf32(pg_sz);
diff --git a/lib/librte_mempool/rte_mempool.h b/lib/librte_mempool/rte_mempool.h
index 8b1b7f7ed..774ab0f66 100644
--- a/lib/librte_mempool/rte_mempool.h
+++ b/lib/librte_mempool/rte_mempool.h
@@ -245,6 +245,7 @@ struct rte_mempool {
 #define MEMPOOL_F_SC_GET         0x0008 /**< Default get is "single-consumer".*/
 #define MEMPOOL_F_POOL_CREATED   0x0010 /**< Internal: pool is created. */
 #define MEMPOOL_F_NO_PHYS_CONTIG 0x0020 /**< Don't need physically contiguous objs. */
+#define MEMPOOL_F_PAGE_ALIGN     0x0040 /**< Base address is page aligned. */
 /**
  * This capability flag is advertised by a mempool handler, if the whole
  * memory area containing the objects must be physically contiguous.
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-03-02  4:05 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-27  9:32 [RFC 0/7] PMD driver for AF_XDP Qi Zhang
2018-02-27  9:33 ` [RFC 1/7] net/af_xdp: new PMD driver Qi Zhang
2018-02-28 23:40   ` Stephen Hemminger
2018-02-28 23:42   ` Stephen Hemminger
2018-03-01  1:51     ` Zhang, Qi Z
2018-02-28 23:42   ` Stephen Hemminger
2018-02-28 23:45   ` Stephen Hemminger
2018-03-01  1:59     ` Zhang, Qi Z
2018-02-27  9:33 ` [RFC 2/7] lib/mbuf: enable parse flags when create mempool Qi Zhang
2018-02-27  9:33 ` [RFC 3/7] lib/mempool: allow page size aligned mempool Qi Zhang
2018-02-27  9:33 ` [RFC 4/7] net/af_xdp: use mbuf mempool for buffer management Qi Zhang
2018-03-01  2:08   ` Stephen Hemminger
2018-02-27  9:33 ` [RFC 5/7] net/af_xdp: enable share mempool Qi Zhang
2018-02-27  9:33 ` [RFC 6/7] net/af_xdp: load BPF file Qi Zhang
2018-03-01  2:10   ` Stephen Hemminger
2018-02-27  9:33 ` [RFC 7/7] app/testpmd: enable parameter for mempool flags Qi Zhang
2018-03-01  2:52 ` [RFC 0/7] PMD driver for AF_XDP Jason Wang
2018-03-01  4:18   ` Zhang, Qi Z
2018-03-01  4:20     ` Zhang, Qi Z
2018-03-01  7:46       ` Jason Wang
2018-03-01 12:56         ` Zhang, Qi Z
2018-03-01 13:18           ` Jason Wang
2018-03-02  4:05             ` Zhang, Qi Z
2018-02-27  9:35 Qi Zhang
2018-02-27  9:35 ` [RFC 3/7] lib/mempool: allow page size aligned mempool Qi Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.