BPF Archive on lore.kernel.org
 help / color / Atom feed
* [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
@ 2019-05-02  8:39 Magnus Karlsson
  2019-05-02  8:39 ` [RFC bpf-next 1/7] net: fs: make busy poll budget configurable in napi_busy_loop Magnus Karlsson
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-02  8:39 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz

This RFC proposes to add busy-poll support to AF_XDP sockets. With
busy-poll, the driver is executed in process context by calling the
poll() syscall. The main advantage with this is that all processing
occurs on a single core. This eliminates the core-to-core cache
transfers that occur between the application and the softirqd
processing on another core, that occurs without busy-poll. From a
systems point of view, it also provides an advatage that we do not
have to provision extra cores in the system to handle
ksoftirqd/softirq processing, as all processing is done on the single
core that executes the application. The drawback of busy-poll is that
max throughput seen from a single application will be lower (due to
the syscall), but on a per core basis it will often be higher as
the normal mode runs on two cores and busy-poll on a single one.

The semantics of busy-poll from the application point of view are the
following:

* The application is required to call poll() to drive rx and tx
  processing. There is no guarantee that softirq and interrupts will
  do this for you. This is in contrast with the current
  implementations of busy-poll that are opportunistic in the sense
  that packets might be received/transmitted by busy-poll or
  softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
  loads just as the current opportunistic implementations, but I would
  like to get to a point where this is not the case for busy-poll
  enabled XDP sockets, as this slows down performance considerably and
  starts to use one more core for the softirq processing. The end goal
  is for only poll() to drive the napi loop when busy-poll is enabled
  on an AF_XDP socket. More about this later.)

* It should be enabled on a per socket basis. No global enablement, i.e.
  the XDP socket busy-poll will not care about the current
  /proc/sys/net/core/busy_poll and busy_read global enablement
  mechanisms.

* The batch size (how many packets that are processed every time the
  napi function in the driver is called, i.e. the weight parameter)
  should be configurable. Currently, the busy-poll size of AF_INET
  sockets is set to 8, but for AF_XDP sockets this is too small as the
  amount of processing per packet is much smaller with AF_XDP. This
  should be configurable on a per socket basis.

* If you put multiple AF_XDP busy-poll enabled sockets into a poll()
  call the napi contexts of all of them should be executed. This is in
  contrast to the AF_INET busy-poll that quits after the fist one that
  finds any packets. We need all napi contexts to be executed due to
  the first requirement in this list. The behaviour we want is much more
  like regular sockets in that all of them are checked in the poll
  call.

* Should be possible to mix AF_XDP busy-poll sockets with any other
  sockets including busy-poll AF_INET ones in a single poll() call
  without any change to semantics or the behaviour of any of those
  socket types.

* As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
  mode return POLLERR if the fill ring is empty or the completion
  queue is full.

Busy-poll support is enabled by calling a new setsockopt called
XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
any other value will return an error.

A typical packet processing rxdrop loop with busy-poll will look something
like this:

for (i = 0; i < num_socks; i++) {
    fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
    fds[i].events = POLLIN;
}

for (;;) {
    ret = poll(fds, num_socks, 0);
    if (ret <= 0)
            continue;

    for (i = 0; i < num_socks; i++)
        rx_drop(xsks[i], fds); /* The actual application */
}

Need some advice around this issue please:

In this patch set, softirq/ksoftirqd will kick in at high loads and
render the busy poll support useless as the execution is now happening
in the same way as without busy-poll support. Everything works from an
application perspective but this defeats the purpose of the support
and also consumes an extra core. What I would like to accomplish when
XDP socket busy-poll is enabled is that softirq/ksoftirq is never
invoked for the traffic that goes to this socket. This way, we would
get better performance on a per core basis and also get the same
behaviour independent of load.

To accomplish this, the driver would need to create a napi context
containing the busy-poll enabled XDP socket queues (not shared with
any other traffic) that do not have an interrupt associated with
it.

Does this sound like an avenue worth pursuing and if so, should it be
part of this RFC/PATCH or separate?

Epoll() is not supported at this point in time. Since it was designed
for handling a very large number of sockets, I thought it was better
to leave this for later if the need will arise.

To do:

* Move over all drivers to the new xdp_[rt]xq_info functions. If you
  think this is the right way to go forward, I will move over
  Mellanox, Netronome, etc for the proper patch.

* Performance measurements

* Test SW drivers like virtio, veth and tap. Actually, test anything
  that is not i40e.

* Test multiple sockets of different types in the same poll() call

* Check bisectability of each patch

* Versioning of the libbpf interface since we add an entry to the
  socket configuration struct

This RFC has been applied against commit 2b5bc3c8ebce ("r8169: remove manual autoneg restart workaround")

Structure of the patch set:
Patch 1: Makes the busy poll batch size configurable inside the kernel
Patch 2: Adds napi id to struct xdp_rxq_info, adds this to the
         xdp_rxq_reg_info function and changes the interface and
	 implementation so that we only need a single copy of the
	 xdp_rxq_info struct that resides in _rx. Previously there was
	 another one in the driver, but now the driver uses the one in
	 _rx. All XDP enabled drivers are converted to these new
	 functions.
Patch 3: Adds a struct xdp_txq_info with corresponding functions to
         xdp_rxq_info that is used to store information about the tx
	 queue. The use of these are added to all AF_XDP enabled drivers.
Patch 4: Introduce a new setsockopt/getsockopt in the uapi for
         enabling busy_poll.
Patch 5: Implements busy poll in the xsk code.
Patch 6: Add busy-poll support to libbpf.
Patch 7: Add busy-poll support to the xdpsock sample application.

Thanks: Magnus

Magnus Karlsson (7):
  net: fs: make busy poll budget configurable in napi_busy_loop
  net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and
    add napi id
  net: i40e: ixgbe: add xdp_txq_info structure
  netdevice: introduce busy-poll setsockopt for AF_XDP
  net: add busy-poll support for XDP sockets
  libbpf: add busy-poll support to XDP sockets
  samples/bpf: add busy-poll support to xdpsock sample

 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 -
 drivers/net/ethernet/intel/i40e/i40e_main.c    |   8 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  37 ++++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c     |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  48 ++++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c   |   2 +-
 drivers/net/tun.c                              |  14 +-
 drivers/net/veth.c                             |  10 +-
 drivers/net/virtio_net.c                       |   8 +-
 fs/eventpoll.c                                 |   5 +-
 include/linux/netdevice.h                      |   1 +
 include/net/busy_poll.h                        |   7 +-
 include/net/xdp.h                              |  23 ++-
 include/net/xdp_sock.h                         |   3 +
 include/uapi/linux/if_xdp.h                    |   1 +
 net/core/dev.c                                 |  43 ++----
 net/core/xdp.c                                 | 103 ++++++++++---
 net/xdp/Kconfig                                |   1 +
 net/xdp/xsk.c                                  | 122 ++++++++++++++-
 net/xdp/xsk_queue.h                            |  18 ++-
 samples/bpf/xdpsock_user.c                     | 203 +++++++++++++++----------
 tools/include/uapi/linux/if_xdp.h              |   1 +
 tools/lib/bpf/xsk.c                            |  23 +--
 tools/lib/bpf/xsk.h                            |   1 +
 26 files changed, 495 insertions(+), 195 deletions(-)

--
2.7.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC bpf-next 1/7] net: fs: make busy poll budget configurable in napi_busy_loop
  2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
@ 2019-05-02  8:39 ` Magnus Karlsson
  2019-05-02  8:39 ` [RFC bpf-next 2/7] net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and add napi id Magnus Karlsson
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-02  8:39 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz

This patch adds the possibility to set the busy poll budget to
something else than 8 in napi_busy_loop. All the current users of
napi_busy_loop will still have a budget of 8, but the for the XDP
socket busy poll support, we need to have a configurable budget that
is usually larger since each packet requires less processing than with
an AF_INET socket.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 fs/eventpoll.c          |  5 ++++-
 include/net/busy_poll.h |  7 +++++--
 net/core/dev.c          | 21 ++++++++++-----------
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 4a0e98d..0fbbc35 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -394,6 +394,8 @@ static bool ep_busy_loop_end(void *p, unsigned long start_time)
 	return ep_events_available(ep) || busy_loop_timeout(start_time);
 }
 
+#define BUSY_POLL_BUDGET 8
+
 /*
  * Busy poll if globally on and supporting sockets found && no events,
  * busy loop will return if need_resched or ep_events_available.
@@ -405,7 +407,8 @@ static void ep_busy_loop(struct eventpoll *ep, int nonblock)
 	unsigned int napi_id = READ_ONCE(ep->napi_id);
 
 	if ((napi_id >= MIN_NAPI_ID) && net_busy_loop_on())
-		napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep);
+		napi_busy_loop(napi_id, nonblock ? NULL : ep_busy_loop_end, ep,
+			       BUSY_POLL_BUDGET);
 }
 
 static inline void ep_reset_busy_poll_napi_id(struct eventpoll *ep)
diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index ba61cdd..94817e8 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -55,7 +55,7 @@ bool sk_busy_loop_end(void *p, unsigned long start_time);
 
 void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
-		    void *loop_end_arg);
+		    void *loop_end_arg, int budget);
 
 #else /* CONFIG_NET_RX_BUSY_POLL */
 static inline unsigned long net_busy_loop_on(void)
@@ -111,13 +111,16 @@ static inline bool sk_busy_loop_timeout(struct sock *sk,
 	return true;
 }
 
+#define BUSY_POLL_BUDGET 8
+
 static inline void sk_busy_loop(struct sock *sk, int nonblock)
 {
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	unsigned int napi_id = READ_ONCE(sk->sk_napi_id);
 
 	if (napi_id >= MIN_NAPI_ID)
-		napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk);
+		napi_busy_loop(napi_id, nonblock ? NULL : sk_busy_loop_end, sk,
+			       BUSY_POLL_BUDGET);
 #endif
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 22f2640..e82fc44 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6108,9 +6108,8 @@ static struct napi_struct *napi_by_id(unsigned int napi_id)
 
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 
-#define BUSY_POLL_BUDGET 8
-
-static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
+static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
+			   int budget)
 {
 	int rc;
 
@@ -6131,17 +6130,17 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock)
 	/* All we really want here is to re-enable device interrupts.
 	 * Ideally, a new ndo_busy_poll_stop() could avoid another round.
 	 */
-	rc = napi->poll(napi, BUSY_POLL_BUDGET);
-	trace_napi_poll(napi, rc, BUSY_POLL_BUDGET);
+	rc = napi->poll(napi, budget);
+	trace_napi_poll(napi, rc, budget);
 	netpoll_poll_unlock(have_poll_lock);
-	if (rc == BUSY_POLL_BUDGET)
+	if (rc == budget)
 		__napi_schedule(napi);
 	local_bh_enable();
 }
 
 void napi_busy_loop(unsigned int napi_id,
 		    bool (*loop_end)(void *, unsigned long),
-		    void *loop_end_arg)
+		    void *loop_end_arg, int budget)
 {
 	unsigned long start_time = loop_end ? busy_loop_current_time() : 0;
 	int (*napi_poll)(struct napi_struct *napi, int budget);
@@ -6178,8 +6177,8 @@ void napi_busy_loop(unsigned int napi_id,
 			have_poll_lock = netpoll_poll_lock(napi);
 			napi_poll = napi->poll;
 		}
-		work = napi_poll(napi, BUSY_POLL_BUDGET);
-		trace_napi_poll(napi, work, BUSY_POLL_BUDGET);
+		work = napi_poll(napi, budget);
+		trace_napi_poll(napi, work, budget);
 count:
 		if (work > 0)
 			__NET_ADD_STATS(dev_net(napi->dev),
@@ -6191,7 +6190,7 @@ void napi_busy_loop(unsigned int napi_id,
 
 		if (unlikely(need_resched())) {
 			if (napi_poll)
-				busy_poll_stop(napi, have_poll_lock);
+				busy_poll_stop(napi, have_poll_lock, budget);
 			preempt_enable();
 			rcu_read_unlock();
 			cond_resched();
@@ -6202,7 +6201,7 @@ void napi_busy_loop(unsigned int napi_id,
 		cpu_relax();
 	}
 	if (napi_poll)
-		busy_poll_stop(napi, have_poll_lock);
+		busy_poll_stop(napi, have_poll_lock, budget);
 	preempt_enable();
 out:
 	rcu_read_unlock();
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC bpf-next 2/7] net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and add napi id
  2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
  2019-05-02  8:39 ` [RFC bpf-next 1/7] net: fs: make busy poll budget configurable in napi_busy_loop Magnus Karlsson
@ 2019-05-02  8:39 ` Magnus Karlsson
  2019-05-02  8:39 ` [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP Magnus Karlsson
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-02  8:39 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz

This patch centralizes the xdp_rxq_info struct to only reside in a
single place and adds napi id to the information contained in it.

The reason to add napi id is that it is needed for the AF_XDP busy
poll support. The xsk code needs to know what napi id to call when it
gets a poll request on a socket that is bound to a specific queue id
on a netdev.

Previously, the xdp_req_info struct resided both in the _rx structure
and in the driver. The one in the _rx structure was used for the
XDP_SKB case and the one in the driver for the XDP_DRV case. With
busy-poll, the request to execute the napi context always comes from
the syscall path, never the driver path, so the xdp_rxq_info needs to
reside in the _rx struct for both XDP_SKB and XDP_DRV. With this,
there is no longer a need to have an extra copy in the driver that is
only valid for the XDP_DRV case. This structure has been converted to
a pointer reference to the xdp_rxq_info struct in the kernel instead,
making the code smaller and simpler.

NOTE: this patch needs to include moving over all drivers to the new
interface. I only did a handful here to demonstrate the changes. When
we agree on how to do it, I will move over all of them.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |  2 -
 drivers/net/ethernet/intel/i40e/i40e_main.c    |  8 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.c    | 16 +++++---
 drivers/net/ethernet/intel/i40e/i40e_txrx.h    |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c     |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h       |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 36 +++++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c   |  2 +-
 drivers/net/tun.c                              | 14 +++----
 drivers/net/veth.c                             | 10 ++---
 drivers/net/virtio_net.c                       |  8 ++--
 include/net/xdp.h                              | 13 ++++---
 net/core/dev.c                                 | 19 +---------
 net/core/xdp.c                                 | 51 +++++++++++++++++---------
 14 files changed, 102 insertions(+), 83 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 9eaea1b..dcb5144 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -2006,8 +2006,6 @@ static int i40e_set_ringparam(struct net_device *netdev,
 			 */
 			rx_rings[i].desc = NULL;
 			rx_rings[i].rx_bi = NULL;
-			/* Clear cloned XDP RX-queue info before setup call */
-			memset(&rx_rings[i].xdp_rxq, 0, sizeof(rx_rings[i].xdp_rxq));
 			/* this is to allow wr32 to have something to write to
 			 * during early allocation of Rx buffers
 			 */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 65c2b9d..763c48c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3238,7 +3238,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	memset(&rx_ctx, 0, sizeof(rx_ctx));
 
 	if (ring->vsi->type == I40E_VSI_MAIN)
-		xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
+		xdp_rxq_info_unreg_mem_model(ring->netdev, ring->queue_index);
 
 	ring->xsk_umem = i40e_xsk_umem(ring);
 	if (ring->xsk_umem) {
@@ -3250,7 +3250,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 		 */
 		chain_len = 1;
 		ring->zca.free = i40e_zca_free;
-		ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+		ret = xdp_rxq_info_reg_mem_model(ring->netdev,
+						 ring->queue_index,
 						 MEM_TYPE_ZERO_COPY,
 						 &ring->zca);
 		if (ret)
@@ -3262,7 +3263,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)
 	} else {
 		ring->rx_buf_len = vsi->rx_buf_len;
 		if (ring->vsi->type == I40E_VSI_MAIN) {
-			ret = xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+			ret = xdp_rxq_info_reg_mem_model(ring->netdev,
+							 ring->queue_index,
 							 MEM_TYPE_PAGE_SHARED,
 							 NULL);
 			if (ret)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e193170..74132ad 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1408,8 +1408,10 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 {
 	i40e_clean_rx_ring(rx_ring);
-	if (rx_ring->vsi->type == I40E_VSI_MAIN)
-		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+	if (rx_ring->vsi->type == I40E_VSI_MAIN) {
+		xdp_rxq_info_unreg(rx_ring->vsi->netdev, rx_ring->queue_index);
+		rx_ring->xdp_rxq = NULL;
+	}
 	rx_ring->xdp_prog = NULL;
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
@@ -1460,15 +1462,19 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 
 	/* XDP RX-queue info only needed for RX rings exposed to XDP */
 	if (rx_ring->vsi->type == I40E_VSI_MAIN) {
-		err = xdp_rxq_info_reg(&rx_ring->xdp_rxq, rx_ring->netdev,
-				       rx_ring->queue_index);
+		err = xdp_rxq_info_reg(rx_ring->netdev, rx_ring->queue_index,
+				       rx_ring->q_vector->napi.napi_id);
 		if (err < 0)
 			goto err;
+
+		rx_ring->xdp_rxq = xdp_rxq_info_get(rx_ring->netdev,
+						    rx_ring->queue_index);
 	}
 
 	rx_ring->xdp_prog = rx_ring->vsi->xdp_prog;
 
 	return 0;
+
 err:
 	kfree(rx_ring->rx_bi);
 	rx_ring->rx_bi = NULL;
@@ -2335,7 +2341,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	bool failure = false;
 	struct xdp_buff xdp;
 
-	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.rxq = rx_ring->xdp_rxq;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
 		struct i40e_rx_buffer *rx_buffer;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 100e92d..066f616 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -417,7 +417,7 @@ struct i40e_ring {
 					 */
 
 	struct i40e_channel *ch;
-	struct xdp_rxq_info xdp_rxq;
+	struct xdp_rxq_info *xdp_rxq;
 	struct xdp_umem *xsk_umem;
 	struct zero_copy_allocator zca; /* ZC allocator anchor */
 } ____cacheline_internodealigned_in_smp;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 1b17486..2eba2bc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -536,7 +536,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 	struct sk_buff *skb;
 	struct xdp_buff xdp;
 
-	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.rxq = rx_ring->xdp_rxq;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
 		struct i40e_rx_buffer *bi;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 08d85e3..ea320b9 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -351,7 +351,7 @@ struct ixgbe_ring {
 		struct ixgbe_tx_queue_stats tx_stats;
 		struct ixgbe_rx_queue_stats rx_stats;
 	};
-	struct xdp_rxq_info xdp_rxq;
+	struct xdp_rxq_info *xdp_rxq;
 	struct xdp_umem *xsk_umem;
 	struct zero_copy_allocator zca; /* ZC allocator anchor */
 	u16 ring_idx;		/* {rx,tx,xdp}_ring back reference idx */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7b90320..3afb521 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2285,7 +2285,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 	unsigned int xdp_xmit = 0;
 	struct xdp_buff xdp;
 
-	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.rxq = rx_ring->xdp_rxq;
 
 	while (likely(total_rx_packets < budget)) {
 		union ixgbe_adv_rx_desc *rx_desc;
@@ -4066,17 +4066,19 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter *adapter,
 	u32 rxdctl;
 	u8 reg_idx = ring->reg_idx;
 
-	xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
+	xdp_rxq_info_unreg_mem_model(ring->netdev, ring->queue_index);
 	ring->xsk_umem = ixgbe_xsk_umem(adapter, ring);
 	if (ring->xsk_umem) {
 		ring->zca.free = ixgbe_zca_free;
-		WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
+		(void)xdp_rxq_info_reg_mem_model(ring->netdev,
+						   ring->queue_index,
 						   MEM_TYPE_ZERO_COPY,
-						   &ring->zca));
+						   &ring->zca);
 
 	} else {
-		WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq,
-						   MEM_TYPE_PAGE_SHARED, NULL));
+		(void)xdp_rxq_info_reg_mem_model(ring->netdev,
+						   ring->queue_index,
+						   MEM_TYPE_PAGE_SHARED, NULL);
 	}
 
 	/* disable queue to avoid use of these values while updating state */
@@ -6514,6 +6516,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	struct device *dev = rx_ring->dev;
 	int orig_node = dev_to_node(dev);
 	int ring_node = NUMA_NO_NODE;
+	int err = -ENOMEM;
 	int size;
 
 	size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
@@ -6527,6 +6530,14 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	if (!rx_ring->rx_buffer_info)
 		goto err;
 
+	/* XDP RX-queue info */
+	err = xdp_rxq_info_reg(adapter->netdev, rx_ring->queue_index,
+			       rx_ring->q_vector->napi.napi_id);
+	if (err)
+		goto err;
+	rx_ring->xdp_rxq = xdp_rxq_info_get(rx_ring->netdev,
+					    rx_ring->queue_index);
+
 	/* Round up to nearest 4K */
 	rx_ring->size = rx_ring->count * sizeof(union ixgbe_adv_rx_desc);
 	rx_ring->size = ALIGN(rx_ring->size, 4096);
@@ -6540,17 +6551,14 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	if (!rx_ring->desc)
 		rx_ring->desc = dma_alloc_coherent(dev, rx_ring->size,
 						   &rx_ring->dma, GFP_KERNEL);
-	if (!rx_ring->desc)
+	if (!rx_ring->desc) {
+		err = -ENOMEM;
 		goto err;
+	}
 
 	rx_ring->next_to_clean = 0;
 	rx_ring->next_to_use = 0;
 
-	/* XDP RX-queue info */
-	if (xdp_rxq_info_reg(&rx_ring->xdp_rxq, adapter->netdev,
-			     rx_ring->queue_index) < 0)
-		goto err;
-
 	rx_ring->xdp_prog = adapter->xdp_prog;
 
 	return 0;
@@ -6558,7 +6566,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	vfree(rx_ring->rx_buffer_info);
 	rx_ring->rx_buffer_info = NULL;
 	dev_err(dev, "Unable to allocate memory for the Rx descriptor ring\n");
-	return -ENOMEM;
+	return err;
 }
 
 /**
@@ -6648,7 +6656,7 @@ void ixgbe_free_rx_resources(struct ixgbe_ring *rx_ring)
 	ixgbe_clean_rx_ring(rx_ring);
 
 	rx_ring->xdp_prog = NULL;
-	xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+	xdp_rxq_info_unreg(rx_ring->netdev, rx_ring->queue_index);
 	vfree(rx_ring->rx_buffer_info);
 	rx_ring->rx_buffer_info = NULL;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index bfe95ce..9c10c93 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -487,7 +487,7 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 	struct sk_buff *skb;
 	struct xdp_buff xdp;
 
-	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.rxq = rx_ring->xdp_rxq;
 
 	while (likely(total_rx_packets < budget)) {
 		union ixgbe_adv_rx_desc *rx_desc;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 9d72f8c..b05c239 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -726,7 +726,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
 				unregister_netdevice(tun->dev);
 		}
 		if (tun)
-			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+			xdp_rxq_info_unreg(tun->dev, tfile->queue_index);
 		ptr_ring_cleanup(&tfile->tx_ring, tun_ptr_free);
 		sock_put(&tfile->sk);
 	}
@@ -774,13 +774,13 @@ static void tun_detach_all(struct net_device *dev)
 		tun_napi_del(tfile);
 		/* Drop read queue */
 		tun_queue_purge(tfile);
-		xdp_rxq_info_unreg(&tfile->xdp_rxq);
+		xdp_rxq_info_unreg(dev, tfile->queue_index);
 		sock_put(&tfile->sk);
 	}
 	list_for_each_entry_safe(tfile, tmp, &tun->disabled, next) {
 		tun_enable_queue(tfile);
 		tun_queue_purge(tfile);
-		xdp_rxq_info_unreg(&tfile->xdp_rxq);
+		xdp_rxq_info_unreg(dev, tfile->queue_index);
 		sock_put(&tfile->sk);
 	}
 	BUG_ON(tun->numdisabled != 0);
@@ -842,14 +842,14 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 			tfile->xdp_rxq.queue_index = tfile->queue_index;
 	} else {
 		/* Setup XDP RX-queue info, for new tfile getting attached */
-		err = xdp_rxq_info_reg(&tfile->xdp_rxq,
-				       tun->dev, tfile->queue_index);
+		err = xdp_rxq_info_reg(tun->dev, tfile->queue_index,
+				       tfile->napi.napi_id);
 		if (err < 0)
 			goto out;
-		err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+		err = xdp_rxq_info_reg_mem_model(dev, tfile->queue_index,
 						 MEM_TYPE_PAGE_SHARED, NULL);
 		if (err < 0) {
-			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+			xdp_rxq_info_unreg(dev, tfile->queue_index);
 			goto out;
 		}
 		err = 0;
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 09a1433..a5bc608a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -817,11 +817,11 @@ static int veth_enable_xdp(struct net_device *dev)
 		for (i = 0; i < dev->real_num_rx_queues; i++) {
 			struct veth_rq *rq = &priv->rq[i];
 
-			err = xdp_rxq_info_reg(&rq->xdp_rxq, dev, i);
+			err = xdp_rxq_info_reg(dev, i, rq->xdp_napi.napi_id);
 			if (err < 0)
 				goto err_rxq_reg;
 
-			err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+			err = xdp_rxq_info_reg_mem_model(dev, i,
 							 MEM_TYPE_PAGE_SHARED,
 							 NULL);
 			if (err < 0)
@@ -841,10 +841,10 @@ static int veth_enable_xdp(struct net_device *dev)
 
 	return 0;
 err_reg_mem:
-	xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq);
+	xdp_rxq_info_unreg(dev, i);
 err_rxq_reg:
 	for (i--; i >= 0; i--)
-		xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq);
+		xdp_rxq_info_unreg(dev, i);
 
 	return err;
 }
@@ -861,7 +861,7 @@ static void veth_disable_xdp(struct net_device *dev)
 		struct veth_rq *rq = &priv->rq[i];
 
 		rq->xdp_rxq.mem = rq->xdp_mem;
-		xdp_rxq_info_unreg(&rq->xdp_rxq);
+		xdp_rxq_info_unreg(dev, i);
 	}
 }
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 559c48e6..f15b3d5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1469,14 +1469,14 @@ static int virtnet_open(struct net_device *dev)
 			if (!try_fill_recv(vi, &vi->rq[i], GFP_KERNEL))
 				schedule_delayed_work(&vi->refill, 0);
 
-		err = xdp_rxq_info_reg(&vi->rq[i].xdp_rxq, dev, i);
+		err = xdp_rxq_info_reg(dev, i, vi->rq[i].napi.napi_id);
 		if (err < 0)
 			return err;
 
-		err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
+		err = xdp_rxq_info_reg_mem_model(dev, i,
 						 MEM_TYPE_PAGE_SHARED, NULL);
 		if (err < 0) {
-			xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+			xdp_rxq_info_unreg(dev, i);
 			return err;
 		}
 
@@ -1817,7 +1817,7 @@ static int virtnet_close(struct net_device *dev)
 	cancel_delayed_work_sync(&vi->refill);
 
 	for (i = 0; i < vi->max_queue_pairs; i++) {
-		xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+		xdp_rxq_info_unreg(dev, i);
 		napi_disable(&vi->rq[i].napi);
 		virtnet_napi_tx_disable(&vi->sq[i].napi);
 	}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0f25b36..d5fb5c0 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -60,6 +60,7 @@ struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
 	u32 reg_state;
+	unsigned int napi_id;
 	struct xdp_mem_info mem;
 } ____cacheline_aligned; /* perf critical, avoid false-sharing */
 
@@ -129,14 +130,16 @@ void xdp_return_frame(struct xdp_frame *xdpf);
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf);
 void xdp_return_buff(struct xdp_buff *xdp);
 
-int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
-		     struct net_device *dev, u32 queue_index);
-void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
+void xdp_rxq_info_init(struct net_device *dev, u32 queue_index);
+int xdp_rxq_info_reg(struct net_device *dev, u32 queue_index,
+		     unsigned int napi_id);
+void xdp_rxq_info_unreg(struct net_device *net, u32 queue_index);
+struct xdp_rxq_info *xdp_rxq_info_get(struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
-int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+int xdp_rxq_info_reg_mem_model(struct net_device *dev, u32 queue_index,
 			       enum xdp_mem_type type, void *allocator);
-void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq);
+void xdp_rxq_info_unreg_mem_model(struct net_device *dev, u32 queue_index);
 
 /* Drivers not supporting XDP metadata can use this helper, which
  * rejects any room expansion for metadata as a result.
diff --git a/net/core/dev.c b/net/core/dev.c
index e82fc44..0d6b3ed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4383,6 +4383,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 
 	rxqueue = netif_get_rxqueue(skb);
 	xdp->rxq = &rxqueue->xdp_rxq;
+	xdp->rxq->napi_id = skb->napi_id;
 
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
@@ -8530,7 +8531,6 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 	unsigned int i, count = dev->num_rx_queues;
 	struct netdev_rx_queue *rx;
 	size_t sz = count * sizeof(*rx);
-	int err = 0;
 
 	BUG_ON(count < 1);
 
@@ -8544,32 +8544,17 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 		rx[i].dev = dev;
 
 		/* XDP RX-queue setup */
-		err = xdp_rxq_info_reg(&rx[i].xdp_rxq, dev, i);
-		if (err < 0)
-			goto err_rxq_info;
+		xdp_rxq_info_init(dev, i);
 	}
 	return 0;
-
-err_rxq_info:
-	/* Rollback successful reg's and free other resources */
-	while (i--)
-		xdp_rxq_info_unreg(&rx[i].xdp_rxq);
-	kvfree(dev->_rx);
-	dev->_rx = NULL;
-	return err;
 }
 
 static void netif_free_rx_queues(struct net_device *dev)
 {
-	unsigned int i, count = dev->num_rx_queues;
-
 	/* netif_alloc_rx_queues alloc failed, resources have been unreg'ed */
 	if (!dev->_rx)
 		return;
 
-	for (i = 0; i < count; i++)
-		xdp_rxq_info_unreg(&dev->_rx[i].xdp_rxq);
-
 	kvfree(dev->_rx);
 }
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4b2b194..ed691f9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -94,8 +94,9 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
 	kfree(xa);
 }
 
-void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+void xdp_rxq_info_unreg_mem_model(struct net_device *dev, u32 queue_index)
 {
+	struct xdp_rxq_info *xdp_rxq = &dev->_rx[queue_index].xdp_rxq;
 	struct xdp_mem_allocator *xa;
 	int id = xdp_rxq->mem.id;
 
@@ -122,18 +123,33 @@ void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg_mem_model);
 
-void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
+static void _xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
 {
+	memset(xdp_rxq, 0, sizeof(*xdp_rxq));
+}
+
+void xdp_rxq_info_init(struct net_device *dev, u32 queue_index)
+{
+	struct xdp_rxq_info *xdp_rxq = &dev->_rx[queue_index].xdp_rxq;
+
+	_xdp_rxq_info_init(xdp_rxq);
+	xdp_rxq->dev = dev;
+	xdp_rxq->queue_index = queue_index;
+}
+
+void xdp_rxq_info_unreg(struct net_device *dev, u32 queue_index)
+{
+	struct xdp_rxq_info *xdp_rxq = &dev->_rx[queue_index].xdp_rxq;
+
 	/* Simplify driver cleanup code paths, allow unreg "unused" */
 	if (xdp_rxq->reg_state == REG_STATE_UNUSED)
 		return;
 
 	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
-	xdp_rxq_info_unreg_mem_model(xdp_rxq);
+	xdp_rxq_info_unreg_mem_model(dev, queue_index);
 
 	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
-	xdp_rxq->dev = NULL;
 
 	/* Reset mem info to defaults */
 	xdp_rxq->mem.id = 0;
@@ -141,15 +157,12 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
 
-static void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq)
-{
-	memset(xdp_rxq, 0, sizeof(*xdp_rxq));
-}
-
 /* Returns 0 on success, negative on failure */
-int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
-		     struct net_device *dev, u32 queue_index)
+int xdp_rxq_info_reg(struct net_device *dev, u32 queue_index,
+		     unsigned int napi_id)
 {
+	struct xdp_rxq_info *xdp_rxq = &dev->_rx[queue_index].xdp_rxq;
+
 	if (xdp_rxq->reg_state == REG_STATE_UNUSED) {
 		WARN(1, "Driver promised not to register this");
 		return -EINVAL;
@@ -157,7 +170,7 @@ int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 
 	if (xdp_rxq->reg_state == REG_STATE_REGISTERED) {
 		WARN(1, "Missing unregister, handled but fix driver");
-		xdp_rxq_info_unreg(xdp_rxq);
+		xdp_rxq_info_unreg(dev, queue_index);
 	}
 
 	if (!dev) {
@@ -166,15 +179,18 @@ int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 	}
 
 	/* State either UNREGISTERED or NEW */
-	xdp_rxq_info_init(xdp_rxq);
-	xdp_rxq->dev = dev;
-	xdp_rxq->queue_index = queue_index;
-
+	xdp_rxq->napi_id = napi_id;
 	xdp_rxq->reg_state = REG_STATE_REGISTERED;
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg);
 
+struct xdp_rxq_info *xdp_rxq_info_get(struct net_device *dev, u32 queue_index)
+{
+	return &dev->_rx[queue_index].xdp_rxq;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_get);
+
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq)
 {
 	xdp_rxq->reg_state = REG_STATE_UNUSED;
@@ -249,9 +265,10 @@ static bool __is_supported_mem_type(enum xdp_mem_type type)
 	return true;
 }
 
-int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+int xdp_rxq_info_reg_mem_model(struct net_device *dev, u32 queue_index,
 			       enum xdp_mem_type type, void *allocator)
 {
+	struct xdp_rxq_info *xdp_rxq = &dev->_rx[queue_index].xdp_rxq;
 	struct xdp_mem_allocator *xdp_alloc;
 	gfp_t gfp = GFP_KERNEL;
 	int id, errno, ret;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP
  2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
  2019-05-02  8:39 ` [RFC bpf-next 1/7] net: fs: make busy poll budget configurable in napi_busy_loop Magnus Karlsson
  2019-05-02  8:39 ` [RFC bpf-next 2/7] net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and add napi id Magnus Karlsson
@ 2019-05-02  8:39 ` Magnus Karlsson
  2019-05-03  0:13   ` Samudrala, Sridhar
  2019-05-02  8:39 ` [RFC bpf-next 5/7] net: add busy-poll support for XDP sockets Magnus Karlsson
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-02  8:39 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz

This patch introduces a new setsockopt that enables busy-poll for XDP
sockets. It is called XDP_BUSY_POLL_BATCH_SIZE and takes batch size as
an argument. A value between 1 and NAPI_WEIGHT (64) will turn it on, 0
will turn it off and any other value will return an error. There is
also a corresponding getsockopt implementation.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index caed8b1..be28a78 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_FILL_RING		5
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
+#define XDP_BUSY_POLL_BATCH_SIZE	8
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC bpf-next 5/7] net: add busy-poll support for XDP sockets
  2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
                   ` (2 preceding siblings ...)
  2019-05-02  8:39 ` [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP Magnus Karlsson
@ 2019-05-02  8:39 ` Magnus Karlsson
  2019-05-03  0:23   ` Samudrala, Sridhar
  2019-05-02  8:39 ` [RFC bpf-next 6/7] libbpf: add busy-poll support to " Magnus Karlsson
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-02  8:39 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz

This patch adds busy-poll support for XDP sockets (AF_XDP). With
busy-poll, the driver is executed in process context by calling the
poll() syscall. The main advantage with this is that all processing
occurs on a single core. This eliminates the core-to-core cache
transfers that occur between the application and the softirqd
processing on another core, that occurs without busy-poll. From a
systems point of view, it also provides an advatage that we do not
have to provision extra cores in the system to handle
ksoftirqd/softirq processing, as all processing is done on the single
core that executes the application. The drawback of busy-poll is that
max throughput seen from a single application will be lower (due to
the syscall), but on a per core basis it will often be higher as the
normal mode runs on two cores and busy-poll on a single one.

The semantics of busy-poll from the application point of view are the
following:

* The application is required to call poll() to drive rx and tx
  processing. There is no guarantee that softirq and interrupts will
  do this for you.

* It should be enabled on a per socket basis. No global enablement, i.e.
  the XDP socket busy-poll will not care about the current
  /proc/sys/net/core/busy_poll and busy_read global enablement
  mechanisms.

* The batch size (how many packets that are processed every time the
  napi function in the driver is called, i.e. the weight parameter)
  should be configurable. Currently, the busy-poll size of AF_INET
  sockets is set to 8, but for AF_XDP sockets this is too small as the
  amount of processing per packet is much smaller with AF_XDP. This
  should be configurable on a per socket basis.

* If you put multiple AF_XDP busy-poll enabled sockets into a poll()
  call the napi contexts of all of them should be executed. This is in
  contrast to the AF_INET busy-poll that quits after the fist one that
  finds any packets. We need all napi contexts to be executed due to
  the first requirement in this list. The behaviour we want is much more
  like regular sockets in that all of them are checked in the poll
  call.

* Should be possible to mix AF_XDP busy-poll sockets with any other
  sockets including busy-poll AF_INET ones in a single poll() call
  without any change to semantics or the behaviour of any of those
  socket types.

* As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
  mode return POLLERR if the fill ring is empty or the completion
  queue is full.

Busy-poll support is enabled by calling a new setsockopt called
XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
any other value will return an error.

A typical packet processing rxdrop loop with busy-poll will look something
like this:

for (i = 0; i < num_socks; i++) {
    fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
    fds[i].events = POLLIN;
}

for (;;) {
    ret = poll(fds, num_socks, 0);
    if (ret <= 0)
            continue;

    for (i = 0; i < num_socks; i++)
        rx_drop(xsks[i], fds); /* The actual application */
}

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h |   3 ++
 net/xdp/Kconfig        |   1 +
 net/xdp/xsk.c          | 122 ++++++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h    |  18 ++++++--
 4 files changed, 138 insertions(+), 6 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d074b6d..2e956b37 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -57,7 +57,10 @@ struct xdp_sock {
 	struct net_device *dev;
 	struct xdp_umem *umem;
 	struct list_head flush_node;
+	unsigned int napi_id_rx;
+	unsigned int napi_id_tx;
 	u16 queue_id;
+	u16 bp_batch_size;
 	struct xsk_queue *tx ____cacheline_aligned_in_smp;
 	struct list_head list;
 	bool zc;
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
index 0255b33..219baaa 100644
--- a/net/xdp/Kconfig
+++ b/net/xdp/Kconfig
@@ -1,6 +1,7 @@
 config XDP_SOCKETS
 	bool "XDP sockets"
 	depends on BPF_SYSCALL
+	select NET_RX_BUSY_POLL
 	default n
 	help
 	  XDP sockets allows a channel between XDP programs and
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e886..bd3d0fe 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -22,6 +22,7 @@
 #include <linux/net.h>
 #include <linux/netdevice.h>
 #include <linux/rculist.h>
+#include <net/busy_poll.h>
 #include <net/xdp_sock.h>
 #include <net/xdp.h>
 
@@ -302,16 +303,107 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
 	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
 }
 
+static unsigned int xsk_check_rx_poll_err(struct xdp_sock *xs)
+{
+	return xskq_consumer_empty(xs->umem->fq) ? POLLERR : 0;
+}
+
+static unsigned int xsk_check_tx_poll_err(struct xdp_sock *xs)
+{
+	return xskq_producer_full(xs->umem->cq) ? POLLERR : 0;
+}
+
+static bool xsk_busy_loop_end(void *p, unsigned long start_time)
+{
+	return true;
+}
+
+static unsigned int xsk_get_napi_id_rx(struct xdp_sock *xs)
+{
+	if (xs->napi_id_rx)
+		return xs->napi_id_rx;
+	if (xs->dev->_rx[xs->queue_id].xdp_rxq.napi_id) {
+		xs->napi_id_rx = xdp_rxq_info_get(xs->dev,
+						  xs->queue_id)->napi_id;
+		return xs->napi_id_rx;
+	}
+
+	WARN_ON_ONCE(true);
+	return 0;
+}
+
+static unsigned int xsk_get_napi_id_tx(struct xdp_sock *xs)
+{
+	if (xs->napi_id_tx)
+		return xs->napi_id_tx;
+	if (xs->dev->_tx[xs->queue_id].xdp_txq.napi_id) {
+		xs->napi_id_tx = xdp_txq_info_get(xs->dev,
+						  xs->queue_id)->napi_id;
+		return xs->napi_id_tx;
+	}
+
+	WARN_ON_ONCE(true);
+	return 0;
+}
+
+static void xsk_exec_poll_generic(struct sock *sk, struct xdp_sock *xs,
+				  __poll_t events)
+{
+	if (events & (POLLIN | POLLRDNORM))
+		/* NAPI id filled in by the generic XDP code */
+		napi_busy_loop(xsk_get_napi_id_rx(xs), xsk_busy_loop_end, NULL,
+			       xs->bp_batch_size);
+	if (events & (POLLOUT | POLLWRNORM))
+		/* Use the regular send path as we do not have any
+		 * NAPI id for the Tx path. It is only in the driver
+		 * and not communicated upwards in the skb case.
+		 */
+		xsk_generic_xmit(sk, NULL, 0);
+}
+
+static void xsk_exec_poll_zc(struct xdp_sock *xs, __poll_t events)
+{
+	unsigned int napi_id_rx = xsk_get_napi_id_rx(xs);
+	unsigned int napi_id_tx = xsk_get_napi_id_tx(xs);
+
+	if (events & (POLLIN | POLLRDNORM))
+		napi_busy_loop(xs->napi_id_rx, xsk_busy_loop_end, NULL,
+			       xs->bp_batch_size);
+	if (napi_id_rx != napi_id_tx)
+		if (events & (POLLOUT | POLLWRNORM))
+			/* Tx has its own napi so we need to call it too */
+			napi_busy_loop(xs->napi_id_tx, xsk_busy_loop_end, NULL,
+				       xs->bp_batch_size);
+}
+
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
 			     struct poll_table_struct *wait)
 {
 	unsigned int mask = datagram_poll(file, sock, wait);
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
+	__poll_t events = poll_requested_events(wait);
+
+	if (xs->bp_batch_size) {
+		if (xs->zc)
+			xsk_exec_poll_zc(xs, events);
+		else
+			xsk_exec_poll_generic(sk, xs, events);
+
+		if (events & (POLLIN | POLLRDNORM))
+			mask |= xsk_check_rx_poll_err(xs);
+		if (events & (POLLOUT | POLLWRNORM))
+			mask |= xsk_check_tx_poll_err(xs);
+
+		/* Clear the busy_loop flag so that any further fds in
+		 * the pollfd struct will have their napis scheduled.
+		 */
+		mask &= ~POLL_BUSY_LOOP;
+	}
 
-	if (xs->rx && !xskq_empty_desc(xs->rx))
+	if (xs->rx && !xskq_producer_empty(xs->rx))
 		mask |= POLLIN | POLLRDNORM;
-	if (xs->tx && !xskq_full_desc(xs->tx))
+	if (xs->tx && !xskq_consumer_full(xs->tx))
 		mask |= POLLOUT | POLLWRNORM;
 
 	return mask;
@@ -572,6 +664,21 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_BUSY_POLL_BATCH_SIZE:
+	{
+		u16 batch_size;
+
+		if (copy_from_user(&batch_size, optval, sizeof(batch_size)))
+			return -EFAULT;
+
+		if (batch_size == 0 || batch_size > NAPI_POLL_WEIGHT)
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		xs->bp_batch_size = batch_size;
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
 	default:
 		break;
 	}
@@ -644,6 +751,17 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
 
 		return 0;
 	}
+	case XDP_BUSY_POLL_BATCH_SIZE:
+		if (len < sizeof(xs->bp_batch_size))
+			return -EINVAL;
+
+		if (copy_to_user(optval, &xs->bp_batch_size,
+				 sizeof(xs->bp_batch_size)))
+			return -EFAULT;
+		if (put_user(sizeof(xs->bp_batch_size), optlen))
+			return -EFAULT;
+
+		return 0;
 	default:
 		break;
 	}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 88b9ae2..ebbd996 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -292,14 +292,24 @@ static inline void xskq_produce_flush_desc(struct xsk_queue *q)
 	WRITE_ONCE(q->ring->producer, q->prod_tail);
 }
 
-static inline bool xskq_full_desc(struct xsk_queue *q)
+static inline bool xskq_consumer_full(struct xsk_queue *q)
 {
-	return xskq_nb_avail(q, q->nentries) == q->nentries;
+	return READ_ONCE(q->ring->producer) - q->cons_tail == q->nentries;
 }
 
-static inline bool xskq_empty_desc(struct xsk_queue *q)
+static inline bool xskq_producer_empty(struct xsk_queue *q)
 {
-	return xskq_nb_free(q, q->prod_tail, q->nentries) == q->nentries;
+	return READ_ONCE(q->ring->consumer) == q->prod_tail;
+}
+
+static inline bool xskq_consumer_empty(struct xsk_queue *q)
+{
+	return READ_ONCE(q->ring->producer) == q->cons_tail;
+}
+
+static inline bool xskq_producer_full(struct xsk_queue *q)
+{
+	return q->prod_tail - READ_ONCE(q->ring->consumer) == q->nentries;
 }
 
 void xskq_set_umem(struct xsk_queue *q, u64 size, u64 chunk_mask);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC bpf-next 6/7] libbpf: add busy-poll support to XDP sockets
  2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
                   ` (3 preceding siblings ...)
  2019-05-02  8:39 ` [RFC bpf-next 5/7] net: add busy-poll support for XDP sockets Magnus Karlsson
@ 2019-05-02  8:39 ` " Magnus Karlsson
  2019-05-02  8:39 ` [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample Magnus Karlsson
  2019-05-06 16:31 ` [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Alexei Starovoitov
  6 siblings, 0 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-02  8:39 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz

This patch adds busy-poll support for XDP sockets to libbpf. A new
option is provided in the xsk_socket_config struct called
busy_poll. The value of it is the desired batch size. A value between
1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and any
other value will return an error.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 tools/include/uapi/linux/if_xdp.h |  1 +
 tools/lib/bpf/xsk.c               | 23 +++++++++++++----------
 tools/lib/bpf/xsk.h               |  1 +
 3 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index caed8b1..be28a78 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_FILL_RING		5
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
+#define XDP_BUSY_POLL_BATCH_SIZE	8
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index 557ef8d..b5538f1 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -120,10 +120,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
 		return;
 	}
 
-	cfg->fill_size = usr_cfg->fill_size;
-	cfg->comp_size = usr_cfg->comp_size;
-	cfg->frame_size = usr_cfg->frame_size;
-	cfg->frame_headroom = usr_cfg->frame_headroom;
+	memcpy(cfg, usr_cfg, sizeof(*usr_cfg));
 }
 
 static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
@@ -135,18 +132,14 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
 		cfg->libbpf_flags = 0;
 		cfg->xdp_flags = 0;
 		cfg->bind_flags = 0;
+		cfg->busy_poll = 0;
 		return 0;
 	}
 
 	if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
 		return -EINVAL;
 
-	cfg->rx_size = usr_cfg->rx_size;
-	cfg->tx_size = usr_cfg->tx_size;
-	cfg->libbpf_flags = usr_cfg->libbpf_flags;
-	cfg->xdp_flags = usr_cfg->xdp_flags;
-	cfg->bind_flags = usr_cfg->bind_flags;
-
+	memcpy(cfg, usr_cfg, sizeof(*usr_cfg));
 	return 0;
 }
 
@@ -632,6 +625,16 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
 	}
 	xsk->tx = tx;
 
+	if (xsk->config.busy_poll) {
+		err = setsockopt(xsk->fd, SOL_XDP, XDP_BUSY_POLL_BATCH_SIZE,
+				 &xsk->config.busy_poll,
+				 sizeof(xsk->config.busy_poll));
+		if (err) {
+			err = -errno;
+			goto out_mmap_tx;
+		}
+	}
+
 	sxdp.sxdp_family = PF_XDP;
 	sxdp.sxdp_ifindex = xsk->ifindex;
 	sxdp.sxdp_queue_id = xsk->queue_id;
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index 82ea71a..517a56a 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -187,6 +187,7 @@ struct xsk_socket_config {
 	__u32 libbpf_flags;
 	__u32 xdp_flags;
 	__u16 bind_flags;
+	__u16 busy_poll;
 };
 
 /* Set config to NULL to get the default configuration. */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample
  2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
                   ` (4 preceding siblings ...)
  2019-05-02  8:39 ` [RFC bpf-next 6/7] libbpf: add busy-poll support to " Magnus Karlsson
@ 2019-05-02  8:39 ` Magnus Karlsson
  2019-05-24  2:51   ` Ye Xiaolong
  2019-05-06 16:31 ` [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Alexei Starovoitov
  6 siblings, 1 reply; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-02  8:39 UTC (permalink / raw)
  To: magnus.karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, sridhar.samudrala, kevin.laatz

This patch adds busy-poll support to the xdpsock sample
application. It is enabled by the "-b" or the "--busy-poll" command
line options.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 samples/bpf/xdpsock_user.c | 203 ++++++++++++++++++++++++++++-----------------
 1 file changed, 125 insertions(+), 78 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index d08ee1a..1272edf 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -66,6 +66,7 @@ static const char *opt_if = "";
 static int opt_ifindex;
 static int opt_queue;
 static int opt_poll;
+static int opt_busy_poll;
 static int opt_interval = 1;
 static u32 opt_xdp_bind_flags;
 static __u32 prog_id;
@@ -119,8 +120,11 @@ static void print_benchmark(bool running)
 	else
 		printf("	");
 
-	if (opt_poll)
+	if (opt_poll) {
+		if (opt_busy_poll)
+			printf("busy-");
 		printf("poll() ");
+	}
 
 	if (running) {
 		printf("running...");
@@ -306,7 +310,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
 	xsk->umem = umem;
 	cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 	cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
-	cfg.libbpf_flags = 0;
+	cfg.busy_poll = (opt_busy_poll ? BATCH_SIZE : 0);
 	cfg.xdp_flags = opt_xdp_flags;
 	cfg.bind_flags = opt_xdp_bind_flags;
 	ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
@@ -319,17 +323,17 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
 		exit_with_error(-ret);
 
 	ret = xsk_ring_prod__reserve(&xsk->umem->fq,
-				     XSK_RING_PROD__DEFAULT_NUM_DESCS,
+				     1024,
 				     &idx);
-	if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS)
+	if (ret != 1024)
 		exit_with_error(-ret);
 	for (i = 0;
-	     i < XSK_RING_PROD__DEFAULT_NUM_DESCS *
+	     i < 1024 *
 		     XSK_UMEM__DEFAULT_FRAME_SIZE;
 	     i += XSK_UMEM__DEFAULT_FRAME_SIZE)
 		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = i;
 	xsk_ring_prod__submit(&xsk->umem->fq,
-			      XSK_RING_PROD__DEFAULT_NUM_DESCS);
+			      1024);
 
 	return xsk;
 }
@@ -341,6 +345,7 @@ static struct option long_options[] = {
 	{"interface", required_argument, 0, 'i'},
 	{"queue", required_argument, 0, 'q'},
 	{"poll", no_argument, 0, 'p'},
+	{"busy-poll", no_argument, 0, 'b'},
 	{"xdp-skb", no_argument, 0, 'S'},
 	{"xdp-native", no_argument, 0, 'N'},
 	{"interval", required_argument, 0, 'n'},
@@ -360,6 +365,7 @@ static void usage(const char *prog)
 		"  -i, --interface=n	Run on interface n\n"
 		"  -q, --queue=n	Use queue n (default 0)\n"
 		"  -p, --poll		Use poll syscall\n"
+		"  -b, --busy-poll	Use poll syscall with busy poll\n"
 		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
 		"  -N, --xdp-native=n	Enfore XDP native mode\n"
 		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
@@ -377,7 +383,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
+		c = getopt_long(argc, argv, "Frtli:q:pbsSNn:cz", long_options,
 				&option_index);
 		if (c == -1)
 			break;
@@ -401,6 +407,10 @@ static void parse_command_line(int argc, char **argv)
 		case 'p':
 			opt_poll = 1;
 			break;
+		case 'b':
+			opt_busy_poll = 1;
+			opt_poll = 1;
+			break;
 		case 'S':
 			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
 			opt_xdp_bind_flags |= XDP_COPY;
@@ -444,7 +454,8 @@ static void kick_tx(struct xsk_socket_info *xsk)
 	exit_with_error(errno);
 }
 
-static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
+static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
+				     struct pollfd *fds)
 {
 	u32 idx_cq = 0, idx_fq = 0;
 	unsigned int rcvd;
@@ -453,7 +464,8 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
 	if (!xsk->outstanding_tx)
 		return;
 
-	kick_tx(xsk);
+	if (!opt_poll)
+		kick_tx(xsk);
 	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
 		xsk->outstanding_tx;
 
@@ -467,6 +479,8 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
 		while (ret != rcvd) {
 			if (ret < 0)
 				exit_with_error(-ret);
+			if (opt_busy_poll)
+				ret = poll(fds, num_socks, 0);
 			ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
 						     &idx_fq);
 		}
@@ -490,7 +504,8 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
 	if (!xsk->outstanding_tx)
 		return;
 
-	kick_tx(xsk);
+	if (!opt_busy_poll)
+		kick_tx(xsk);
 
 	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx);
 	if (rcvd > 0) {
@@ -500,10 +515,10 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
 	}
 }
 
-static void rx_drop(struct xsk_socket_info *xsk)
+static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
 {
-	unsigned int rcvd, i;
 	u32 idx_rx = 0, idx_fq = 0;
+	unsigned int rcvd, i;
 	int ret;
 
 	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
@@ -514,6 +529,8 @@ static void rx_drop(struct xsk_socket_info *xsk)
 	while (ret != rcvd) {
 		if (ret < 0)
 			exit_with_error(-ret);
+		if (opt_busy_poll)
+			ret = poll(fds, num_socks, 0);
 		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
 	}
 
@@ -533,43 +550,68 @@ static void rx_drop(struct xsk_socket_info *xsk)
 
 static void rx_drop_all(void)
 {
-	struct pollfd fds[MAX_SOCKS + 1];
-	int i, ret, timeout, nfds = 1;
+	struct pollfd fds[MAX_SOCKS];
+	int i, ret;
 
 	memset(fds, 0, sizeof(fds));
 
 	for (i = 0; i < num_socks; i++) {
 		fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
 		fds[i].events = POLLIN;
-		timeout = 1000; /* 1sn */
 	}
 
 	for (;;) {
 		if (opt_poll) {
-			ret = poll(fds, nfds, timeout);
+			ret = poll(fds, num_socks, 0);
 			if (ret <= 0)
 				continue;
 		}
 
 		for (i = 0; i < num_socks; i++)
-			rx_drop(xsks[i]);
+			rx_drop(xsks[i], fds);
+	}
+}
+
+static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb)
+{
+	u32 idx;
+
+	if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) ==
+	    BATCH_SIZE) {
+		unsigned int i;
+
+		for (i = 0; i < BATCH_SIZE; i++) {
+			xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
+				= (frame_nb + i) <<
+				XSK_UMEM__DEFAULT_FRAME_SHIFT;
+			xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
+				sizeof(pkt_data) - 1;
+		}
+
+		xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
+		xsk->outstanding_tx += BATCH_SIZE;
+		frame_nb += BATCH_SIZE;
+		frame_nb %= NUM_FRAMES;
 	}
+
+	complete_tx_only(xsk);
 }
 
-static void tx_only(struct xsk_socket_info *xsk)
+static void tx_only_all(void)
 {
-	int timeout, ret, nfds = 1;
-	struct pollfd fds[nfds + 1];
-	u32 idx, frame_nb = 0;
+	struct pollfd fds[MAX_SOCKS];
+	u32 frame_nb[MAX_SOCKS] = {};
+	int i, ret;
 
 	memset(fds, 0, sizeof(fds));
-	fds[0].fd = xsk_socket__fd(xsk->xsk);
-	fds[0].events = POLLOUT;
-	timeout = 1000; /* 1sn */
+	for (i = 0; i < num_socks; i++) {
+		fds[0].fd = xsk_socket__fd(xsks[i]->xsk);
+		fds[0].events = POLLOUT;
+	}
 
 	for (;;) {
 		if (opt_poll) {
-			ret = poll(fds, nfds, timeout);
+			ret = poll(fds, num_socks, 0);
 			if (ret <= 0)
 				continue;
 
@@ -577,70 +619,75 @@ static void tx_only(struct xsk_socket_info *xsk)
 				continue;
 		}
 
-		if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) ==
-		    BATCH_SIZE) {
-			unsigned int i;
-
-			for (i = 0; i < BATCH_SIZE; i++) {
-				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
-					= (frame_nb + i) <<
-					XSK_UMEM__DEFAULT_FRAME_SHIFT;
-				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
-					sizeof(pkt_data) - 1;
-			}
-
-			xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
-			xsk->outstanding_tx += BATCH_SIZE;
-			frame_nb += BATCH_SIZE;
-			frame_nb %= NUM_FRAMES;
-		}
-
-		complete_tx_only(xsk);
+		for (i = 0; i < num_socks; i++)
+			tx_only(xsks[i], frame_nb[i]);
 	}
 }
 
-static void l2fwd(struct xsk_socket_info *xsk)
+static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
 {
-	for (;;) {
-		unsigned int rcvd, i;
-		u32 idx_rx = 0, idx_tx = 0;
-		int ret;
+	unsigned int rcvd, i;
+	u32 idx_rx = 0, idx_tx = 0;
+	int ret;
 
-		for (;;) {
-			complete_tx_l2fwd(xsk);
+	complete_tx_l2fwd(xsk, fds);
 
-			rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE,
-						   &idx_rx);
-			if (rcvd > 0)
-				break;
-		}
+	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE,
+				   &idx_rx);
+	if (!rcvd)
+		return;
 
+	ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
+	while (ret != rcvd) {
+		if (ret < 0)
+			exit_with_error(-ret);
+		if (opt_busy_poll)
+			ret = poll(fds, num_socks, 0);
 		ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
-		while (ret != rcvd) {
-			if (ret < 0)
-				exit_with_error(-ret);
-			ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
-		}
+	}
+
+	for (i = 0; i < rcvd; i++) {
+		u64 addr = xsk_ring_cons__rx_desc(&xsk->rx,
+						  idx_rx)->addr;
+		u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
+						 idx_rx++)->len;
+		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
 
-		for (i = 0; i < rcvd; i++) {
-			u64 addr = xsk_ring_cons__rx_desc(&xsk->rx,
-							  idx_rx)->addr;
-			u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
-							 idx_rx++)->len;
-			char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+		swap_mac_addresses(pkt);
 
-			swap_mac_addresses(pkt);
+		hex_dump(pkt, len, addr);
+		xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
+		xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
+	}
 
-			hex_dump(pkt, len, addr);
-			xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
-			xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
-		}
+	xsk_ring_prod__submit(&xsk->tx, rcvd);
+	xsk_ring_cons__release(&xsk->rx, rcvd);
 
-		xsk_ring_prod__submit(&xsk->tx, rcvd);
-		xsk_ring_cons__release(&xsk->rx, rcvd);
+	xsk->rx_npkts += rcvd;
+	xsk->outstanding_tx += rcvd;
+}
 
-		xsk->rx_npkts += rcvd;
-		xsk->outstanding_tx += rcvd;
+static void l2fwd_all(void)
+{
+	struct pollfd fds[MAX_SOCKS];
+	int i, ret;
+
+	memset(fds, 0, sizeof(fds));
+
+	for (i = 0; i < num_socks; i++) {
+		fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
+		fds[i].events = POLLOUT | POLLIN;
+	}
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, num_socks, 0);
+			if (ret <= 0)
+				continue;
+		}
+
+		for (i = 0; i < num_socks; i++)
+			l2fwd(xsks[i], fds);
 	}
 }
 
@@ -693,9 +740,9 @@ int main(int argc, char **argv)
 	if (opt_bench == BENCH_RXDROP)
 		rx_drop_all();
 	else if (opt_bench == BENCH_TXONLY)
-		tx_only(xsks[0]);
+		tx_only_all();
 	else
-		l2fwd(xsks[0]);
+		l2fwd_all();
 
 	return 0;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP
  2019-05-02  8:39 ` [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP Magnus Karlsson
@ 2019-05-03  0:13   ` Samudrala, Sridhar
  2019-05-03  6:35     ` Magnus Karlsson
  0 siblings, 1 reply; 24+ messages in thread
From: Samudrala, Sridhar @ 2019-05-03  0:13 UTC (permalink / raw)
  To: Magnus Karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, kevin.laatz



On 5/2/2019 1:39 AM, Magnus Karlsson wrote:
> This patch introduces a new setsockopt that enables busy-poll for XDP
> sockets. It is called XDP_BUSY_POLL_BATCH_SIZE and takes batch size as
> an argument. A value between 1 and NAPI_WEIGHT (64) will turn it on, 0
> will turn it off and any other value will return an error. There is
> also a corresponding getsockopt implementation.

I think this socket option should also allow specifying a timeout value 
when using blocking poll() calls.
OR can we use SO_BUSY_POLL to specify this timeout value?

> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>   include/uapi/linux/if_xdp.h | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index caed8b1..be28a78 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
>   #define XDP_UMEM_FILL_RING		5
>   #define XDP_UMEM_COMPLETION_RING	6
>   #define XDP_STATISTICS			7
> +#define XDP_BUSY_POLL_BATCH_SIZE	8
>   
>   struct xdp_umem_reg {
>   	__u64 addr; /* Start of packet data area */
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 5/7] net: add busy-poll support for XDP sockets
  2019-05-02  8:39 ` [RFC bpf-next 5/7] net: add busy-poll support for XDP sockets Magnus Karlsson
@ 2019-05-03  0:23   ` Samudrala, Sridhar
  0 siblings, 0 replies; 24+ messages in thread
From: Samudrala, Sridhar @ 2019-05-03  0:23 UTC (permalink / raw)
  To: Magnus Karlsson, bjorn.topel, ast, daniel, netdev, brouer
  Cc: bpf, bruce.richardson, ciara.loftus, jakub.kicinski, xiaolong.ye,
	qi.z.zhang, maximmi, kevin.laatz



On 5/2/2019 1:39 AM, Magnus Karlsson wrote:
> This patch adds busy-poll support for XDP sockets (AF_XDP). With
> busy-poll, the driver is executed in process context by calling the
> poll() syscall. The main advantage with this is that all processing
> occurs on a single core. This eliminates the core-to-core cache
> transfers that occur between the application and the softirqd
> processing on another core, that occurs without busy-poll. From a
> systems point of view, it also provides an advatage that we do not
> have to provision extra cores in the system to handle
> ksoftirqd/softirq processing, as all processing is done on the single
> core that executes the application. The drawback of busy-poll is that
> max throughput seen from a single application will be lower (due to
> the syscall), but on a per core basis it will often be higher as the
> normal mode runs on two cores and busy-poll on a single one.
> 
> The semantics of busy-poll from the application point of view are the
> following:
> 
> * The application is required to call poll() to drive rx and tx
>    processing. There is no guarantee that softirq and interrupts will
>    do this for you.
> 
> * It should be enabled on a per socket basis. No global enablement, i.e.
>    the XDP socket busy-poll will not care about the current
>    /proc/sys/net/core/busy_poll and busy_read global enablement
>    mechanisms.
> 
> * The batch size (how many packets that are processed every time the
>    napi function in the driver is called, i.e. the weight parameter)
>    should be configurable. Currently, the busy-poll size of AF_INET
>    sockets is set to 8, but for AF_XDP sockets this is too small as the
>    amount of processing per packet is much smaller with AF_XDP. This
>    should be configurable on a per socket basis.
> 
> * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
>    call the napi contexts of all of them should be executed. This is in
>    contrast to the AF_INET busy-poll that quits after the fist one that
>    finds any packets. We need all napi contexts to be executed due to
>    the first requirement in this list. The behaviour we want is much more
>    like regular sockets in that all of them are checked in the poll
>    call.
> 
> * Should be possible to mix AF_XDP busy-poll sockets with any other
>    sockets including busy-poll AF_INET ones in a single poll() call
>    without any change to semantics or the behaviour of any of those
>    socket types.
> 
> * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
>    mode return POLLERR if the fill ring is empty or the completion
>    queue is full.
> 
> Busy-poll support is enabled by calling a new setsockopt called
> XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> any other value will return an error.
> 
> A typical packet processing rxdrop loop with busy-poll will look something
> like this:
> 
> for (i = 0; i < num_socks; i++) {
>      fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
>      fds[i].events = POLLIN;
> }
> 
> for (;;) {
>      ret = poll(fds, num_socks, 0);
>      if (ret <= 0)
>              continue;
> 
>      for (i = 0; i < num_socks; i++)
>          rx_drop(xsks[i], fds); /* The actual application */
> }
> 
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> ---
>   include/net/xdp_sock.h |   3 ++
>   net/xdp/Kconfig        |   1 +
>   net/xdp/xsk.c          | 122 ++++++++++++++++++++++++++++++++++++++++++++++++-
>   net/xdp/xsk_queue.h    |  18 ++++++--
>   4 files changed, 138 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index d074b6d..2e956b37 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -57,7 +57,10 @@ struct xdp_sock {
>   	struct net_device *dev;
>   	struct xdp_umem *umem;
>   	struct list_head flush_node;
> +	unsigned int napi_id_rx;
> +	unsigned int napi_id_tx;
>   	u16 queue_id;
> +	u16 bp_batch_size;
>   	struct xsk_queue *tx ____cacheline_aligned_in_smp;
>   	struct list_head list;
>   	bool zc;
> diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
> index 0255b33..219baaa 100644
> --- a/net/xdp/Kconfig
> +++ b/net/xdp/Kconfig
> @@ -1,6 +1,7 @@
>   config XDP_SOCKETS
>   	bool "XDP sockets"
>   	depends on BPF_SYSCALL
> +	select NET_RX_BUSY_POLL
>   	default n
>   	help
>   	  XDP sockets allows a channel between XDP programs and
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index a14e886..bd3d0fe 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -22,6 +22,7 @@
>   #include <linux/net.h>
>   #include <linux/netdevice.h>
>   #include <linux/rculist.h>
> +#include <net/busy_poll.h>
>   #include <net/xdp_sock.h>
>   #include <net/xdp.h>
>   
> @@ -302,16 +303,107 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>   	return (xs->zc) ? xsk_zc_xmit(sk) : xsk_generic_xmit(sk, m, total_len);
>   }
>   
> +static unsigned int xsk_check_rx_poll_err(struct xdp_sock *xs)
> +{
> +	return xskq_consumer_empty(xs->umem->fq) ? POLLERR : 0;
> +}
> +
> +static unsigned int xsk_check_tx_poll_err(struct xdp_sock *xs)
> +{
> +	return xskq_producer_full(xs->umem->cq) ? POLLERR : 0;
> +}
> +
> +static bool xsk_busy_loop_end(void *p, unsigned long start_time)
> +{
> +	return true;

This function should be updated to return TRUE on busy poll timeout OR 
it should check for xskq_full_desc on TX and xskq_empty_desc on RX

> +}
> +

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP
  2019-05-03  0:13   ` Samudrala, Sridhar
@ 2019-05-03  6:35     ` Magnus Karlsson
  0 siblings, 0 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-03  6:35 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Magnus Karlsson, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jesper Dangaard Brouer,
	bpf, bruce.richardson, ciara.loftus, Jakub Kicinski, Ye Xiaolong,
	Zhang, Qi Z, Maxim Mikityanskiy, kevin.laatz, ilias.apalodimas

On Fri, May 3, 2019 at 2:26 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
>
>
> On 5/2/2019 1:39 AM, Magnus Karlsson wrote:
> > This patch introduces a new setsockopt that enables busy-poll for XDP
> > sockets. It is called XDP_BUSY_POLL_BATCH_SIZE and takes batch size as
> > an argument. A value between 1 and NAPI_WEIGHT (64) will turn it on, 0
> > will turn it off and any other value will return an error. There is
> > also a corresponding getsockopt implementation.
>
> I think this socket option should also allow specifying a timeout value
> when using blocking poll() calls.
> OR can we use SO_BUSY_POLL to specify this timeout value?

I think you are correct in that we need to be able to specify the
timeout. The current approach of always having a timeout of zero was
optimized for the high throughput case. But Ilias and others often
talk about using AF_XDP for time sensitive networking, and in that
case spinning in the kernel (for a max period of the timeout) waiting
for a packet would provide better latency. And with a configurable
value, we could support both cases, so why not.

I will add the timeout value to the new setsockopt I introduced, so it
will take both a batch size and a timeout value. I will also call it
something else since it should not have batch_size in its name
anymore.

Thanks: Magnus

> >
> > Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> > ---
> >   include/uapi/linux/if_xdp.h | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index caed8b1..be28a78 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
> >   #define XDP_UMEM_FILL_RING          5
> >   #define XDP_UMEM_COMPLETION_RING    6
> >   #define XDP_STATISTICS                      7
> > +#define XDP_BUSY_POLL_BATCH_SIZE     8
> >
> >   struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
                   ` (5 preceding siblings ...)
  2019-05-02  8:39 ` [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample Magnus Karlsson
@ 2019-05-06 16:31 ` Alexei Starovoitov
  2019-05-07 11:51   ` Magnus Karlsson
  6 siblings, 1 reply; 24+ messages in thread
From: Alexei Starovoitov @ 2019-05-06 16:31 UTC (permalink / raw)
  To: Magnus Karlsson; +Cc: bjorn.topel, daniel, netdev, bpf, jakub.kicinski, bsd

On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> This RFC proposes to add busy-poll support to AF_XDP sockets. With
> busy-poll, the driver is executed in process context by calling the
> poll() syscall. The main advantage with this is that all processing
> occurs on a single core. This eliminates the core-to-core cache
> transfers that occur between the application and the softirqd
> processing on another core, that occurs without busy-poll. From a
> systems point of view, it also provides an advatage that we do not
> have to provision extra cores in the system to handle
> ksoftirqd/softirq processing, as all processing is done on the single
> core that executes the application. The drawback of busy-poll is that
> max throughput seen from a single application will be lower (due to
> the syscall), but on a per core basis it will often be higher as
> the normal mode runs on two cores and busy-poll on a single one.
> 
> The semantics of busy-poll from the application point of view are the
> following:
> 
> * The application is required to call poll() to drive rx and tx
>   processing. There is no guarantee that softirq and interrupts will
>   do this for you. This is in contrast with the current
>   implementations of busy-poll that are opportunistic in the sense
>   that packets might be received/transmitted by busy-poll or
>   softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
>   loads just as the current opportunistic implementations, but I would
>   like to get to a point where this is not the case for busy-poll
>   enabled XDP sockets, as this slows down performance considerably and
>   starts to use one more core for the softirq processing. The end goal
>   is for only poll() to drive the napi loop when busy-poll is enabled
>   on an AF_XDP socket. More about this later.)
> 
> * It should be enabled on a per socket basis. No global enablement, i.e.
>   the XDP socket busy-poll will not care about the current
>   /proc/sys/net/core/busy_poll and busy_read global enablement
>   mechanisms.
> 
> * The batch size (how many packets that are processed every time the
>   napi function in the driver is called, i.e. the weight parameter)
>   should be configurable. Currently, the busy-poll size of AF_INET
>   sockets is set to 8, but for AF_XDP sockets this is too small as the
>   amount of processing per packet is much smaller with AF_XDP. This
>   should be configurable on a per socket basis.
> 
> * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
>   call the napi contexts of all of them should be executed. This is in
>   contrast to the AF_INET busy-poll that quits after the fist one that
>   finds any packets. We need all napi contexts to be executed due to
>   the first requirement in this list. The behaviour we want is much more
>   like regular sockets in that all of them are checked in the poll
>   call.
> 
> * Should be possible to mix AF_XDP busy-poll sockets with any other
>   sockets including busy-poll AF_INET ones in a single poll() call
>   without any change to semantics or the behaviour of any of those
>   socket types.
> 
> * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
>   mode return POLLERR if the fill ring is empty or the completion
>   queue is full.
> 
> Busy-poll support is enabled by calling a new setsockopt called
> XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> any other value will return an error.
> 
> A typical packet processing rxdrop loop with busy-poll will look something
> like this:
> 
> for (i = 0; i < num_socks; i++) {
>     fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
>     fds[i].events = POLLIN;
> }
> 
> for (;;) {
>     ret = poll(fds, num_socks, 0);
>     if (ret <= 0)
>             continue;
> 
>     for (i = 0; i < num_socks; i++)
>         rx_drop(xsks[i], fds); /* The actual application */
> }
> 
> Need some advice around this issue please:
> 
> In this patch set, softirq/ksoftirqd will kick in at high loads and
> render the busy poll support useless as the execution is now happening
> in the same way as without busy-poll support. Everything works from an
> application perspective but this defeats the purpose of the support
> and also consumes an extra core. What I would like to accomplish when

Not sure what you mean by 'extra core' .
The above poll+rx_drop is executed for every af_xdp socket
and there are N cpus processing exactly N af_xdp sockets.
Where is 'extra core'?
Are you suggesting a model where single core will be busy-polling
all af_xdp sockets? and then waking processing threads?
or single core will process all sockets?
I think the af_xdp model should be flexible and allow easy out-of-the-box
experience, but it should be optimized for 'ideal' user that
does the-right-thing from max packet-per-second point of view.
I thought we've already converged on the model where af_xdp hw rx queues
bind one-to-one to af_xdp sockets and user space pins processing
threads one-to-one to af_xdp sockets on corresponding cpus...
If so that's the model to optimize for on the kernel side
while keeping all other user configurations functional.

> XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> invoked for the traffic that goes to this socket. This way, we would
> get better performance on a per core basis and also get the same
> behaviour independent of load.

I suspect separate rx kthreads of af_xdp socket processing is necessary
with and without busy-poll exactly because of 'high load' case
you've described.
If we do this additional rx-kthread model why differentiate
between busy-polling and polling?

af_xdp rx queue is completely different form stack rx queue because
of target dma address setup.
Using stack's napi ksoftirqd threads for processing af_xdp queues creates
the fairness issues. Isn't it better to have separate kthreads for them
and let scheduler deal with fairness among af_xdp processing and stack?

> 
> To accomplish this, the driver would need to create a napi context
> containing the busy-poll enabled XDP socket queues (not shared with
> any other traffic) that do not have an interrupt associated with
> it.

why no interrupt?

> 
> Does this sound like an avenue worth pursuing and if so, should it be
> part of this RFC/PATCH or separate?
> 
> Epoll() is not supported at this point in time. Since it was designed
> for handling a very large number of sockets, I thought it was better
> to leave this for later if the need will arise.
> 
> To do:
> 
> * Move over all drivers to the new xdp_[rt]xq_info functions. If you
>   think this is the right way to go forward, I will move over
>   Mellanox, Netronome, etc for the proper patch.
> 
> * Performance measurements
> 
> * Test SW drivers like virtio, veth and tap. Actually, test anything
>   that is not i40e.
> 
> * Test multiple sockets of different types in the same poll() call
> 
> * Check bisectability of each patch
> 
> * Versioning of the libbpf interface since we add an entry to the
>   socket configuration struct
> 
> This RFC has been applied against commit 2b5bc3c8ebce ("r8169: remove manual autoneg restart workaround")
> 
> Structure of the patch set:
> Patch 1: Makes the busy poll batch size configurable inside the kernel
> Patch 2: Adds napi id to struct xdp_rxq_info, adds this to the
>          xdp_rxq_reg_info function and changes the interface and
> 	 implementation so that we only need a single copy of the
> 	 xdp_rxq_info struct that resides in _rx. Previously there was
> 	 another one in the driver, but now the driver uses the one in
> 	 _rx. All XDP enabled drivers are converted to these new
> 	 functions.
> Patch 3: Adds a struct xdp_txq_info with corresponding functions to
>          xdp_rxq_info that is used to store information about the tx
> 	 queue. The use of these are added to all AF_XDP enabled drivers.
> Patch 4: Introduce a new setsockopt/getsockopt in the uapi for
>          enabling busy_poll.
> Patch 5: Implements busy poll in the xsk code.
> Patch 6: Add busy-poll support to libbpf.
> Patch 7: Add busy-poll support to the xdpsock sample application.
> 
> Thanks: Magnus
> 
> Magnus Karlsson (7):
>   net: fs: make busy poll budget configurable in napi_busy_loop
>   net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and
>     add napi id
>   net: i40e: ixgbe: add xdp_txq_info structure
>   netdevice: introduce busy-poll setsockopt for AF_XDP
>   net: add busy-poll support for XDP sockets
>   libbpf: add busy-poll support to XDP sockets
>   samples/bpf: add busy-poll support to xdpsock sample
> 
>  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 -
>  drivers/net/ethernet/intel/i40e/i40e_main.c    |   8 +-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  37 ++++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +-
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c     |   2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |   2 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  48 ++++--
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c   |   2 +-
>  drivers/net/tun.c                              |  14 +-
>  drivers/net/veth.c                             |  10 +-
>  drivers/net/virtio_net.c                       |   8 +-
>  fs/eventpoll.c                                 |   5 +-
>  include/linux/netdevice.h                      |   1 +
>  include/net/busy_poll.h                        |   7 +-
>  include/net/xdp.h                              |  23 ++-
>  include/net/xdp_sock.h                         |   3 +
>  include/uapi/linux/if_xdp.h                    |   1 +
>  net/core/dev.c                                 |  43 ++----
>  net/core/xdp.c                                 | 103 ++++++++++---
>  net/xdp/Kconfig                                |   1 +
>  net/xdp/xsk.c                                  | 122 ++++++++++++++-
>  net/xdp/xsk_queue.h                            |  18 ++-
>  samples/bpf/xdpsock_user.c                     | 203 +++++++++++++++----------
>  tools/include/uapi/linux/if_xdp.h              |   1 +
>  tools/lib/bpf/xsk.c                            |  23 +--
>  tools/lib/bpf/xsk.h                            |   1 +
>  26 files changed, 495 insertions(+), 195 deletions(-)
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-06 16:31 ` [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Alexei Starovoitov
@ 2019-05-07 11:51   ` Magnus Karlsson
  2019-05-07 18:24     ` Alexei Starovoitov
  0 siblings, 1 reply; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-07 11:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Magnus Karlsson, Björn Töpel, Daniel Borkmann,
	Network Development, bpf, Jakub Kicinski, Jonathan Lemon

On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > busy-poll, the driver is executed in process context by calling the
> > poll() syscall. The main advantage with this is that all processing
> > occurs on a single core. This eliminates the core-to-core cache
> > transfers that occur between the application and the softirqd
> > processing on another core, that occurs without busy-poll. From a
> > systems point of view, it also provides an advatage that we do not
> > have to provision extra cores in the system to handle
> > ksoftirqd/softirq processing, as all processing is done on the single
> > core that executes the application. The drawback of busy-poll is that
> > max throughput seen from a single application will be lower (due to
> > the syscall), but on a per core basis it will often be higher as
> > the normal mode runs on two cores and busy-poll on a single one.
> >
> > The semantics of busy-poll from the application point of view are the
> > following:
> >
> > * The application is required to call poll() to drive rx and tx
> >   processing. There is no guarantee that softirq and interrupts will
> >   do this for you. This is in contrast with the current
> >   implementations of busy-poll that are opportunistic in the sense
> >   that packets might be received/transmitted by busy-poll or
> >   softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> >   loads just as the current opportunistic implementations, but I would
> >   like to get to a point where this is not the case for busy-poll
> >   enabled XDP sockets, as this slows down performance considerably and
> >   starts to use one more core for the softirq processing. The end goal
> >   is for only poll() to drive the napi loop when busy-poll is enabled
> >   on an AF_XDP socket. More about this later.)
> >
> > * It should be enabled on a per socket basis. No global enablement, i.e.
> >   the XDP socket busy-poll will not care about the current
> >   /proc/sys/net/core/busy_poll and busy_read global enablement
> >   mechanisms.
> >
> > * The batch size (how many packets that are processed every time the
> >   napi function in the driver is called, i.e. the weight parameter)
> >   should be configurable. Currently, the busy-poll size of AF_INET
> >   sockets is set to 8, but for AF_XDP sockets this is too small as the
> >   amount of processing per packet is much smaller with AF_XDP. This
> >   should be configurable on a per socket basis.
> >
> > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> >   call the napi contexts of all of them should be executed. This is in
> >   contrast to the AF_INET busy-poll that quits after the fist one that
> >   finds any packets. We need all napi contexts to be executed due to
> >   the first requirement in this list. The behaviour we want is much more
> >   like regular sockets in that all of them are checked in the poll
> >   call.
> >
> > * Should be possible to mix AF_XDP busy-poll sockets with any other
> >   sockets including busy-poll AF_INET ones in a single poll() call
> >   without any change to semantics or the behaviour of any of those
> >   socket types.
> >
> > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> >   mode return POLLERR if the fill ring is empty or the completion
> >   queue is full.
> >
> > Busy-poll support is enabled by calling a new setsockopt called
> > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > any other value will return an error.
> >
> > A typical packet processing rxdrop loop with busy-poll will look something
> > like this:
> >
> > for (i = 0; i < num_socks; i++) {
> >     fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> >     fds[i].events = POLLIN;
> > }
> >
> > for (;;) {
> >     ret = poll(fds, num_socks, 0);
> >     if (ret <= 0)
> >             continue;
> >
> >     for (i = 0; i < num_socks; i++)
> >         rx_drop(xsks[i], fds); /* The actual application */
> > }
> >
> > Need some advice around this issue please:
> >
> > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > render the busy poll support useless as the execution is now happening
> > in the same way as without busy-poll support. Everything works from an
> > application perspective but this defeats the purpose of the support
> > and also consumes an extra core. What I would like to accomplish when
>
> Not sure what you mean by 'extra core' .
> The above poll+rx_drop is executed for every af_xdp socket
> and there are N cpus processing exactly N af_xdp sockets.
> Where is 'extra core'?
> Are you suggesting a model where single core will be busy-polling
> all af_xdp sockets? and then waking processing threads?
> or single core will process all sockets?
> I think the af_xdp model should be flexible and allow easy out-of-the-box
> experience, but it should be optimized for 'ideal' user that
> does the-right-thing from max packet-per-second point of view.
> I thought we've already converged on the model where af_xdp hw rx queues
> bind one-to-one to af_xdp sockets and user space pins processing
> threads one-to-one to af_xdp sockets on corresponding cpus...
> If so that's the model to optimize for on the kernel side
> while keeping all other user configurations functional.
>
> > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > invoked for the traffic that goes to this socket. This way, we would
> > get better performance on a per core basis and also get the same
> > behaviour independent of load.
>
> I suspect separate rx kthreads of af_xdp socket processing is necessary
> with and without busy-poll exactly because of 'high load' case
> you've described.
> If we do this additional rx-kthread model why differentiate
> between busy-polling and polling?
>
> af_xdp rx queue is completely different form stack rx queue because
> of target dma address setup.
> Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> the fairness issues. Isn't it better to have separate kthreads for them
> and let scheduler deal with fairness among af_xdp processing and stack?

When using ordinary poll() on an AF_XDP socket, the application will
run on one core and the driver processing will run on another in
softirq/ksoftirqd context. (Either due to explicit core and irq
pinning or due to the scheduler or irqbalance moving the two threads
apart.) In AF_XDP busy-poll mode of this RFC, I would like the
application and the driver processing to occur on a single core, thus
there is no "extra" driver core involved that need to be taken into
account when sizing and/or provisioning the system. The napi context
is in this mode invoked from syscall context when executing the poll
syscall from the application.

Executing the app and the driver on the same core could of course be
accomplished already today by pinning the application and the driver
interrupt to the same core, but that would not be that efficient due
to context switching between the two. A more efficient way would be to
call the napi loop from within the poll() syscall when you are inside
the kernel anyway. This is what the classical busy-poll mechanism
operating on AF_INET sockets does. Inside the poll() call, it executes
the napi context of the driver until it finds a packet (if it is rx)
and then returns to the application that then processes the packets. I
would like to adopt something quite similar for AF_XDP sockets. (Some
of the differences can be found at the top of the original post.)

From an API point of view with busy-poll of AF_XDP sockets, the user
would bind to a queue number and taskset its application to a specific
core and both the app and the driver execution would only occur on
that core. This is in my mind simpler than with regular poll or AF_XDP
using no syscalls on rx (i.e. current state), in which you bind to a
queue, taskset your application to a core and then you also have to
take care to route the interrupt of the queue you bound to to another
core that will execute the driver part in the kernel. So the model is
in both cases still one core - one socket - one napi. (Users can of
course create multiple sockets in an app if they desire.)

The main reasons I would like to introduce busy-poll for AF_XDP
sockets are:

* It is simpler to provision, see arguments above. Both application
  and driver runs efficiently on the same core.

* It is faster (on a per core basis) since we do not have any core to
  core communication. All header and descriptor transfers between
  kernel and application are core local which is much
  faster. Scalability will also be better. E.g., 64 bytes desc + 64
  bytes packet header = 128 bytes per packet less on the interconnect
  between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
  this will be ~400Gbit/s of interconnect traffic less with busy-poll.

* It provides a way to seamlessly replace user-space drivers in DPDK
  with Linux drivers in kernel space. (Do not think I have to argue
  why this is a good idea on this list ;-).) The DPDK model is that
  application and driver run on the same core since they are both in
  user space. If we can provide the same model (both running
  efficiently on the same core, NOT drivers in user-space) with
  AF_XDP, it is easy for DPDK users to make the switch. Compare this
  to the current way where there are both application cores and
  driver/ksoftirqd cores. If a systems builder had 12 cores in his
  appliance box and they had 12 instances of a DPDK app, one on each
  core, how would he/she reason when repartitioning between
  application and driver cores? 8 application cores and 4 driver
  cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
  simpler to migrate if we had an efficient way to run both of them on
  the same core.

Why no interrupt? That should have been: no interrupts enabled to
start with. We would like to avoid interrupts as much as possible
since when they trigger, we will revert to the non busy-poll model,
i.e. processing on two separate cores, and the advantages from above
will disappear. How to accomplish this?

* One way would be to create a napi context with the queue we have
  bound to but with no interrupt associated with it, or it being
  disabled. The socket would in that case only be able to receive and
  send packets when calling the poll() syscall. If you do not call
  poll(), you do not get any packets, nor are any packets sent. It
  would only be possible to support this with a poll() timeout value
  of zero. This would have the best performance

* Maybe we could support timeout values >0 by re-enabling the interrupt
  at some point. When calling poll(), the core would invoke the napi
  context repeatedly with the interrupt of that napi disabled until it
  found a packet, but max for a period of time up until the busy poll
  timeout (like regular busy poll today does). If that times out, we
  go up to the regular timeout of the poll() call and enable
  interrupts of the queue associated with the napi and put the process
  to sleep. Once woken up by an interrupt, the interrupt of the napi
  would be disabled again and control returned to the application. We
  would with this scheme process the vast majority of packets locally
  on a core with interrupts disabled and with good performance and
  only when we have low load and are sleeping/waiting in poll would we
  process some packets using interrupts on the core that the
  interrupt has been bound to.

I will produce some performance numbers for the various options and
post them in a follow up mail. We need some numbers to talk
around.

/Magnus

> >
> > To accomplish this, the driver would need to create a napi context
> > containing the busy-poll enabled XDP socket queues (not shared with
> > any other traffic) that do not have an interrupt associated with
> > it.
>
> why no interrupt?
>
> >
> > Does this sound like an avenue worth pursuing and if so, should it be
> > part of this RFC/PATCH or separate?
> >
> > Epoll() is not supported at this point in time. Since it was designed
> > for handling a very large number of sockets, I thought it was better
> > to leave this for later if the need will arise.
> >
> > To do:
> >
> > * Move over all drivers to the new xdp_[rt]xq_info functions. If you
> >   think this is the right way to go forward, I will move over
> >   Mellanox, Netronome, etc for the proper patch.
> >
> > * Performance measurements
> >
> > * Test SW drivers like virtio, veth and tap. Actually, test anything
> >   that is not i40e.
> >
> > * Test multiple sockets of different types in the same poll() call
> >
> > * Check bisectability of each patch
> >
> > * Versioning of the libbpf interface since we add an entry to the
> >   socket configuration struct
> >
> > This RFC has been applied against commit 2b5bc3c8ebce ("r8169: remove manual autoneg restart workaround")
> >
> > Structure of the patch set:
> > Patch 1: Makes the busy poll batch size configurable inside the kernel
> > Patch 2: Adds napi id to struct xdp_rxq_info, adds this to the
> >          xdp_rxq_reg_info function and changes the interface and
> >        implementation so that we only need a single copy of the
> >        xdp_rxq_info struct that resides in _rx. Previously there was
> >        another one in the driver, but now the driver uses the one in
> >        _rx. All XDP enabled drivers are converted to these new
> >        functions.
> > Patch 3: Adds a struct xdp_txq_info with corresponding functions to
> >          xdp_rxq_info that is used to store information about the tx
> >        queue. The use of these are added to all AF_XDP enabled drivers.
> > Patch 4: Introduce a new setsockopt/getsockopt in the uapi for
> >          enabling busy_poll.
> > Patch 5: Implements busy poll in the xsk code.
> > Patch 6: Add busy-poll support to libbpf.
> > Patch 7: Add busy-poll support to the xdpsock sample application.
> >
> > Thanks: Magnus
> >
> > Magnus Karlsson (7):
> >   net: fs: make busy poll budget configurable in napi_busy_loop
> >   net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and
> >     add napi id
> >   net: i40e: ixgbe: add xdp_txq_info structure
> >   netdevice: introduce busy-poll setsockopt for AF_XDP
> >   net: add busy-poll support for XDP sockets
> >   libbpf: add busy-poll support to XDP sockets
> >   samples/bpf: add busy-poll support to xdpsock sample
> >
> >  drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   2 -
> >  drivers/net/ethernet/intel/i40e/i40e_main.c    |   8 +-
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.c    |  37 ++++-
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.h    |   2 +-
> >  drivers/net/ethernet/intel/i40e/i40e_xsk.c     |   2 +-
> >  drivers/net/ethernet/intel/ixgbe/ixgbe.h       |   2 +-
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |  48 ++++--
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c   |   2 +-
> >  drivers/net/tun.c                              |  14 +-
> >  drivers/net/veth.c                             |  10 +-
> >  drivers/net/virtio_net.c                       |   8 +-
> >  fs/eventpoll.c                                 |   5 +-
> >  include/linux/netdevice.h                      |   1 +
> >  include/net/busy_poll.h                        |   7 +-
> >  include/net/xdp.h                              |  23 ++-
> >  include/net/xdp_sock.h                         |   3 +
> >  include/uapi/linux/if_xdp.h                    |   1 +
> >  net/core/dev.c                                 |  43 ++----
> >  net/core/xdp.c                                 | 103 ++++++++++---
> >  net/xdp/Kconfig                                |   1 +
> >  net/xdp/xsk.c                                  | 122 ++++++++++++++-
> >  net/xdp/xsk_queue.h                            |  18 ++-
> >  samples/bpf/xdpsock_user.c                     | 203 +++++++++++++++----------
> >  tools/include/uapi/linux/if_xdp.h              |   1 +
> >  tools/lib/bpf/xsk.c                            |  23 +--
> >  tools/lib/bpf/xsk.h                            |   1 +
> >  26 files changed, 495 insertions(+), 195 deletions(-)
> >
> > --
> > 2.7.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-07 11:51   ` Magnus Karlsson
@ 2019-05-07 18:24     ` Alexei Starovoitov
  2019-05-08 12:10       ` Magnus Karlsson
  2019-05-13 20:42       ` Jonathan Lemon
  0 siblings, 2 replies; 24+ messages in thread
From: Alexei Starovoitov @ 2019-05-07 18:24 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Magnus Karlsson, Björn Töpel, Daniel Borkmann,
	Network Development, bpf, Jakub Kicinski, Jonathan Lemon

On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote:
> On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > > busy-poll, the driver is executed in process context by calling the
> > > poll() syscall. The main advantage with this is that all processing
> > > occurs on a single core. This eliminates the core-to-core cache
> > > transfers that occur between the application and the softirqd
> > > processing on another core, that occurs without busy-poll. From a
> > > systems point of view, it also provides an advatage that we do not
> > > have to provision extra cores in the system to handle
> > > ksoftirqd/softirq processing, as all processing is done on the single
> > > core that executes the application. The drawback of busy-poll is that
> > > max throughput seen from a single application will be lower (due to
> > > the syscall), but on a per core basis it will often be higher as
> > > the normal mode runs on two cores and busy-poll on a single one.
> > >
> > > The semantics of busy-poll from the application point of view are the
> > > following:
> > >
> > > * The application is required to call poll() to drive rx and tx
> > >   processing. There is no guarantee that softirq and interrupts will
> > >   do this for you. This is in contrast with the current
> > >   implementations of busy-poll that are opportunistic in the sense
> > >   that packets might be received/transmitted by busy-poll or
> > >   softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > >   loads just as the current opportunistic implementations, but I would
> > >   like to get to a point where this is not the case for busy-poll
> > >   enabled XDP sockets, as this slows down performance considerably and
> > >   starts to use one more core for the softirq processing. The end goal
> > >   is for only poll() to drive the napi loop when busy-poll is enabled
> > >   on an AF_XDP socket. More about this later.)
> > >
> > > * It should be enabled on a per socket basis. No global enablement, i.e.
> > >   the XDP socket busy-poll will not care about the current
> > >   /proc/sys/net/core/busy_poll and busy_read global enablement
> > >   mechanisms.
> > >
> > > * The batch size (how many packets that are processed every time the
> > >   napi function in the driver is called, i.e. the weight parameter)
> > >   should be configurable. Currently, the busy-poll size of AF_INET
> > >   sockets is set to 8, but for AF_XDP sockets this is too small as the
> > >   amount of processing per packet is much smaller with AF_XDP. This
> > >   should be configurable on a per socket basis.
> > >
> > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > >   call the napi contexts of all of them should be executed. This is in
> > >   contrast to the AF_INET busy-poll that quits after the fist one that
> > >   finds any packets. We need all napi contexts to be executed due to
> > >   the first requirement in this list. The behaviour we want is much more
> > >   like regular sockets in that all of them are checked in the poll
> > >   call.
> > >
> > > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > >   sockets including busy-poll AF_INET ones in a single poll() call
> > >   without any change to semantics or the behaviour of any of those
> > >   socket types.
> > >
> > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > >   mode return POLLERR if the fill ring is empty or the completion
> > >   queue is full.
> > >
> > > Busy-poll support is enabled by calling a new setsockopt called
> > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > > any other value will return an error.
> > >
> > > A typical packet processing rxdrop loop with busy-poll will look something
> > > like this:
> > >
> > > for (i = 0; i < num_socks; i++) {
> > >     fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > >     fds[i].events = POLLIN;
> > > }
> > >
> > > for (;;) {
> > >     ret = poll(fds, num_socks, 0);
> > >     if (ret <= 0)
> > >             continue;
> > >
> > >     for (i = 0; i < num_socks; i++)
> > >         rx_drop(xsks[i], fds); /* The actual application */
> > > }
> > >
> > > Need some advice around this issue please:
> > >
> > > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > > render the busy poll support useless as the execution is now happening
> > > in the same way as without busy-poll support. Everything works from an
> > > application perspective but this defeats the purpose of the support
> > > and also consumes an extra core. What I would like to accomplish when
> >
> > Not sure what you mean by 'extra core' .
> > The above poll+rx_drop is executed for every af_xdp socket
> > and there are N cpus processing exactly N af_xdp sockets.
> > Where is 'extra core'?
> > Are you suggesting a model where single core will be busy-polling
> > all af_xdp sockets? and then waking processing threads?
> > or single core will process all sockets?
> > I think the af_xdp model should be flexible and allow easy out-of-the-box
> > experience, but it should be optimized for 'ideal' user that
> > does the-right-thing from max packet-per-second point of view.
> > I thought we've already converged on the model where af_xdp hw rx queues
> > bind one-to-one to af_xdp sockets and user space pins processing
> > threads one-to-one to af_xdp sockets on corresponding cpus...
> > If so that's the model to optimize for on the kernel side
> > while keeping all other user configurations functional.
> >
> > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > > invoked for the traffic that goes to this socket. This way, we would
> > > get better performance on a per core basis and also get the same
> > > behaviour independent of load.
> >
> > I suspect separate rx kthreads of af_xdp socket processing is necessary
> > with and without busy-poll exactly because of 'high load' case
> > you've described.
> > If we do this additional rx-kthread model why differentiate
> > between busy-polling and polling?
> >
> > af_xdp rx queue is completely different form stack rx queue because
> > of target dma address setup.
> > Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> > the fairness issues. Isn't it better to have separate kthreads for them
> > and let scheduler deal with fairness among af_xdp processing and stack?
> 
> When using ordinary poll() on an AF_XDP socket, the application will
> run on one core and the driver processing will run on another in
> softirq/ksoftirqd context. (Either due to explicit core and irq
> pinning or due to the scheduler or irqbalance moving the two threads
> apart.) In AF_XDP busy-poll mode of this RFC, I would like the
> application and the driver processing to occur on a single core, thus
> there is no "extra" driver core involved that need to be taken into
> account when sizing and/or provisioning the system. The napi context
> is in this mode invoked from syscall context when executing the poll
> syscall from the application.
> 
> Executing the app and the driver on the same core could of course be
> accomplished already today by pinning the application and the driver
> interrupt to the same core, but that would not be that efficient due
> to context switching between the two. 

Have you benchmarked it?
I don't think context switch will be that noticable when kpti is off.
napi processes 64 packets descriptors and switches back to user to
do payload processing of these packets.
I would think that the same job is on two different cores would be
a bit more performant with user code consuming close to 100%
and softirq is single digit %. Say it's 10%.
I believe combining the two on single core is not 100 + 10 since
there is no cache bouncing. So Mpps from two cores setup will
reduce by 2-3% instead of 10%.
There is a cost of going to sleep and being woken up from poll(),
but 64 packets is probably large enough number to amortize.
If not, have you tried to bump napi budget to say 256 for af_xdp rx queues?
Busy-poll avoids sleep/wakeup overhead and probably can make
this scheme work with lower batching (like 64), but fundamentally
they're the same thing.
I'm not saying that we shouldn't do busy-poll. I'm saying it's
complimentary, but in all cases single core per af_xdp rq queue
with user thread pinning is preferred.

> A more efficient way would be to
> call the napi loop from within the poll() syscall when you are inside
> the kernel anyway. This is what the classical busy-poll mechanism
> operating on AF_INET sockets does. Inside the poll() call, it executes
> the napi context of the driver until it finds a packet (if it is rx)
> and then returns to the application that then processes the packets. I
> would like to adopt something quite similar for AF_XDP sockets. (Some
> of the differences can be found at the top of the original post.)
> 
> From an API point of view with busy-poll of AF_XDP sockets, the user
> would bind to a queue number and taskset its application to a specific
> core and both the app and the driver execution would only occur on
> that core. This is in my mind simpler than with regular poll or AF_XDP
> using no syscalls on rx (i.e. current state), in which you bind to a
> queue, taskset your application to a core and then you also have to
> take care to route the interrupt of the queue you bound to to another
> core that will execute the driver part in the kernel. So the model is
> in both cases still one core - one socket - one napi. (Users can of
> course create multiple sockets in an app if they desire.)
> 
> The main reasons I would like to introduce busy-poll for AF_XDP
> sockets are:
> 
> * It is simpler to provision, see arguments above. Both application
>   and driver runs efficiently on the same core.
> 
> * It is faster (on a per core basis) since we do not have any core to
>   core communication. All header and descriptor transfers between
>   kernel and application are core local which is much
>   faster. Scalability will also be better. E.g., 64 bytes desc + 64
>   bytes packet header = 128 bytes per packet less on the interconnect
>   between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
>   this will be ~400Gbit/s of interconnect traffic less with busy-poll.

exactly. don't make cpu do this core-to-core stuff.
pin one rx to one core.

> * It provides a way to seamlessly replace user-space drivers in DPDK
>   with Linux drivers in kernel space. (Do not think I have to argue
>   why this is a good idea on this list ;-).) The DPDK model is that
>   application and driver run on the same core since they are both in
>   user space. If we can provide the same model (both running
>   efficiently on the same core, NOT drivers in user-space) with
>   AF_XDP, it is easy for DPDK users to make the switch. Compare this
>   to the current way where there are both application cores and
>   driver/ksoftirqd cores. If a systems builder had 12 cores in his
>   appliance box and they had 12 instances of a DPDK app, one on each
>   core, how would he/she reason when repartitioning between
>   application and driver cores? 8 application cores and 4 driver
>   cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
>   simpler to migrate if we had an efficient way to run both of them on
>   the same core.
> 
> Why no interrupt? That should have been: no interrupts enabled to
> start with. We would like to avoid interrupts as much as possible
> since when they trigger, we will revert to the non busy-poll model,
> i.e. processing on two separate cores, and the advantages from above
> will disappear. How to accomplish this?
> 
> * One way would be to create a napi context with the queue we have
>   bound to but with no interrupt associated with it, or it being
>   disabled. The socket would in that case only be able to receive and
>   send packets when calling the poll() syscall. If you do not call
>   poll(), you do not get any packets, nor are any packets sent. It
>   would only be possible to support this with a poll() timeout value
>   of zero. This would have the best performance
> 
> * Maybe we could support timeout values >0 by re-enabling the interrupt
>   at some point. When calling poll(), the core would invoke the napi
>   context repeatedly with the interrupt of that napi disabled until it
>   found a packet, but max for a period of time up until the busy poll
>   timeout (like regular busy poll today does). If that times out, we
>   go up to the regular timeout of the poll() call and enable
>   interrupts of the queue associated with the napi and put the process
>   to sleep. Once woken up by an interrupt, the interrupt of the napi
>   would be disabled again and control returned to the application. We
>   would with this scheme process the vast majority of packets locally
>   on a core with interrupts disabled and with good performance and
>   only when we have low load and are sleeping/waiting in poll would we
>   process some packets using interrupts on the core that the
>   interrupt has been bound to.

I think both 'no interrupt' solutions are challenging for users.
Stack rx queues and af_xdp rx queues should look almost the same from
napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread
to work with both poll and busy-poll. The only difference between
poll and busy-poll will be the running context: new kthread vs user task.
If busy-poll drained the queue then new kthread napi has no work to do.
No irq approach could be marginally faster, but more error prone.
With new kthread the user space will still work in all configuration.
Even when single user task is processing many af_xdp sockets.

I'm proposing new kthread only partially for performance reasons, but
mainly to avoid sharing stack rx and af_xdp queues within the same softirq.
Currently we share softirqd for stack napis for all NICs in the system,
but af_xdp depends on isolated processing.
Ideally we have rss into N queues for stack and rss into M af_xdp sockets.
The same host will be receive traffic on both.
Even if we rss stack queues to one set of cpus and af_xdp on another cpus
softirqds are doing work on all cpus.
A burst of 64 packets on stack queues or some other work in softirqd
will spike the latency for af_xdp queues if softirq is shared.
Hence the proposal for new napi_kthreads:
- user creates af_xdp socket and binds to _CPU_ X then
- driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed)
- spawns kthread pinned to cpu X
- configures irq for that af_xdp queue to fire on cpu X
- user space with the help of libbpf pins its processing thread to that cpu X
- repeat above for as many af_xdp sockets as there as cpus
  (its also ok to pick the same cpu X for different af_xdp socket
  then new kthread is shared)
- user space configures hw to RSS to these set of af_xdp sockets.
  since ethtool api is a mess I propose to use af_xdp api to do this rss config

imo that would be the simplest and performant way of using af_xdp.
All configuration apis are under libbpf (or libxdp if we choose to fork it)
End result is one af_xdp rx queue - one napi - one kthread - one user thread.
All pinned to the same cpu with irq on that cpu.
Both poll and busy-poll approaches will not bounce data between cpus.
No 'shadow' queues to speak of and should solve the issues that
folks were bringing up in different threads.
How crazy does it sound?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-07 18:24     ` Alexei Starovoitov
@ 2019-05-08 12:10       ` Magnus Karlsson
  2019-05-16 12:37         ` Magnus Karlsson
  2019-05-13 20:42       ` Jonathan Lemon
  1 sibling, 1 reply; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-08 12:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Magnus Karlsson, Björn Töpel, Daniel Borkmann,
	Network Development, bpf, Jakub Kicinski, Jonathan Lemon

On Tue, May 7, 2019 at 8:24 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote:
> > On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > > > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > > > busy-poll, the driver is executed in process context by calling the
> > > > poll() syscall. The main advantage with this is that all processing
> > > > occurs on a single core. This eliminates the core-to-core cache
> > > > transfers that occur between the application and the softirqd
> > > > processing on another core, that occurs without busy-poll. From a
> > > > systems point of view, it also provides an advatage that we do not
> > > > have to provision extra cores in the system to handle
> > > > ksoftirqd/softirq processing, as all processing is done on the single
> > > > core that executes the application. The drawback of busy-poll is that
> > > > max throughput seen from a single application will be lower (due to
> > > > the syscall), but on a per core basis it will often be higher as
> > > > the normal mode runs on two cores and busy-poll on a single one.
> > > >
> > > > The semantics of busy-poll from the application point of view are the
> > > > following:
> > > >
> > > > * The application is required to call poll() to drive rx and tx
> > > >   processing. There is no guarantee that softirq and interrupts will
> > > >   do this for you. This is in contrast with the current
> > > >   implementations of busy-poll that are opportunistic in the sense
> > > >   that packets might be received/transmitted by busy-poll or
> > > >   softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > > >   loads just as the current opportunistic implementations, but I would
> > > >   like to get to a point where this is not the case for busy-poll
> > > >   enabled XDP sockets, as this slows down performance considerably and
> > > >   starts to use one more core for the softirq processing. The end goal
> > > >   is for only poll() to drive the napi loop when busy-poll is enabled
> > > >   on an AF_XDP socket. More about this later.)
> > > >
> > > > * It should be enabled on a per socket basis. No global enablement, i.e.
> > > >   the XDP socket busy-poll will not care about the current
> > > >   /proc/sys/net/core/busy_poll and busy_read global enablement
> > > >   mechanisms.
> > > >
> > > > * The batch size (how many packets that are processed every time the
> > > >   napi function in the driver is called, i.e. the weight parameter)
> > > >   should be configurable. Currently, the busy-poll size of AF_INET
> > > >   sockets is set to 8, but for AF_XDP sockets this is too small as the
> > > >   amount of processing per packet is much smaller with AF_XDP. This
> > > >   should be configurable on a per socket basis.
> > > >
> > > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > > >   call the napi contexts of all of them should be executed. This is in
> > > >   contrast to the AF_INET busy-poll that quits after the fist one that
> > > >   finds any packets. We need all napi contexts to be executed due to
> > > >   the first requirement in this list. The behaviour we want is much more
> > > >   like regular sockets in that all of them are checked in the poll
> > > >   call.
> > > >
> > > > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > > >   sockets including busy-poll AF_INET ones in a single poll() call
> > > >   without any change to semantics or the behaviour of any of those
> > > >   socket types.
> > > >
> > > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > > >   mode return POLLERR if the fill ring is empty or the completion
> > > >   queue is full.
> > > >
> > > > Busy-poll support is enabled by calling a new setsockopt called
> > > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > > > any other value will return an error.
> > > >
> > > > A typical packet processing rxdrop loop with busy-poll will look something
> > > > like this:
> > > >
> > > > for (i = 0; i < num_socks; i++) {
> > > >     fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > > >     fds[i].events = POLLIN;
> > > > }
> > > >
> > > > for (;;) {
> > > >     ret = poll(fds, num_socks, 0);
> > > >     if (ret <= 0)
> > > >             continue;
> > > >
> > > >     for (i = 0; i < num_socks; i++)
> > > >         rx_drop(xsks[i], fds); /* The actual application */
> > > > }
> > > >
> > > > Need some advice around this issue please:
> > > >
> > > > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > > > render the busy poll support useless as the execution is now happening
> > > > in the same way as without busy-poll support. Everything works from an
> > > > application perspective but this defeats the purpose of the support
> > > > and also consumes an extra core. What I would like to accomplish when
> > >
> > > Not sure what you mean by 'extra core' .
> > > The above poll+rx_drop is executed for every af_xdp socket
> > > and there are N cpus processing exactly N af_xdp sockets.
> > > Where is 'extra core'?
> > > Are you suggesting a model where single core will be busy-polling
> > > all af_xdp sockets? and then waking processing threads?
> > > or single core will process all sockets?
> > > I think the af_xdp model should be flexible and allow easy out-of-the-box
> > > experience, but it should be optimized for 'ideal' user that
> > > does the-right-thing from max packet-per-second point of view.
> > > I thought we've already converged on the model where af_xdp hw rx queues
> > > bind one-to-one to af_xdp sockets and user space pins processing
> > > threads one-to-one to af_xdp sockets on corresponding cpus...
> > > If so that's the model to optimize for on the kernel side
> > > while keeping all other user configurations functional.
> > >
> > > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > > > invoked for the traffic that goes to this socket. This way, we would
> > > > get better performance on a per core basis and also get the same
> > > > behaviour independent of load.
> > >
> > > I suspect separate rx kthreads of af_xdp socket processing is necessary
> > > with and without busy-poll exactly because of 'high load' case
> > > you've described.
> > > If we do this additional rx-kthread model why differentiate
> > > between busy-polling and polling?
> > >
> > > af_xdp rx queue is completely different form stack rx queue because
> > > of target dma address setup.
> > > Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> > > the fairness issues. Isn't it better to have separate kthreads for them
> > > and let scheduler deal with fairness among af_xdp processing and stack?
> >
> > When using ordinary poll() on an AF_XDP socket, the application will
> > run on one core and the driver processing will run on another in
> > softirq/ksoftirqd context. (Either due to explicit core and irq
> > pinning or due to the scheduler or irqbalance moving the two threads
> > apart.) In AF_XDP busy-poll mode of this RFC, I would like the
> > application and the driver processing to occur on a single core, thus
> > there is no "extra" driver core involved that need to be taken into
> > account when sizing and/or provisioning the system. The napi context
> > is in this mode invoked from syscall context when executing the poll
> > syscall from the application.
> >
> > Executing the app and the driver on the same core could of course be
> > accomplished already today by pinning the application and the driver
> > interrupt to the same core, but that would not be that efficient due
> > to context switching between the two.
>
> Have you benchmarked it?
> I don't think context switch will be that noticable when kpti is off.
> napi processes 64 packets descriptors and switches back to user to
> do payload processing of these packets.
> I would think that the same job is on two different cores would be
> a bit more performant with user code consuming close to 100%
> and softirq is single digit %. Say it's 10%.
> I believe combining the two on single core is not 100 + 10 since
> there is no cache bouncing. So Mpps from two cores setup will
> reduce by 2-3% instead of 10%.
> There is a cost of going to sleep and being woken up from poll(),
> but 64 packets is probably large enough number to amortize.
> If not, have you tried to bump napi budget to say 256 for af_xdp rx queues?
> Busy-poll avoids sleep/wakeup overhead and probably can make
> this scheme work with lower batching (like 64), but fundamentally
> they're the same thing.
> I'm not saying that we shouldn't do busy-poll. I'm saying it's
> complimentary, but in all cases single core per af_xdp rq queue
> with user thread pinning is preferred.
>
> > A more efficient way would be to
> > call the napi loop from within the poll() syscall when you are inside
> > the kernel anyway. This is what the classical busy-poll mechanism
> > operating on AF_INET sockets does. Inside the poll() call, it executes
> > the napi context of the driver until it finds a packet (if it is rx)
> > and then returns to the application that then processes the packets. I
> > would like to adopt something quite similar for AF_XDP sockets. (Some
> > of the differences can be found at the top of the original post.)
> >
> > From an API point of view with busy-poll of AF_XDP sockets, the user
> > would bind to a queue number and taskset its application to a specific
> > core and both the app and the driver execution would only occur on
> > that core. This is in my mind simpler than with regular poll or AF_XDP
> > using no syscalls on rx (i.e. current state), in which you bind to a
> > queue, taskset your application to a core and then you also have to
> > take care to route the interrupt of the queue you bound to to another
> > core that will execute the driver part in the kernel. So the model is
> > in both cases still one core - one socket - one napi. (Users can of
> > course create multiple sockets in an app if they desire.)
> >
> > The main reasons I would like to introduce busy-poll for AF_XDP
> > sockets are:
> >
> > * It is simpler to provision, see arguments above. Both application
> >   and driver runs efficiently on the same core.
> >
> > * It is faster (on a per core basis) since we do not have any core to
> >   core communication. All header and descriptor transfers between
> >   kernel and application are core local which is much
> >   faster. Scalability will also be better. E.g., 64 bytes desc + 64
> >   bytes packet header = 128 bytes per packet less on the interconnect
> >   between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
> >   this will be ~400Gbit/s of interconnect traffic less with busy-poll.
>
> exactly. don't make cpu do this core-to-core stuff.
> pin one rx to one core.
>
> > * It provides a way to seamlessly replace user-space drivers in DPDK
> >   with Linux drivers in kernel space. (Do not think I have to argue
> >   why this is a good idea on this list ;-).) The DPDK model is that
> >   application and driver run on the same core since they are both in
> >   user space. If we can provide the same model (both running
> >   efficiently on the same core, NOT drivers in user-space) with
> >   AF_XDP, it is easy for DPDK users to make the switch. Compare this
> >   to the current way where there are both application cores and
> >   driver/ksoftirqd cores. If a systems builder had 12 cores in his
> >   appliance box and they had 12 instances of a DPDK app, one on each
> >   core, how would he/she reason when repartitioning between
> >   application and driver cores? 8 application cores and 4 driver
> >   cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
> >   simpler to migrate if we had an efficient way to run both of them on
> >   the same core.
> >
> > Why no interrupt? That should have been: no interrupts enabled to
> > start with. We would like to avoid interrupts as much as possible
> > since when they trigger, we will revert to the non busy-poll model,
> > i.e. processing on two separate cores, and the advantages from above
> > will disappear. How to accomplish this?
> >
> > * One way would be to create a napi context with the queue we have
> >   bound to but with no interrupt associated with it, or it being
> >   disabled. The socket would in that case only be able to receive and
> >   send packets when calling the poll() syscall. If you do not call
> >   poll(), you do not get any packets, nor are any packets sent. It
> >   would only be possible to support this with a poll() timeout value
> >   of zero. This would have the best performance
> >
> > * Maybe we could support timeout values >0 by re-enabling the interrupt
> >   at some point. When calling poll(), the core would invoke the napi
> >   context repeatedly with the interrupt of that napi disabled until it
> >   found a packet, but max for a period of time up until the busy poll
> >   timeout (like regular busy poll today does). If that times out, we
> >   go up to the regular timeout of the poll() call and enable
> >   interrupts of the queue associated with the napi and put the process
> >   to sleep. Once woken up by an interrupt, the interrupt of the napi
> >   would be disabled again and control returned to the application. We
> >   would with this scheme process the vast majority of packets locally
> >   on a core with interrupts disabled and with good performance and
> >   only when we have low load and are sleeping/waiting in poll would we
> >   process some packets using interrupts on the core that the
> >   interrupt has been bound to.
>
> I think both 'no interrupt' solutions are challenging for users.
> Stack rx queues and af_xdp rx queues should look almost the same from
> napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread
> to work with both poll and busy-poll. The only difference between
> poll and busy-poll will be the running context: new kthread vs user task.
> If busy-poll drained the queue then new kthread napi has no work to do.
> No irq approach could be marginally faster, but more error prone.
> With new kthread the user space will still work in all configuration.
> Even when single user task is processing many af_xdp sockets.
>
> I'm proposing new kthread only partially for performance reasons, but
> mainly to avoid sharing stack rx and af_xdp queues within the same softirq.
> Currently we share softirqd for stack napis for all NICs in the system,
> but af_xdp depends on isolated processing.
> Ideally we have rss into N queues for stack and rss into M af_xdp sockets.
> The same host will be receive traffic on both.
> Even if we rss stack queues to one set of cpus and af_xdp on another cpus
> softirqds are doing work on all cpus.
> A burst of 64 packets on stack queues or some other work in softirqd
> will spike the latency for af_xdp queues if softirq is shared.
> Hence the proposal for new napi_kthreads:
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed)
> - spawns kthread pinned to cpu X
> - configures irq for that af_xdp queue to fire on cpu X
> - user space with the help of libbpf pins its processing thread to that cpu X
> - repeat above for as many af_xdp sockets as there as cpus
>   (its also ok to pick the same cpu X for different af_xdp socket
>   then new kthread is shared)
> - user space configures hw to RSS to these set of af_xdp sockets.
>   since ethtool api is a mess I propose to use af_xdp api to do this rss config
>
> imo that would be the simplest and performant way of using af_xdp.
> All configuration apis are under libbpf (or libxdp if we choose to fork it)
> End result is one af_xdp rx queue - one napi - one kthread - one user thread.
> All pinned to the same cpu with irq on that cpu.
> Both poll and busy-poll approaches will not bounce data between cpus.
> No 'shadow' queues to speak of and should solve the issues that
> folks were bringing up in different threads.
> How crazy does it sound?

Actually, it sounds remarkably sane :-). It will create something
quite similar to what I have been wanting, but you take it at least
two steps further. Did not think about introducing a separate kthread
as a potential solution, and the user space configuration of RSS (and
maybe other flow steering mechanisms) from AF_XDP Björn and I have
only been loosely talking about. Anyway, I am producing performance
numbers for the options that we have talked about. I will get back to
you with them as soon as I have them and we can continue the
discussions based on those.

Thanks: Magnus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-07 18:24     ` Alexei Starovoitov
  2019-05-08 12:10       ` Magnus Karlsson
@ 2019-05-13 20:42       ` Jonathan Lemon
  2019-05-13 23:30         ` Samudrala, Sridhar
  2019-05-14  7:53         ` Björn Töpel
  1 sibling, 2 replies; 24+ messages in thread
From: Jonathan Lemon @ 2019-05-13 20:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Magnus Karlsson, Björn Töpel, Daniel Borkmann,
	Network Development, bpf, Jakub Kicinski

Tossing in my .02 cents:


I anticipate that most users of AF_XDP will want packet processing
for a given RX queue occurring on a single core - otherwise we end
up with cache delays.  The usual model is one thread, one socket,
one core, but this isn't enforced anywhere in the AF_XDP code and is
up to the user to set this up.

On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
> I'm not saying that we shouldn't do busy-poll. I'm saying it's
> complimentary, but in all cases single core per af_xdp rq queue
> with user thread pinning is preferred.

So I think we're on the same page here.

> Stack rx queues and af_xdp rx queues should look almost the same from
> napi point of view. Stack -> normal napi in softirq. af_xdp -> new 
> kthread
> to work with both poll and busy-poll. The only difference between
> poll and busy-poll will be the running context: new kthread vs user 
> task.
...
> A burst of 64 packets on stack queues or some other work in softirqd
> will spike the latency for af_xdp queues if softirq is shared.

True, but would it be shared?  This goes back to the current model, 
which
as used by Intel is:

    (channel == RX, TX, softirq)

MLX, on the other hand, wants:

    (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)

Which would indeed lead to sharing.  The more I look at the above, the
stronger I start to dislike it.  Perhaps this should be disallowed?

I believe there was some mention at LSF/MM that the 'channel' concept
was something specific to HW and really shouldn't be part of the SW API.

> Hence the proposal for new napi_kthreads:
> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be 
> exposed)
> - spawns kthread pinned to cpu X
> - configures irq for that af_xdp queue to fire on cpu X
> - user space with the help of libbpf pins its processing thread to 
> that cpu X
> - repeat above for as many af_xdp sockets as there as cpus
>   (its also ok to pick the same cpu X for different af_xdp socket
>   then new kthread is shared)
> - user space configures hw to RSS to these set of af_xdp sockets.
>   since ethtool api is a mess I propose to use af_xdp api to do this 
> rss config


 From a high level point of view, this sounds quite sensible, but does 
need
some details ironed out.  The model above essentially enforces a model 
of:

    (af_xdp = RX.af_xdp + bound_cpu)
      (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)

(temporarily ignoring TX for right now)


I forsee two issues with the above approach:
   1. hardware limitations in the number of queues/rings
   2. RSS/steering rules

> - user creates af_xdp socket and binds to _CPU_ X then
> - driver allocates single af_xdp rq queue (queue ID doesn't need to be 
> exposed)

Here, the driver may not be able to create an arbitrary RQ, but may need 
to
tear down/reuse an existing one used by the stack.  This may not be an 
issue
for modern hardware.

> - user space configures hw to RSS to these set of af_xdp sockets.
>   since ethtool api is a mess I propose to use af_xdp api to do this 
> rss config

Currently, RSS only steers default traffic.  On a system with shared
stack/af_xdp queues, there should be a way to split the traffic types,
unless we're talking about a model where all traffic goes to AF_XDP.

This classification has to be done by the NIC, since it comes before RSS
steering - which currently means sending flow match rules to the NIC, 
which
is less than ideal.  I agree that the ethtool interface is non optimal, 
but
it does make things clear to the user what's going on.

Perhaps an af_xdp library that does some bookkeeping:
   - open af_xdp socket
   - define af_xdp_set as (classification, steering rules, other?)
   - bind socket to (cpu, af_xdp_set)
   - kernel:
     - pins calling thread to cpu
     - creates kthread if one doesn't exist, binds to irq and cpu
     - has driver create RQ.af_xdp, possibly replacing RQ.stack
     - applies (af_xdp_set) to NIC.

Seems workable, but a little complicated?  The complexity could be moved
into a separate library.


> imo that would be the simplest and performant way of using af_xdp.
> All configuration apis are under libbpf (or libxdp if we choose to 
> fork it)
> End result is one af_xdp rx queue - one napi - one kthread - one user 
> thread.
> All pinned to the same cpu with irq on that cpu.
> Both poll and busy-poll approaches will not bounce data between cpus.
> No 'shadow' queues to speak of and should solve the issues that
> folks were bringing up in different threads.

Sounds like a sensible model from my POV.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-13 20:42       ` Jonathan Lemon
@ 2019-05-13 23:30         ` Samudrala, Sridhar
  2019-05-14  7:53         ` Björn Töpel
  1 sibling, 0 replies; 24+ messages in thread
From: Samudrala, Sridhar @ 2019-05-13 23:30 UTC (permalink / raw)
  To: Jonathan Lemon, Alexei Starovoitov
  Cc: Magnus Karlsson, Björn Töpel, Daniel Borkmann,
	Network Development, bpf, Jakub Kicinski

On 5/13/2019 1:42 PM, Jonathan Lemon wrote:
> Tossing in my .02 cents:
> 
> 
> I anticipate that most users of AF_XDP will want packet processing
> for a given RX queue occurring on a single core - otherwise we end
> up with cache delays.  The usual model is one thread, one socket,
> one core, but this isn't enforced anywhere in the AF_XDP code and is
> up to the user to set this up.

AF_XDP with busypoll should allow a single thread to poll a given RX
queue and use a single core.

> 
> On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
>> I'm not saying that we shouldn't do busy-poll. I'm saying it's
>> complimentary, but in all cases single core per af_xdp rq queue
>> with user thread pinning is preferred.
> 
> So I think we're on the same page here.
> 
>> Stack rx queues and af_xdp rx queues should look almost the same from
>> napi point of view. Stack -> normal napi in softirq. af_xdp -> new
>> kthread
>> to work with both poll and busy-poll. The only difference between
>> poll and busy-poll will be the running context: new kthread vs user
>> task.
> ...
>> A burst of 64 packets on stack queues or some other work in softirqd
>> will spike the latency for af_xdp queues if softirq is shared.
> 
> True, but would it be shared?  This goes back to the current model,
> which
> as used by Intel is:
> 
>      (channel == RX, TX, softirq)
> 
> MLX, on the other hand, wants:
> 
>      (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
> 
> Which would indeed lead to sharing.  The more I look at the above, the
> stronger I start to dislike it.  Perhaps this should be disallowed?
> 
> I believe there was some mention at LSF/MM that the 'channel' concept
> was something specific to HW and really shouldn't be part of the SW API.
> 
>> Hence the proposal for new napi_kthreads:
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
>> - spawns kthread pinned to cpu X
>> - configures irq for that af_xdp queue to fire on cpu X
>> - user space with the help of libbpf pins its processing thread to
>> that cpu X
>> - repeat above for as many af_xdp sockets as there as cpus
>>    (its also ok to pick the same cpu X for different af_xdp socket
>>    then new kthread is shared)
>> - user space configures hw to RSS to these set of af_xdp sockets.
>>    since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
> 
> 
>   From a high level point of view, this sounds quite sensible, but does
> need
> some details ironed out.  The model above essentially enforces a model
> of:
> 
>      (af_xdp = RX.af_xdp + bound_cpu)
>        (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
> 
> (temporarily ignoring TX for right now)
> 
> 
> I forsee two issues with the above approach:
>     1. hardware limitations in the number of queues/rings
>     2. RSS/steering rules
> 
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
> 
> Here, the driver may not be able to create an arbitrary RQ, but may need
> to
> tear down/reuse an existing one used by the stack.  This may not be an
> issue
> for modern hardware.
> 
>> - user space configures hw to RSS to these set of af_xdp sockets.
>>    since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
> 
> Currently, RSS only steers default traffic.  On a system with shared
> stack/af_xdp queues, there should be a way to split the traffic types,
> unless we're talking about a model where all traffic goes to AF_XDP.
> 
> This classification has to be done by the NIC, since it comes before RSS
> steering - which currently means sending flow match rules to the NIC,
> which
> is less than ideal.  I agree that the ethtool interface is non optimal,
> but
> it does make things clear to the user what's going on.

'tc' provides another interface to split NIC queues into groups of
queues each with its own RSS. For ex:
tc qdisc add dev <i/f> root mqprio num_tc 3 map 0 1 2 queues 2@0 32@2 
8@34 hw 1 mode channel
will split NIC queues into 3 groups of 2, 32 and 8 queues.

By default all the packets goto only the first queue group with 2 
queues. Filters can be added to redirect packets to the other queues groups.

tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip 
192.168.0.2 ip_proto tcp dst_port 1234 skip_sw hw_tc 1
tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip 
192.168.0.3 ip_proto tcp dst_port 1234 skip_sw hw_tc 2

Here hw_tc indicates the queue group.

It should be possible to run AF_XDP on queue group 3 by creating 8 
af-xdp sockets and binding them to queues 34-42.

Does this look like a reasonable model to use a subset of nic queues for 
af-xdp applications?


> 
> Perhaps an af_xdp library that does some bookkeeping:
>     - open af_xdp socket
>     - define af_xdp_set as (classification, steering rules, other?)
>     - bind socket to (cpu, af_xdp_set)
>     - kernel:
>       - pins calling thread to cpu
>       - creates kthread if one doesn't exist, binds to irq and cpu
>       - has driver create RQ.af_xdp, possibly replacing RQ.stack
>       - applies (af_xdp_set) to NIC.
> 
> Seems workable, but a little complicated?  The complexity could be moved
> into a separate library.
> 
> 
>> imo that would be the simplest and performant way of using af_xdp.
>> All configuration apis are under libbpf (or libxdp if we choose to
>> fork it)
>> End result is one af_xdp rx queue - one napi - one kthread - one user
>> thread.
>> All pinned to the same cpu with irq on that cpu.
>> Both poll and busy-poll approaches will not bounce data between cpus.
>> No 'shadow' queues to speak of and should solve the issues that
>> folks were bringing up in different threads.
> 
> Sounds like a sensible model from my POV.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-13 20:42       ` Jonathan Lemon
  2019-05-13 23:30         ` Samudrala, Sridhar
@ 2019-05-14  7:53         ` Björn Töpel
  1 sibling, 0 replies; 24+ messages in thread
From: Björn Töpel @ 2019-05-14  7:53 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Alexei Starovoitov, Magnus Karlsson, Björn Töpel,
	Daniel Borkmann, Network Development, bpf, Jakub Kicinski

On Mon, 13 May 2019 at 22:44, Jonathan Lemon <bsd@fb.com> wrote:
>
> Tossing in my .02 cents:
>
>
> I anticipate that most users of AF_XDP will want packet processing
> for a given RX queue occurring on a single core - otherwise we end
> up with cache delays.  The usual model is one thread, one socket,
> one core, but this isn't enforced anywhere in the AF_XDP code and is
> up to the user to set this up.
>

Hmm, I definitely see use-cases where one would like multiple Rx
sockets per core, and say, multiple Tx socket per core. Enforcing it
at the uapi is IMO not correct. (Maybe in libbpf, but that's another
thing.)

> On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
> > I'm not saying that we shouldn't do busy-poll. I'm saying it's
> > complimentary, but in all cases single core per af_xdp rq queue
> > with user thread pinning is preferred.
>
> So I think we're on the same page here.
>
> > Stack rx queues and af_xdp rx queues should look almost the same from
> > napi point of view. Stack -> normal napi in softirq. af_xdp -> new
> > kthread
> > to work with both poll and busy-poll. The only difference between
> > poll and busy-poll will be the running context: new kthread vs user
> > task.
> ...
> > A burst of 64 packets on stack queues or some other work in softirqd
> > will spike the latency for af_xdp queues if softirq is shared.
>
> True, but would it be shared?  This goes back to the current model,
> which
> as used by Intel is:
>
>     (channel == RX, TX, softirq)
>
> MLX, on the other hand, wants:
>
>     (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
>
> Which would indeed lead to sharing.  The more I look at the above, the
> stronger I start to dislike it.  Perhaps this should be disallowed?
>
> I believe there was some mention at LSF/MM that the 'channel' concept
> was something specific to HW and really shouldn't be part of the SW API.
>

I'm probably stating things people already know, but let I'll take a
detour here anyway, hijacking this thread for a queue rant.

AF_XDP sockets has two modes; zero-copy mode and copy-mode. A socket
has different flavors: Rx, Tx or both. Sockets with Rx flavors (Rx or
'both') can be attached to an Rx queue. Today, the only Rx queues are
the ones attached to the stack.

Zero-copy sockets with Rx flavors require hardware steering, and to be
useful, a mechanism to create a set of queues is needed. When stack
queues and AF_XDP sockets reside on a shared netdev; Create queues
separated from the stack (how do we represent that to a user?).
Another way is creating a new netdev (say, macvlan with a zero-copy
support), and have all the Rx queues be represented by AF_XDP sockets.

Copy-mode Rx sockets, OTOH, does not require steering. In the
copy-mode case, the XDP program is a switchboard where some packets
can go to the stack, some to user-space and some elsewhere.

So, what does AF_XDP need, that's not in place yet (from my perspective)?

* For zero-copy: a mechanism to create new sets of Rx queues, and a
mechanism direct flows (via, say, a new bpf hook)
* For zero-copy/copy: a mechanism to create new Tx queues, and from
AF_XDP select that queue to be used by a socket. This would be good
for the generic XDP redirect case as well.

Zero-copy AF_XDP Rx sockets is typically used when the hardware
support that kind of steering, and typically a minimal XDP program
would then be used (if any, going forward). Copy-mode is for the
software fallback, where a more capable XDP program is needed. One
problem is that zero-copy and copy-mode behaves differently, so
copy-mode can't really be seen as a fallback to zero-copy today. In
copy-mode you cannot receive from Rx queue X, and redirect to socket
bound to queue Y (X != Y). In zero-copy mode, you just bind to a
queue, and do the redirection from configuration.

So, it would be nice with "unbound/anonymous queue sockets" or
"virtual/no queues ids", which I think is what Alexei is proposing?
Create a bunch of sockets. For copy-mode, the XDP program will do the
steering (receive from HW queue X, redirect to any socket). For
zero-copy, the configuration will solve that. The userspace
application doesn't have to change, which is a good abstraction. :-)
For the copy-mode it would be a performance hit (relaxing the SPSC
relationship), but maybe we can care about that later.

From my perspective, a mechanism to create Tx *and* Rx queues separate
from the stack is useful even outside the scope of AF_XDP. Create a
set of Rx queues and Tx queues, configure flows to those Rx queues
(via BPF?), and let an XDP program do, say, load-balancing using the
setup Tx queues. This makes sense without AF_XDP as well. The
anonymous queue path is OTOH simpler, but is an AF_XDP only
mechanism...

> > Hence the proposal for new napi_kthreads:
> > - user creates af_xdp socket and binds to _CPU_ X then
> > - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> > exposed)
> > - spawns kthread pinned to cpu X
> > - configures irq for that af_xdp queue to fire on cpu X
> > - user space with the help of libbpf pins its processing thread to
> > that cpu X
> > - repeat above for as many af_xdp sockets as there as cpus
> >   (its also ok to pick the same cpu X for different af_xdp socket
> >   then new kthread is shared)
> > - user space configures hw to RSS to these set of af_xdp sockets.
> >   since ethtool api is a mess I propose to use af_xdp api to do this
> > rss config
>
>
>  From a high level point of view, this sounds quite sensible, but does
> need
> some details ironed out.  The model above essentially enforces a model
> of:
>
>     (af_xdp = RX.af_xdp + bound_cpu)
>       (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
>
> (temporarily ignoring TX for right now)
>

...and multiple Rx queues per core in there as well.

>
> I forsee two issues with the above approach:
>    1. hardware limitations in the number of queues/rings
>    2. RSS/steering rules
>
> > - user creates af_xdp socket and binds to _CPU_ X then
> > - driver allocates single af_xdp rq queue (queue ID doesn't need to be
> > exposed)
>
> Here, the driver may not be able to create an arbitrary RQ, but may need
> to
> tear down/reuse an existing one used by the stack.  This may not be an
> issue
> for modern hardware.
>

Again, let's focus on usability first. If the hardware cannot support
it efficiently, a software fallback. We don't want an API that's a
trash bin of different "stuff" hardware vendors want to put in because
they can. (Hi ethtool! ;-))

> > - user space configures hw to RSS to these set of af_xdp sockets.
> >   since ethtool api is a mess I propose to use af_xdp api to do this
> > rss config
>
> Currently, RSS only steers default traffic.  On a system with shared
> stack/af_xdp queues, there should be a way to split the traffic types,
> unless we're talking about a model where all traffic goes to AF_XDP.
>
> This classification has to be done by the NIC, since it comes before RSS
> steering - which currently means sending flow match rules to the NIC,
> which
> is less than ideal.  I agree that the ethtool interface is non optimal,
> but
> it does make things clear to the user what's going on.
>

I would also like to see a something else than ethtool, but not
limited to AF_XDP. Maybe a BPF configuration hook: Probe the HW for
capabilities; Missing support? Fallback and load an XDP program for
software emulation. Hardware support for BPF? Pass the "fallback" XDP
program to the hardware.

> Perhaps an af_xdp library that does some bookkeeping:
>    - open af_xdp socket
>    - define af_xdp_set as (classification, steering rules, other?)
>    - bind socket to (cpu, af_xdp_set)
>    - kernel:
>      - pins calling thread to cpu
>      - creates kthread if one doesn't exist, binds to irq and cpu
>      - has driver create RQ.af_xdp, possibly replacing RQ.stack
>      - applies (af_xdp_set) to NIC.
>
> Seems workable, but a little complicated?  The complexity could be moved
> into a separate library.
>

Yes. :-)


>
> > imo that would be the simplest and performant way of using af_xdp.
> > All configuration apis are under libbpf (or libxdp if we choose to
> > fork it)
> > End result is one af_xdp rx queue - one napi - one kthread - one user
> > thread.
> > All pinned to the same cpu with irq on that cpu.
> > Both poll and busy-poll approaches will not bounce data between cpus.
> > No 'shadow' queues to speak of and should solve the issues that
> > folks were bringing up in different threads.
>
> Sounds like a sensible model from my POV.

No, "shadow queues", but AF_XDP only queues. Maybe that's ok. OTOH the
XDP Tx queues are still there, and they cannot (today at least) be
configured.


Björn

> --
> Jonathan

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-08 12:10       ` Magnus Karlsson
@ 2019-05-16 12:37         ` Magnus Karlsson
  2019-05-16 23:50           ` Samudrala, Sridhar
  2019-05-17 18:20           ` Jakub Kicinski
  0 siblings, 2 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-16 12:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Magnus Karlsson, Björn Töpel, Daniel Borkmann,
	Network Development, bpf, Jakub Kicinski, Jonathan Lemon,
	Maciej Fijalkowski, Samudrala, Sridhar

On Wed, May 8, 2019 at 2:10 PM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Tue, May 7, 2019 at 8:24 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, May 07, 2019 at 01:51:45PM +0200, Magnus Karlsson wrote:
> > > On Mon, May 6, 2019 at 6:33 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Thu, May 02, 2019 at 10:39:16AM +0200, Magnus Karlsson wrote:
> > > > > This RFC proposes to add busy-poll support to AF_XDP sockets. With
> > > > > busy-poll, the driver is executed in process context by calling the
> > > > > poll() syscall. The main advantage with this is that all processing
> > > > > occurs on a single core. This eliminates the core-to-core cache
> > > > > transfers that occur between the application and the softirqd
> > > > > processing on another core, that occurs without busy-poll. From a
> > > > > systems point of view, it also provides an advatage that we do not
> > > > > have to provision extra cores in the system to handle
> > > > > ksoftirqd/softirq processing, as all processing is done on the single
> > > > > core that executes the application. The drawback of busy-poll is that
> > > > > max throughput seen from a single application will be lower (due to
> > > > > the syscall), but on a per core basis it will often be higher as
> > > > > the normal mode runs on two cores and busy-poll on a single one.
> > > > >
> > > > > The semantics of busy-poll from the application point of view are the
> > > > > following:
> > > > >
> > > > > * The application is required to call poll() to drive rx and tx
> > > > >   processing. There is no guarantee that softirq and interrupts will
> > > > >   do this for you. This is in contrast with the current
> > > > >   implementations of busy-poll that are opportunistic in the sense
> > > > >   that packets might be received/transmitted by busy-poll or
> > > > >   softirqd. (In this patch set, softirq/ksoftirqd will kick in at high
> > > > >   loads just as the current opportunistic implementations, but I would
> > > > >   like to get to a point where this is not the case for busy-poll
> > > > >   enabled XDP sockets, as this slows down performance considerably and
> > > > >   starts to use one more core for the softirq processing. The end goal
> > > > >   is for only poll() to drive the napi loop when busy-poll is enabled
> > > > >   on an AF_XDP socket. More about this later.)
> > > > >
> > > > > * It should be enabled on a per socket basis. No global enablement, i.e.
> > > > >   the XDP socket busy-poll will not care about the current
> > > > >   /proc/sys/net/core/busy_poll and busy_read global enablement
> > > > >   mechanisms.
> > > > >
> > > > > * The batch size (how many packets that are processed every time the
> > > > >   napi function in the driver is called, i.e. the weight parameter)
> > > > >   should be configurable. Currently, the busy-poll size of AF_INET
> > > > >   sockets is set to 8, but for AF_XDP sockets this is too small as the
> > > > >   amount of processing per packet is much smaller with AF_XDP. This
> > > > >   should be configurable on a per socket basis.
> > > > >
> > > > > * If you put multiple AF_XDP busy-poll enabled sockets into a poll()
> > > > >   call the napi contexts of all of them should be executed. This is in
> > > > >   contrast to the AF_INET busy-poll that quits after the fist one that
> > > > >   finds any packets. We need all napi contexts to be executed due to
> > > > >   the first requirement in this list. The behaviour we want is much more
> > > > >   like regular sockets in that all of them are checked in the poll
> > > > >   call.
> > > > >
> > > > > * Should be possible to mix AF_XDP busy-poll sockets with any other
> > > > >   sockets including busy-poll AF_INET ones in a single poll() call
> > > > >   without any change to semantics or the behaviour of any of those
> > > > >   socket types.
> > > > >
> > > > > * As suggested by Maxim Mikityanskiy, poll() will in the busy-poll
> > > > >   mode return POLLERR if the fill ring is empty or the completion
> > > > >   queue is full.
> > > > >
> > > > > Busy-poll support is enabled by calling a new setsockopt called
> > > > > XDP_BUSY_POLL_BATCH_SIZE that takes batch size as an argument. A value
> > > > > between 1 and NAPI_WEIGHT (64) will turn it on, 0 will turn it off and
> > > > > any other value will return an error.
> > > > >
> > > > > A typical packet processing rxdrop loop with busy-poll will look something
> > > > > like this:
> > > > >
> > > > > for (i = 0; i < num_socks; i++) {
> > > > >     fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> > > > >     fds[i].events = POLLIN;
> > > > > }
> > > > >
> > > > > for (;;) {
> > > > >     ret = poll(fds, num_socks, 0);
> > > > >     if (ret <= 0)
> > > > >             continue;
> > > > >
> > > > >     for (i = 0; i < num_socks; i++)
> > > > >         rx_drop(xsks[i], fds); /* The actual application */
> > > > > }
> > > > >
> > > > > Need some advice around this issue please:
> > > > >
> > > > > In this patch set, softirq/ksoftirqd will kick in at high loads and
> > > > > render the busy poll support useless as the execution is now happening
> > > > > in the same way as without busy-poll support. Everything works from an
> > > > > application perspective but this defeats the purpose of the support
> > > > > and also consumes an extra core. What I would like to accomplish when
> > > >
> > > > Not sure what you mean by 'extra core' .
> > > > The above poll+rx_drop is executed for every af_xdp socket
> > > > and there are N cpus processing exactly N af_xdp sockets.
> > > > Where is 'extra core'?
> > > > Are you suggesting a model where single core will be busy-polling
> > > > all af_xdp sockets? and then waking processing threads?
> > > > or single core will process all sockets?
> > > > I think the af_xdp model should be flexible and allow easy out-of-the-box
> > > > experience, but it should be optimized for 'ideal' user that
> > > > does the-right-thing from max packet-per-second point of view.
> > > > I thought we've already converged on the model where af_xdp hw rx queues
> > > > bind one-to-one to af_xdp sockets and user space pins processing
> > > > threads one-to-one to af_xdp sockets on corresponding cpus...
> > > > If so that's the model to optimize for on the kernel side
> > > > while keeping all other user configurations functional.
> > > >
> > > > > XDP socket busy-poll is enabled is that softirq/ksoftirq is never
> > > > > invoked for the traffic that goes to this socket. This way, we would
> > > > > get better performance on a per core basis and also get the same
> > > > > behaviour independent of load.
> > > >
> > > > I suspect separate rx kthreads of af_xdp socket processing is necessary
> > > > with and without busy-poll exactly because of 'high load' case
> > > > you've described.
> > > > If we do this additional rx-kthread model why differentiate
> > > > between busy-polling and polling?
> > > >
> > > > af_xdp rx queue is completely different form stack rx queue because
> > > > of target dma address setup.
> > > > Using stack's napi ksoftirqd threads for processing af_xdp queues creates
> > > > the fairness issues. Isn't it better to have separate kthreads for them
> > > > and let scheduler deal with fairness among af_xdp processing and stack?
> > >
> > > When using ordinary poll() on an AF_XDP socket, the application will
> > > run on one core and the driver processing will run on another in
> > > softirq/ksoftirqd context. (Either due to explicit core and irq
> > > pinning or due to the scheduler or irqbalance moving the two threads
> > > apart.) In AF_XDP busy-poll mode of this RFC, I would like the
> > > application and the driver processing to occur on a single core, thus
> > > there is no "extra" driver core involved that need to be taken into
> > > account when sizing and/or provisioning the system. The napi context
> > > is in this mode invoked from syscall context when executing the poll
> > > syscall from the application.
> > >
> > > Executing the app and the driver on the same core could of course be
> > > accomplished already today by pinning the application and the driver
> > > interrupt to the same core, but that would not be that efficient due
> > > to context switching between the two.
> >
> > Have you benchmarked it?
> > I don't think context switch will be that noticable when kpti is off.
> > napi processes 64 packets descriptors and switches back to user to
> > do payload processing of these packets.
> > I would think that the same job is on two different cores would be
> > a bit more performant with user code consuming close to 100%
> > and softirq is single digit %. Say it's 10%.
> > I believe combining the two on single core is not 100 + 10 since
> > there is no cache bouncing. So Mpps from two cores setup will
> > reduce by 2-3% instead of 10%.
> > There is a cost of going to sleep and being woken up from poll(),
> > but 64 packets is probably large enough number to amortize.
> > If not, have you tried to bump napi budget to say 256 for af_xdp rx queues?
> > Busy-poll avoids sleep/wakeup overhead and probably can make
> > this scheme work with lower batching (like 64), but fundamentally
> > they're the same thing.
> > I'm not saying that we shouldn't do busy-poll. I'm saying it's
> > complimentary, but in all cases single core per af_xdp rq queue
> > with user thread pinning is preferred.
> >
> > > A more efficient way would be to
> > > call the napi loop from within the poll() syscall when you are inside
> > > the kernel anyway. This is what the classical busy-poll mechanism
> > > operating on AF_INET sockets does. Inside the poll() call, it executes
> > > the napi context of the driver until it finds a packet (if it is rx)
> > > and then returns to the application that then processes the packets. I
> > > would like to adopt something quite similar for AF_XDP sockets. (Some
> > > of the differences can be found at the top of the original post.)
> > >
> > > From an API point of view with busy-poll of AF_XDP sockets, the user
> > > would bind to a queue number and taskset its application to a specific
> > > core and both the app and the driver execution would only occur on
> > > that core. This is in my mind simpler than with regular poll or AF_XDP
> > > using no syscalls on rx (i.e. current state), in which you bind to a
> > > queue, taskset your application to a core and then you also have to
> > > take care to route the interrupt of the queue you bound to to another
> > > core that will execute the driver part in the kernel. So the model is
> > > in both cases still one core - one socket - one napi. (Users can of
> > > course create multiple sockets in an app if they desire.)
> > >
> > > The main reasons I would like to introduce busy-poll for AF_XDP
> > > sockets are:
> > >
> > > * It is simpler to provision, see arguments above. Both application
> > >   and driver runs efficiently on the same core.
> > >
> > > * It is faster (on a per core basis) since we do not have any core to
> > >   core communication. All header and descriptor transfers between
> > >   kernel and application are core local which is much
> > >   faster. Scalability will also be better. E.g., 64 bytes desc + 64
> > >   bytes packet header = 128 bytes per packet less on the interconnect
> > >   between cores. At 20 Mpps/core, this is ~20Gbit/s and with 20 cores
> > >   this will be ~400Gbit/s of interconnect traffic less with busy-poll.
> >
> > exactly. don't make cpu do this core-to-core stuff.
> > pin one rx to one core.
> >
> > > * It provides a way to seamlessly replace user-space drivers in DPDK
> > >   with Linux drivers in kernel space. (Do not think I have to argue
> > >   why this is a good idea on this list ;-).) The DPDK model is that
> > >   application and driver run on the same core since they are both in
> > >   user space. If we can provide the same model (both running
> > >   efficiently on the same core, NOT drivers in user-space) with
> > >   AF_XDP, it is easy for DPDK users to make the switch. Compare this
> > >   to the current way where there are both application cores and
> > >   driver/ksoftirqd cores. If a systems builder had 12 cores in his
> > >   appliance box and they had 12 instances of a DPDK app, one on each
> > >   core, how would he/she reason when repartitioning between
> > >   application and driver cores? 8 application cores and 4 driver
> > >   cores, or 6 of each? Maybe it is also packet dependent? Etc. Much
> > >   simpler to migrate if we had an efficient way to run both of them on
> > >   the same core.
> > >
> > > Why no interrupt? That should have been: no interrupts enabled to
> > > start with. We would like to avoid interrupts as much as possible
> > > since when they trigger, we will revert to the non busy-poll model,
> > > i.e. processing on two separate cores, and the advantages from above
> > > will disappear. How to accomplish this?
> > >
> > > * One way would be to create a napi context with the queue we have
> > >   bound to but with no interrupt associated with it, or it being
> > >   disabled. The socket would in that case only be able to receive and
> > >   send packets when calling the poll() syscall. If you do not call
> > >   poll(), you do not get any packets, nor are any packets sent. It
> > >   would only be possible to support this with a poll() timeout value
> > >   of zero. This would have the best performance
> > >
> > > * Maybe we could support timeout values >0 by re-enabling the interrupt
> > >   at some point. When calling poll(), the core would invoke the napi
> > >   context repeatedly with the interrupt of that napi disabled until it
> > >   found a packet, but max for a period of time up until the busy poll
> > >   timeout (like regular busy poll today does). If that times out, we
> > >   go up to the regular timeout of the poll() call and enable
> > >   interrupts of the queue associated with the napi and put the process
> > >   to sleep. Once woken up by an interrupt, the interrupt of the napi
> > >   would be disabled again and control returned to the application. We
> > >   would with this scheme process the vast majority of packets locally
> > >   on a core with interrupts disabled and with good performance and
> > >   only when we have low load and are sleeping/waiting in poll would we
> > >   process some packets using interrupts on the core that the
> > >   interrupt has been bound to.
> >
> > I think both 'no interrupt' solutions are challenging for users.
> > Stack rx queues and af_xdp rx queues should look almost the same from
> > napi point of view. Stack -> normal napi in softirq. af_xdp -> new kthread
> > to work with both poll and busy-poll. The only difference between
> > poll and busy-poll will be the running context: new kthread vs user task.
> > If busy-poll drained the queue then new kthread napi has no work to do.
> > No irq approach could be marginally faster, but more error prone.
> > With new kthread the user space will still work in all configuration.
> > Even when single user task is processing many af_xdp sockets.
> >
> > I'm proposing new kthread only partially for performance reasons, but
> > mainly to avoid sharing stack rx and af_xdp queues within the same softirq.
> > Currently we share softirqd for stack napis for all NICs in the system,
> > but af_xdp depends on isolated processing.
> > Ideally we have rss into N queues for stack and rss into M af_xdp sockets.
> > The same host will be receive traffic on both.
> > Even if we rss stack queues to one set of cpus and af_xdp on another cpus
> > softirqds are doing work on all cpus.
> > A burst of 64 packets on stack queues or some other work in softirqd
> > will spike the latency for af_xdp queues if softirq is shared.
> > Hence the proposal for new napi_kthreads:
> > - user creates af_xdp socket and binds to _CPU_ X then
> > - driver allocates single af_xdp rq queue (queue ID doesn't need to be exposed)
> > - spawns kthread pinned to cpu X
> > - configures irq for that af_xdp queue to fire on cpu X
> > - user space with the help of libbpf pins its processing thread to that cpu X
> > - repeat above for as many af_xdp sockets as there as cpus
> >   (its also ok to pick the same cpu X for different af_xdp socket
> >   then new kthread is shared)
> > - user space configures hw to RSS to these set of af_xdp sockets.
> >   since ethtool api is a mess I propose to use af_xdp api to do this rss config
> >
> > imo that would be the simplest and performant way of using af_xdp.
> > All configuration apis are under libbpf (or libxdp if we choose to fork it)
> > End result is one af_xdp rx queue - one napi - one kthread - one user thread.
> > All pinned to the same cpu with irq on that cpu.
> > Both poll and busy-poll approaches will not bounce data between cpus.
> > No 'shadow' queues to speak of and should solve the issues that
> > folks were bringing up in different threads.
> > How crazy does it sound?
>
> Actually, it sounds remarkably sane :-). It will create something
> quite similar to what I have been wanting, but you take it at least
> two steps further. Did not think about introducing a separate kthread
> as a potential solution, and the user space configuration of RSS (and
> maybe other flow steering mechanisms) from AF_XDP Björn and I have
> only been loosely talking about. Anyway, I am producing performance
> numbers for the options that we have talked about. I will get back to
> you with them as soon as I have them and we can continue the
> discussions based on those.
>
> Thanks: Magnus

After a number of surprises and issues in the driver here are now the
first set of results. 64 byte packets at 40Gbit/s line rate. All
results in Mpps. Note that I just used my local system and kernel build
for these numbers so they are not performance tuned. Jesper would
likely get better results on his setup :-). Explanation follows after
the table.

                                      Applications
method  cores  irqs        txpush        rxdrop      l2fwd
---------------------------------------------------------------
r-t-c     2     y           35.9          11.2        8.6
poll      2     y           34.2           9.4        8.3
r-t-c     1     y           18.1           N/A        6.2
poll      1     y           14.6           8.4        5.9
busypoll  2     y           31.9          10.5        7.9
busypoll  1     y           21.5           8.7        6.2
busypoll  1     n           22.0          10.3        7.3

r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
        and only spin on the pointers in the ring.
poll = Use the regular syscall poll()
busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
           sent out.

cores == 2 means that softirq/ksoftirqd is one a different core from
           the application. 2 cores are consumed in total.
cores == 1 means that both softirq/ksoftirqd and the application runs
           on the same core. Only 1 core is used in total.

irqs == 'y' is the normal case. irqs == 'n' means that I have created a
        new napi context with the AF_XDP queues inside that does not
        have any interrupts associated with it. No other traffic goes
        to this napi context.

N/A = This combination does not make sense since the application will
      not yield due to run-to-completion without any syscalls
      whatsoever. It works, but it crawls in the 30 Kpps
      range. Creating huge rings would help, but did not do that.

The applications are the ones from the xdpsock sample application in
samples/bpf/.

Some things I had to do to get these results:

* The current buffer allocation scheme in i40e where we continuously
  try to access the fill queue until we find some entries, is not
  effective if we are on a single core. Instead, we try once and call
  a function that sets a flag. This flag is then checked in the xsk
  poll code, and if it is set we schedule napi so that it can try to
  allocate some buffers from the fill ring again. Note that this flag
  has to propagate all the way to user space so that the application
  knows that it has to call poll(). I currently set a flag in the Rx
  ring to indicate that the application should call poll() to resume
  the driver. This is similar to what the io_uring in the storage
  subsystem does. It is not enough to return POLLERR from poll() as
  that will only work for the case when we are using poll(). But I do
  that as well.

* Implemented Sridhar's suggestion on adding busy_loop_end callbacks
  that terminate the busy poll loop if the Rx queue is empty or the Tx
  queue is full.

* There is a race in the setup code in i40e when it is used with
  busy-poll. The fact that busy-poll calls the napi_busy_loop code
  before interrupts have been registered and enabled seems to trigger
  some bug where nothing gets transmitted. This only happens for
  busy-poll. Poll and run-to-completion only enters the napi loop of
  i40e by interrupts and only then after interrupts have been enabled,
  which is the last thing that is done after setup. I have just worked
  around it by introducing a sleep(1) in the application for these
  experiments. Ugly, but should not impact the numbers, I believe.

* The 1 core case is sensitive to the amount of work done reported
  from the driver. This was not correct in the XDP code of i40e and
  let to bad performance. Now it reports the correct values for
  Rx. Note that i40e does not honor the napi budget on Tx and sets
  that to 256, and these are not reported back to the napi
  library.

Some observations:

* Cannot really explain the drop in performance for txpush when going
  from 2 cores to 1. As stated before, the reporting of Tx work is not
  really propagated to the napi infrastructure. Tried reporting this
  in a correct manner (completely ignoring Rx for this experiment) but
  the results were the same. Will dig deeper into this to screen out
  any stupid mistakes.

* With the fixes above, all my driver processing is in softirq for 1
  core. It never goes over to ksoftirqd. Previously when work was
  reported incorrectly, this was the case. I would have liked
  ksoftirqd to take over as that would have been more like a separate
  thread. How to accomplish this? There might still be some reporting
  problem in the driver that hinders this, but actually think it is
  more correct now.

* Looking at the current results for a single core, busy poll provides
  a 40% boost for Tx but only 5% for Rx. But if I instead create a
  napi context without any interrupt associated with it and drive that
  from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
  increases only marginally from the 40% improvement as there are few
  interrupts on Tx due to the completion interrupt bit being set quite
  infrequently. One question I have is: what am I breaking by creating
  a napi context not used by anyone else, only AF_XDP, that does not
  have an interrupt associated with it?

Todo:

* Explain the drop in Tx push when going from 2 cores to 1.

* Really run a separate thread for kernel processing instead of softirq.

* What other experiments would you like to see?

/Magnus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-16 12:37         ` Magnus Karlsson
@ 2019-05-16 23:50           ` Samudrala, Sridhar
  2019-05-17  7:50             ` Magnus Karlsson
  2019-05-17 18:20           ` Jakub Kicinski
  1 sibling, 1 reply; 24+ messages in thread
From: Samudrala, Sridhar @ 2019-05-16 23:50 UTC (permalink / raw)
  To: Magnus Karlsson, Alexei Starovoitov
  Cc: Magnus Karlsson, Björn Töpel, Daniel Borkmann,
	Network Development, bpf, Jakub Kicinski, Jonathan Lemon,
	Maciej Fijalkowski

On 5/16/2019 5:37 AM, Magnus Karlsson wrote:
> 
> After a number of surprises and issues in the driver here are now the
> first set of results. 64 byte packets at 40Gbit/s line rate. All
> results in Mpps. Note that I just used my local system and kernel build
> for these numbers so they are not performance tuned. Jesper would
> likely get better results on his setup :-). Explanation follows after
> the table.
> 
>                                        Applications
> method  cores  irqs        txpush        rxdrop      l2fwd
> ---------------------------------------------------------------
> r-t-c     2     y           35.9          11.2        8.6
> poll      2     y           34.2           9.4        8.3
> r-t-c     1     y           18.1           N/A        6.2
> poll      1     y           14.6           8.4        5.9
> busypoll  2     y           31.9          10.5        7.9
> busypoll  1     y           21.5           8.7        6.2
> busypoll  1     n           22.0          10.3        7.3
> 
> r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
>          and only spin on the pointers in the ring.
> poll = Use the regular syscall poll()
> busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
>             sent out.
> 
> cores == 2 means that softirq/ksoftirqd is one a different core from
>             the application. 2 cores are consumed in total.
> cores == 1 means that both softirq/ksoftirqd and the application runs
>             on the same core. Only 1 core is used in total.
> 
> irqs == 'y' is the normal case. irqs == 'n' means that I have created a
>          new napi context with the AF_XDP queues inside that does not
>          have any interrupts associated with it. No other traffic goes
>          to this napi context.
> 
> N/A = This combination does not make sense since the application will
>        not yield due to run-to-completion without any syscalls
>        whatsoever. It works, but it crawls in the 30 Kpps
>        range. Creating huge rings would help, but did not do that.
> 
> The applications are the ones from the xdpsock sample application in
> samples/bpf/.
> 
> Some things I had to do to get these results:
> 
> * The current buffer allocation scheme in i40e where we continuously
>    try to access the fill queue until we find some entries, is not
>    effective if we are on a single core. Instead, we try once and call
>    a function that sets a flag. This flag is then checked in the xsk
>    poll code, and if it is set we schedule napi so that it can try to
>    allocate some buffers from the fill ring again. Note that this flag
>    has to propagate all the way to user space so that the application
>    knows that it has to call poll(). I currently set a flag in the Rx
>    ring to indicate that the application should call poll() to resume
>    the driver. This is similar to what the io_uring in the storage
>    subsystem does. It is not enough to return POLLERR from poll() as
>    that will only work for the case when we are using poll(). But I do
>    that as well.
> 
> * Implemented Sridhar's suggestion on adding busy_loop_end callbacks
>    that terminate the busy poll loop if the Rx queue is empty or the Tx
>    queue is full.
> 
> * There is a race in the setup code in i40e when it is used with
>    busy-poll. The fact that busy-poll calls the napi_busy_loop code
>    before interrupts have been registered and enabled seems to trigger
>    some bug where nothing gets transmitted. This only happens for
>    busy-poll. Poll and run-to-completion only enters the napi loop of
>    i40e by interrupts and only then after interrupts have been enabled,
>    which is the last thing that is done after setup. I have just worked
>    around it by introducing a sleep(1) in the application for these
>    experiments. Ugly, but should not impact the numbers, I believe.
> 
> * The 1 core case is sensitive to the amount of work done reported
>    from the driver. This was not correct in the XDP code of i40e and
>    let to bad performance. Now it reports the correct values for
>    Rx. Note that i40e does not honor the napi budget on Tx and sets
>    that to 256, and these are not reported back to the napi
>    library.
> 
> Some observations:
> 
> * Cannot really explain the drop in performance for txpush when going
>    from 2 cores to 1. As stated before, the reporting of Tx work is not
>    really propagated to the napi infrastructure. Tried reporting this
>    in a correct manner (completely ignoring Rx for this experiment) but
>    the results were the same. Will dig deeper into this to screen out
>    any stupid mistakes.
> 
> * With the fixes above, all my driver processing is in softirq for 1
>    core. It never goes over to ksoftirqd. Previously when work was
>    reported incorrectly, this was the case. I would have liked
>    ksoftirqd to take over as that would have been more like a separate
>    thread. How to accomplish this? There might still be some reporting
>    problem in the driver that hinders this, but actually think it is
>    more correct now.
> 
> * Looking at the current results for a single core, busy poll provides
>    a 40% boost for Tx but only 5% for Rx. But if I instead create a
>    napi context without any interrupt associated with it and drive that
>    from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
>    increases only marginally from the 40% improvement as there are few
>    interrupts on Tx due to the completion interrupt bit being set quite
>    infrequently. One question I have is: what am I breaking by creating
>    a napi context not used by anyone else, only AF_XDP, that does not
>    have an interrupt associated with it?
> 
> Todo:
> 
> * Explain the drop in Tx push when going from 2 cores to 1.
> 
> * Really run a separate thread for kernel processing instead of softirq.
> 
> * What other experiments would you like to see?

Thanks for sharing the results.
For busypoll tests, i guess you may have increased the busypoll budget 
to 64.
What is the busypoll timeout you are using?
Can you try a test that skips calling bpf program for queues that are 
associated with af-xdp socket? I remember seeing a significant bump in
rxdrop performance with this change.
The other overhead i saw was with the dma_sync_single calls in the driver.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-16 23:50           ` Samudrala, Sridhar
@ 2019-05-17  7:50             ` Magnus Karlsson
  0 siblings, 0 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-17  7:50 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Alexei Starovoitov, Magnus Karlsson, Björn Töpel,
	Daniel Borkmann, Network Development, bpf, Jakub Kicinski,
	Jonathan Lemon, Maciej Fijalkowski

On Fri, May 17, 2019 at 1:50 AM Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
>
> On 5/16/2019 5:37 AM, Magnus Karlsson wrote:
> >
> > After a number of surprises and issues in the driver here are now the
> > first set of results. 64 byte packets at 40Gbit/s line rate. All
> > results in Mpps. Note that I just used my local system and kernel build
> > for these numbers so they are not performance tuned. Jesper would
> > likely get better results on his setup :-). Explanation follows after
> > the table.
> >
> >                                        Applications
> > method  cores  irqs        txpush        rxdrop      l2fwd
> > ---------------------------------------------------------------
> > r-t-c     2     y           35.9          11.2        8.6
> > poll      2     y           34.2           9.4        8.3
> > r-t-c     1     y           18.1           N/A        6.2
> > poll      1     y           14.6           8.4        5.9
> > busypoll  2     y           31.9          10.5        7.9
> > busypoll  1     y           21.5           8.7        6.2
> > busypoll  1     n           22.0          10.3        7.3
> >
> > r-t-c = Run-to-completion, the mode where we in Rx uses no syscalls
> >          and only spin on the pointers in the ring.
> > poll = Use the regular syscall poll()
> > busypoll = Use the regular syscall poll() in busy-poll mode. The RFC I
> >             sent out.
> >
> > cores == 2 means that softirq/ksoftirqd is one a different core from
> >             the application. 2 cores are consumed in total.
> > cores == 1 means that both softirq/ksoftirqd and the application runs
> >             on the same core. Only 1 core is used in total.
> >
> > irqs == 'y' is the normal case. irqs == 'n' means that I have created a
> >          new napi context with the AF_XDP queues inside that does not
> >          have any interrupts associated with it. No other traffic goes
> >          to this napi context.
> >
> > N/A = This combination does not make sense since the application will
> >        not yield due to run-to-completion without any syscalls
> >        whatsoever. It works, but it crawls in the 30 Kpps
> >        range. Creating huge rings would help, but did not do that.
> >
> > The applications are the ones from the xdpsock sample application in
> > samples/bpf/.
> >
> > Some things I had to do to get these results:
> >
> > * The current buffer allocation scheme in i40e where we continuously
> >    try to access the fill queue until we find some entries, is not
> >    effective if we are on a single core. Instead, we try once and call
> >    a function that sets a flag. This flag is then checked in the xsk
> >    poll code, and if it is set we schedule napi so that it can try to
> >    allocate some buffers from the fill ring again. Note that this flag
> >    has to propagate all the way to user space so that the application
> >    knows that it has to call poll(). I currently set a flag in the Rx
> >    ring to indicate that the application should call poll() to resume
> >    the driver. This is similar to what the io_uring in the storage
> >    subsystem does. It is not enough to return POLLERR from poll() as
> >    that will only work for the case when we are using poll(). But I do
> >    that as well.
> >
> > * Implemented Sridhar's suggestion on adding busy_loop_end callbacks
> >    that terminate the busy poll loop if the Rx queue is empty or the Tx
> >    queue is full.
> >
> > * There is a race in the setup code in i40e when it is used with
> >    busy-poll. The fact that busy-poll calls the napi_busy_loop code
> >    before interrupts have been registered and enabled seems to trigger
> >    some bug where nothing gets transmitted. This only happens for
> >    busy-poll. Poll and run-to-completion only enters the napi loop of
> >    i40e by interrupts and only then after interrupts have been enabled,
> >    which is the last thing that is done after setup. I have just worked
> >    around it by introducing a sleep(1) in the application for these
> >    experiments. Ugly, but should not impact the numbers, I believe.
> >
> > * The 1 core case is sensitive to the amount of work done reported
> >    from the driver. This was not correct in the XDP code of i40e and
> >    let to bad performance. Now it reports the correct values for
> >    Rx. Note that i40e does not honor the napi budget on Tx and sets
> >    that to 256, and these are not reported back to the napi
> >    library.
> >
> > Some observations:
> >
> > * Cannot really explain the drop in performance for txpush when going
> >    from 2 cores to 1. As stated before, the reporting of Tx work is not
> >    really propagated to the napi infrastructure. Tried reporting this
> >    in a correct manner (completely ignoring Rx for this experiment) but
> >    the results were the same. Will dig deeper into this to screen out
> >    any stupid mistakes.
> >
> > * With the fixes above, all my driver processing is in softirq for 1
> >    core. It never goes over to ksoftirqd. Previously when work was
> >    reported incorrectly, this was the case. I would have liked
> >    ksoftirqd to take over as that would have been more like a separate
> >    thread. How to accomplish this? There might still be some reporting
> >    problem in the driver that hinders this, but actually think it is
> >    more correct now.
> >
> > * Looking at the current results for a single core, busy poll provides
> >    a 40% boost for Tx but only 5% for Rx. But if I instead create a
> >    napi context without any interrupt associated with it and drive that
> >    from busy-poll, I get a 15% - 20% performance improvement for Rx. Tx
> >    increases only marginally from the 40% improvement as there are few
> >    interrupts on Tx due to the completion interrupt bit being set quite
> >    infrequently. One question I have is: what am I breaking by creating
> >    a napi context not used by anyone else, only AF_XDP, that does not
> >    have an interrupt associated with it?
> >
> > Todo:
> >
> > * Explain the drop in Tx push when going from 2 cores to 1.
> >
> > * Really run a separate thread for kernel processing instead of softirq.
> >
> > * What other experiments would you like to see?
>
> Thanks for sharing the results.
> For busypoll tests, i guess you may have increased the busypoll budget
> to 64.

Yes, I am using a batch size of 64 for all experiments as the
NAPI_POLL_WEIGHT is also 64. Note that the i40e driver batches 256
packets on Tx as it does not care what the budget parameter is in the
NAPI function. Rx is according to budget though.

> What is the busypoll timeout you are using?

0, as I am slamming the system as hard as I can with packets. The CPU
is always at close to 100% due to this and there is always something
to do. With a busy-poll timeout value of 100, I see a performance
degradation between 2% (slowest rx) - 7% (fastest tx). But any other
value than 0 for the busy-poll timeout does not make much sense when I
am running the driver and the application on the same core. I am
better off trying to get into softirq/ksoftirqd quicker to get some
new packets and/or send my Tx ones.

Regular poll() has a timeout value in the poll() syscall of 1000, as
it needs to yield to the driver processing. With 0 there are, to my
surprise, some performance improvements of a couple of percent.
Looking at the code, the code path for a 0 timeout is shorter which
might explain this.

> Can you try a test that skips calling bpf program for queues that are
> associated with af-xdp socket? I remember seeing a significant bump in
> rxdrop performance with this change.

Björn is working on this. This should improve performance much more
than busy-poll in my mind.

> The other overhead i saw was with the dma_sync_single calls in the driver.

I will do a "perf top" and check out the bottlenecks in more detail.

Thanks: Magnus

> Thanks
> Sridhar

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-16 12:37         ` Magnus Karlsson
  2019-05-16 23:50           ` Samudrala, Sridhar
@ 2019-05-17 18:20           ` Jakub Kicinski
  2019-05-18  8:49             ` Magnus Karlsson
  1 sibling, 1 reply; 24+ messages in thread
From: Jakub Kicinski @ 2019-05-17 18:20 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Alexei Starovoitov, Magnus Karlsson, Björn Töpel,
	Daniel Borkmann, Network Development, bpf, Jonathan Lemon,
	Maciej Fijalkowski, Samudrala, Sridhar

On Thu, 16 May 2019 14:37:51 +0200, Magnus Karlsson wrote:
>                                       Applications
> method  cores  irqs        txpush        rxdrop      l2fwd
> ---------------------------------------------------------------
> r-t-c     2     y           35.9          11.2        8.6
> poll      2     y           34.2           9.4        8.3
> r-t-c     1     y           18.1           N/A        6.2
> poll      1     y           14.6           8.4        5.9
> busypoll  2     y           31.9          10.5        7.9
> busypoll  1     y           21.5           8.7        6.2
> busypoll  1     n           22.0          10.3        7.3

Thanks for the numbers!  One question that keeps coming to my mind 
is how do the cases compare on zero drop performance?

When I was experimenting with AF_XDP it seemed to be slightly more
prone to dropping packets than expected.  I wonder if you're seeing
a similar thing (well drops or back pressure to the traffic generator)?
Perhaps the single core busy poll would make a difference there?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
  2019-05-17 18:20           ` Jakub Kicinski
@ 2019-05-18  8:49             ` Magnus Karlsson
  0 siblings, 0 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-18  8:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Magnus Karlsson, Björn Töpel,
	Daniel Borkmann, Network Development, bpf, Jonathan Lemon,
	Maciej Fijalkowski, Samudrala, Sridhar

On Fri, May 17, 2019 at 8:20 PM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Thu, 16 May 2019 14:37:51 +0200, Magnus Karlsson wrote:
> >                                       Applications
> > method  cores  irqs        txpush        rxdrop      l2fwd
> > ---------------------------------------------------------------
> > r-t-c     2     y           35.9          11.2        8.6
> > poll      2     y           34.2           9.4        8.3
> > r-t-c     1     y           18.1           N/A        6.2
> > poll      1     y           14.6           8.4        5.9
> > busypoll  2     y           31.9          10.5        7.9
> > busypoll  1     y           21.5           8.7        6.2
> > busypoll  1     n           22.0          10.3        7.3
>
> Thanks for the numbers!  One question that keeps coming to my mind
> is how do the cases compare on zero drop performance?
>
> When I was experimenting with AF_XDP it seemed to be slightly more
> prone to dropping packets than expected.  I wonder if you're seeing
> a similar thing (well drops or back pressure to the traffic generator)?
> Perhaps the single core busy poll would make a difference there?

Good question. I will run the experiments and see what we get.

/Magnus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample
  2019-05-02  8:39 ` [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample Magnus Karlsson
@ 2019-05-24  2:51   ` Ye Xiaolong
  2019-05-27  6:58     ` Magnus Karlsson
  0 siblings, 1 reply; 24+ messages in thread
From: Ye Xiaolong @ 2019-05-24  2:51 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: bjorn.topel, ast, daniel, netdev, brouer, bpf, bruce.richardson,
	ciara.loftus, jakub.kicinski, qi.z.zhang, maximmi,
	sridhar.samudrala, kevin.laatz

Hi, Magnus

On 05/02, Magnus Karlsson wrote:
>This patch adds busy-poll support to the xdpsock sample
>application. It is enabled by the "-b" or the "--busy-poll" command
>line options.
>
>Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
>---
> samples/bpf/xdpsock_user.c | 203 ++++++++++++++++++++++++++++-----------------
> 1 file changed, 125 insertions(+), 78 deletions(-)
>
>diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
>index d08ee1a..1272edf 100644
>--- a/samples/bpf/xdpsock_user.c
>+++ b/samples/bpf/xdpsock_user.c
>@@ -66,6 +66,7 @@ static const char *opt_if = "";
> static int opt_ifindex;
> static int opt_queue;
> static int opt_poll;
>+static int opt_busy_poll;
> static int opt_interval = 1;
> static u32 opt_xdp_bind_flags;
> static __u32 prog_id;
>@@ -119,8 +120,11 @@ static void print_benchmark(bool running)
> 	else
> 		printf("	");
> 
>-	if (opt_poll)
>+	if (opt_poll) {
>+		if (opt_busy_poll)
>+			printf("busy-");
> 		printf("poll() ");
>+	}
> 
> 	if (running) {
> 		printf("running...");
>@@ -306,7 +310,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
> 	xsk->umem = umem;
> 	cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> 	cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
>-	cfg.libbpf_flags = 0;

Any purpose for removing this line, as cfg here is a local variable, cfg.libbpf_flags
can be random and may lead to xdpsock failure as `Invalid no_argument`.

Thanks,
Xiaolong

>+	cfg.busy_poll = (opt_busy_poll ? BATCH_SIZE : 0);
> 	cfg.xdp_flags = opt_xdp_flags;
> 	cfg.bind_flags = opt_xdp_bind_flags;
> 	ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
>@@ -319,17 +323,17 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
> 		exit_with_error(-ret);
> 
> 	ret = xsk_ring_prod__reserve(&xsk->umem->fq,
>-				     XSK_RING_PROD__DEFAULT_NUM_DESCS,
>+				     1024,
> 				     &idx);
>-	if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS)
>+	if (ret != 1024)
> 		exit_with_error(-ret);
> 	for (i = 0;
>-	     i < XSK_RING_PROD__DEFAULT_NUM_DESCS *
>+	     i < 1024 *
> 		     XSK_UMEM__DEFAULT_FRAME_SIZE;
> 	     i += XSK_UMEM__DEFAULT_FRAME_SIZE)
> 		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = i;
> 	xsk_ring_prod__submit(&xsk->umem->fq,
>-			      XSK_RING_PROD__DEFAULT_NUM_DESCS);
>+			      1024);
> 
> 	return xsk;
> }
>@@ -341,6 +345,7 @@ static struct option long_options[] = {
> 	{"interface", required_argument, 0, 'i'},
> 	{"queue", required_argument, 0, 'q'},
> 	{"poll", no_argument, 0, 'p'},
>+	{"busy-poll", no_argument, 0, 'b'},
> 	{"xdp-skb", no_argument, 0, 'S'},
> 	{"xdp-native", no_argument, 0, 'N'},
> 	{"interval", required_argument, 0, 'n'},
>@@ -360,6 +365,7 @@ static void usage(const char *prog)
> 		"  -i, --interface=n	Run on interface n\n"
> 		"  -q, --queue=n	Use queue n (default 0)\n"
> 		"  -p, --poll		Use poll syscall\n"
>+		"  -b, --busy-poll	Use poll syscall with busy poll\n"
> 		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
> 		"  -N, --xdp-native=n	Enfore XDP native mode\n"
> 		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
>@@ -377,7 +383,7 @@ static void parse_command_line(int argc, char **argv)
> 	opterr = 0;
> 
> 	for (;;) {
>-		c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
>+		c = getopt_long(argc, argv, "Frtli:q:pbsSNn:cz", long_options,
> 				&option_index);
> 		if (c == -1)
> 			break;
>@@ -401,6 +407,10 @@ static void parse_command_line(int argc, char **argv)
> 		case 'p':
> 			opt_poll = 1;
> 			break;
>+		case 'b':
>+			opt_busy_poll = 1;
>+			opt_poll = 1;
>+			break;
> 		case 'S':
> 			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
> 			opt_xdp_bind_flags |= XDP_COPY;
>@@ -444,7 +454,8 @@ static void kick_tx(struct xsk_socket_info *xsk)
> 	exit_with_error(errno);
> }
> 
>-static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
>+static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
>+				     struct pollfd *fds)
> {
> 	u32 idx_cq = 0, idx_fq = 0;
> 	unsigned int rcvd;
>@@ -453,7 +464,8 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
> 	if (!xsk->outstanding_tx)
> 		return;
> 
>-	kick_tx(xsk);
>+	if (!opt_poll)
>+		kick_tx(xsk);
> 	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
> 		xsk->outstanding_tx;
> 
>@@ -467,6 +479,8 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
> 		while (ret != rcvd) {
> 			if (ret < 0)
> 				exit_with_error(-ret);
>+			if (opt_busy_poll)
>+				ret = poll(fds, num_socks, 0);
> 			ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
> 						     &idx_fq);
> 		}
>@@ -490,7 +504,8 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
> 	if (!xsk->outstanding_tx)
> 		return;
> 
>-	kick_tx(xsk);
>+	if (!opt_busy_poll)
>+		kick_tx(xsk);
> 
> 	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx);
> 	if (rcvd > 0) {
>@@ -500,10 +515,10 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
> 	}
> }
> 
>-static void rx_drop(struct xsk_socket_info *xsk)
>+static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
> {
>-	unsigned int rcvd, i;
> 	u32 idx_rx = 0, idx_fq = 0;
>+	unsigned int rcvd, i;
> 	int ret;
> 
> 	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
>@@ -514,6 +529,8 @@ static void rx_drop(struct xsk_socket_info *xsk)
> 	while (ret != rcvd) {
> 		if (ret < 0)
> 			exit_with_error(-ret);
>+		if (opt_busy_poll)
>+			ret = poll(fds, num_socks, 0);
> 		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
> 	}
> 
>@@ -533,43 +550,68 @@ static void rx_drop(struct xsk_socket_info *xsk)
> 
> static void rx_drop_all(void)
> {
>-	struct pollfd fds[MAX_SOCKS + 1];
>-	int i, ret, timeout, nfds = 1;
>+	struct pollfd fds[MAX_SOCKS];
>+	int i, ret;
> 
> 	memset(fds, 0, sizeof(fds));
> 
> 	for (i = 0; i < num_socks; i++) {
> 		fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> 		fds[i].events = POLLIN;
>-		timeout = 1000; /* 1sn */
> 	}
> 
> 	for (;;) {
> 		if (opt_poll) {
>-			ret = poll(fds, nfds, timeout);
>+			ret = poll(fds, num_socks, 0);
> 			if (ret <= 0)
> 				continue;
> 		}
> 
> 		for (i = 0; i < num_socks; i++)
>-			rx_drop(xsks[i]);
>+			rx_drop(xsks[i], fds);
>+	}
>+}
>+
>+static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb)
>+{
>+	u32 idx;
>+
>+	if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) ==
>+	    BATCH_SIZE) {
>+		unsigned int i;
>+
>+		for (i = 0; i < BATCH_SIZE; i++) {
>+			xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
>+				= (frame_nb + i) <<
>+				XSK_UMEM__DEFAULT_FRAME_SHIFT;
>+			xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
>+				sizeof(pkt_data) - 1;
>+		}
>+
>+		xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
>+		xsk->outstanding_tx += BATCH_SIZE;
>+		frame_nb += BATCH_SIZE;
>+		frame_nb %= NUM_FRAMES;
> 	}
>+
>+	complete_tx_only(xsk);
> }
> 
>-static void tx_only(struct xsk_socket_info *xsk)
>+static void tx_only_all(void)
> {
>-	int timeout, ret, nfds = 1;
>-	struct pollfd fds[nfds + 1];
>-	u32 idx, frame_nb = 0;
>+	struct pollfd fds[MAX_SOCKS];
>+	u32 frame_nb[MAX_SOCKS] = {};
>+	int i, ret;
> 
> 	memset(fds, 0, sizeof(fds));
>-	fds[0].fd = xsk_socket__fd(xsk->xsk);
>-	fds[0].events = POLLOUT;
>-	timeout = 1000; /* 1sn */
>+	for (i = 0; i < num_socks; i++) {
>+		fds[0].fd = xsk_socket__fd(xsks[i]->xsk);
>+		fds[0].events = POLLOUT;
>+	}
> 
> 	for (;;) {
> 		if (opt_poll) {
>-			ret = poll(fds, nfds, timeout);
>+			ret = poll(fds, num_socks, 0);
> 			if (ret <= 0)
> 				continue;
> 
>@@ -577,70 +619,75 @@ static void tx_only(struct xsk_socket_info *xsk)
> 				continue;
> 		}
> 
>-		if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) ==
>-		    BATCH_SIZE) {
>-			unsigned int i;
>-
>-			for (i = 0; i < BATCH_SIZE; i++) {
>-				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
>-					= (frame_nb + i) <<
>-					XSK_UMEM__DEFAULT_FRAME_SHIFT;
>-				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
>-					sizeof(pkt_data) - 1;
>-			}
>-
>-			xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
>-			xsk->outstanding_tx += BATCH_SIZE;
>-			frame_nb += BATCH_SIZE;
>-			frame_nb %= NUM_FRAMES;
>-		}
>-
>-		complete_tx_only(xsk);
>+		for (i = 0; i < num_socks; i++)
>+			tx_only(xsks[i], frame_nb[i]);
> 	}
> }
> 
>-static void l2fwd(struct xsk_socket_info *xsk)
>+static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
> {
>-	for (;;) {
>-		unsigned int rcvd, i;
>-		u32 idx_rx = 0, idx_tx = 0;
>-		int ret;
>+	unsigned int rcvd, i;
>+	u32 idx_rx = 0, idx_tx = 0;
>+	int ret;
> 
>-		for (;;) {
>-			complete_tx_l2fwd(xsk);
>+	complete_tx_l2fwd(xsk, fds);
> 
>-			rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE,
>-						   &idx_rx);
>-			if (rcvd > 0)
>-				break;
>-		}
>+	rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE,
>+				   &idx_rx);
>+	if (!rcvd)
>+		return;
> 
>+	ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
>+	while (ret != rcvd) {
>+		if (ret < 0)
>+			exit_with_error(-ret);
>+		if (opt_busy_poll)
>+			ret = poll(fds, num_socks, 0);
> 		ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
>-		while (ret != rcvd) {
>-			if (ret < 0)
>-				exit_with_error(-ret);
>-			ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
>-		}
>+	}
>+
>+	for (i = 0; i < rcvd; i++) {
>+		u64 addr = xsk_ring_cons__rx_desc(&xsk->rx,
>+						  idx_rx)->addr;
>+		u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
>+						 idx_rx++)->len;
>+		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> 
>-		for (i = 0; i < rcvd; i++) {
>-			u64 addr = xsk_ring_cons__rx_desc(&xsk->rx,
>-							  idx_rx)->addr;
>-			u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
>-							 idx_rx++)->len;
>-			char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
>+		swap_mac_addresses(pkt);
> 
>-			swap_mac_addresses(pkt);
>+		hex_dump(pkt, len, addr);
>+		xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
>+		xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
>+	}
> 
>-			hex_dump(pkt, len, addr);
>-			xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
>-			xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
>-		}
>+	xsk_ring_prod__submit(&xsk->tx, rcvd);
>+	xsk_ring_cons__release(&xsk->rx, rcvd);
> 
>-		xsk_ring_prod__submit(&xsk->tx, rcvd);
>-		xsk_ring_cons__release(&xsk->rx, rcvd);
>+	xsk->rx_npkts += rcvd;
>+	xsk->outstanding_tx += rcvd;
>+}
> 
>-		xsk->rx_npkts += rcvd;
>-		xsk->outstanding_tx += rcvd;
>+static void l2fwd_all(void)
>+{
>+	struct pollfd fds[MAX_SOCKS];
>+	int i, ret;
>+
>+	memset(fds, 0, sizeof(fds));
>+
>+	for (i = 0; i < num_socks; i++) {
>+		fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
>+		fds[i].events = POLLOUT | POLLIN;
>+	}
>+
>+	for (;;) {
>+		if (opt_poll) {
>+			ret = poll(fds, num_socks, 0);
>+			if (ret <= 0)
>+				continue;
>+		}
>+
>+		for (i = 0; i < num_socks; i++)
>+			l2fwd(xsks[i], fds);
> 	}
> }
> 
>@@ -693,9 +740,9 @@ int main(int argc, char **argv)
> 	if (opt_bench == BENCH_RXDROP)
> 		rx_drop_all();
> 	else if (opt_bench == BENCH_TXONLY)
>-		tx_only(xsks[0]);
>+		tx_only_all();
> 	else
>-		l2fwd(xsks[0]);
>+		l2fwd_all();
> 
> 	return 0;
> }
>-- 
>2.7.4
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample
  2019-05-24  2:51   ` Ye Xiaolong
@ 2019-05-27  6:58     ` Magnus Karlsson
  0 siblings, 0 replies; 24+ messages in thread
From: Magnus Karlsson @ 2019-05-27  6:58 UTC (permalink / raw)
  To: Ye Xiaolong
  Cc: Magnus Karlsson, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Jesper Dangaard Brouer,
	bpf, bruce.richardson, ciara.loftus, Jakub Kicinski, Zhang, Qi Z,
	Maxim Mikityanskiy, Samudrala, Sridhar, kevin.laatz

On Fri, May 24, 2019 at 4:59 AM Ye Xiaolong <xiaolong.ye@intel.com> wrote:
>
> Hi, Magnus
>
> On 05/02, Magnus Karlsson wrote:
> >This patch adds busy-poll support to the xdpsock sample
> >application. It is enabled by the "-b" or the "--busy-poll" command
> >line options.
> >
> >Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> >---
> > samples/bpf/xdpsock_user.c | 203 ++++++++++++++++++++++++++++-----------------
> > 1 file changed, 125 insertions(+), 78 deletions(-)
> >
> >diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> >index d08ee1a..1272edf 100644
> >--- a/samples/bpf/xdpsock_user.c
> >+++ b/samples/bpf/xdpsock_user.c
> >@@ -66,6 +66,7 @@ static const char *opt_if = "";
> > static int opt_ifindex;
> > static int opt_queue;
> > static int opt_poll;
> >+static int opt_busy_poll;
> > static int opt_interval = 1;
> > static u32 opt_xdp_bind_flags;
> > static __u32 prog_id;
> >@@ -119,8 +120,11 @@ static void print_benchmark(bool running)
> >       else
> >               printf("        ");
> >
> >-      if (opt_poll)
> >+      if (opt_poll) {
> >+              if (opt_busy_poll)
> >+                      printf("busy-");
> >               printf("poll() ");
> >+      }
> >
> >       if (running) {
> >               printf("running...");
> >@@ -306,7 +310,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
> >       xsk->umem = umem;
> >       cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> >       cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
> >-      cfg.libbpf_flags = 0;
>
> Any purpose for removing this line, as cfg here is a local variable, cfg.libbpf_flags
> can be random and may lead to xdpsock failure as `Invalid no_argument`.

No, that was a mistake. Thanks for catching.

I will produce a new patch set with the improvements needed to make
execution of application and softirq/ksoftirqd on the same core
efficient, make a v2 of the busy poll patch set (with all your
suggested improvements) on top of that, and then see what performance
we have.

Thanks: Magnus

> Thanks,
> Xiaolong
>
> >+      cfg.busy_poll = (opt_busy_poll ? BATCH_SIZE : 0);
> >       cfg.xdp_flags = opt_xdp_flags;
> >       cfg.bind_flags = opt_xdp_bind_flags;
> >       ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
> >@@ -319,17 +323,17 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
> >               exit_with_error(-ret);
> >
> >       ret = xsk_ring_prod__reserve(&xsk->umem->fq,
> >-                                   XSK_RING_PROD__DEFAULT_NUM_DESCS,
> >+                                   1024,
> >                                    &idx);
> >-      if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS)
> >+      if (ret != 1024)
> >               exit_with_error(-ret);
> >       for (i = 0;
> >-           i < XSK_RING_PROD__DEFAULT_NUM_DESCS *
> >+           i < 1024 *
> >                    XSK_UMEM__DEFAULT_FRAME_SIZE;
> >            i += XSK_UMEM__DEFAULT_FRAME_SIZE)
> >               *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = i;
> >       xsk_ring_prod__submit(&xsk->umem->fq,
> >-                            XSK_RING_PROD__DEFAULT_NUM_DESCS);
> >+                            1024);
> >
> >       return xsk;
> > }
> >@@ -341,6 +345,7 @@ static struct option long_options[] = {
> >       {"interface", required_argument, 0, 'i'},
> >       {"queue", required_argument, 0, 'q'},
> >       {"poll", no_argument, 0, 'p'},
> >+      {"busy-poll", no_argument, 0, 'b'},
> >       {"xdp-skb", no_argument, 0, 'S'},
> >       {"xdp-native", no_argument, 0, 'N'},
> >       {"interval", required_argument, 0, 'n'},
> >@@ -360,6 +365,7 @@ static void usage(const char *prog)
> >               "  -i, --interface=n    Run on interface n\n"
> >               "  -q, --queue=n        Use queue n (default 0)\n"
> >               "  -p, --poll           Use poll syscall\n"
> >+              "  -b, --busy-poll      Use poll syscall with busy poll\n"
> >               "  -S, --xdp-skb=n      Use XDP skb-mod\n"
> >               "  -N, --xdp-native=n   Enfore XDP native mode\n"
> >               "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
> >@@ -377,7 +383,7 @@ static void parse_command_line(int argc, char **argv)
> >       opterr = 0;
> >
> >       for (;;) {
> >-              c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
> >+              c = getopt_long(argc, argv, "Frtli:q:pbsSNn:cz", long_options,
> >                               &option_index);
> >               if (c == -1)
> >                       break;
> >@@ -401,6 +407,10 @@ static void parse_command_line(int argc, char **argv)
> >               case 'p':
> >                       opt_poll = 1;
> >                       break;
> >+              case 'b':
> >+                      opt_busy_poll = 1;
> >+                      opt_poll = 1;
> >+                      break;
> >               case 'S':
> >                       opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
> >                       opt_xdp_bind_flags |= XDP_COPY;
> >@@ -444,7 +454,8 @@ static void kick_tx(struct xsk_socket_info *xsk)
> >       exit_with_error(errno);
> > }
> >
> >-static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
> >+static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk,
> >+                                   struct pollfd *fds)
> > {
> >       u32 idx_cq = 0, idx_fq = 0;
> >       unsigned int rcvd;
> >@@ -453,7 +464,8 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
> >       if (!xsk->outstanding_tx)
> >               return;
> >
> >-      kick_tx(xsk);
> >+      if (!opt_poll)
> >+              kick_tx(xsk);
> >       ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
> >               xsk->outstanding_tx;
> >
> >@@ -467,6 +479,8 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
> >               while (ret != rcvd) {
> >                       if (ret < 0)
> >                               exit_with_error(-ret);
> >+                      if (opt_busy_poll)
> >+                              ret = poll(fds, num_socks, 0);
> >                       ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
> >                                                    &idx_fq);
> >               }
> >@@ -490,7 +504,8 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
> >       if (!xsk->outstanding_tx)
> >               return;
> >
> >-      kick_tx(xsk);
> >+      if (!opt_busy_poll)
> >+              kick_tx(xsk);
> >
> >       rcvd = xsk_ring_cons__peek(&xsk->umem->cq, BATCH_SIZE, &idx);
> >       if (rcvd > 0) {
> >@@ -500,10 +515,10 @@ static inline void complete_tx_only(struct xsk_socket_info *xsk)
> >       }
> > }
> >
> >-static void rx_drop(struct xsk_socket_info *xsk)
> >+static void rx_drop(struct xsk_socket_info *xsk, struct pollfd *fds)
> > {
> >-      unsigned int rcvd, i;
> >       u32 idx_rx = 0, idx_fq = 0;
> >+      unsigned int rcvd, i;
> >       int ret;
> >
> >       rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE, &idx_rx);
> >@@ -514,6 +529,8 @@ static void rx_drop(struct xsk_socket_info *xsk)
> >       while (ret != rcvd) {
> >               if (ret < 0)
> >                       exit_with_error(-ret);
> >+              if (opt_busy_poll)
> >+                      ret = poll(fds, num_socks, 0);
> >               ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
> >       }
> >
> >@@ -533,43 +550,68 @@ static void rx_drop(struct xsk_socket_info *xsk)
> >
> > static void rx_drop_all(void)
> > {
> >-      struct pollfd fds[MAX_SOCKS + 1];
> >-      int i, ret, timeout, nfds = 1;
> >+      struct pollfd fds[MAX_SOCKS];
> >+      int i, ret;
> >
> >       memset(fds, 0, sizeof(fds));
> >
> >       for (i = 0; i < num_socks; i++) {
> >               fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> >               fds[i].events = POLLIN;
> >-              timeout = 1000; /* 1sn */
> >       }
> >
> >       for (;;) {
> >               if (opt_poll) {
> >-                      ret = poll(fds, nfds, timeout);
> >+                      ret = poll(fds, num_socks, 0);
> >                       if (ret <= 0)
> >                               continue;
> >               }
> >
> >               for (i = 0; i < num_socks; i++)
> >-                      rx_drop(xsks[i]);
> >+                      rx_drop(xsks[i], fds);
> >+      }
> >+}
> >+
> >+static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb)
> >+{
> >+      u32 idx;
> >+
> >+      if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) ==
> >+          BATCH_SIZE) {
> >+              unsigned int i;
> >+
> >+              for (i = 0; i < BATCH_SIZE; i++) {
> >+                      xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
> >+                              = (frame_nb + i) <<
> >+                              XSK_UMEM__DEFAULT_FRAME_SHIFT;
> >+                      xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
> >+                              sizeof(pkt_data) - 1;
> >+              }
> >+
> >+              xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
> >+              xsk->outstanding_tx += BATCH_SIZE;
> >+              frame_nb += BATCH_SIZE;
> >+              frame_nb %= NUM_FRAMES;
> >       }
> >+
> >+      complete_tx_only(xsk);
> > }
> >
> >-static void tx_only(struct xsk_socket_info *xsk)
> >+static void tx_only_all(void)
> > {
> >-      int timeout, ret, nfds = 1;
> >-      struct pollfd fds[nfds + 1];
> >-      u32 idx, frame_nb = 0;
> >+      struct pollfd fds[MAX_SOCKS];
> >+      u32 frame_nb[MAX_SOCKS] = {};
> >+      int i, ret;
> >
> >       memset(fds, 0, sizeof(fds));
> >-      fds[0].fd = xsk_socket__fd(xsk->xsk);
> >-      fds[0].events = POLLOUT;
> >-      timeout = 1000; /* 1sn */
> >+      for (i = 0; i < num_socks; i++) {
> >+              fds[0].fd = xsk_socket__fd(xsks[i]->xsk);
> >+              fds[0].events = POLLOUT;
> >+      }
> >
> >       for (;;) {
> >               if (opt_poll) {
> >-                      ret = poll(fds, nfds, timeout);
> >+                      ret = poll(fds, num_socks, 0);
> >                       if (ret <= 0)
> >                               continue;
> >
> >@@ -577,70 +619,75 @@ static void tx_only(struct xsk_socket_info *xsk)
> >                               continue;
> >               }
> >
> >-              if (xsk_ring_prod__reserve(&xsk->tx, BATCH_SIZE, &idx) ==
> >-                  BATCH_SIZE) {
> >-                      unsigned int i;
> >-
> >-                      for (i = 0; i < BATCH_SIZE; i++) {
> >-                              xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
> >-                                      = (frame_nb + i) <<
> >-                                      XSK_UMEM__DEFAULT_FRAME_SHIFT;
> >-                              xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
> >-                                      sizeof(pkt_data) - 1;
> >-                      }
> >-
> >-                      xsk_ring_prod__submit(&xsk->tx, BATCH_SIZE);
> >-                      xsk->outstanding_tx += BATCH_SIZE;
> >-                      frame_nb += BATCH_SIZE;
> >-                      frame_nb %= NUM_FRAMES;
> >-              }
> >-
> >-              complete_tx_only(xsk);
> >+              for (i = 0; i < num_socks; i++)
> >+                      tx_only(xsks[i], frame_nb[i]);
> >       }
> > }
> >
> >-static void l2fwd(struct xsk_socket_info *xsk)
> >+static void l2fwd(struct xsk_socket_info *xsk, struct pollfd *fds)
> > {
> >-      for (;;) {
> >-              unsigned int rcvd, i;
> >-              u32 idx_rx = 0, idx_tx = 0;
> >-              int ret;
> >+      unsigned int rcvd, i;
> >+      u32 idx_rx = 0, idx_tx = 0;
> >+      int ret;
> >
> >-              for (;;) {
> >-                      complete_tx_l2fwd(xsk);
> >+      complete_tx_l2fwd(xsk, fds);
> >
> >-                      rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE,
> >-                                                 &idx_rx);
> >-                      if (rcvd > 0)
> >-                              break;
> >-              }
> >+      rcvd = xsk_ring_cons__peek(&xsk->rx, BATCH_SIZE,
> >+                                 &idx_rx);
> >+      if (!rcvd)
> >+              return;
> >
> >+      ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
> >+      while (ret != rcvd) {
> >+              if (ret < 0)
> >+                      exit_with_error(-ret);
> >+              if (opt_busy_poll)
> >+                      ret = poll(fds, num_socks, 0);
> >               ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
> >-              while (ret != rcvd) {
> >-                      if (ret < 0)
> >-                              exit_with_error(-ret);
> >-                      ret = xsk_ring_prod__reserve(&xsk->tx, rcvd, &idx_tx);
> >-              }
> >+      }
> >+
> >+      for (i = 0; i < rcvd; i++) {
> >+              u64 addr = xsk_ring_cons__rx_desc(&xsk->rx,
> >+                                                idx_rx)->addr;
> >+              u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
> >+                                               idx_rx++)->len;
> >+              char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> >
> >-              for (i = 0; i < rcvd; i++) {
> >-                      u64 addr = xsk_ring_cons__rx_desc(&xsk->rx,
> >-                                                        idx_rx)->addr;
> >-                      u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
> >-                                                       idx_rx++)->len;
> >-                      char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
> >+              swap_mac_addresses(pkt);
> >
> >-                      swap_mac_addresses(pkt);
> >+              hex_dump(pkt, len, addr);
> >+              xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
> >+              xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
> >+      }
> >
> >-                      hex_dump(pkt, len, addr);
> >-                      xsk_ring_prod__tx_desc(&xsk->tx, idx_tx)->addr = addr;
> >-                      xsk_ring_prod__tx_desc(&xsk->tx, idx_tx++)->len = len;
> >-              }
> >+      xsk_ring_prod__submit(&xsk->tx, rcvd);
> >+      xsk_ring_cons__release(&xsk->rx, rcvd);
> >
> >-              xsk_ring_prod__submit(&xsk->tx, rcvd);
> >-              xsk_ring_cons__release(&xsk->rx, rcvd);
> >+      xsk->rx_npkts += rcvd;
> >+      xsk->outstanding_tx += rcvd;
> >+}
> >
> >-              xsk->rx_npkts += rcvd;
> >-              xsk->outstanding_tx += rcvd;
> >+static void l2fwd_all(void)
> >+{
> >+      struct pollfd fds[MAX_SOCKS];
> >+      int i, ret;
> >+
> >+      memset(fds, 0, sizeof(fds));
> >+
> >+      for (i = 0; i < num_socks; i++) {
> >+              fds[i].fd = xsk_socket__fd(xsks[i]->xsk);
> >+              fds[i].events = POLLOUT | POLLIN;
> >+      }
> >+
> >+      for (;;) {
> >+              if (opt_poll) {
> >+                      ret = poll(fds, num_socks, 0);
> >+                      if (ret <= 0)
> >+                              continue;
> >+              }
> >+
> >+              for (i = 0; i < num_socks; i++)
> >+                      l2fwd(xsks[i], fds);
> >       }
> > }
> >
> >@@ -693,9 +740,9 @@ int main(int argc, char **argv)
> >       if (opt_bench == BENCH_RXDROP)
> >               rx_drop_all();
> >       else if (opt_bench == BENCH_TXONLY)
> >-              tx_only(xsks[0]);
> >+              tx_only_all();
> >       else
> >-              l2fwd(xsks[0]);
> >+              l2fwd_all();
> >
> >       return 0;
> > }
> >--
> >2.7.4
> >

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, back to index

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-02  8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
2019-05-02  8:39 ` [RFC bpf-next 1/7] net: fs: make busy poll budget configurable in napi_busy_loop Magnus Karlsson
2019-05-02  8:39 ` [RFC bpf-next 2/7] net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and add napi id Magnus Karlsson
2019-05-02  8:39 ` [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP Magnus Karlsson
2019-05-03  0:13   ` Samudrala, Sridhar
2019-05-03  6:35     ` Magnus Karlsson
2019-05-02  8:39 ` [RFC bpf-next 5/7] net: add busy-poll support for XDP sockets Magnus Karlsson
2019-05-03  0:23   ` Samudrala, Sridhar
2019-05-02  8:39 ` [RFC bpf-next 6/7] libbpf: add busy-poll support to " Magnus Karlsson
2019-05-02  8:39 ` [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample Magnus Karlsson
2019-05-24  2:51   ` Ye Xiaolong
2019-05-27  6:58     ` Magnus Karlsson
2019-05-06 16:31 ` [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Alexei Starovoitov
2019-05-07 11:51   ` Magnus Karlsson
2019-05-07 18:24     ` Alexei Starovoitov
2019-05-08 12:10       ` Magnus Karlsson
2019-05-16 12:37         ` Magnus Karlsson
2019-05-16 23:50           ` Samudrala, Sridhar
2019-05-17  7:50             ` Magnus Karlsson
2019-05-17 18:20           ` Jakub Kicinski
2019-05-18  8:49             ` Magnus Karlsson
2019-05-13 20:42       ` Jonathan Lemon
2019-05-13 23:30         ` Samudrala, Sridhar
2019-05-14  7:53         ` Björn Töpel

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org bpf@archiver.kernel.org
	public-inbox-index bpf


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/ public-inbox