netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
@ 2020-09-04 13:53 Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 1/6] xsk: improve xdp_do_redirect() error codes Björn Töpel
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:53 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, bjorn.topel, davem, kuba,
	hawk, john.fastabend, intel-wired-lan

This series addresses a problem that arises when AF_XDP zero-copy is
enabled, and the kernel softirq Rx processing and userland process is
running on the same core.

In contrast to the two-core case, when the userland process/Rx softirq
shares one core, it it very important that the kernel is not doing
unnecessary work, but instead let the userland process run. This has
not been the case.

For the Intel drivers, when the XDP_REDIRECT fails due to a full Rx
ring, the NAPI loop will simply drop the packet and continue
processing the next packet. The XDP_REDIRECT operation will then fail
again, since userland has not been able to empty the full Rx ring.

The fix for this is letting the NAPI loop exit early, if the AF_XDP
socket Rx ring is full.

The outline is as following; The first patch cleans up the error codes
returned by xdp_do_redirect(), so that a driver can figure out when
the Rx ring is full (ENOBUFS). Patch two adds an extended
xdp_do_redirect() variant that returns what kind of map that was used
in the XDP_REDIRECT action. The third patch adds an AF_XDP driver
helper to figure out if the Rx ring was full. Finally, the last three
patches implements the "early exit" support for Intel.

On my machine the "one core scenario Rx drop" performance went from
~65Kpps to 21Mpps. In other words, from "not usable" to
"usable". YMMV.

I prefer to route this series via bpf-next, since it include core
changes, and not only driver changes.


Have a nice weekend!
Björn

Björn Töpel (6):
  xsk: improve xdp_do_redirect() error codes
  xdp: introduce xdp_do_redirect_ext() function
  xsk: introduce xsk_do_redirect_rx_full() helper
  i40e, xsk: finish napi loop if AF_XDP Rx queue is full
  ice, xsk: finish napi loop if AF_XDP Rx queue is full
  ixgbe, xsk: finish napi loop if AF_XDP Rx queue is full

 drivers/net/ethernet/intel/i40e/i40e_xsk.c   | 23 ++++++++++++++------
 drivers/net/ethernet/intel/ice/ice_xsk.c     | 23 ++++++++++++++------
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 23 ++++++++++++++------
 include/linux/filter.h                       |  2 ++
 include/net/xdp_sock_drv.h                   |  9 ++++++++
 net/core/filter.c                            | 16 ++++++++++++--
 net/xdp/xsk.c                                |  2 +-
 net/xdp/xsk_queue.h                          |  2 +-
 8 files changed, 75 insertions(+), 25 deletions(-)


base-commit: 8eb629585d2231e90112148009e2a11b0979ca38
-- 
2.25.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 1/6] xsk: improve xdp_do_redirect() error codes
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
@ 2020-09-04 13:53 ` Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 2/6] xdp: introduce xdp_do_redirect_ext() function Björn Töpel
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:53 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, davem, kuba, hawk,
	john.fastabend, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

The error codes returned by xdp_do_redirect() when redirecting a frame
to an AF_XDP socket has not been very useful. A driver could not
distinguish between different errors. Prior this change the following
codes where used:

Socket not bound or incorrect queue/netdev: EINVAL
XDP frame/AF_XDP buffer size mismatch: ENOSPC
Could not allocate buffer (copy mode): ENOSPC
AF_XDP Rx buffer full: ENOSPC

After this change:

Socket not bound or incorrect queue/netdev: EINVAL
XDP frame/AF_XDP buffer size mismatch: ENOSPC
Could not allocate buffer (copy mode): ENOMEM
AF_XDP Rx buffer full: ENOBUFS

An AF_XDP zero-copy driver can now potentially determine if the
failure was due to a full Rx buffer, and if so stop processing more
frames, yielding to the userland AF_XDP application.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c       | 2 +-
 net/xdp/xsk_queue.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 3895697f8540..db38560c4af7 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -197,7 +197,7 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len,
 	xsk_xdp = xsk_buff_alloc(xs->pool);
 	if (!xsk_xdp) {
 		xs->rx_dropped++;
-		return -ENOSPC;
+		return -ENOMEM;
 	}
 
 	xsk_copy_xdp(xsk_xdp, xdp, len);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 2d883f631c85..b76966cf122e 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -305,7 +305,7 @@ static inline int xskq_prod_reserve_desc(struct xsk_queue *q,
 	u32 idx;
 
 	if (xskq_prod_is_full(q))
-		return -ENOSPC;
+		return -ENOBUFS;
 
 	/* A, matches D */
 	idx = q->cached_prod++ & q->ring_mask;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 2/6] xdp: introduce xdp_do_redirect_ext() function
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 1/6] xsk: improve xdp_do_redirect() error codes Björn Töpel
@ 2020-09-04 13:53 ` Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper Björn Töpel
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:53 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, davem, kuba, hawk,
	john.fastabend, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Introduce the xdp_do_redirect_ext() which returns additional
information to the caller. For now, it is the type of map that the
packet was redirected to.

This enables the driver to have more fine-grained control, e.g. is the
redirect fails due to full AF_XDP Rx queue (error code ENOBUFS and map
is XSKMAP), a zero-copy enabled driver should yield to userland as
soon as possible.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h |  2 ++
 net/core/filter.c      | 16 ++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 995625950cc1..0060c2c8abc3 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -942,6 +942,8 @@ static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
  */
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *prog);
+int xdp_do_redirect_ext(struct net_device *dev, struct xdp_buff *xdp,
+			struct bpf_prog *xdp_prog, enum bpf_map_type *map_type);
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
diff --git a/net/core/filter.c b/net/core/filter.c
index 47eef9a0be6a..ce6098210a23 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3596,8 +3596,8 @@ void bpf_clear_redirect_map(struct bpf_map *map)
 	}
 }
 
-int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
-		    struct bpf_prog *xdp_prog)
+int xdp_do_redirect_ext(struct net_device *dev, struct xdp_buff *xdp,
+			struct bpf_prog *xdp_prog, enum bpf_map_type *map_type)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = READ_ONCE(ri->map);
@@ -3609,6 +3609,8 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	ri->tgt_value = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
+	*map_type = BPF_MAP_TYPE_UNSPEC;
+
 	if (unlikely(!map)) {
 		fwd = dev_get_by_index_rcu(dev_net(dev), index);
 		if (unlikely(!fwd)) {
@@ -3618,6 +3620,7 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
+		*map_type = map->map_type;
 		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
 	}
 
@@ -3630,6 +3633,15 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	_trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
 	return err;
 }
+EXPORT_SYMBOL_GPL(xdp_do_redirect_ext);
+
+int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
+		    struct bpf_prog *xdp_prog)
+{
+	enum bpf_map_type dummy;
+
+	return xdp_do_redirect_ext(dev, xdp, xdp_prog, &dummy);
+}
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 1/6] xsk: improve xdp_do_redirect() error codes Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 2/6] xdp: introduce xdp_do_redirect_ext() function Björn Töpel
@ 2020-09-04 13:53 ` Björn Töpel
  2020-09-04 15:11   ` Jesper Dangaard Brouer
  2020-09-04 13:53 ` [PATCH bpf-next 4/6] i40e, xsk: finish napi loop if AF_XDP Rx queue is full Björn Töpel
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:53 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, davem, kuba, hawk,
	john.fastabend, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

The xsk_do_redirect_rx_full() helper can be used to check if a failure
of xdp_do_redirect() was due to the AF_XDP socket had a full Rx ring.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock_drv.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 5b1ee8a9976d..34c58b5fbc28 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -116,6 +116,11 @@ static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
 	xp_dma_sync_for_device(pool, dma, size);
 }
 
+static inline bool xsk_do_redirect_rx_full(int err, enum bpf_map_type map_type)
+{
+	return err == -ENOBUFS && map_type == BPF_MAP_TYPE_XSKMAP;
+}
+
 #else
 
 static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
@@ -235,6 +240,10 @@ static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
 {
 }
 
+static inline bool xsk_do_redirect_rx_full(int err, enum bpf_map_type map_type)
+{
+	return false;
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_DRV_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 4/6] i40e, xsk: finish napi loop if AF_XDP Rx queue is full
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
                   ` (2 preceding siblings ...)
  2020-09-04 13:53 ` [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper Björn Töpel
@ 2020-09-04 13:53 ` Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 5/6] ice, " Björn Töpel
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:53 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, davem, kuba, hawk,
	john.fastabend, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Make the AF_XDP zero-copy path aware that the reason for redirect
failure was due to full Rx queue. If so, exit the napi loop as soon as
possible (exit the softirq processing), so that the userspace AF_XDP
process can hopefully empty the Rx queue. This mainly helps the "one
core scenario", where the userland process and Rx softirq processing
is on the same core.

Note that the early exit can only be performed if the "need wakeup"
feature is enabled, because otherwise there is no notification
mechanism available from the kernel side.

This requires that the driver starts using the newly introduced
xdp_do_redirect_ext() and xsk_do_redirect_rx_full() functions.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 23 +++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 2a1153d8957b..3ac803ee8d51 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -142,13 +142,15 @@ int i40e_xsk_pool_setup(struct i40e_vsi *vsi, struct xsk_buff_pool *pool,
  * i40e_run_xdp_zc - Executes an XDP program on an xdp_buff
  * @rx_ring: Rx ring
  * @xdp: xdp_buff used as input to the XDP program
+ * @early_exit: true means that the napi loop should exit early
  *
  * Returns any of I40E_XDP_{PASS, CONSUMED, TX, REDIR}
  **/
-static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
+static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp, bool *early_exit)
 {
 	int err, result = I40E_XDP_PASS;
 	struct i40e_ring *xdp_ring;
+	enum bpf_map_type map_type;
 	struct bpf_prog *xdp_prog;
 	u32 act;
 
@@ -167,8 +169,13 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
 		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
 		break;
 	case XDP_REDIRECT:
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
+		err = xdp_do_redirect_ext(rx_ring->netdev, xdp, xdp_prog, &map_type);
+		if (err) {
+			*early_exit = xsk_do_redirect_rx_full(err, map_type);
+			result = I40E_XDP_CONSUMED;
+		} else {
+			result = I40E_XDP_REDIR;
+		}
 		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
@@ -268,8 +275,8 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
+	bool early_exit = false, failure = false;
 	unsigned int xdp_res, xdp_xmit = 0;
-	bool failure = false;
 	struct sk_buff *skb;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
@@ -316,7 +323,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 		(*bi)->data_end = (*bi)->data + size;
 		xsk_buff_dma_sync_for_cpu(*bi, rx_ring->xsk_pool);
 
-		xdp_res = i40e_run_xdp_zc(rx_ring, *bi);
+		xdp_res = i40e_run_xdp_zc(rx_ring, *bi, &early_exit);
 		if (xdp_res) {
 			if (xdp_res & (I40E_XDP_TX | I40E_XDP_REDIR))
 				xdp_xmit |= xdp_res;
@@ -329,6 +336,8 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 
 			cleaned_count++;
 			i40e_inc_ntc(rx_ring);
+			if (early_exit)
+				break;
 			continue;
 		}
 
@@ -363,12 +372,12 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget)
 	i40e_update_rx_stats(rx_ring, total_rx_bytes, total_rx_packets);
 
 	if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) {
-		if (failure || rx_ring->next_to_clean == rx_ring->next_to_use)
+		if (early_exit || failure || rx_ring->next_to_clean == rx_ring->next_to_use)
 			xsk_set_rx_need_wakeup(rx_ring->xsk_pool);
 		else
 			xsk_clear_rx_need_wakeup(rx_ring->xsk_pool);
 
-		return (int)total_rx_packets;
+		return early_exit ? 0 : (int)total_rx_packets;
 	}
 	return failure ? budget : (int)total_rx_packets;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 5/6] ice, xsk: finish napi loop if AF_XDP Rx queue is full
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
                   ` (3 preceding siblings ...)
  2020-09-04 13:53 ` [PATCH bpf-next 4/6] i40e, xsk: finish napi loop if AF_XDP Rx queue is full Björn Töpel
@ 2020-09-04 13:53 ` Björn Töpel
  2020-09-04 13:53 ` [PATCH bpf-next 6/6] ixgbe, " Björn Töpel
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:53 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, davem, kuba, hawk,
	john.fastabend, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Make the AF_XDP zero-copy path aware that the reason for redirect
failure was due to full Rx queue. If so, exit the napi loop as soon as
possible (exit the softirq processing), so that the userspace AF_XDP
process can hopefully empty the Rx queue. This mainly helps the "one
core scenario", where the userland process and Rx softirq processing
is on the same core.

Note that the early exit can only be performed if the "need wakeup"
feature is enabled, because otherwise there is no notification
mechanism available from the kernel side.

This requires that the driver starts using the newly introduced
xdp_do_redirect_ext() and xsk_do_redirect_rx_full() functions.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_xsk.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 797886524054..f698d0199b0a 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -502,13 +502,15 @@ ice_construct_skb_zc(struct ice_ring *rx_ring, struct ice_rx_buf *rx_buf)
  * ice_run_xdp_zc - Executes an XDP program in zero-copy path
  * @rx_ring: Rx ring
  * @xdp: xdp_buff used as input to the XDP program
+ * @early_exit: true means that the napi loop should exit early
  *
  * Returns any of ICE_XDP_{PASS, CONSUMED, TX, REDIR}
  */
 static int
-ice_run_xdp_zc(struct ice_ring *rx_ring, struct xdp_buff *xdp)
+ice_run_xdp_zc(struct ice_ring *rx_ring, struct xdp_buff *xdp, bool *early_exit)
 {
 	int err, result = ICE_XDP_PASS;
+	enum bpf_map_type map_type;
 	struct bpf_prog *xdp_prog;
 	struct ice_ring *xdp_ring;
 	u32 act;
@@ -529,8 +531,13 @@ ice_run_xdp_zc(struct ice_ring *rx_ring, struct xdp_buff *xdp)
 		result = ice_xmit_xdp_buff(xdp, xdp_ring);
 		break;
 	case XDP_REDIRECT:
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		result = !err ? ICE_XDP_REDIR : ICE_XDP_CONSUMED;
+		err = xdp_do_redirect_ext(rx_ring->netdev, xdp, xdp_prog, &map_type);
+		if (err) {
+			*early_exit = xsk_do_redirect_rx_full(err, map_type);
+			result = ICE_XDP_CONSUMED;
+		} else {
+			result = ICE_XDP_REDIR;
+		}
 		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
@@ -558,8 +565,8 @@ int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget)
 {
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	u16 cleaned_count = ICE_DESC_UNUSED(rx_ring);
+	bool early_exit = false, failure = false;
 	unsigned int xdp_xmit = 0;
-	bool failure = false;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
 		union ice_32b_rx_flex_desc *rx_desc;
@@ -597,7 +604,7 @@ int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget)
 		rx_buf->xdp->data_end = rx_buf->xdp->data + size;
 		xsk_buff_dma_sync_for_cpu(rx_buf->xdp, rx_ring->xsk_pool);
 
-		xdp_res = ice_run_xdp_zc(rx_ring, rx_buf->xdp);
+		xdp_res = ice_run_xdp_zc(rx_ring, rx_buf->xdp, &early_exit);
 		if (xdp_res) {
 			if (xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))
 				xdp_xmit |= xdp_res;
@@ -610,6 +617,8 @@ int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget)
 			cleaned_count++;
 
 			ice_bump_ntc(rx_ring);
+			if (early_exit)
+				break;
 			continue;
 		}
 
@@ -646,12 +655,12 @@ int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget)
 	ice_update_rx_ring_stats(rx_ring, total_rx_packets, total_rx_bytes);
 
 	if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) {
-		if (failure || rx_ring->next_to_clean == rx_ring->next_to_use)
+		if (early_exit || failure || rx_ring->next_to_clean == rx_ring->next_to_use)
 			xsk_set_rx_need_wakeup(rx_ring->xsk_pool);
 		else
 			xsk_clear_rx_need_wakeup(rx_ring->xsk_pool);
 
-		return (int)total_rx_packets;
+		return early_exit ? 0 : (int)total_rx_packets;
 	}
 
 	return failure ? budget : (int)total_rx_packets;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next 6/6] ixgbe, xsk: finish napi loop if AF_XDP Rx queue is full
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
                   ` (4 preceding siblings ...)
  2020-09-04 13:53 ` [PATCH bpf-next 5/6] ice, " Björn Töpel
@ 2020-09-04 13:53 ` Björn Töpel
  2020-09-04 15:35   ` Jesper Dangaard Brouer
  2020-09-04 13:59 ` [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring " Björn Töpel
  2020-09-04 14:27 ` Jesper Dangaard Brouer
  7 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:53 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, davem, kuba, hawk,
	john.fastabend, intel-wired-lan

From: Björn Töpel <bjorn.topel@intel.com>

Make the AF_XDP zero-copy path aware that the reason for redirect
failure was due to full Rx queue. If so, exit the napi loop as soon as
possible (exit the softirq processing), so that the userspace AF_XDP
process can hopefully empty the Rx queue. This mainly helps the "one
core scenario", where the userland process and Rx softirq processing
is on the same core.

Note that the early exit can only be performed if the "need wakeup"
feature is enabled, because otherwise there is no notification
mechanism available from the kernel side.

This requires that the driver starts using the newly introduced
xdp_do_redirect_ext() and xsk_do_redirect_rx_full() functions.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 23 ++++++++++++++------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 3771857cf887..a4aebfd986b3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -93,9 +93,11 @@ int ixgbe_xsk_pool_setup(struct ixgbe_adapter *adapter,
 
 static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 			    struct ixgbe_ring *rx_ring,
-			    struct xdp_buff *xdp)
+			    struct xdp_buff *xdp,
+			    bool *early_exit)
 {
 	int err, result = IXGBE_XDP_PASS;
+	enum bpf_map_type map_type;
 	struct bpf_prog *xdp_prog;
 	struct xdp_frame *xdpf;
 	u32 act;
@@ -116,8 +118,13 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
 		break;
 	case XDP_REDIRECT:
-		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
-		result = !err ? IXGBE_XDP_REDIR : IXGBE_XDP_CONSUMED;
+		err = xdp_do_redirect_ext(rx_ring->netdev, xdp, xdp_prog, &map_type);
+		if (err) {
+			*early_exit = xsk_do_redirect_rx_full(err, map_type);
+			result = IXGBE_XDP_CONSUMED;
+		} else {
+			result = IXGBE_XDP_REDIR;
+		}
 		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
@@ -235,8 +242,8 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 	struct ixgbe_adapter *adapter = q_vector->adapter;
 	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
+	bool early_exit = false, failure = false;
 	unsigned int xdp_res, xdp_xmit = 0;
-	bool failure = false;
 	struct sk_buff *skb;
 
 	while (likely(total_rx_packets < budget)) {
@@ -288,7 +295,7 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 
 		bi->xdp->data_end = bi->xdp->data + size;
 		xsk_buff_dma_sync_for_cpu(bi->xdp, rx_ring->xsk_pool);
-		xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp);
+		xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp, &early_exit);
 
 		if (xdp_res) {
 			if (xdp_res & (IXGBE_XDP_TX | IXGBE_XDP_REDIR))
@@ -302,6 +309,8 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 
 			cleaned_count++;
 			ixgbe_inc_ntc(rx_ring);
+			if (early_exit)
+				break;
 			continue;
 		}
 
@@ -346,12 +355,12 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
 	q_vector->rx.total_bytes += total_rx_bytes;
 
 	if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) {
-		if (failure || rx_ring->next_to_clean == rx_ring->next_to_use)
+		if (early_exit || failure || rx_ring->next_to_clean == rx_ring->next_to_use)
 			xsk_set_rx_need_wakeup(rx_ring->xsk_pool);
 		else
 			xsk_clear_rx_need_wakeup(rx_ring->xsk_pool);
 
-		return (int)total_rx_packets;
+		return early_exit ? 0 : (int)total_rx_packets;
 	}
 	return failure ? budget : (int)total_rx_packets;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
                   ` (5 preceding siblings ...)
  2020-09-04 13:53 ` [PATCH bpf-next 6/6] ixgbe, " Björn Töpel
@ 2020-09-04 13:59 ` Björn Töpel
  2020-09-08 10:32   ` Maxim Mikityanskiy
  2020-09-04 14:27 ` Jesper Dangaard Brouer
  7 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 13:59 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev, bpf, Maxim Mikityanskiy
  Cc: magnus.karlsson, davem, kuba, hawk, john.fastabend, intel-wired-lan

On 2020-09-04 15:53, Björn Töpel wrote:
> This series addresses a problem that arises when AF_XDP zero-copy is 
> enabled, and the kernel softirq Rx processing and userland process
> is running on the same core.
> 
[...]
> 

@Maxim I'm not well versed in Mellanox drivers. Would this be relevant 
to mlx5 as well?


Cheers,
Björn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
                   ` (6 preceding siblings ...)
  2020-09-04 13:59 ` [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring " Björn Töpel
@ 2020-09-04 14:27 ` Jesper Dangaard Brouer
  2020-09-04 14:32   ` Björn Töpel
  7 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-04 14:27 UTC (permalink / raw)
  To: Björn Töpel, Eric Dumazet
  Cc: brouer, ast, daniel, netdev, bpf, magnus.karlsson, bjorn.topel,
	davem, kuba, john.fastabend, intel-wired-lan

On Fri,  4 Sep 2020 15:53:25 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> On my machine the "one core scenario Rx drop" performance went from
> ~65Kpps to 21Mpps. In other words, from "not usable" to
> "usable". YMMV.

We have observed this kind of dropping off an edge before with softirq
(when userspace process runs on same RX-CPU), but I thought that Eric
Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its job").

I wonder what makes AF_XDP different or if the problem have come back?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-04 14:27 ` Jesper Dangaard Brouer
@ 2020-09-04 14:32   ` Björn Töpel
  2020-09-04 23:58     ` Jakub Kicinski
  0 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 14:32 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet
  Cc: ast, daniel, netdev, bpf, magnus.karlsson, davem, kuba,
	john.fastabend, intel-wired-lan

On 2020-09-04 16:27, Jesper Dangaard Brouer wrote:
> On Fri,  4 Sep 2020 15:53:25 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
> 
>> On my machine the "one core scenario Rx drop" performance went from
>> ~65Kpps to 21Mpps. In other words, from "not usable" to
>> "usable". YMMV.
> 
> We have observed this kind of dropping off an edge before with softirq
> (when userspace process runs on same RX-CPU), but I thought that Eric
> Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its job").
> 
> I wonder what makes AF_XDP different or if the problem have come back?
> 

I would say this is not the same issue. The problem is that the softirq 
is busy dropping packets since the AF_XDP Rx is full. So, the cycles 
*are* split 50/50, which is not what we want in this case. :-)

This issue is more of a "Intel AF_XDP ZC drivers does stupid work", than 
fairness. If the Rx ring is full, then there is really no use to let the 
NAPI loop continue.

Would you agree, or am I rambling? :-P


Björn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper
  2020-09-04 13:53 ` [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper Björn Töpel
@ 2020-09-04 15:11   ` Jesper Dangaard Brouer
  2020-09-04 15:39     ` Björn Töpel
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-04 15:11 UTC (permalink / raw)
  To: Björn Töpel
  Cc: brouer, ast, daniel, netdev, bpf, Björn Töpel,
	magnus.karlsson, davem, kuba, hawk, john.fastabend,
	intel-wired-lan

On Fri,  4 Sep 2020 15:53:28 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The xsk_do_redirect_rx_full() helper can be used to check if a failure
> of xdp_do_redirect() was due to the AF_XDP socket had a full Rx ring.

This is very AF_XDP specific.  I think that the cpumap could likely
benefit from similar approach? e.g. if the cpumap kthread is scheduled
on the same CPU.

But for cpumap we only want this behavior if sched on the same CPU as
RX-NAPI.  This could be "seen" by the cpumap code itself in the case
bq_flush_to_queue() drops packets, check if rcpu->cpu equal
smp_processor_id().  Maybe I'm taking this too far?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 6/6] ixgbe, xsk: finish napi loop if AF_XDP Rx queue is full
  2020-09-04 13:53 ` [PATCH bpf-next 6/6] ixgbe, " Björn Töpel
@ 2020-09-04 15:35   ` Jesper Dangaard Brouer
  2020-09-04 15:54     ` Björn Töpel
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-04 15:35 UTC (permalink / raw)
  To: Björn Töpel
  Cc: brouer, ast, daniel, netdev, bpf, Björn Töpel,
	magnus.karlsson, davem, kuba, john.fastabend, intel-wired-lan

On Fri,  4 Sep 2020 15:53:31 +0200
Björn Töpel <bjorn.topel@gmail.com> wrote:

> From: Björn Töpel <bjorn.topel@intel.com>
> 
> Make the AF_XDP zero-copy path aware that the reason for redirect
> failure was due to full Rx queue. If so, exit the napi loop as soon as
> possible (exit the softirq processing), so that the userspace AF_XDP
> process can hopefully empty the Rx queue. This mainly helps the "one
> core scenario", where the userland process and Rx softirq processing
> is on the same core.
> 
> Note that the early exit can only be performed if the "need wakeup"
> feature is enabled, because otherwise there is no notification
> mechanism available from the kernel side.
> 
> This requires that the driver starts using the newly introduced
> xdp_do_redirect_ext() and xsk_do_redirect_rx_full() functions.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 23 ++++++++++++++------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> index 3771857cf887..a4aebfd986b3 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> @@ -93,9 +93,11 @@ int ixgbe_xsk_pool_setup(struct ixgbe_adapter *adapter,
>  
>  static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
>  			    struct ixgbe_ring *rx_ring,
> -			    struct xdp_buff *xdp)
> +			    struct xdp_buff *xdp,
> +			    bool *early_exit)
>  {
>  	int err, result = IXGBE_XDP_PASS;
> +	enum bpf_map_type map_type;
>  	struct bpf_prog *xdp_prog;
>  	struct xdp_frame *xdpf;
>  	u32 act;
> @@ -116,8 +118,13 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
>  		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
>  		break;
>  	case XDP_REDIRECT:
> -		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> -		result = !err ? IXGBE_XDP_REDIR : IXGBE_XDP_CONSUMED;
> +		err = xdp_do_redirect_ext(rx_ring->netdev, xdp, xdp_prog, &map_type);
> +		if (err) {
> +			*early_exit = xsk_do_redirect_rx_full(err, map_type);

Have you tried calling xdp_do_flush (that calls __xsk_map_flush()) and
(I guess) xsk_set_rx_need_wakeup() here, instead of stopping the loop?
(Or doing this in xsk core).

Looking at the code, the AF_XDP frames are "published" in the queue
rather late for AF_XDP.  Maybe in an orthogonal optimization, have you
considered "publishing" the ring producer when e.g. the queue is
half-full?


> +			result = IXGBE_XDP_CONSUMED;
> +		} else {
> +			result = IXGBE_XDP_REDIR;
> +		}
>  		break;
>  	default:
>  		bpf_warn_invalid_xdp_action(act);
> @@ -235,8 +242,8 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
>  	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>  	struct ixgbe_adapter *adapter = q_vector->adapter;
>  	u16 cleaned_count = ixgbe_desc_unused(rx_ring);
> +	bool early_exit = false, failure = false;
>  	unsigned int xdp_res, xdp_xmit = 0;
> -	bool failure = false;
>  	struct sk_buff *skb;
>  
>  	while (likely(total_rx_packets < budget)) {
> @@ -288,7 +295,7 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
>  
>  		bi->xdp->data_end = bi->xdp->data + size;
>  		xsk_buff_dma_sync_for_cpu(bi->xdp, rx_ring->xsk_pool);
> -		xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp);
> +		xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp, &early_exit);
>  
>  		if (xdp_res) {
>  			if (xdp_res & (IXGBE_XDP_TX | IXGBE_XDP_REDIR))
> @@ -302,6 +309,8 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
>  
>  			cleaned_count++;
>  			ixgbe_inc_ntc(rx_ring);
> +			if (early_exit)
> +				break;
>  			continue;
>  		}
>  
> @@ -346,12 +355,12 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
>  	q_vector->rx.total_bytes += total_rx_bytes;
>  
>  	if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) {
> -		if (failure || rx_ring->next_to_clean == rx_ring->next_to_use)
> +		if (early_exit || failure || rx_ring->next_to_clean == rx_ring->next_to_use)
>  			xsk_set_rx_need_wakeup(rx_ring->xsk_pool);
>  		else
>  			xsk_clear_rx_need_wakeup(rx_ring->xsk_pool);
>  
> -		return (int)total_rx_packets;
> +		return early_exit ? 0 : (int)total_rx_packets;
>  	}
>  	return failure ? budget : (int)total_rx_packets;
>  }



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper
  2020-09-04 15:11   ` Jesper Dangaard Brouer
@ 2020-09-04 15:39     ` Björn Töpel
  2020-09-07 12:45       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 15:39 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Björn Töpel
  Cc: ast, daniel, netdev, bpf, magnus.karlsson, davem, kuba, hawk,
	john.fastabend, intel-wired-lan

On 2020-09-04 17:11, Jesper Dangaard Brouer wrote:
> On Fri,  4 Sep 2020 15:53:28 +0200 Björn Töpel
> <bjorn.topel@gmail.com> wrote:
> 
>> From: Björn Töpel <bjorn.topel@intel.com>
>> 
>> The xsk_do_redirect_rx_full() helper can be used to check if a
>> failure of xdp_do_redirect() was due to the AF_XDP socket had a
>> full Rx ring.
> 
> This is very AF_XDP specific.  I think that the cpumap could likely 
> benefit from similar approach? e.g. if the cpumap kthread is
> scheduled on the same CPU.
> 

At least I thought this was *very* AF_XDP specific, since the kernel is
dependent of that userland runs. Allocation (source) and Rx ring (sink).
Maybe I was wrong! :-)

The thing with AF_XDP zero-copy, is that we sort of assume that if a
user enabled that most packets will have XDP_REDIRECT to an AF_XDP socket.


> But for cpumap we only want this behavior if sched on the same CPU
> as RX-NAPI.  This could be "seen" by the cpumap code itself in the
> case bq_flush_to_queue() drops packets, check if rcpu->cpu equal 
> smp_processor_id().  Maybe I'm taking this too far?
> 

Interesting. So, if you're running on the same core, and redirect fail
for CPUMAP, you'd like to yield the NAPI loop? Is that really OK from a
fairness perspective? I mean, with AF_XDP zero-copy we pretty much know
that all actions will be redirect to socket. For CPUMAP type of
applications, can that assumption be made?


Björn


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 6/6] ixgbe, xsk: finish napi loop if AF_XDP Rx queue is full
  2020-09-04 15:35   ` Jesper Dangaard Brouer
@ 2020-09-04 15:54     ` Björn Töpel
  0 siblings, 0 replies; 26+ messages in thread
From: Björn Töpel @ 2020-09-04 15:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Björn Töpel
  Cc: ast, daniel, netdev, bpf, magnus.karlsson, davem, kuba,
	john.fastabend, intel-wired-lan

On 2020-09-04 17:35, Jesper Dangaard Brouer wrote:
> On Fri,  4 Sep 2020 15:53:31 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
> 
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> Make the AF_XDP zero-copy path aware that the reason for redirect
>> failure was due to full Rx queue. If so, exit the napi loop as soon as
>> possible (exit the softirq processing), so that the userspace AF_XDP
>> process can hopefully empty the Rx queue. This mainly helps the "one
>> core scenario", where the userland process and Rx softirq processing
>> is on the same core.
>>
>> Note that the early exit can only be performed if the "need wakeup"
>> feature is enabled, because otherwise there is no notification
>> mechanism available from the kernel side.
>>
>> This requires that the driver starts using the newly introduced
>> xdp_do_redirect_ext() and xsk_do_redirect_rx_full() functions.
>>
>> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
>> ---
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 23 ++++++++++++++------
>>   1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> index 3771857cf887..a4aebfd986b3 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
>> @@ -93,9 +93,11 @@ int ixgbe_xsk_pool_setup(struct ixgbe_adapter *adapter,
>>   
>>   static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
>>   			    struct ixgbe_ring *rx_ring,
>> -			    struct xdp_buff *xdp)
>> +			    struct xdp_buff *xdp,
>> +			    bool *early_exit)
>>   {
>>   	int err, result = IXGBE_XDP_PASS;
>> +	enum bpf_map_type map_type;
>>   	struct bpf_prog *xdp_prog;
>>   	struct xdp_frame *xdpf;
>>   	u32 act;
>> @@ -116,8 +118,13 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
>>   		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
>>   		break;
>>   	case XDP_REDIRECT:
>> -		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
>> -		result = !err ? IXGBE_XDP_REDIR : IXGBE_XDP_CONSUMED;
>> +		err = xdp_do_redirect_ext(rx_ring->netdev, xdp, xdp_prog, &map_type);
>> +		if (err) {
>> +			*early_exit = xsk_do_redirect_rx_full(err, map_type);
> 
> Have you tried calling xdp_do_flush (that calls __xsk_map_flush()) and
> (I guess) xsk_set_rx_need_wakeup() here, instead of stopping the loop?
> (Or doing this in xsk core).
>

Moving the need_wake logic to the xsk core/flush would be a very nice
cleanup. The driver would still need to pass information from the driver
though. Still, much cleaner. I'll take a stab at that. Thanks!


> Looking at the code, the AF_XDP frames are "published" in the queue
> rather late for AF_XDP.  Maybe in an orthogonal optimization, have you
> considered "publishing" the ring producer when e.g. the queue is
> half-full?
>

Hmm, I haven't. You mean instead of yielding, you publish/submit? I
*think* I still prefer stopping the processing.

I'll play with this a bit!

Very nice suggestions, Jesper! Thanks!


Björn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-04 14:32   ` Björn Töpel
@ 2020-09-04 23:58     ` Jakub Kicinski
  2020-09-07 13:37       ` Björn Töpel
  0 siblings, 1 reply; 26+ messages in thread
From: Jakub Kicinski @ 2020-09-04 23:58 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet, ast,
	daniel, netdev, bpf, magnus.karlsson, davem, john.fastabend,
	intel-wired-lan

On Fri, 4 Sep 2020 16:32:56 +0200 Björn Töpel wrote:
> On 2020-09-04 16:27, Jesper Dangaard Brouer wrote:
> > On Fri,  4 Sep 2020 15:53:25 +0200
> > Björn Töpel <bjorn.topel@gmail.com> wrote:
> >   
> >> On my machine the "one core scenario Rx drop" performance went from
> >> ~65Kpps to 21Mpps. In other words, from "not usable" to
> >> "usable". YMMV.  
> > 
> > We have observed this kind of dropping off an edge before with softirq
> > (when userspace process runs on same RX-CPU), but I thought that Eric
> > Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its job").
> > 
> > I wonder what makes AF_XDP different or if the problem have come back?
> >   
> 
> I would say this is not the same issue. The problem is that the softirq 
> is busy dropping packets since the AF_XDP Rx is full. So, the cycles 
> *are* split 50/50, which is not what we want in this case. :-)
> 
> This issue is more of a "Intel AF_XDP ZC drivers does stupid work", than 
> fairness. If the Rx ring is full, then there is really no use to let the 
> NAPI loop continue.
> 
> Would you agree, or am I rambling? :-P

I wonder if ksoftirqd never kicks in because we are able to discard 
the entire ring before we run out of softirq "slice".


I've been pondering the exact problem you're solving with Maciej
recently. The efficiency of AF_XDP on one core with the NAPI processing.

Your solution (even though it admittedly helps, and is quite simple)
still has the application potentially not able to process packets 
until the queue fills up. This will be bad for latency.

Why don't we move closer to application polling? Never re-arm the NAPI 
after RX, let the application ask for packets, re-arm if 0 polled. 
You'd get max batching, min latency.

Who's the rambling one now? :-D

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper
  2020-09-04 15:39     ` Björn Töpel
@ 2020-09-07 12:45       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-07 12:45 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Björn Töpel, ast, daniel, netdev, bpf, magnus.karlsson,
	davem, kuba, hawk, john.fastabend, intel-wired-lan, brouer,
	Toke Høiland-Jørgensen

On Fri, 4 Sep 2020 17:39:17 +0200
Björn Töpel <bjorn.topel@intel.com> wrote:

> On 2020-09-04 17:11, Jesper Dangaard Brouer wrote:
> > On Fri,  4 Sep 2020 15:53:28 +0200 Björn Töpel
> > <bjorn.topel@gmail.com> wrote:
> >   
> >> From: Björn Töpel <bjorn.topel@intel.com>
> >> 
> >> The xsk_do_redirect_rx_full() helper can be used to check if a
> >> failure of xdp_do_redirect() was due to the AF_XDP socket had a
> >> full Rx ring.  
> > 
> > This is very AF_XDP specific.  I think that the cpumap could likely 
> > benefit from similar approach? e.g. if the cpumap kthread is
> > scheduled on the same CPU.
> >   
> 
> At least I thought this was *very* AF_XDP specific, since the kernel is
> dependent of that userland runs. Allocation (source) and Rx ring (sink).
> Maybe I was wrong! :-)
> 
> The thing with AF_XDP zero-copy, is that we sort of assume that if a
> user enabled that most packets will have XDP_REDIRECT to an AF_XDP socket.
> 
> 
> > But for cpumap we only want this behavior if sched on the same CPU
> > as RX-NAPI.  This could be "seen" by the cpumap code itself in the
> > case bq_flush_to_queue() drops packets, check if rcpu->cpu equal 
> > smp_processor_id().  Maybe I'm taking this too far?
> >   
> 
> Interesting. So, if you're running on the same core, and redirect fail
> for CPUMAP, you'd like to yield the NAPI loop? Is that really OK from a
> fairness perspective? I mean, with AF_XDP zero-copy we pretty much know
> that all actions will be redirect to socket. For CPUMAP type of
> applications, can that assumption be made?

Yes, you are right.  The RX NAPI loop could be doing something else,
and yielding the NAPI loop due to detecting same-CPU is stalling on
cpumap delivery might not be correct action.

I just tested the same-CPU processing case for cpumap (result below
signature), and it doesn't exhibit the bad 'dropping-off-edge'
performance slowdown.  The cpumap code also already tries to mitigate
this, by calling wake_up_process() for every 8 packets (CPU_MAP_BULK_SIZE).

I find your patchset very interesting, as I believe we do need some
kind of general push-back "flow-control" mechanism for XDP. Maybe I
should solve this differently in our XDP-TX-QoS pipe dream ;-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Quick benchmark of cpumap.


Same CPU RX and cpumap processing:
----------------------------------

(Doing XDP_DROP on CPU)
Running XDP/eBPF prog_name:xdp_cpu_map0
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          4       9,189,700      0           0          
XDP-RX          total   9,189,700      0          
cpumap-enqueue    4:4   9,189,696      0           8.00       bulk-average
cpumap-enqueue  sum:4   9,189,696      0           8.00       bulk-average
cpumap_kthread  4       9,189,702      0           143,582    sched
cpumap_kthread  total   9,189,702      0           143,582    sched-sum
redirect_err    total   0              0          
xdp_exception   total   0              0          

2nd remote XDP/eBPF prog_name: xdp1
XDP-cpumap      CPU:to  xdp-pass       xdp-drop    xdp-redir
xdp-in-kthread  4       0              9,189,702   0         
xdp-in-kthread  total   0              9,189,702   0         

 %CPU
 51,8 ksoftirqd/4                       
 48,2 cpumap/4/map:17                   


(Doing XDP_PASS on CPU)
Running XDP/eBPF prog_name:xdp_cpu_map0
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          4       8,593,822      0           0          
XDP-RX          total   8,593,822      0          
cpumap-enqueue    4:4   8,593,888      7,714,949   8.00       bulk-average
cpumap-enqueue  sum:4   8,593,888      7,714,949   8.00       bulk-average
cpumap_kthread  4       878,930        0           13,732     sched
cpumap_kthread  total   878,930        0           13,732     sched-sum
redirect_err    total   0              0          
xdp_exception   total   0              0          

2nd remote XDP/eBPF prog_name: xdp_redirect_dummy
XDP-cpumap      CPU:to  xdp-pass       xdp-drop    xdp-redir
xdp-in-kthread  4       878,931        0           0         
xdp-in-kthread  total   878,931        0           0         



Another CPU getting cpumap redirected packets:
----------------------------------------------

(Doing XDP_DROP on CPU)
Running XDP/eBPF prog_name:xdp_cpu_map0
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          4       17,526,797     0           0          
XDP-RX          total   17,526,797     0          
cpumap-enqueue    4:0   17,526,796     245,811     8.00       bulk-average
cpumap-enqueue  sum:0   17,526,796     245,811     8.00       bulk-average
cpumap_kthread  0       17,281,001     0           16,351     sched
cpumap_kthread  total   17,281,001     0           16,351     sched-sum
redirect_err    total   0              0          
xdp_exception   total   0              0          

2nd remote XDP/eBPF prog_name: xdp1
XDP-cpumap      CPU:to  xdp-pass       xdp-drop    xdp-redir
xdp-in-kthread  0       0              17,281,001  0         
xdp-in-kthread  total   0              17,281,001  0         


(Doing XDP_PASS on CPU)
Running XDP/eBPF prog_name:xdp_cpu_map0
XDP-cpumap      CPU:to  pps            drop-pps    extra-info
XDP-RX          4       14,603,587     0           0          
XDP-RX          total   14,603,587     0          
cpumap-enqueue    4:0   14,603,582     12,999,248  8.00       bulk-average
cpumap-enqueue  sum:0   14,603,582     12,999,248  8.00       bulk-average
cpumap_kthread  0       1,604,338      0           0          
cpumap_kthread  total   1,604,338      0           0          
redirect_err    total   0              0          
xdp_exception   total   0              0          

2nd remote XDP/eBPF prog_name: xdp_redirect_dummy
XDP-cpumap      CPU:to  xdp-pass       xdp-drop    xdp-redir
xdp-in-kthread  0       1,604,338      0           0         
xdp-in-kthread  total   1,604,338      0           0         







^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-04 23:58     ` Jakub Kicinski
@ 2020-09-07 13:37       ` Björn Töpel
  2020-09-07 18:40         ` Jakub Kicinski
  0 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-07 13:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet, ast,
	daniel, netdev, bpf, magnus.karlsson, davem, john.fastabend,
	intel-wired-lan

On 2020-09-05 01:58, Jakub Kicinski wrote:
 > On Fri, 4 Sep 2020 16:32:56 +0200 Björn Töpel wrote:
 >> On 2020-09-04 16:27, Jesper Dangaard Brouer wrote:
 >>> On Fri,  4 Sep 2020 15:53:25 +0200
 >>> Björn Töpel <bjorn.topel@gmail.com> wrote:
 >>>
 >>>> On my machine the "one core scenario Rx drop" performance went from
 >>>> ~65Kpps to 21Mpps. In other words, from "not usable" to
 >>>> "usable". YMMV.
 >>>
 >>> We have observed this kind of dropping off an edge before with softirq
 >>> (when userspace process runs on same RX-CPU), but I thought that Eric
 >>> Dumazet solved it in 4cd13c21b207 ("softirq: Let ksoftirqd do its 
job").
 >>>
 >>> I wonder what makes AF_XDP different or if the problem have come back?
 >>>
 >>
 >> I would say this is not the same issue. The problem is that the softirq
 >> is busy dropping packets since the AF_XDP Rx is full. So, the cycles
 >> *are* split 50/50, which is not what we want in this case. :-)
 >>
 >> This issue is more of a "Intel AF_XDP ZC drivers does stupid work", than
 >> fairness. If the Rx ring is full, then there is really no use to let the
 >> NAPI loop continue.
 >>
 >> Would you agree, or am I rambling? :-P
 >
 > I wonder if ksoftirqd never kicks in because we are able to discard
 > the entire ring before we run out of softirq "slice".
 >

This is exactly what's happening, so we're entering a "busy poll like"
behavior; syscall, return from syscall softirq/napi, userland.

 >
 > I've been pondering the exact problem you're solving with Maciej
 > recently. The efficiency of AF_XDP on one core with the NAPI processing.
 >
 > Your solution (even though it admittedly helps, and is quite simple)
 > still has the application potentially not able to process packets
 > until the queue fills up. This will be bad for latency.
 >
 > Why don't we move closer to application polling? Never re-arm the NAPI
 > after RX, let the application ask for packets, re-arm if 0 polled.
 > You'd get max batching, min latency.
 >
 > Who's the rambling one now? :-D
 >

:-D No, these are all very good ideas! We've actually experimented
with it with the busy-poll series a while back -- NAPI busy-polling
does exactly "application polling".

However, I wonder if the busy-polling would have better performance
than the scenario above (i.e. when the ksoftirqd never kicks in)?
Executing the NAPI poll *explicitly* in the syscall, or implicitly
from the softirq.

Hmm, thinking out loud here. A simple(r) patch enabling busy poll;
Exporting the napi_id to the AF_XDP socket (xdp->rxq->napi_id to
sk->sk_napi_id), and do the sk_busy_poll_loop() in sendmsg.

Or did you have something completely different in mind?

As for this patch set, I think it would make sense to pull it in since
it makes the single-core scenario *much* better, and it is pretty
simple. Then do the application polling as another, potentially,
improvement series.


Thoughts? Thanks a lot for the feedback!
Björn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-07 13:37       ` Björn Töpel
@ 2020-09-07 18:40         ` Jakub Kicinski
  2020-09-08  6:58           ` Björn Töpel
  0 siblings, 1 reply; 26+ messages in thread
From: Jakub Kicinski @ 2020-09-07 18:40 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet, ast,
	daniel, netdev, bpf, magnus.karlsson, davem, john.fastabend,
	intel-wired-lan

On Mon, 7 Sep 2020 15:37:40 +0200 Björn Töpel wrote:
>  > I've been pondering the exact problem you're solving with Maciej
>  > recently. The efficiency of AF_XDP on one core with the NAPI processing.
>  >
>  > Your solution (even though it admittedly helps, and is quite simple)
>  > still has the application potentially not able to process packets
>  > until the queue fills up. This will be bad for latency.
>  >
>  > Why don't we move closer to application polling? Never re-arm the NAPI
>  > after RX, let the application ask for packets, re-arm if 0 polled.
>  > You'd get max batching, min latency.
>  >
>  > Who's the rambling one now? :-D
>  >  
> 
> :-D No, these are all very good ideas! We've actually experimented
> with it with the busy-poll series a while back -- NAPI busy-polling
> does exactly "application polling".
> 
> However, I wonder if the busy-polling would have better performance
> than the scenario above (i.e. when the ksoftirqd never kicks in)?
> Executing the NAPI poll *explicitly* in the syscall, or implicitly
> from the softirq.
> 
> Hmm, thinking out loud here. A simple(r) patch enabling busy poll;
> Exporting the napi_id to the AF_XDP socket (xdp->rxq->napi_id to
> sk->sk_napi_id), and do the sk_busy_poll_loop() in sendmsg.
> 
> Or did you have something completely different in mind?

My understanding is that busy-polling is allowing application to pick
up packets from the ring before the IRQ fires.

What we're more concerned about is the IRQ firing in the first place.

 application:   busy    | needs packets | idle
 -----------------------+---------------+----------------------
   standard   |         |   polls NAPI  | keep polling? sleep?
   busy poll  | IRQ on  |    IRQ off    |  IRQ off      IRQ on
 -------------+---------+---------------+----------------------
              |         |   polls once  |
    AF_XDP    | IRQ off |    IRQ off    |  IRQ on


So busy polling is pretty orthogonal. It only applies to the
"application needs packets" time. What we'd need is for the application
to be able to suppress NAPI polls, promising the kernel that it will
busy poll when appropriate.

> As for this patch set, I think it would make sense to pull it in since
> it makes the single-core scenario *much* better, and it is pretty
> simple. Then do the application polling as another, potentially,
> improvement series.

Up to you, it's extra code in the driver so mostly your code to
maintain.

I think that if we implement what I described above - everyone will
use that on a single core setup, so this set would be dead code
(assuming RQ is sized appropriately). But again, your call :)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-07 18:40         ` Jakub Kicinski
@ 2020-09-08  6:58           ` Björn Töpel
  2020-09-08 17:24             ` Jakub Kicinski
  0 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-08  6:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet, ast,
	daniel, netdev, bpf, magnus.karlsson, davem, john.fastabend,
	intel-wired-lan

On 2020-09-07 20:40, Jakub Kicinski wrote:
> On Mon, 7 Sep 2020 15:37:40 +0200 Björn Töpel wrote:
>>   > I've been pondering the exact problem you're solving with Maciej
>>   > recently. The efficiency of AF_XDP on one core with the NAPI processing.
>>   >
>>   > Your solution (even though it admittedly helps, and is quite simple)
>>   > still has the application potentially not able to process packets
>>   > until the queue fills up. This will be bad for latency.
>>   >
>>   > Why don't we move closer to application polling? Never re-arm the NAPI
>>   > after RX, let the application ask for packets, re-arm if 0 polled.
>>   > You'd get max batching, min latency.
>>   >
>>   > Who's the rambling one now? :-D
>>   >
>>
>> :-D No, these are all very good ideas! We've actually experimented
>> with it with the busy-poll series a while back -- NAPI busy-polling
>> does exactly "application polling".
>>
>> However, I wonder if the busy-polling would have better performance
>> than the scenario above (i.e. when the ksoftirqd never kicks in)?
>> Executing the NAPI poll *explicitly* in the syscall, or implicitly
>> from the softirq.
>>
>> Hmm, thinking out loud here. A simple(r) patch enabling busy poll;
>> Exporting the napi_id to the AF_XDP socket (xdp->rxq->napi_id to
>> sk->sk_napi_id), and do the sk_busy_poll_loop() in sendmsg.
>>
>> Or did you have something completely different in mind?
> 
> My understanding is that busy-polling is allowing application to pick
> up packets from the ring before the IRQ fires.
> 
> What we're more concerned about is the IRQ firing in the first place.
> 
>   application:   busy    | needs packets | idle
>   -----------------------+---------------+----------------------
>     standard   |         |   polls NAPI  | keep polling? sleep?
>     busy poll  | IRQ on  |    IRQ off    |  IRQ off      IRQ on
>   -------------+---------+---------------+----------------------
>                |         |   polls once  |
>      AF_XDP    | IRQ off |    IRQ off    |  IRQ on
> 
> 
> So busy polling is pretty orthogonal. It only applies to the
> "application needs packets" time. What we'd need is for the application
> to be able to suppress NAPI polls, promising the kernel that it will
> busy poll when appropriate.
>

Ah, nice write-up! Thanks! A strict busy-poll mechanism, not the
opportunistic (existing) NAPI busy-poll.

This would be a new kind of mechanism, and a very much welcome one in
AF_XDP-land. More below.

>> As for this patch set, I think it would make sense to pull it in since
>> it makes the single-core scenario *much* better, and it is pretty
>> simple. Then do the application polling as another, potentially,
>> improvement series.
> 
> Up to you, it's extra code in the driver so mostly your code to
> maintain.
> 
> I think that if we implement what I described above - everyone will
> use that on a single core setup, so this set would be dead code
> (assuming RQ is sized appropriately). But again, your call :)
> 

Now, I agree that the busy-poll you describe above would be the best
option, but from my perspective it's a much larger set that involves
experimenting. I will explore that, but I still think this series should
go in sooner to make the single core scenario usable *today*.

Ok, back to the busy-poll ideas. I'll call your idea "strict busy-poll",
i.e. the NAPI loop is *only* driven by userland, and interrupts stay
disabled. "Syscall driven poll-mode driver". :-)

On the driver side (again, only talking Intel here, since that's what I
know the details of), the NAPI context would only cover AF_XDP queues,
so that other queues are not starved.

Any ideas how strict busy-poll would look, API/implmentation-wise? An
option only for AF_XDP sockets? Would this make sense to regular
sockets? If so, maybe extend the existing NAPI busy-poll with a "strict"
mode?

I'll start playing around a bit, but again, I think this simple series
should go in just to make AF_XDP single core usable *today*.


Thanks!
Björn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-04 13:59 ` [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring " Björn Töpel
@ 2020-09-08 10:32   ` Maxim Mikityanskiy
  2020-09-08 11:37     ` Magnus Karlsson
  2020-09-09 15:37     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2020-09-08 10:32 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel
  Cc: netdev, bpf, Maxim Mikityanskiy, magnus.karlsson, davem, kuba,
	hawk, john.fastabend, intel-wired-lan

On 2020-09-04 16:59, Björn Töpel wrote:
> On 2020-09-04 15:53, Björn Töpel wrote:
>> This series addresses a problem that arises when AF_XDP zero-copy is 
>> enabled, and the kernel softirq Rx processing and userland process
>> is running on the same core.
>>
> [...]
>>
> 
> @Maxim I'm not well versed in Mellanox drivers. Would this be relevant 
> to mlx5 as well?

Thanks for letting me know about this series! So the basic idea is to 
stop processing hardware completions if the RX ring gets full, because 
the application didn't have chance to run? Yes, I think it's also 
relevant to mlx5, the issue is not driver-specific, and a similar fix is 
applicable. However, it may lead to completion queue overflows - some 
analysis is needed to understand what happens then and how to handle it.

Regarding the feature, I think it should be opt-in (disabled by 
default), because old applications may not wakeup RX after they process 
packets in the RX ring. Is it required to change xdpsock accordingly? 
Also, when need_wakeup is disabled, your driver implementation seems to 
quit NAPI anyway, but it shouldn't happen, because no one will send a 
wakeup.

Waiting until the RX ring fills up, then passing control to the 
application and waiting until the hardware completion queue fills up, 
and so on increases latency - the busy polling approach sounds more 
legit here.

The behavior may be different depending on the driver implementation:

1. If you arm the completion queue and leave interrupts enabled on early 
exit too, the application will soon be interrupted anyway and won't have 
much time to process many packets, leading to app <-> NAPI ping-pong one 
packet at a time, making NAPI inefficient.

2. If you don't arm the completion queue on early exit and wait for the 
explicit wakeup from the application, it will easily overflow the 
hardware completion queue, because we don't have a symmetric mechanism 
to stop the application on imminent hardware queue overflow. It doesn't 
feel correct and may be trickier to handle: if the application is too 
slow, such drops should happen on driver/kernel level, not in hardware.

Which behavior is used in your drivers? Or am I missing some more options?

BTW, it should be better to pass control to the application before the 
first dropped packet, not after it has been dropped.

Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and 
XDP_TX may suffer from such behavior, so it's another point to make a 
knob on the application layer to enable/disable it.

 From the driver API perspective, I would prefer to see a simpler API if 
possible. The current API exposes things that the driver shouldn't know 
(BPF map type), and requires XSK-specific handling. It would be better 
if some specific error code returned from xdp_do_redirect was reserved 
to mean "exit NAPI early if you support it". This way we wouldn't need 
two new helpers, two xdp_do_redirect functions, and this approach would 
be extensible to other non-XSK use cases without further changes in the 
driver, and also the logic to opt-in the feature could be put inside the 
kernel.

Thanks,
Max

> 
> Cheers,
> Björn


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-08 10:32   ` Maxim Mikityanskiy
@ 2020-09-08 11:37     ` Magnus Karlsson
  2020-09-08 12:21       ` Björn Töpel
  2020-09-09 15:37     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 26+ messages in thread
From: Magnus Karlsson @ 2020-09-08 11:37 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Björn Töpel, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, bpf, Maxim Mikityanskiy,
	Karlsson, Magnus, David S. Miller, kuba, hawk, John Fastabend,
	intel-wired-lan

On Tue, Sep 8, 2020 at 12:33 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2020-09-04 16:59, Björn Töpel wrote:
> > On 2020-09-04 15:53, Björn Töpel wrote:
> >> This series addresses a problem that arises when AF_XDP zero-copy is
> >> enabled, and the kernel softirq Rx processing and userland process
> >> is running on the same core.
> >>
> > [...]
> >>
> >
> > @Maxim I'm not well versed in Mellanox drivers. Would this be relevant
> > to mlx5 as well?
>
> Thanks for letting me know about this series! So the basic idea is to
> stop processing hardware completions if the RX ring gets full, because
> the application didn't have chance to run? Yes, I think it's also
> relevant to mlx5, the issue is not driver-specific, and a similar fix is
> applicable. However, it may lead to completion queue overflows - some
> analysis is needed to understand what happens then and how to handle it.
>
> Regarding the feature, I think it should be opt-in (disabled by
> default), because old applications may not wakeup RX after they process
> packets in the RX ring.

How about need_wakeup enable/disable at bind time being that opt-in,
instead of a new option? It is off by default, and when it is off, the
driver busy-spins on the Rx ring until it can put an entry there. It
will not yield to the application by returning something less than
budget. Applications need not check the need_wakeup flag. If
need_wakeup is enabled by the user, the contract is that user-space
needs to check the need_wakeup flag and act on it. If it does not,
then that is a programming error and it can be set for any unspecified
reason. No reason to modify the application, if it checks need_wakeup.
But if this patch behaves like that I have not checked.

Good points in the rest of the mail, that I think should be addressed.

/Magnus

> Is it required to change xdpsock accordingly?
> Also, when need_wakeup is disabled, your driver implementation seems to
> quit NAPI anyway, but it shouldn't happen, because no one will send a
> wakeup.
>
> Waiting until the RX ring fills up, then passing control to the
> application and waiting until the hardware completion queue fills up,
> and so on increases latency - the busy polling approach sounds more
> legit here.
>
> The behavior may be different depending on the driver implementation:
>
> 1. If you arm the completion queue and leave interrupts enabled on early
> exit too, the application will soon be interrupted anyway and won't have
> much time to process many packets, leading to app <-> NAPI ping-pong one
> packet at a time, making NAPI inefficient.
>
> 2. If you don't arm the completion queue on early exit and wait for the
> explicit wakeup from the application, it will easily overflow the
> hardware completion queue, because we don't have a symmetric mechanism
> to stop the application on imminent hardware queue overflow. It doesn't
> feel correct and may be trickier to handle: if the application is too
> slow, such drops should happen on driver/kernel level, not in hardware.
>
> Which behavior is used in your drivers? Or am I missing some more options?
>
> BTW, it should be better to pass control to the application before the
> first dropped packet, not after it has been dropped.
>
> Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and
> XDP_TX may suffer from such behavior, so it's another point to make a
> knob on the application layer to enable/disable it.
>
>  From the driver API perspective, I would prefer to see a simpler API if
> possible. The current API exposes things that the driver shouldn't know
> (BPF map type), and requires XSK-specific handling. It would be better
> if some specific error code returned from xdp_do_redirect was reserved
> to mean "exit NAPI early if you support it". This way we wouldn't need
> two new helpers, two xdp_do_redirect functions, and this approach would
> be extensible to other non-XSK use cases without further changes in the
> driver, and also the logic to opt-in the feature could be put inside the
> kernel.
>
> Thanks,
> Max
>
> >
> > Cheers,
> > Björn
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-08 11:37     ` Magnus Karlsson
@ 2020-09-08 12:21       ` Björn Töpel
  0 siblings, 0 replies; 26+ messages in thread
From: Björn Töpel @ 2020-09-08 12:21 UTC (permalink / raw)
  To: Magnus Karlsson, Maxim Mikityanskiy
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Network Development, bpf, Maxim Mikityanskiy, Karlsson, Magnus,
	David S. Miller, kuba, hawk, John Fastabend, intel-wired-lan

On 2020-09-08 13:37, Magnus Karlsson wrote:
> On Tue, Sep 8, 2020 at 12:33 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>>
>> On 2020-09-04 16:59, Björn Töpel wrote:
>>> On 2020-09-04 15:53, Björn Töpel wrote:
>>>> This series addresses a problem that arises when AF_XDP zero-copy is
>>>> enabled, and the kernel softirq Rx processing and userland process
>>>> is running on the same core.
>>>>
>>> [...]
>>>>
>>>
>>> @Maxim I'm not well versed in Mellanox drivers. Would this be relevant
>>> to mlx5 as well?
>>
>> Thanks for letting me know about this series! So the basic idea is to
>> stop processing hardware completions if the RX ring gets full, because
>> the application didn't have chance to run? Yes, I think it's also
>> relevant to mlx5, the issue is not driver-specific, and a similar fix is
>> applicable. However, it may lead to completion queue overflows - some
>> analysis is needed to understand what happens then and how to handle it.
>>
>> Regarding the feature, I think it should be opt-in (disabled by
>> default), because old applications may not wakeup RX after they process
>> packets in the RX ring.
>

First of all; Max, thanks for some really good input as usual!


> How about need_wakeup enable/disable at bind time being that opt-in,
> instead of a new option? It is off by default, and when it is off, the
> driver busy-spins on the Rx ring until it can put an entry there. It
> will not yield to the application by returning something less than
> budget. Applications need not check the need_wakeup flag. If
> need_wakeup is enabled by the user, the contract is that user-space
> needs to check the need_wakeup flag and act on it. If it does not,
> then that is a programming error and it can be set for any unspecified
> reason. No reason to modify the application, if it checks need_wakeup.
> But if this patch behaves like that I have not checked.
>

It does not behave exactly like this. If need_wakeup is enabled, the 
napi look exists, but if not the napi is rearmed -- so we'll get a less 
efficient loop. I'll need address this.

I'm leaning towards a more explicit opt-in like Max suggested. As Max 
pointed out, AF_XDP/XDP_TX is actually a non-insane(!) way of using 
AF_XDP zero-copy, which will suffer from the early exit.


> Good points in the rest of the mail, that I think should be addressed.
>

Yeah, I agree. I will take a step back an rethink. I'll experiment with
the busy-polling suggested by Jakub, and also an opt-in early exit.

Thanks for all the suggestions folks!


Cheers,
Björn


> /Magnus
> 
>> Is it required to change xdpsock accordingly?
>> Also, when need_wakeup is disabled, your driver implementation seems to
>> quit NAPI anyway, but it shouldn't happen, because no one will send a
>> wakeup.
>>
>> Waiting until the RX ring fills up, then passing control to the
>> application and waiting until the hardware completion queue fills up,
>> and so on increases latency - the busy polling approach sounds more
>> legit here.
>>
>> The behavior may be different depending on the driver implementation:
>>
>> 1. If you arm the completion queue and leave interrupts enabled on early
>> exit too, the application will soon be interrupted anyway and won't have
>> much time to process many packets, leading to app <-> NAPI ping-pong one
>> packet at a time, making NAPI inefficient.
>>
>> 2. If you don't arm the completion queue on early exit and wait for the
>> explicit wakeup from the application, it will easily overflow the
>> hardware completion queue, because we don't have a symmetric mechanism
>> to stop the application on imminent hardware queue overflow. It doesn't
>> feel correct and may be trickier to handle: if the application is too
>> slow, such drops should happen on driver/kernel level, not in hardware.
>>
>> Which behavior is used in your drivers? Or am I missing some more options?
>>
>> BTW, it should be better to pass control to the application before the
>> first dropped packet, not after it has been dropped.
>>
>> Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and
>> XDP_TX may suffer from such behavior, so it's another point to make a
>> knob on the application layer to enable/disable it.
>>
>>   From the driver API perspective, I would prefer to see a simpler API if
>> possible. The current API exposes things that the driver shouldn't know
>> (BPF map type), and requires XSK-specific handling. It would be better
>> if some specific error code returned from xdp_do_redirect was reserved
>> to mean "exit NAPI early if you support it". This way we wouldn't need
>> two new helpers, two xdp_do_redirect functions, and this approach would
>> be extensible to other non-XSK use cases without further changes in the
>> driver, and also the logic to opt-in the feature could be put inside the
>> kernel.
>>
>> Thanks,
>> Max
>>
>>>
>>> Cheers,
>>> Björn
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-08  6:58           ` Björn Töpel
@ 2020-09-08 17:24             ` Jakub Kicinski
  2020-09-08 18:28               ` Björn Töpel
  0 siblings, 1 reply; 26+ messages in thread
From: Jakub Kicinski @ 2020-09-08 17:24 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet, ast,
	daniel, netdev, bpf, magnus.karlsson, davem, john.fastabend,
	intel-wired-lan

On Tue, 8 Sep 2020 08:58:30 +0200 Björn Töpel wrote:
> >> As for this patch set, I think it would make sense to pull it in since
> >> it makes the single-core scenario *much* better, and it is pretty
> >> simple. Then do the application polling as another, potentially,
> >> improvement series.  
> > 
> > Up to you, it's extra code in the driver so mostly your code to
> > maintain.
> > 
> > I think that if we implement what I described above - everyone will
> > use that on a single core setup, so this set would be dead code
> > (assuming RQ is sized appropriately). But again, your call :)
> 
> Now, I agree that the busy-poll you describe above would be the best
> option, but from my perspective it's a much larger set that involves
> experimenting. I will explore that, but I still think this series should
> go in sooner to make the single core scenario usable *today*.
> 
> Ok, back to the busy-poll ideas. I'll call your idea "strict busy-poll",
> i.e. the NAPI loop is *only* driven by userland, and interrupts stay
> disabled. "Syscall driven poll-mode driver". :-)
> 
> On the driver side (again, only talking Intel here, since that's what I
> know the details of), the NAPI context would only cover AF_XDP queues,
> so that other queues are not starved.
> 
> Any ideas how strict busy-poll would look, API/implmentation-wise? An
> option only for AF_XDP sockets? Would this make sense to regular
> sockets? If so, maybe extend the existing NAPI busy-poll with a "strict"
> mode?

For AF_XDP and other sockets I think it should be quite straightforward.

For AF_XDP just implement current busy poll.

Then for all socket types add a new sockopt which sets "timeout" on how
long the IRQs can be suppressed for (we don't want application crash or
hang to knock the system off the network), or just enables the feature
and the timeout is from a sysctl.

Then make sure that at the end of polling napi doesn't get scheduled,
and set some bit which will prevent napi_schedule_prep() from letting
normal IRQ processing from scheduling it, too. Set a timer for the
timeout handling to undo all this.

What I haven't figured out in my head is how/if this relates to the
ongoing wq/threaded NAPI polling work 🤔 but that shouldn't stop you.

> I'll start playing around a bit, but again, I think this simple series
> should go in just to make AF_XDP single core usable *today*.

No objection from me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-08 17:24             ` Jakub Kicinski
@ 2020-09-08 18:28               ` Björn Töpel
  2020-09-08 18:34                 ` Jakub Kicinski
  0 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2020-09-08 18:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet, ast,
	daniel, netdev, bpf, magnus.karlsson, davem, john.fastabend,
	intel-wired-lan

On 2020-09-08 19:24, Jakub Kicinski wrote:
>> I'll start playing around a bit, but again, I think this simple series
>> should go in just to make AF_XDP single core usable*today*.
> No objection from me.

Thanks Jakub, but as you (probably) noticed in the other thread Maxim 
had some valid concerns. Let's drop this for now, and I'll get back 
after some experimentation/hacking.


Again, thanks for the ideas! Very much appreciated!
Björn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-08 18:28               ` Björn Töpel
@ 2020-09-08 18:34                 ` Jakub Kicinski
  0 siblings, 0 replies; 26+ messages in thread
From: Jakub Kicinski @ 2020-09-08 18:34 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Jesper Dangaard Brouer, Björn Töpel, Eric Dumazet, ast,
	daniel, netdev, bpf, magnus.karlsson, davem, john.fastabend,
	intel-wired-lan

On Tue, 8 Sep 2020 20:28:14 +0200 Björn Töpel wrote:
> On 2020-09-08 19:24, Jakub Kicinski wrote:
> >> I'll start playing around a bit, but again, I think this simple series
> >> should go in just to make AF_XDP single core usable*today*.  
> > No objection from me.  
> 
> Thanks Jakub, but as you (probably) noticed in the other thread Maxim 
> had some valid concerns. Let's drop this for now, and I'll get back 
> after some experimentation/hacking.

Yeah, I sort of assumed you got the wake-up problem down :S

If it gets complicated it may not be worth pursuing this optimization.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
  2020-09-08 10:32   ` Maxim Mikityanskiy
  2020-09-08 11:37     ` Magnus Karlsson
@ 2020-09-09 15:37     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-09 15:37 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: brouer, Björn Töpel, Björn Töpel, ast,
	daniel, netdev, bpf, Maxim Mikityanskiy, magnus.karlsson, davem,
	kuba, hawk, john.fastabend, intel-wired-lan

On Tue, 8 Sep 2020 13:32:01 +0300
Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

>  From the driver API perspective, I would prefer to see a simpler API if 
> possible. The current API exposes things that the driver shouldn't know 
> (BPF map type), and requires XSK-specific handling. It would be better 
> if some specific error code returned from xdp_do_redirect was reserved 
> to mean "exit NAPI early if you support it". This way we wouldn't need 
> two new helpers, two xdp_do_redirect functions, and this approach would 
> be extensible to other non-XSK use cases without further changes in the 
> driver, and also the logic to opt-in the feature could be put inside the 
> kernel.

I agree.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-09-09 15:38 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
2020-09-04 13:53 ` [PATCH bpf-next 1/6] xsk: improve xdp_do_redirect() error codes Björn Töpel
2020-09-04 13:53 ` [PATCH bpf-next 2/6] xdp: introduce xdp_do_redirect_ext() function Björn Töpel
2020-09-04 13:53 ` [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper Björn Töpel
2020-09-04 15:11   ` Jesper Dangaard Brouer
2020-09-04 15:39     ` Björn Töpel
2020-09-07 12:45       ` Jesper Dangaard Brouer
2020-09-04 13:53 ` [PATCH bpf-next 4/6] i40e, xsk: finish napi loop if AF_XDP Rx queue is full Björn Töpel
2020-09-04 13:53 ` [PATCH bpf-next 5/6] ice, " Björn Töpel
2020-09-04 13:53 ` [PATCH bpf-next 6/6] ixgbe, " Björn Töpel
2020-09-04 15:35   ` Jesper Dangaard Brouer
2020-09-04 15:54     ` Björn Töpel
2020-09-04 13:59 ` [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring " Björn Töpel
2020-09-08 10:32   ` Maxim Mikityanskiy
2020-09-08 11:37     ` Magnus Karlsson
2020-09-08 12:21       ` Björn Töpel
2020-09-09 15:37     ` Jesper Dangaard Brouer
2020-09-04 14:27 ` Jesper Dangaard Brouer
2020-09-04 14:32   ` Björn Töpel
2020-09-04 23:58     ` Jakub Kicinski
2020-09-07 13:37       ` Björn Töpel
2020-09-07 18:40         ` Jakub Kicinski
2020-09-08  6:58           ` Björn Töpel
2020-09-08 17:24             ` Jakub Kicinski
2020-09-08 18:28               ` Björn Töpel
2020-09-08 18:34                 ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).