netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] net: provide macros for commonly copied lockless queue stop/wake code
@ 2023-04-01  5:12 Jakub Kicinski
  2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-01  5:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, Jakub Kicinski

Drivers often struggle to implement lockless xmit and cleanup.
Add macros wrapping commonly copied implementation.
Convert two example drivers.

Jakub Kicinski (3):
  net: provide macros for commonly copied lockless queue stop/wake code
  ixgbe: use new queue try_stop/try_wake macros
  bnxt: use new queue try_stop/try_wake macros

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  41 +----
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  37 +---
 include/net/netdev_queues.h                   | 167 ++++++++++++++++++
 3 files changed, 184 insertions(+), 61 deletions(-)
 create mode 100644 include/net/netdev_queues.h

-- 
2.39.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01  5:12 [PATCH net-next 0/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
@ 2023-04-01  5:12 ` Jakub Kicinski
  2023-04-01 15:04   ` Heiner Kallweit
  2023-04-01 15:18   ` Heiner Kallweit
  2023-04-01  5:12 ` [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros Jakub Kicinski
  2023-04-01  5:12 ` [PATCH net-next 3/3] bnxt: " Jakub Kicinski
  2 siblings, 2 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-01  5:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, Jakub Kicinski

A lot of drivers follow the same scheme to stop / start queues
without introducing locks between xmit and NAPI tx completions.
I'm guessing they all copy'n'paste each other's code.

Smaller drivers shy away from the scheme and introduce a lock
which may cause deadlocks in netpoll.

Provide macros which encapsulate the necessary logic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
v2:
 - really flip the unlikely into a likely in __netif_tx_queue_maybe_wake()
 - convert if / else into pre-init of _ret
v1: https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
 - perdicate -> predicate
 - on race use start instead of wake and make a note of that
   in the doc / comment at the start
rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
---
 include/net/netdev_queues.h | 167 ++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 include/net/netdev_queues.h

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
new file mode 100644
index 000000000000..d050eb5e5bea
--- /dev/null
+++ b/include/net/netdev_queues.h
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_NET_QUEUES_H
+#define _LINUX_NET_QUEUES_H
+
+#include <linux/netdevice.h>
+
+/* Lockless queue stopping / waking helpers.
+ *
+ * These macros are designed to safely implement stopping and waking
+ * netdev queues without full lock protection. We assume that there can
+ * be no concurrent stop attempts and no concurrent wake attempts.
+ * The try-stop should happen from the xmit handler*, while wake up
+ * should be triggered from NAPI poll context. The two may run
+ * concurrently (single producer, single consumer).
+ *
+ * All descriptor ring indexes (and other relevant shared state) must
+ * be updated before invoking the macros.
+ *
+ * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
+ *   instead of netif_tx_wake_queue()) so uses outside of the xmit
+ *   handler may lead to bugs
+ */
+
+#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
+	({								\
+		int _res;						\
+									\
+		netif_tx_stop_queue(txq);				\
+									\
+		smp_mb();						\
+									\
+		/* We need to check again in a case another		\
+		 * CPU has just made room available.			\
+		 */							\
+		_res = 0;						\
+		if (unlikely(get_desc >= start_thrs)) {			\
+			netif_tx_start_queue(txq);			\
+			_res = -1;					\
+		}							\
+		_res;							\
+	})								\
+
+/**
+ * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @stop_thrs:	minimal number of available descriptors for queue to be left
+ *		enabled
+ * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
+ *		equal to @stop_thrs or higher to avoid frequent waking
+ *
+ * All arguments may be evaluated multiple times, beware of side effects.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ * Expected to be used from ndo_start_xmit, see the comment on top of the file.
+ *
+ * Returns:
+ *	 0 if the queue was stopped
+ *	 1 if the queue was left enabled
+ *	-1 if the queue was re-enabled (raced with waking)
+ */
+#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
+	({								\
+		int _res;						\
+									\
+		_res = 1;						\
+		if (unlikely(get_desc < stop_thrs))			\
+			_res = netif_tx_queue_try_stop(txq, get_desc,	\
+						       start_thrs);	\
+		_res;							\
+	})								\
+
+#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		/* Make sure that anybody stopping the queue after	\
+		 * this sees the new next_to_clean.			\
+		 */							\
+		smp_mb();						\
+		_res = 1;						\
+		if (unlikely(netif_tx_queue_stopped(txq)) && !(down_cond)) { \
+			netif_tx_wake_queue(txq);			\
+			_res = 0;					\
+		}							\
+		_res;							\
+	})
+
+#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
+
+/**
+ * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @start_thrs:	minimal number of descriptors to re-enable the queue
+ * @down_cond:	down condition, predicate indicating that the queue should
+ *		not be woken up even if descriptors are available
+ *
+ * All arguments may be evaluated multiple times.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ *
+ * Returns:
+ *	 0 if the queue was woken up
+ *	 1 if the queue was already enabled (or disabled but @down_cond is true)
+ *	-1 if the queue was left stopped
+ */
+#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		_res = -1;						\
+		if (likely(get_desc > start_thrs))			\
+			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
+							 start_thrs,	\
+							 down_cond);	\
+		_res;							\
+	})
+
+#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
+
+/* subqueue variants follow */
+
+#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
+	})
+
+#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_maybe_stop(txq, get_desc,		\
+					  stop_thrs, start_thrs);	\
+	})
+
+#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_try_wake(txq, get_desc,		\
+					  start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
+	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
+
+#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_maybe_wake(txq, get_desc,		\
+					    start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
+	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
+
+#endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros
  2023-04-01  5:12 [PATCH net-next 0/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
  2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
@ 2023-04-01  5:12 ` Jakub Kicinski
  2023-04-01  5:12 ` [PATCH net-next 3/3] bnxt: " Jakub Kicinski
  2 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-01  5:12 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, Jakub Kicinski, jesse.brandeburg,
	anthony.l.nguyen

Convert ixgbe to use the new macros, I think a lot of people
copy the ixgbe code. The only functional change is that the
unlikely() in ixgbe_clean_tx_irq() turns into a likely()
inside the new macro and no longer includes

  total_packets && netif_carrier_ok(tx_ring->netdev)

which is probably for the best, anyway.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jesse.brandeburg@intel.com
CC: anthony.l.nguyen@intel.com
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 37 +++++--------------
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 773c35fecace..db00e50a40ff 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -36,6 +36,7 @@
 #include <net/tc_act/tc_mirred.h>
 #include <net/vxlan.h>
 #include <net/mpls.h>
+#include <net/netdev_queues.h>
 #include <net/xdp_sock_drv.h>
 #include <net/xfrm.h>
 
@@ -1253,20 +1254,12 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 				  total_packets, total_bytes);
 
 #define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)
-	if (unlikely(total_packets && netif_carrier_ok(tx_ring->netdev) &&
-		     (ixgbe_desc_unused(tx_ring) >= TX_WAKE_THRESHOLD))) {
-		/* Make sure that anybody stopping the queue after this
-		 * sees the new next_to_clean.
-		 */
-		smp_mb();
-		if (__netif_subqueue_stopped(tx_ring->netdev,
-					     tx_ring->queue_index)
-		    && !test_bit(__IXGBE_DOWN, &adapter->state)) {
-			netif_wake_subqueue(tx_ring->netdev,
-					    tx_ring->queue_index);
-			++tx_ring->tx_stats.restart_queue;
-		}
-	}
+	if (total_packets && netif_carrier_ok(tx_ring->netdev) &&
+	    !__netif_subqueue_maybe_wake(tx_ring->netdev, tx_ring->queue_index,
+					 ixgbe_desc_unused(tx_ring),
+					 TX_WAKE_THRESHOLD,
+					 test_bit(__IXGBE_DOWN, &adapter->state)))
+		++tx_ring->tx_stats.restart_queue;
 
 	return !!budget;
 }
@@ -8270,22 +8263,10 @@ static void ixgbe_tx_olinfo_status(union ixgbe_adv_tx_desc *tx_desc,
 
 static int __ixgbe_maybe_stop_tx(struct ixgbe_ring *tx_ring, u16 size)
 {
-	netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
-
-	/* Herbert's original patch had:
-	 *  smp_mb__after_netif_stop_queue();
-	 * but since that doesn't exist yet, just open code it.
-	 */
-	smp_mb();
-
-	/* We need to check again in a case another CPU has just
-	 * made room available.
-	 */
-	if (likely(ixgbe_desc_unused(tx_ring) < size))
+	if (!netif_subqueue_try_stop(tx_ring->netdev, tx_ring->queue_index,
+				     ixgbe_desc_unused(tx_ring), size))
 		return -EBUSY;
 
-	/* A reprieve! - use start_queue because it doesn't call schedule */
-	netif_start_subqueue(tx_ring->netdev, tx_ring->queue_index);
 	++tx_ring->tx_stats.restart_queue;
 	return 0;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH net-next 3/3] bnxt: use new queue try_stop/try_wake macros
  2023-04-01  5:12 [PATCH net-next 0/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
  2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
  2023-04-01  5:12 ` [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros Jakub Kicinski
@ 2023-04-01  5:12 ` Jakub Kicinski
  2023-04-01 18:35   ` Michael Chan
  2 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-01  5:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, Jakub Kicinski, michael.chan

Convert bnxt to use new macros rather than open code the logic.
Two differences:
(1) bnxt_tx_int() will now only issue a memory barrier if it sees
    enough space on the ring to wake the queue. This should be fine,
    the mb() is between the writes to the ring pointers and checking
    queue state.
(2) we'll start the queue instead of waking on race, this should
    be safe inside the xmit handler.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: michael.chan@broadcom.com
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 41 +++++------------------
 1 file changed, 8 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 8ff5a4f98d6f..2a5fed0da1a9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -56,6 +56,7 @@
 #include <linux/hwmon-sysfs.h>
 #include <net/page_pool.h>
 #include <linux/align.h>
+#include <net/netdev_queues.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
@@ -331,26 +332,6 @@ static void bnxt_txr_db_kick(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
 	txr->kick_pending = 0;
 }
 
-static bool bnxt_txr_netif_try_stop_queue(struct bnxt *bp,
-					  struct bnxt_tx_ring_info *txr,
-					  struct netdev_queue *txq)
-{
-	netif_tx_stop_queue(txq);
-
-	/* netif_tx_stop_queue() must be done before checking
-	 * tx index in bnxt_tx_avail() below, because in
-	 * bnxt_tx_int(), we update tx index before checking for
-	 * netif_tx_queue_stopped().
-	 */
-	smp_mb();
-	if (bnxt_tx_avail(bp, txr) >= bp->tx_wake_thresh) {
-		netif_tx_wake_queue(txq);
-		return false;
-	}
-
-	return true;
-}
-
 static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct bnxt *bp = netdev_priv(dev);
@@ -384,7 +365,8 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (net_ratelimit() && txr->kick_pending)
 			netif_warn(bp, tx_err, dev,
 				   "bnxt: ring busy w/ flush pending!\n");
-		if (bnxt_txr_netif_try_stop_queue(bp, txr, txq))
+		if (!netif_tx_queue_try_stop(txq, bnxt_tx_avail(bp, txr),
+					     bp->tx_wake_thresh))
 			return NETDEV_TX_BUSY;
 	}
 
@@ -614,7 +596,8 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (netdev_xmit_more() && !tx_buf->is_push)
 			bnxt_txr_db_kick(bp, txr, prod);
 
-		bnxt_txr_netif_try_stop_queue(bp, txr, txq);
+		netif_tx_queue_try_stop(txq, bnxt_tx_avail(bp, txr),
+					bp->tx_wake_thresh);
 	}
 	return NETDEV_TX_OK;
 
@@ -708,17 +691,9 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int nr_pkts)
 	netdev_tx_completed_queue(txq, nr_pkts, tx_bytes);
 	txr->tx_cons = cons;
 
-	/* Need to make the tx_cons update visible to bnxt_start_xmit()
-	 * before checking for netif_tx_queue_stopped().  Without the
-	 * memory barrier, there is a small possibility that bnxt_start_xmit()
-	 * will miss it and cause the queue to be stopped forever.
-	 */
-	smp_mb();
-
-	if (unlikely(netif_tx_queue_stopped(txq)) &&
-	    bnxt_tx_avail(bp, txr) >= bp->tx_wake_thresh &&
-	    READ_ONCE(txr->dev_state) != BNXT_DEV_STATE_CLOSING)
-		netif_tx_wake_queue(txq);
+	__netif_tx_queue_maybe_wake(txq, bnxt_tx_avail(bp, txr),
+				    bp->tx_wake_thresh,
+				    READ_ONCE(txr->dev_state) != BNXT_DEV_STATE_CLOSING);
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
@ 2023-04-01 15:04   ` Heiner Kallweit
  2023-04-01 18:03     ` Jakub Kicinski
  2023-04-01 15:18   ` Heiner Kallweit
  1 sibling, 1 reply; 29+ messages in thread
From: Heiner Kallweit @ 2023-04-01 15:04 UTC (permalink / raw)
  To: Jakub Kicinski, davem; +Cc: netdev, edumazet, pabeni

On 01.04.2023 07:12, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> v2:
>  - really flip the unlikely into a likely in __netif_tx_queue_maybe_wake()
>  - convert if / else into pre-init of _ret
> v1: https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
> ---
>  include/net/netdev_queues.h | 167 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..d050eb5e5bea
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,167 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up
> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\

Wouldn't a smp_mb__after_atomic() be sufficient here, because netif_tx_stop_queue()
includes a set_bit()? At least on X86 this would result in a no-op.

> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		_res = 0;						\
> +		if (unlikely(get_desc >= start_thrs)) {			\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +
> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> + *
> + * Returns:
> + *	 0 if the queue was stopped
> + *	 1 if the queue was left enabled
> + *	-1 if the queue was re-enabled (raced with waking)
> + */
> +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> +	({								\
> +		int _res;						\
> +									\
> +		_res = 1;						\
> +		if (unlikely(get_desc < stop_thrs))			\
> +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> +						       start_thrs);	\
> +		_res;							\
> +	})								\
> +
> +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \

Maybe I miss something, but: Why the get_desc and start_thrs parameters
if they aren't used?

> +	({								\
> +		int _res;						\
> +									\
> +		/* Make sure that anybody stopping the queue after	\
> +		 * this sees the new next_to_clean.			\
> +		 */							\
> +		smp_mb();						\
> +		_res = 1;						\
> +		if (unlikely(netif_tx_queue_stopped(txq)) && !(down_cond)) { \
> +			netif_tx_wake_queue(txq);			\
> +			_res = 0;					\
> +		}							\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
> +
> +/**
> + * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @start_thrs:	minimal number of descriptors to re-enable the queue
> + * @down_cond:	down condition, predicate indicating that the queue should
> + *		not be woken up even if descriptors are available
> + *
> + * All arguments may be evaluated multiple times.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + *
> + * Returns:
> + *	 0 if the queue was woken up
> + *	 1 if the queue was already enabled (or disabled but @down_cond is true)
> + *	-1 if the queue was left stopped
> + */
> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		_res = -1;						\
> +		if (likely(get_desc > start_thrs))			\
> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> +							 start_thrs,	\
> +							 down_cond);	\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
> +
> +/* subqueue variants follow */
> +
> +#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
> +	})
> +
> +#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_maybe_stop(txq, get_desc,		\
> +					  stop_thrs, start_thrs);	\
> +	})
> +
> +#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_try_wake(txq, get_desc,		\
> +					  start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
> +	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_maybe_wake(txq, get_desc,		\
> +					    start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
> +	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#endif


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
  2023-04-01 15:04   ` Heiner Kallweit
@ 2023-04-01 15:18   ` Heiner Kallweit
  2023-04-01 18:58     ` Jakub Kicinski
  1 sibling, 1 reply; 29+ messages in thread
From: Heiner Kallweit @ 2023-04-01 15:18 UTC (permalink / raw)
  To: Jakub Kicinski, davem; +Cc: netdev, edumazet, pabeni

On 01.04.2023 07:12, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> v2:
>  - really flip the unlikely into a likely in __netif_tx_queue_maybe_wake()
>  - convert if / else into pre-init of _ret
> v1: https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
> ---
>  include/net/netdev_queues.h | 167 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..d050eb5e5bea
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,167 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up
> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\
> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		_res = 0;						\
> +		if (unlikely(get_desc >= start_thrs)) {			\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +
> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> + *
> + * Returns:
> + *	 0 if the queue was stopped
> + *	 1 if the queue was left enabled
> + *	-1 if the queue was re-enabled (raced with waking)
> + */
> +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> +	({								\
> +		int _res;						\
> +									\
> +		_res = 1;						\
> +		if (unlikely(get_desc < stop_thrs))			\
> +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> +						       start_thrs);	\
> +		_res;							\
> +	})								\
> +
> +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		/* Make sure that anybody stopping the queue after	\
> +		 * this sees the new next_to_clean.			\
> +		 */							\
> +		smp_mb();						\
> +		_res = 1;						\
> +		if (unlikely(netif_tx_queue_stopped(txq)) && !(down_cond)) { \
> +			netif_tx_wake_queue(txq);			\
> +			_res = 0;					\
> +		}							\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
> +
> +/**
> + * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @start_thrs:	minimal number of descriptors to re-enable the queue
> + * @down_cond:	down condition, predicate indicating that the queue should
> + *		not be woken up even if descriptors are available
> + *
> + * All arguments may be evaluated multiple times.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + *
> + * Returns:
> + *	 0 if the queue was woken up
> + *	 1 if the queue was already enabled (or disabled but @down_cond is true)
> + *	-1 if the queue was left stopped
> + */
> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		_res = -1;						\

One more question: Don't we need a read memory barrier here to ensure
get_desc is up-to-date?

> +		if (likely(get_desc > start_thrs))			\
> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> +							 start_thrs,	\
> +							 down_cond);	\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
> +
> +/* subqueue variants follow */
> +
> +#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
> +	})
> +
> +#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_maybe_stop(txq, get_desc,		\
> +					  stop_thrs, start_thrs);	\
> +	})
> +
> +#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_try_wake(txq, get_desc,		\
> +					  start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
> +	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_maybe_wake(txq, get_desc,		\
> +					    start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
> +	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#endif


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 15:04   ` Heiner Kallweit
@ 2023-04-01 18:03     ` Jakub Kicinski
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-01 18:03 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: davem, netdev, edumazet, pabeni

On Sat, 1 Apr 2023 17:04:17 +0200 Heiner Kallweit wrote:
> > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		netif_tx_stop_queue(txq);				\
> > +									\
> > +		smp_mb();						\  
> 
> Wouldn't a smp_mb__after_atomic() be sufficient here, because netif_tx_stop_queue()
> includes a set_bit()? At least on X86 this would result in a no-op.

Yup, good point, __after_atomic() should be perfectly fine. I didn't
think too much about the smp_mb() 'cause it's what existing code was
using and it's a slow path. I'll fix in v3.

> > +									\
> > +		/* We need to check again in a case another		\
> > +		 * CPU has just made room available.			\
> > +		 */							\
> > +		_res = 0;						\
> > +		if (unlikely(get_desc >= start_thrs)) {			\
> > +			netif_tx_start_queue(txq);			\
> > +			_res = -1;					\
> > +		}							\
> > +		_res;							\
> > +	})								\
> > +
> > +/**
> > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > + * @txq:	struct netdev_queue to stop/start
> > + * @get_desc:	get current number of free descriptors (see requirements below!)
> > + * @stop_thrs:	minimal number of available descriptors for queue to be left
> > + *		enabled
> > + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> > + *		equal to @stop_thrs or higher to avoid frequent waking
> > + *
> > + * All arguments may be evaluated multiple times, beware of side effects.
> > + * @get_desc must be a formula or a function call, it must always
> > + * return up-to-date information when evaluated!
> > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> > + *
> > + * Returns:
> > + *	 0 if the queue was stopped
> > + *	 1 if the queue was left enabled
> > + *	-1 if the queue was re-enabled (raced with waking)
> > + */
> > +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		_res = 1;						\
> > +		if (unlikely(get_desc < stop_thrs))			\
> > +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> > +						       start_thrs);	\
> > +		_res;							\
> > +	})								\
> > +
> > +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \  
> 
> Maybe I miss something, but: Why the get_desc and start_thrs parameters
> if they aren't used?

Copy'n'paste fail, will fix :(

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 3/3] bnxt: use new queue try_stop/try_wake macros
  2023-04-01  5:12 ` [PATCH net-next 3/3] bnxt: " Jakub Kicinski
@ 2023-04-01 18:35   ` Michael Chan
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Chan @ 2023-04-01 18:35 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

On Fri, Mar 31, 2023 at 10:12 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Convert bnxt to use new macros rather than open code the logic.
> Two differences:
> (1) bnxt_tx_int() will now only issue a memory barrier if it sees
>     enough space on the ring to wake the queue. This should be fine,
>     the mb() is between the writes to the ring pointers and checking
>     queue state.
> (2) we'll start the queue instead of waking on race, this should
>     be safe inside the xmit handler.
>

Reviewed that these 2 differences should be fine.  Thanks.

Reviewed-by: Michael Chan <michael.chan@broadcom.com>

> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: michael.chan@broadcom.com
> ---

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 15:18   ` Heiner Kallweit
@ 2023-04-01 18:58     ` Jakub Kicinski
  2023-04-01 20:41       ` Heiner Kallweit
  2023-04-03 15:18       ` Alexander Duyck
  0 siblings, 2 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-01 18:58 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: davem, netdev, edumazet, pabeni, Alexander Duyck

On Sat, 1 Apr 2023 17:18:12 +0200 Heiner Kallweit wrote:
> > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		_res = -1;						\  
> 
> One more question: Don't we need a read memory barrier here to ensure
> get_desc is up-to-date?

CC: Alex, maybe I should not be posting after 10pm, with the missing v2
and sparse CC list.. :|

I was thinking about this too yesterday. AFAICT this implementation
could indeed result in waking even tho the queue is full on non-x86.
That's why the drivers have an extra check at the start of .xmit? :(

I *think* that the right ordering would be:

WRITE cons
mb()  # A
READ stopped
rmb() # C
READ prod, cons

And on the producer side (existing):

WRITE prod
READ prod, cons
mb()  # B
WRITE stopped
READ prod, cons

But I'm slightly afraid to change it, it's been working for over 
a decade :D

One neat thing that I noticed, which we could potentially exploit 
if we were to touch this code is that BQL already has a smp_mb() 
on the consumer side. So on any kernel config and driver which support
BQL we can use that instead of adding another barrier at #A.

It would actually be a neat optimization because right now, AFAICT,
completion will fire the # A -like barrier almost every time.

> > +		if (likely(get_desc > start_thrs))			\
> > +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> > +							 start_thrs,	\
> > +							 down_cond);	\
> > +		_res;							\
> > +	})

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 18:58     ` Jakub Kicinski
@ 2023-04-01 20:41       ` Heiner Kallweit
  2023-04-03 15:18       ` Alexander Duyck
  1 sibling, 0 replies; 29+ messages in thread
From: Heiner Kallweit @ 2023-04-01 20:41 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, Alexander Duyck

On 01.04.2023 20:58, Jakub Kicinski wrote:
> On Sat, 1 Apr 2023 17:18:12 +0200 Heiner Kallweit wrote:
>>> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
>>> +	({								\
>>> +		int _res;						\
>>> +									\
>>> +		_res = -1;						\  
>>
>> One more question: Don't we need a read memory barrier here to ensure
>> get_desc is up-to-date?
> 
> CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> and sparse CC list.. :|
> 
> I was thinking about this too yesterday. AFAICT this implementation
> could indeed result in waking even tho the queue is full on non-x86.
> That's why the drivers have an extra check at the start of .xmit? :(
> 
> I *think* that the right ordering would be:
> 
> WRITE cons
> mb()  # A
> READ stopped
> rmb() # C
> READ prod, cons
> 
> And on the producer side (existing):
> 
> WRITE prod
> READ prod, cons
> mb()  # B
> WRITE stopped
> READ prod, cons
> 
> But I'm slightly afraid to change it, it's been working for over 
> a decade :D
> 
> One neat thing that I noticed, which we could potentially exploit 
> if we were to touch this code is that BQL already has a smp_mb() 
> on the consumer side. So on any kernel config and driver which support
> BQL we can use that instead of adding another barrier at #A.
> 
To me it seems ixgbe relies on this BQL-related barrier in
netdev_tx_completed_queue, maybe unintentionally.

I wonder whether we should move the smp_mb() from __netif_tx_queue_try_wake
to __netif_tx_queue_maybe_wake, before accessing get_desc.
Answer may depend on whether there's any use case where drivers would
call the "try" version directly, instead of the "may" version.
And, just thinking out loud, maybe add a flag to the macro to disable
the smp_mb() call, for cases where the driver has the required barrier
already. Setting this flag may look like this.

bool disable_barrier_in_may_wake = IS_ENABLED(CONFIG_BQL);

However this is quite some effort just to avoid a duplicated barrier.

> It would actually be a neat optimization because right now, AFAICT,
> completion will fire the # A -like barrier almost every time.
> 
>>> +		if (likely(get_desc > start_thrs))			\
>>> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
>>> +							 start_thrs,	\
>>> +							 down_cond);	\
>>> +		_res;							\
>>> +	})


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 18:58     ` Jakub Kicinski
  2023-04-01 20:41       ` Heiner Kallweit
@ 2023-04-03 15:18       ` Alexander Duyck
  2023-04-03 15:56         ` Jakub Kicinski
  2023-04-04  6:39         ` Herbert Xu
  1 sibling, 2 replies; 29+ messages in thread
From: Alexander Duyck @ 2023-04-03 15:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu

On Sat, Apr 1, 2023 at 11:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 1 Apr 2023 17:18:12 +0200 Heiner Kallweit wrote:
> > > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > > +   ({                                                              \
> > > +           int _res;                                               \
> > > +                                                                   \
> > > +           _res = -1;                                              \
> >
> > One more question: Don't we need a read memory barrier here to ensure
> > get_desc is up-to-date?
>
> CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> and sparse CC list.. :|
>
> I was thinking about this too yesterday. AFAICT this implementation
> could indeed result in waking even tho the queue is full on non-x86.
> That's why the drivers have an extra check at the start of .xmit? :(

The extra check at the start is more historical than anything else.
Logic like that has been there since the e1000 days. I think it
addressed items like pktgen which I think didn't make use of the
stop/wake flags way back when. I'll add in Herbet who was the original
author for this code so he can add some additional history if needed.

> I *think* that the right ordering would be:
>
> WRITE cons
> mb()  # A
> READ stopped
> rmb() # C
> READ prod, cons

What would the extra rmb() get you? The mb() will have already flushed
out any writes and if stopped is set the tail should have already been
written before setting it.

One other thing to keep in mind is that the wake gives itself a pretty
good runway. We are talking about enough to transmit at least 2
frames. So if another consumer is stopping it we aren't waking it
unless there is enough space for yet another frame after the current
consumer.

> And on the producer side (existing):
>
> WRITE prod
> READ prod, cons
> mb()  # B
> WRITE stopped
> READ prod, cons
>
> But I'm slightly afraid to change it, it's been working for over
> a decade :D

I wouldn't change it. The code has predated BQL in the e1000 driver
and has been that way since the inception of it I believe in 2.6.19.

> One neat thing that I noticed, which we could potentially exploit
> if we were to touch this code is that BQL already has a smp_mb()
> on the consumer side. So on any kernel config and driver which support
> BQL we can use that instead of adding another barrier at #A.
>
> It would actually be a neat optimization because right now, AFAICT,
> completion will fire the # A -like barrier almost every time.

Yeah, the fact is the barrier in the wake path may actually be
redundant if BQL is enabled. My advice is if you are wanting to get a
better idea of how this was setup you could take a look at the e1000
driver in the 2.6.19 kernel as that was where this code originated and
I am pretty certain it predates anything in any of the other Intel
drivers other than maybe e100.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 15:18       ` Alexander Duyck
@ 2023-04-03 15:56         ` Jakub Kicinski
  2023-04-03 18:11           ` Alexander Duyck
  2023-04-04  6:39         ` Herbert Xu
  1 sibling, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-03 15:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, 3 Apr 2023 08:18:04 -0700 Alexander Duyck wrote:
> On Sat, Apr 1, 2023 at 11:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > One more question: Don't we need a read memory barrier here to ensure
> > > get_desc is up-to-date?  
> >
> > CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> > and sparse CC list.. :|
> >
> > I was thinking about this too yesterday. AFAICT this implementation
> > could indeed result in waking even tho the queue is full on non-x86.
> > That's why the drivers have an extra check at the start of .xmit? :(  
> 
> The extra check at the start is more historical than anything else.
> Logic like that has been there since the e1000 days. I think it
> addressed items like pktgen which I think didn't make use of the
> stop/wake flags way back when. I'll add in Herbet who was the original
> author for this code so he can add some additional history if needed.

Thanks for the pointer, you weren't kidding with the 2.6.19, that seems
to be when to code was added to e1000 :) Looks fairly similar to the
current code minus the BQL.

> > I *think* that the right ordering would be:
> >
> > c1. WRITE cons
> > c2. mb()  # A
> > c3. READ stopped
> > c4. rmb() # C
> > c5. READ prod, cons  
> 
> What would the extra rmb() get you? The mb() will have already flushed
> out any writes and if stopped is set the tail should have already been
> written before setting it.

I don't think in terms of flushes. Let me add line numbers to the
producer and the consumer.

 c1. WRITE cons
 c2. mb()  # A
 c3. READ stopped
 c4. rmb() # C
 c5. READ prod, cons  

 p1. WRITE prod
 p2. READ prod, cons
 p3. mb()  # B
 p4. WRITE stopped
 p5. READ prod, cons

The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..

> One other thing to keep in mind is that the wake gives itself a pretty
> good runway. We are talking about enough to transmit at least 2
> frames. So if another consumer is stopping it we aren't waking it
> unless there is enough space for yet another frame after the current
> consumer.

Ack, the race is very unlikely, basically the completing CPU would have
to take an expensive IRQ between checking the descriptor count and
checking if stopped -- to let the sending CPU queue multiple frames.

But in theory the race is there, right?

> > And on the producer side (existing):
> >
> > p1. WRITE prod
> > p2. READ prod, cons
> > p3. mb()  # B
> > p4. WRITE stopped
> > p5. READ prod, cons
> >
> > But I'm slightly afraid to change it, it's been working for over
> > a decade :D  
> 
> I wouldn't change it. The code has predated BQL in the e1000 driver
> and has been that way since the inception of it I believe in 2.6.19.
> 
> > One neat thing that I noticed, which we could potentially exploit
> > if we were to touch this code is that BQL already has a smp_mb()
> > on the consumer side. So on any kernel config and driver which support
> > BQL we can use that instead of adding another barrier at #A.
> >
> > It would actually be a neat optimization because right now, AFAICT,
> > completion will fire the # A -like barrier almost every time.  
> 
> Yeah, the fact is the barrier in the wake path may actually be
> redundant if BQL is enabled. My advice is if you are wanting to get a
> better idea of how this was setup you could take a look at the e1000
> driver in the 2.6.19 kernel as that was where this code originated and
> I am pretty certain it predates anything in any of the other Intel
> drivers other than maybe e100.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 15:56         ` Jakub Kicinski
@ 2023-04-03 18:11           ` Alexander Duyck
  2023-04-03 19:03             ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Alexander Duyck @ 2023-04-03 18:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 3 Apr 2023 08:18:04 -0700 Alexander Duyck wrote:
> > On Sat, Apr 1, 2023 at 11:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > One more question: Don't we need a read memory barrier here to ensure
> > > > get_desc is up-to-date?
> > >
> > > CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> > > and sparse CC list.. :|
> > >
> > > I was thinking about this too yesterday. AFAICT this implementation
> > > could indeed result in waking even tho the queue is full on non-x86.
> > > That's why the drivers have an extra check at the start of .xmit? :(
> >
> > The extra check at the start is more historical than anything else.
> > Logic like that has been there since the e1000 days. I think it
> > addressed items like pktgen which I think didn't make use of the
> > stop/wake flags way back when. I'll add in Herbet who was the original
> > author for this code so he can add some additional history if needed.
>
> Thanks for the pointer, you weren't kidding with the 2.6.19, that seems
> to be when to code was added to e1000 :) Looks fairly similar to the
> current code minus the BQL.
>
> > > I *think* that the right ordering would be:
> > >
> > > c1. WRITE cons
> > > c2. mb()  # A
> > > c3. READ stopped
> > > c4. rmb() # C
> > > c5. READ prod, cons
> >
> > What would the extra rmb() get you? The mb() will have already flushed
> > out any writes and if stopped is set the tail should have already been
> > written before setting it.
>
> I don't think in terms of flushes. Let me add line numbers to the
> producer and the consumer.
>
>  c1. WRITE cons
>  c2. mb()  # A
>  c3. READ stopped
>  c4. rmb() # C
>  c5. READ prod, cons
>
>  p1. WRITE prod
>  p2. READ prod, cons
>  p3. mb()  # B
>  p4. WRITE stopped
>  p5. READ prod, cons
>
> The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..

So which function is supposed to be consumer vs producer here? I think
your write stopped is on the wrong side of the memory barrier. It
should be writing prod and stopped both before the barrier.

The maybe/try stop should essentially be:
1. write tail
2. read prod/cons
3. if unused >= 1x packet
3.a return

4. set stop
5. mb()
6. Re-read prod/cons
7. if unused >= 1x packet
7.a. test_and_clear stop

The maybe/try wake would be:
1. write head
2. read prod/cons
3. if consumed == 0 || unused < 2x packet
3.a. return

4. mb()
5. test_and_clear stop

> > One other thing to keep in mind is that the wake gives itself a pretty
> > good runway. We are talking about enough to transmit at least 2
> > frames. So if another consumer is stopping it we aren't waking it
> > unless there is enough space for yet another frame after the current
> > consumer.
>
> Ack, the race is very unlikely, basically the completing CPU would have
> to take an expensive IRQ between checking the descriptor count and
> checking if stopped -- to let the sending CPU queue multiple frames.
>
> But in theory the race is there, right?

I don't think this is so much a race as a skid. Specifically when we
wake the queue it will only run for one more packet in such a
scenario. I think it is being run more like a flow control threshold
rather than some sort of lock.

I think I see what you are getting at though. Basically if the xmit
function were to cycle several times between steps 3.a and 4 in the
maybe/try wake it could fill the queue and then trigger the wake even
though the queue is full and the unused space was already consumed.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 18:11           ` Alexander Duyck
@ 2023-04-03 19:03             ` Jakub Kicinski
  2023-04-03 20:27               ` Alexander Duyck
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-03 19:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, 3 Apr 2023 11:11:35 -0700 Alexander Duyck wrote:
> On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > I don't think in terms of flushes. Let me add line numbers to the
> > producer and the consumer.
> >
> >  c1. WRITE cons
> >  c2. mb()  # A
> >  c3. READ stopped
> >  c4. rmb() # C
> >  c5. READ prod, cons
> >
> >  p1. WRITE prod
> >  p2. READ prod, cons
> >  p3. mb()  # B
> >  p4. WRITE stopped
> >  p5. READ prod, cons
> >
> > The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> > orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..  
> 
> So which function is supposed to be consumer vs producer here? 

producer is xmit consumer is NAPI

> I think your write stopped is on the wrong side of the memory barrier. 
> It should be writing prod and stopped both before the barrier.

Indeed, Paul pointed out over chat that we need two barriers there 
to be correct :( Should be fine in practice, first one is BQL,
second one is on the slow path.

> The maybe/try stop should essentially be:
> 1. write tail
> 2. read prod/cons
> 3. if unused >= 1x packet
> 3.a return
> 
> 4. set stop
> 5. mb()
> 6. Re-read prod/cons
> 7. if unused >= 1x packet
> 7.a. test_and_clear stop
> 
> The maybe/try wake would be:
> 1. write head
> 2. read prod/cons
> 3. if consumed == 0 || unused < 2x packet
> 3.a. return
> 
> 4. mb()
> 5. test_and_clear stop
> 
> > > One other thing to keep in mind is that the wake gives itself a pretty
> > > good runway. We are talking about enough to transmit at least 2
> > > frames. So if another consumer is stopping it we aren't waking it
> > > unless there is enough space for yet another frame after the current
> > > consumer.  
> >
> > Ack, the race is very unlikely, basically the completing CPU would have
> > to take an expensive IRQ between checking the descriptor count and
> > checking if stopped -- to let the sending CPU queue multiple frames.
> >
> > But in theory the race is there, right?  
> 
> I don't think this is so much a race as a skid. Specifically when we
> wake the queue it will only run for one more packet in such a
> scenario. I think it is being run more like a flow control threshold
> rather than some sort of lock.
> 
> I think I see what you are getting at though. Basically if the xmit
> function were to cycle several times between steps 3.a and 4 in the
> maybe/try wake it could fill the queue and then trigger the wake even
> though the queue is full and the unused space was already consumed.

Yup, exactly. So we either need to sprinkle a couple more barriers 
and tests in, or document that the code is only 99.999999% safe 
against false positive restarts and drivers need to check for ring
full at the beginning of xmit.

I'm quite tempted to add the barriers, because on the NAPI/consumer
side we could use this as an opportunity to start piggy backing on
the BQL barrier.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 19:03             ` Jakub Kicinski
@ 2023-04-03 20:27               ` Alexander Duyck
  2023-04-05 22:20                 ` Paul E. McKenney
  0 siblings, 1 reply; 29+ messages in thread
From: Alexander Duyck @ 2023-04-03 20:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, Apr 3, 2023 at 12:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 3 Apr 2023 11:11:35 -0700 Alexander Duyck wrote:
> > On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > I don't think in terms of flushes. Let me add line numbers to the
> > > producer and the consumer.
> > >
> > >  c1. WRITE cons
> > >  c2. mb()  # A
> > >  c3. READ stopped
> > >  c4. rmb() # C
> > >  c5. READ prod, cons
> > >
> > >  p1. WRITE prod
> > >  p2. READ prod, cons
> > >  p3. mb()  # B
> > >  p4. WRITE stopped
> > >  p5. READ prod, cons
> > >
> > > The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> > > orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..
> >
> > So which function is supposed to be consumer vs producer here?
>
> producer is xmit consumer is NAPI
>
> > I think your write stopped is on the wrong side of the memory barrier.
> > It should be writing prod and stopped both before the barrier.
>
> Indeed, Paul pointed out over chat that we need two barriers there
> to be correct :( Should be fine in practice, first one is BQL,
> second one is on the slow path.
>
> > The maybe/try stop should essentially be:
> > 1. write tail
> > 2. read prod/cons
> > 3. if unused >= 1x packet
> > 3.a return
> >
> > 4. set stop
> > 5. mb()
> > 6. Re-read prod/cons
> > 7. if unused >= 1x packet
> > 7.a. test_and_clear stop
> >
> > The maybe/try wake would be:
> > 1. write head
> > 2. read prod/cons
> > 3. if consumed == 0 || unused < 2x packet
> > 3.a. return
> >
> > 4. mb()
> > 5. test_and_clear stop
> >
> > > > One other thing to keep in mind is that the wake gives itself a pretty
> > > > good runway. We are talking about enough to transmit at least 2
> > > > frames. So if another consumer is stopping it we aren't waking it
> > > > unless there is enough space for yet another frame after the current
> > > > consumer.
> > >
> > > Ack, the race is very unlikely, basically the completing CPU would have
> > > to take an expensive IRQ between checking the descriptor count and
> > > checking if stopped -- to let the sending CPU queue multiple frames.
> > >
> > > But in theory the race is there, right?
> >
> > I don't think this is so much a race as a skid. Specifically when we
> > wake the queue it will only run for one more packet in such a
> > scenario. I think it is being run more like a flow control threshold
> > rather than some sort of lock.
> >
> > I think I see what you are getting at though. Basically if the xmit
> > function were to cycle several times between steps 3.a and 4 in the
> > maybe/try wake it could fill the queue and then trigger the wake even
> > though the queue is full and the unused space was already consumed.
>
> Yup, exactly. So we either need to sprinkle a couple more barriers
> and tests in, or document that the code is only 99.999999% safe
> against false positive restarts and drivers need to check for ring
> full at the beginning of xmit.
>
> I'm quite tempted to add the barriers, because on the NAPI/consumer
> side we could use this as an opportunity to start piggy backing on
> the BQL barrier.

The thing is the more barriers we add the more it will hurt
performance. I'd be tempted to just increase the runway we have as we
could afford a 1 packet skid if we had a 2 packet runway for the
start/stop thresholds.

I suspect that is probably why we haven't seen any issues as the
DESC_NEEDED is pretty generous since it is assuming worst case
scenarios.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 15:18       ` Alexander Duyck
  2023-04-03 15:56         ` Jakub Kicinski
@ 2023-04-04  6:39         ` Herbert Xu
  2023-04-04 22:36           ` Jakub Kicinski
  1 sibling, 1 reply; 29+ messages in thread
From: Herbert Xu @ 2023-04-04  6:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Heiner Kallweit, davem, netdev, edumazet, pabeni

On Mon, Apr 03, 2023 at 08:18:04AM -0700, Alexander Duyck wrote:
>
> > I *think* that the right ordering would be:
> >
> > WRITE cons
> > mb()  # A
> > READ stopped
> > rmb() # C
> > READ prod, cons
> 
> What would the extra rmb() get you? The mb() will have already flushed
> out any writes and if stopped is set the tail should have already been
> written before setting it.
> 
> One other thing to keep in mind is that the wake gives itself a pretty
> good runway. We are talking about enough to transmit at least 2
> frames. So if another consumer is stopping it we aren't waking it
> unless there is enough space for yet another frame after the current
> consumer.
> 
> > And on the producer side (existing):
> >
> > WRITE prod
> > READ prod, cons
> > mb()  # B
> > WRITE stopped
> > READ prod, cons
> >
> > But I'm slightly afraid to change it, it's been working for over
> > a decade :D
> 
> I wouldn't change it. The code has predated BQL in the e1000 driver
> and has been that way since the inception of it I believe in 2.6.19.

Thanks for adding me to this thread as otherwise I would've surely
missed it.

I see where the confusion is coming from.  The key is that we weren't
trying to stop every single race, because not all of them are fatal.

In particular, we tolerate the race where a wake is done when it
shouldn't be because the network stack copes with that by requeueing
the skb onto the qdisc.

So it's a trade-off.  We could make our code water-tight, but then
we would be incurring a penalty for every skb.  With our current
approach, the penalty is only incurred in the unlikely event of a
race which results in the unlucky skb being requeued.

The race that we do want to stop is a queue being stuck in a stopped
state when it shouldn't because that indeed is fatal.

Going back to the proposed helpers, we only need one mb because
that's all we need to fix the stuck/stopped queue race.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-04  6:39         ` Herbert Xu
@ 2023-04-04 22:36           ` Jakub Kicinski
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-04 22:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alexander Duyck, Heiner Kallweit, davem, netdev, edumazet, pabeni

On Tue, 4 Apr 2023 14:39:11 +0800 Herbert Xu wrote:
> Thanks for adding me to this thread as otherwise I would've surely
> missed it.
> 
> I see where the confusion is coming from.  The key is that we weren't
> trying to stop every single race, because not all of them are fatal.
> 
> In particular, we tolerate the race where a wake is done when it
> shouldn't be because the network stack copes with that by requeueing
> the skb onto the qdisc.
> 
> So it's a trade-off.  We could make our code water-tight, but then
> we would be incurring a penalty for every skb.  With our current
> approach, the penalty is only incurred in the unlikely event of a
> race which results in the unlucky skb being requeued.
> 
> The race that we do want to stop is a queue being stuck in a stopped
> state when it shouldn't because that indeed is fatal.
> 
> Going back to the proposed helpers, we only need one mb because
> that's all we need to fix the stuck/stopped queue race.

Thanks, I'm impressed you still remember the details :)

I'll leave it racy in the next version. Re-using the BQL barrier
is a bit more tricky on the xmit path than I thought. I'll just
document that false-positive wake ups are possible.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 20:27               ` Alexander Duyck
@ 2023-04-05 22:20                 ` Paul E. McKenney
  2023-04-06  5:15                   ` Herbert Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2023-04-05 22:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Heiner Kallweit, davem, netdev, edumazet, pabeni,
	Herbert Xu

On Mon, Apr 03, 2023 at 01:27:44PM -0700, Alexander Duyck wrote:
> On Mon, Apr 3, 2023 at 12:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 3 Apr 2023 11:11:35 -0700 Alexander Duyck wrote:
> > > On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > I don't think in terms of flushes. Let me add line numbers to the
> > > > producer and the consumer.
> > > >
> > > >  c1. WRITE cons
> > > >  c2. mb()  # A
> > > >  c3. READ stopped
> > > >  c4. rmb() # C
> > > >  c5. READ prod, cons
> > > >
> > > >  p1. WRITE prod
> > > >  p2. READ prod, cons
> > > >  p3. mb()  # B
> > > >  p4. WRITE stopped
> > > >  p5. READ prod, cons
> > > >
> > > > The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> > > > orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..
> > >
> > > So which function is supposed to be consumer vs producer here?
> >
> > producer is xmit consumer is NAPI
> >
> > > I think your write stopped is on the wrong side of the memory barrier.
> > > It should be writing prod and stopped both before the barrier.
> >
> > Indeed, Paul pointed out over chat that we need two barriers there
> > to be correct :( Should be fine in practice, first one is BQL,
> > second one is on the slow path.
> >
> > > The maybe/try stop should essentially be:
> > > 1. write tail
> > > 2. read prod/cons
> > > 3. if unused >= 1x packet
> > > 3.a return
> > >
> > > 4. set stop
> > > 5. mb()
> > > 6. Re-read prod/cons
> > > 7. if unused >= 1x packet
> > > 7.a. test_and_clear stop
> > >
> > > The maybe/try wake would be:
> > > 1. write head
> > > 2. read prod/cons
> > > 3. if consumed == 0 || unused < 2x packet
> > > 3.a. return
> > >
> > > 4. mb()
> > > 5. test_and_clear stop
> > >
> > > > > One other thing to keep in mind is that the wake gives itself a pretty
> > > > > good runway. We are talking about enough to transmit at least 2
> > > > > frames. So if another consumer is stopping it we aren't waking it
> > > > > unless there is enough space for yet another frame after the current
> > > > > consumer.
> > > >
> > > > Ack, the race is very unlikely, basically the completing CPU would have
> > > > to take an expensive IRQ between checking the descriptor count and
> > > > checking if stopped -- to let the sending CPU queue multiple frames.
> > > >
> > > > But in theory the race is there, right?
> > >
> > > I don't think this is so much a race as a skid. Specifically when we
> > > wake the queue it will only run for one more packet in such a
> > > scenario. I think it is being run more like a flow control threshold
> > > rather than some sort of lock.
> > >
> > > I think I see what you are getting at though. Basically if the xmit
> > > function were to cycle several times between steps 3.a and 4 in the
> > > maybe/try wake it could fill the queue and then trigger the wake even
> > > though the queue is full and the unused space was already consumed.
> >
> > Yup, exactly. So we either need to sprinkle a couple more barriers
> > and tests in, or document that the code is only 99.999999% safe
> > against false positive restarts and drivers need to check for ring
> > full at the beginning of xmit.
> >
> > I'm quite tempted to add the barriers, because on the NAPI/consumer
> > side we could use this as an opportunity to start piggy backing on
> > the BQL barrier.
> 
> The thing is the more barriers we add the more it will hurt
> performance. I'd be tempted to just increase the runway we have as we
> could afford a 1 packet skid if we had a 2 packet runway for the
> start/stop thresholds.
> 
> I suspect that is probably why we haven't seen any issues as the
> DESC_NEEDED is pretty generous since it is assuming worst case
> scenarios.

Mightn't preemption or interrupts cause further issues?  Or are preemption
and/or interrupts disabled across the relevant sections of code?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-05 22:20                 ` Paul E. McKenney
@ 2023-04-06  5:15                   ` Herbert Xu
  2023-04-06 14:17                     ` Paul E. McKenney
  0 siblings, 1 reply; 29+ messages in thread
From: Herbert Xu @ 2023-04-06  5:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Duyck, Jakub Kicinski, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Wed, Apr 05, 2023 at 03:20:39PM -0700, Paul E. McKenney wrote:
>
> Mightn't preemption or interrupts cause further issues?  Or are preemption
> and/or interrupts disabled across the relevant sections of code?

The code in question is supposed to run in softirq context.  So
both interrupts and preemption should be disabled.

Chyeers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06  5:15                   ` Herbert Xu
@ 2023-04-06 14:17                     ` Paul E. McKenney
  2023-04-06 14:46                       ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2023-04-06 14:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alexander Duyck, Jakub Kicinski, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, Apr 06, 2023 at 01:15:25PM +0800, Herbert Xu wrote:
> On Wed, Apr 05, 2023 at 03:20:39PM -0700, Paul E. McKenney wrote:
> >
> > Mightn't preemption or interrupts cause further issues?  Or are preemption
> > and/or interrupts disabled across the relevant sections of code?
> 
> The code in question is supposed to run in softirq context.  So
> both interrupts and preemption should be disabled.

Agreed, preemption will be enabled in softirq, but interrupts can still
happen, correct?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 14:17                     ` Paul E. McKenney
@ 2023-04-06 14:46                       ` Jakub Kicinski
  2023-04-06 15:45                         ` Paul E. McKenney
  2023-04-07  0:58                         ` Herbert Xu
  0 siblings, 2 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-06 14:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, 6 Apr 2023 07:17:09 -0700 Paul E. McKenney wrote:
> > > Mightn't preemption or interrupts cause further issues?  Or are preemption
> > > and/or interrupts disabled across the relevant sections of code?  
> > 
> > The code in question is supposed to run in softirq context.  So
> > both interrupts and preemption should be disabled.  
> 
> Agreed, preemption will be enabled in softirq, but interrupts can still
> happen, correct?

Starting the queue only happens from softirq (I hope) and stopping 
can happen from any context. So we're risking false-starts again.
I think this puts to bed any hope of making this code safe against
false-starts with just barriers :(

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 14:46                       ` Jakub Kicinski
@ 2023-04-06 15:45                         ` Paul E. McKenney
  2023-04-06 15:56                           ` Jakub Kicinski
  2023-04-07  0:58                         ` Herbert Xu
  1 sibling, 1 reply; 29+ messages in thread
From: Paul E. McKenney @ 2023-04-06 15:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, Apr 06, 2023 at 07:46:48AM -0700, Jakub Kicinski wrote:
> On Thu, 6 Apr 2023 07:17:09 -0700 Paul E. McKenney wrote:
> > > > Mightn't preemption or interrupts cause further issues?  Or are preemption
> > > > and/or interrupts disabled across the relevant sections of code?  
> > > 
> > > The code in question is supposed to run in softirq context.  So
> > > both interrupts and preemption should be disabled.  
> > 
> > Agreed, preemption will be enabled in softirq, but interrupts can still
> > happen, correct?
> 
> Starting the queue only happens from softirq (I hope) and stopping 
> can happen from any context. So we're risking false-starts again.
> I think this puts to bed any hope of making this code safe against
> false-starts with just barriers :(

Is it possible to jam all the relevant state into a single variable?
(I believe that that answer is "no", but just in case asking this question
inspires someone to come up with a good idea.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 15:45                         ` Paul E. McKenney
@ 2023-04-06 15:56                           ` Jakub Kicinski
  2023-04-06 16:25                             ` Paul E. McKenney
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-06 15:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, 6 Apr 2023 08:45:10 -0700 Paul E. McKenney wrote:
> > Starting the queue only happens from softirq (I hope) and stopping 
> > can happen from any context. So we're risking false-starts again.
> > I think this puts to bed any hope of making this code safe against
> > false-starts with just barriers :(  
> 
> Is it possible to jam all the relevant state into a single variable?
> (I believe that that answer is "no", but just in case asking this question
> inspires someone to come up with a good idea.)

Not in any obvious way, half of the state is driver-specific the other
half is flags maintained by the core :S

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 15:56                           ` Jakub Kicinski
@ 2023-04-06 16:25                             ` Paul E. McKenney
  0 siblings, 0 replies; 29+ messages in thread
From: Paul E. McKenney @ 2023-04-06 16:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, Apr 06, 2023 at 08:56:29AM -0700, Jakub Kicinski wrote:
> On Thu, 6 Apr 2023 08:45:10 -0700 Paul E. McKenney wrote:
> > > Starting the queue only happens from softirq (I hope) and stopping 
> > > can happen from any context. So we're risking false-starts again.
> > > I think this puts to bed any hope of making this code safe against
> > > false-starts with just barriers :(  
> > 
> > Is it possible to jam all the relevant state into a single variable?
> > (I believe that that answer is "no", but just in case asking this question
> > inspires someone to come up with a good idea.)
> 
> Not in any obvious way, half of the state is driver-specific the other
> half is flags maintained by the core :S

Hey, I had to ask!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 14:46                       ` Jakub Kicinski
  2023-04-06 15:45                         ` Paul E. McKenney
@ 2023-04-07  0:58                         ` Herbert Xu
  2023-04-07  1:03                           ` Jakub Kicinski
  1 sibling, 1 reply; 29+ messages in thread
From: Herbert Xu @ 2023-04-07  0:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Thu, Apr 06, 2023 at 07:46:48AM -0700, Jakub Kicinski wrote:
> On Thu, 6 Apr 2023 07:17:09 -0700 Paul E. McKenney wrote:
>
> > Agreed, preemption will be enabled in softirq, but interrupts can still
> > happen, correct?
> 
> Starting the queue only happens from softirq (I hope) and stopping 
> can happen from any context. So we're risking false-starts again.
> I think this puts to bed any hope of making this code safe against
> false-starts with just barriers :(

Packet transmit can only occur in process context or softirq context
so it can't interrupt an existing softirq context on the same CPU.

However, even if we did allow one or both to occur in IRQ context I
can't see how this can break.

I think Paul didn't see my earlier message because he wasn't part
of the original CC list.  The only race we're trying to stop is the
one which leaves the TX queue stopped forever.  We don't care about
false starts.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-07  0:58                         ` Herbert Xu
@ 2023-04-07  1:03                           ` Jakub Kicinski
  2023-04-07  1:14                             ` Herbert Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-07  1:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Fri, 7 Apr 2023 08:58:06 +0800 Herbert Xu wrote:
> Packet transmit can only occur in process context or softirq context

I couldn't find a check in netpoll/netconsole which would defer
when in hardirq. Does it defer?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-07  1:03                           ` Jakub Kicinski
@ 2023-04-07  1:14                             ` Herbert Xu
  2023-04-07  1:21                               ` Jakub Kicinski
  0 siblings, 1 reply; 29+ messages in thread
From: Herbert Xu @ 2023-04-07  1:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Thu, Apr 06, 2023 at 06:03:37PM -0700, Jakub Kicinski wrote:
>
> I couldn't find a check in netpoll/netconsole which would defer
> when in hardirq. Does it defer?

Right, netconsole is the one exception that can occur in any
context.  However, as I said in the previous email I don't see
how this can break the wake logic and leave the queue stopped
forever.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-07  1:14                             ` Herbert Xu
@ 2023-04-07  1:21                               ` Jakub Kicinski
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-04-07  1:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Fri, 7 Apr 2023 09:14:34 +0800 Herbert Xu wrote:
> Right, netconsole is the one exception that can occur in any
> context.  However, as I said in the previous email I don't see
> how this can break the wake logic and leave the queue stopped
> forever.

Yup, agreed! just wanted to check I'm not missing some netpoll detail :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH net-next 3/3] bnxt: use new queue try_stop/try_wake macros
  2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
@ 2023-03-22 23:30 ` Jakub Kicinski
  0 siblings, 0 replies; 29+ messages in thread
From: Jakub Kicinski @ 2023-03-22 23:30 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, willemb, alexander.duyck, Jakub Kicinski

Convert bnxt to use new macros rather than open code the logic.
Two differences:
(1) bnxt_tx_int() will now only issue a memory barrier if it sees
    enough space on the ring to wake the queue. This should be fine,
    the mb() is between the writes to the ring pointers and checking
    queue state.
(2) we'll start the queue instead of waking on race, this should
    be safe inside the xmit handler.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 41 +++++------------------
 1 file changed, 8 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index f533a8f46217..cbd48c192de3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -56,6 +56,7 @@
 #include <linux/hwmon-sysfs.h>
 #include <net/page_pool.h>
 #include <linux/align.h>
+#include <net/netdev_queues.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
@@ -331,26 +332,6 @@ static void bnxt_txr_db_kick(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
 	txr->kick_pending = 0;
 }
 
-static bool bnxt_txr_netif_try_stop_queue(struct bnxt *bp,
-					  struct bnxt_tx_ring_info *txr,
-					  struct netdev_queue *txq)
-{
-	netif_tx_stop_queue(txq);
-
-	/* netif_tx_stop_queue() must be done before checking
-	 * tx index in bnxt_tx_avail() below, because in
-	 * bnxt_tx_int(), we update tx index before checking for
-	 * netif_tx_queue_stopped().
-	 */
-	smp_mb();
-	if (bnxt_tx_avail(bp, txr) >= bp->tx_wake_thresh) {
-		netif_tx_wake_queue(txq);
-		return false;
-	}
-
-	return true;
-}
-
 static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct bnxt *bp = netdev_priv(dev);
@@ -384,7 +365,8 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (net_ratelimit() && txr->kick_pending)
 			netif_warn(bp, tx_err, dev,
 				   "bnxt: ring busy w/ flush pending!\n");
-		if (bnxt_txr_netif_try_stop_queue(bp, txr, txq))
+		if (!netif_tx_queue_try_stop(txq, bnxt_tx_avail(bp, txr),
+					     bp->tx_wake_thresh))
 			return NETDEV_TX_BUSY;
 	}
 
@@ -614,7 +596,8 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (netdev_xmit_more() && !tx_buf->is_push)
 			bnxt_txr_db_kick(bp, txr, prod);
 
-		bnxt_txr_netif_try_stop_queue(bp, txr, txq);
+		netif_tx_queue_try_stop(txq, bnxt_tx_avail(bp, txr),
+					bp->tx_wake_thresh);
 	}
 	return NETDEV_TX_OK;
 
@@ -708,17 +691,9 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int nr_pkts)
 	netdev_tx_completed_queue(txq, nr_pkts, tx_bytes);
 	txr->tx_cons = cons;
 
-	/* Need to make the tx_cons update visible to bnxt_start_xmit()
-	 * before checking for netif_tx_queue_stopped().  Without the
-	 * memory barrier, there is a small possibility that bnxt_start_xmit()
-	 * will miss it and cause the queue to be stopped forever.
-	 */
-	smp_mb();
-
-	if (unlikely(netif_tx_queue_stopped(txq)) &&
-	    bnxt_tx_avail(bp, txr) >= bp->tx_wake_thresh &&
-	    READ_ONCE(txr->dev_state) != BNXT_DEV_STATE_CLOSING)
-		netif_tx_wake_queue(txq);
+	__netif_tx_queue_maybe_wake(txq, bnxt_tx_avail(bp, txr),
+				    bp->tx_wake_thresh,
+				    READ_ONCE(txr->dev_state) != BNXT_DEV_STATE_CLOSING);
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-04-07  1:21 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-01  5:12 [PATCH net-next 0/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
2023-04-01 15:04   ` Heiner Kallweit
2023-04-01 18:03     ` Jakub Kicinski
2023-04-01 15:18   ` Heiner Kallweit
2023-04-01 18:58     ` Jakub Kicinski
2023-04-01 20:41       ` Heiner Kallweit
2023-04-03 15:18       ` Alexander Duyck
2023-04-03 15:56         ` Jakub Kicinski
2023-04-03 18:11           ` Alexander Duyck
2023-04-03 19:03             ` Jakub Kicinski
2023-04-03 20:27               ` Alexander Duyck
2023-04-05 22:20                 ` Paul E. McKenney
2023-04-06  5:15                   ` Herbert Xu
2023-04-06 14:17                     ` Paul E. McKenney
2023-04-06 14:46                       ` Jakub Kicinski
2023-04-06 15:45                         ` Paul E. McKenney
2023-04-06 15:56                           ` Jakub Kicinski
2023-04-06 16:25                             ` Paul E. McKenney
2023-04-07  0:58                         ` Herbert Xu
2023-04-07  1:03                           ` Jakub Kicinski
2023-04-07  1:14                             ` Herbert Xu
2023-04-07  1:21                               ` Jakub Kicinski
2023-04-04  6:39         ` Herbert Xu
2023-04-04 22:36           ` Jakub Kicinski
2023-04-01  5:12 ` [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros Jakub Kicinski
2023-04-01  5:12 ` [PATCH net-next 3/3] bnxt: " Jakub Kicinski
2023-04-01 18:35   ` Michael Chan
  -- strict thread matches above, loose matches on Subject: below --
2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
2023-03-22 23:30 ` [PATCH net-next 3/3] bnxt: use new queue try_stop/try_wake macros Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).