All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
@ 2023-03-22 23:30 Jakub Kicinski
  2023-03-22 23:30 ` [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros Jakub Kicinski
                   ` (5 more replies)
  0 siblings, 6 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-22 23:30 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, willemb, alexander.duyck, Jakub Kicinski

A lot of drivers follow the same scheme to stop / start queues
without introducing locks between xmit and NAPI tx completions.
I'm guessing they all copy'n'paste each other's code.

Smaller drivers shy away from the scheme and introduce a lock
which may cause deadlocks in netpoll.

Provide macros which encapsulate the necessary logic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
 - perdicate -> predicate
 - on race use start instead of wake and make a note of that
   in the doc / comment at the start
---
 include/net/netdev_queues.h | 171 ++++++++++++++++++++++++++++++++++++
 1 file changed, 171 insertions(+)
 create mode 100644 include/net/netdev_queues.h

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
new file mode 100644
index 000000000000..64e059647274
--- /dev/null
+++ b/include/net/netdev_queues.h
@@ -0,0 +1,171 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_NET_QUEUES_H
+#define _LINUX_NET_QUEUES_H
+
+#include <linux/netdevice.h>
+
+/* Lockless queue stopping / waking helpers.
+ *
+ * These macros are designed to safely implement stopping and waking
+ * netdev queues without full lock protection. We assume that there can
+ * be no concurrent stop attempts and no concurrent wake attempts.
+ * The try-stop should happen from the xmit handler*, while wake up
+ * should be triggered from NAPI poll context. The two may run
+ * concurrently (single producer, single consumer).
+ *
+ * All descriptor ring indexes (and other relevant shared state) must
+ * be updated before invoking the macros.
+ *
+ * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
+ *   instead of netif_tx_wake_queue()) so uses outside of the xmit
+ *   handler may lead to bugs
+ */
+
+#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
+	({								\
+		int _res;						\
+									\
+		netif_tx_stop_queue(txq);				\
+									\
+		smp_mb();						\
+									\
+		/* We need to check again in a case another		\
+		 * CPU has just made room available.			\
+		 */							\
+		if (likely(get_desc < start_thrs)) {			\
+			_res = 0;					\
+		} else {						\
+			netif_tx_start_queue(txq);			\
+			_res = -1;					\
+		}							\
+		_res;							\
+	})								\
+
+/**
+ * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @stop_thrs:	minimal number of available descriptors for queue to be left
+ *		enabled
+ * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
+ *		equal to @stop_thrs or higher to avoid frequent waking
+ *
+ * All arguments may be evaluated multiple times, beware of side effects.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ * Expected to be used from ndo_start_xmit, see the comment on top of the file.
+ *
+ * Returns:
+ *	 0 if the queue was stopped
+ *	 1 if the queue was left enabled
+ *	-1 if the queue was re-enabled (raced with waking)
+ */
+#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
+	({								\
+		int _res;						\
+									\
+		if (likely(get_desc > stop_thrs))			\
+			_res = 1;					\
+		else							\
+			_res = netif_tx_queue_try_stop(txq, get_desc,	\
+						       start_thrs);	\
+		_res;							\
+	})								\
+
+#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		/* Make sure that anybody stopping the queue after	\
+		 * this sees the new next_to_clean.			\
+		 */							\
+		smp_mb();						\
+		if (netif_tx_queue_stopped(txq) && !(down_cond)) {	\
+			netif_tx_wake_queue(txq);			\
+			_res = 0;					\
+		} else {						\
+			_res = 1;					\
+		}							\
+		_res;							\
+	})
+
+#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
+
+/**
+ * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @start_thrs:	minimal number of descriptors to re-enable the queue
+ * @down_cond:	down condition, predicate indicating that the queue should
+ *		not be woken up even if descriptors are available
+ *
+ * All arguments may be evaluated multiple times.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ *
+ * Returns:
+ *	 0 if the queue was woken up
+ *	 1 if the queue was already enabled (or disabled but @down_cond is true)
+ *	-1 if the queue was left stopped
+ */
+#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		if (likely(get_desc < start_thrs))			\
+			_res = -1;					\
+		else							\
+			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
+							 start_thrs,	\
+							 down_cond);	\
+		_res;							\
+	})
+
+#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
+
+/* subqueue variants follow */
+
+#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
+	})
+
+#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_maybe_stop(txq, get_desc,		\
+					  stop_thrs, start_thrs);	\
+	})
+
+#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_try_wake(txq, get_desc,		\
+					  start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
+	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
+
+#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_maybe_wake(txq, get_desc,		\
+					    start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
+	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
+
+#endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros
  2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
@ 2023-03-22 23:30 ` Jakub Kicinski
  2023-03-22 23:30 ` [PATCH net-next 3/3] bnxt: " Jakub Kicinski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-22 23:30 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, willemb, alexander.duyck, Jakub Kicinski

Convert ixgbe to use the new macros, I think a lot of people
copy the ixgbe code. The only functional change is that the
unlikely() in ixgbe_clean_tx_irq() turns into a likely()
inside the new macro and no longer includes

  total_packets && netif_carrier_ok(tx_ring->netdev)

which is probably for the best, anyway.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 37 +++++--------------
 1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 773c35fecace..db00e50a40ff 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -36,6 +36,7 @@
 #include <net/tc_act/tc_mirred.h>
 #include <net/vxlan.h>
 #include <net/mpls.h>
+#include <net/netdev_queues.h>
 #include <net/xdp_sock_drv.h>
 #include <net/xfrm.h>
 
@@ -1253,20 +1254,12 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 				  total_packets, total_bytes);
 
 #define TX_WAKE_THRESHOLD (DESC_NEEDED * 2)
-	if (unlikely(total_packets && netif_carrier_ok(tx_ring->netdev) &&
-		     (ixgbe_desc_unused(tx_ring) >= TX_WAKE_THRESHOLD))) {
-		/* Make sure that anybody stopping the queue after this
-		 * sees the new next_to_clean.
-		 */
-		smp_mb();
-		if (__netif_subqueue_stopped(tx_ring->netdev,
-					     tx_ring->queue_index)
-		    && !test_bit(__IXGBE_DOWN, &adapter->state)) {
-			netif_wake_subqueue(tx_ring->netdev,
-					    tx_ring->queue_index);
-			++tx_ring->tx_stats.restart_queue;
-		}
-	}
+	if (total_packets && netif_carrier_ok(tx_ring->netdev) &&
+	    !__netif_subqueue_maybe_wake(tx_ring->netdev, tx_ring->queue_index,
+					 ixgbe_desc_unused(tx_ring),
+					 TX_WAKE_THRESHOLD,
+					 test_bit(__IXGBE_DOWN, &adapter->state)))
+		++tx_ring->tx_stats.restart_queue;
 
 	return !!budget;
 }
@@ -8270,22 +8263,10 @@ static void ixgbe_tx_olinfo_status(union ixgbe_adv_tx_desc *tx_desc,
 
 static int __ixgbe_maybe_stop_tx(struct ixgbe_ring *tx_ring, u16 size)
 {
-	netif_stop_subqueue(tx_ring->netdev, tx_ring->queue_index);
-
-	/* Herbert's original patch had:
-	 *  smp_mb__after_netif_stop_queue();
-	 * but since that doesn't exist yet, just open code it.
-	 */
-	smp_mb();
-
-	/* We need to check again in a case another CPU has just
-	 * made room available.
-	 */
-	if (likely(ixgbe_desc_unused(tx_ring) < size))
+	if (!netif_subqueue_try_stop(tx_ring->netdev, tx_ring->queue_index,
+				     ixgbe_desc_unused(tx_ring), size))
 		return -EBUSY;
 
-	/* A reprieve! - use start_queue because it doesn't call schedule */
-	netif_start_subqueue(tx_ring->netdev, tx_ring->queue_index);
 	++tx_ring->tx_stats.restart_queue;
 	return 0;
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH net-next 3/3] bnxt: use new queue try_stop/try_wake macros
  2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
  2023-03-22 23:30 ` [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros Jakub Kicinski
@ 2023-03-22 23:30 ` Jakub Kicinski
  2023-03-23  0:35 ` [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Andrew Lunn
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-22 23:30 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, willemb, alexander.duyck, Jakub Kicinski

Convert bnxt to use new macros rather than open code the logic.
Two differences:
(1) bnxt_tx_int() will now only issue a memory barrier if it sees
    enough space on the ring to wake the queue. This should be fine,
    the mb() is between the writes to the ring pointers and checking
    queue state.
(2) we'll start the queue instead of waking on race, this should
    be safe inside the xmit handler.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 41 +++++------------------
 1 file changed, 8 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index f533a8f46217..cbd48c192de3 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -56,6 +56,7 @@
 #include <linux/hwmon-sysfs.h>
 #include <net/page_pool.h>
 #include <linux/align.h>
+#include <net/netdev_queues.h>
 
 #include "bnxt_hsi.h"
 #include "bnxt.h"
@@ -331,26 +332,6 @@ static void bnxt_txr_db_kick(struct bnxt *bp, struct bnxt_tx_ring_info *txr,
 	txr->kick_pending = 0;
 }
 
-static bool bnxt_txr_netif_try_stop_queue(struct bnxt *bp,
-					  struct bnxt_tx_ring_info *txr,
-					  struct netdev_queue *txq)
-{
-	netif_tx_stop_queue(txq);
-
-	/* netif_tx_stop_queue() must be done before checking
-	 * tx index in bnxt_tx_avail() below, because in
-	 * bnxt_tx_int(), we update tx index before checking for
-	 * netif_tx_queue_stopped().
-	 */
-	smp_mb();
-	if (bnxt_tx_avail(bp, txr) >= bp->tx_wake_thresh) {
-		netif_tx_wake_queue(txq);
-		return false;
-	}
-
-	return true;
-}
-
 static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct bnxt *bp = netdev_priv(dev);
@@ -384,7 +365,8 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (net_ratelimit() && txr->kick_pending)
 			netif_warn(bp, tx_err, dev,
 				   "bnxt: ring busy w/ flush pending!\n");
-		if (bnxt_txr_netif_try_stop_queue(bp, txr, txq))
+		if (!netif_tx_queue_try_stop(txq, bnxt_tx_avail(bp, txr),
+					     bp->tx_wake_thresh))
 			return NETDEV_TX_BUSY;
 	}
 
@@ -614,7 +596,8 @@ static netdev_tx_t bnxt_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		if (netdev_xmit_more() && !tx_buf->is_push)
 			bnxt_txr_db_kick(bp, txr, prod);
 
-		bnxt_txr_netif_try_stop_queue(bp, txr, txq);
+		netif_tx_queue_try_stop(txq, bnxt_tx_avail(bp, txr),
+					bp->tx_wake_thresh);
 	}
 	return NETDEV_TX_OK;
 
@@ -708,17 +691,9 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int nr_pkts)
 	netdev_tx_completed_queue(txq, nr_pkts, tx_bytes);
 	txr->tx_cons = cons;
 
-	/* Need to make the tx_cons update visible to bnxt_start_xmit()
-	 * before checking for netif_tx_queue_stopped().  Without the
-	 * memory barrier, there is a small possibility that bnxt_start_xmit()
-	 * will miss it and cause the queue to be stopped forever.
-	 */
-	smp_mb();
-
-	if (unlikely(netif_tx_queue_stopped(txq)) &&
-	    bnxt_tx_avail(bp, txr) >= bp->tx_wake_thresh &&
-	    READ_ONCE(txr->dev_state) != BNXT_DEV_STATE_CLOSING)
-		netif_tx_wake_queue(txq);
+	__netif_tx_queue_maybe_wake(txq, bnxt_tx_avail(bp, txr),
+				    bp->tx_wake_thresh,
+				    READ_ONCE(txr->dev_state) != BNXT_DEV_STATE_CLOSING);
 }
 
 static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
  2023-03-22 23:30 ` [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros Jakub Kicinski
  2023-03-22 23:30 ` [PATCH net-next 3/3] bnxt: " Jakub Kicinski
@ 2023-03-23  0:35 ` Andrew Lunn
  2023-03-23  1:04     ` [Intel-wired-lan] " Jakub Kicinski
  2023-03-23  3:05 ` Yunsheng Lin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 45+ messages in thread
From: Andrew Lunn @ 2023-03-23  0:35 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, willemb, alexander.duyck

On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
>
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.

Hi Jakub

I notice there is no patch 0/X. Seems like the above would be good
material for it, along with a comment that a few drivers are converted
to make use of the new macros.

   Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-23  0:35 ` [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Andrew Lunn
@ 2023-03-23  1:04     ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-23  1:04 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem, netdev, edumazet, pabeni, willemb, alexander.duyck,
	michael.chan, intel-wired-lan, anthony.l.nguyen,
	jesse.brandeburg

CC: maintainers, in case there isn't a repost
https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/

On Thu, 23 Mar 2023 01:35:34 +0100 Andrew Lunn wrote:
> On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> > A lot of drivers follow the same scheme to stop / start queues
> > without introducing locks between xmit and NAPI tx completions.
> > I'm guessing they all copy'n'paste each other's code.
> >
> > Smaller drivers shy away from the scheme and introduce a lock
> > which may cause deadlocks in netpoll.  
> 
> I notice there is no patch 0/X. Seems like the above would be good
> material for it, along with a comment that a few drivers are converted
> to make use of the new macros.

Then do I repeat the same text in the commit? Or cut the commit down?
Doesn't that just take away information from the commit which will
show up in git blame?

Having a cover letter is a good default, and required if the series 
is a larger change decomposed into steps. But here there is a major
change and a bunch of loose conversions. More sample users than
meaningful part.

LMK what your preference for splitting this info is, I'm unsure.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-wired-lan] [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
@ 2023-03-23  1:04     ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-23  1:04 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: willemb, netdev, jesse.brandeburg, edumazet, intel-wired-lan,
	anthony.l.nguyen, michael.chan, pabeni, davem

CC: maintainers, in case there isn't a repost
https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/

On Thu, 23 Mar 2023 01:35:34 +0100 Andrew Lunn wrote:
> On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> > A lot of drivers follow the same scheme to stop / start queues
> > without introducing locks between xmit and NAPI tx completions.
> > I'm guessing they all copy'n'paste each other's code.
> >
> > Smaller drivers shy away from the scheme and introduce a lock
> > which may cause deadlocks in netpoll.  
> 
> I notice there is no patch 0/X. Seems like the above would be good
> material for it, along with a comment that a few drivers are converted
> to make use of the new macros.

Then do I repeat the same text in the commit? Or cut the commit down?
Doesn't that just take away information from the commit which will
show up in git blame?

Having a cover letter is a good default, and required if the series 
is a larger change decomposed into steps. But here there is a major
change and a bunch of loose conversions. More sample users than
meaningful part.

LMK what your preference for splitting this info is, I'm unsure.
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
                   ` (2 preceding siblings ...)
  2023-03-23  0:35 ` [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Andrew Lunn
@ 2023-03-23  3:05 ` Yunsheng Lin
  2023-03-23  3:27   ` Jakub Kicinski
  2023-03-23  4:53 ` Pavan Chebbi
  2023-03-23 16:05 ` Alexander H Duyck
  5 siblings, 1 reply; 45+ messages in thread
From: Yunsheng Lin @ 2023-03-23  3:05 UTC (permalink / raw)
  To: Jakub Kicinski, davem; +Cc: netdev, edumazet, pabeni, willemb, alexander.duyck

On 2023/3/23 7:30, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> ---
>  include/net/netdev_queues.h | 171 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 171 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..64e059647274
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,171 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up

unnecessary '*' after handler?

> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()

double '*' here?

> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\
> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		if (likely(get_desc < start_thrs)) {			\
> +			_res = 0;					\
> +		} else {						\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +
> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.

For now,there seems to be three ways of calling *_maybe_stop():
1. called before transimting a skb.
2. called after transimting a skb.
3. called both before and after transimting a skb.

Which one should new driver follow?
Do we need to make it clear here?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-23  3:05 ` Yunsheng Lin
@ 2023-03-23  3:27   ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-23  3:27 UTC (permalink / raw)
  To: Yunsheng Lin; +Cc: davem, netdev, edumazet, pabeni, willemb, alexander.duyck

On Thu, 23 Mar 2023 11:05:36 +0800 Yunsheng Lin wrote:
> > +/* Lockless queue stopping / waking helpers.
> > + *
> > + * These macros are designed to safely implement stopping and waking
> > + * netdev queues without full lock protection. We assume that there can
> > + * be no concurrent stop attempts and no concurrent wake attempts.
> > + * The try-stop should happen from the xmit handler*, while wake up  
> 
> unnecessary '*' after handler?
> 
> > + * should be triggered from NAPI poll context. The two may run
> > + * concurrently (single producer, single consumer).
> > + *
> > + * All descriptor ring indexes (and other relevant shared state) must
> > + * be updated before invoking the macros.
> > + *
> > + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()  
> 
> double '*' here?

It's a footnote..

> > + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> > + *   handler may lead to bugs
> > + */
> > +
> > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		netif_tx_stop_queue(txq);				\
> > +									\
> > +		smp_mb();						\
> > +									\
> > +		/* We need to check again in a case another		\
> > +		 * CPU has just made room available.			\
> > +		 */							\
> > +		if (likely(get_desc < start_thrs)) {			\
> > +			_res = 0;					\
> > +		} else {						\
> > +			netif_tx_start_queue(txq);			\
> > +			_res = -1;					\
> > +		}							\
> > +		_res;							\
> > +	})								\
> > +
> > +/**
> > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > + * @txq:	struct netdev_queue to stop/start
> > + * @get_desc:	get current number of free descriptors (see requirements below!)
> > + * @stop_thrs:	minimal number of available descriptors for queue to be left
> > + *		enabled
> > + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> > + *		equal to @stop_thrs or higher to avoid frequent waking
> > + *
> > + * All arguments may be evaluated multiple times, beware of side effects.
> > + * @get_desc must be a formula or a function call, it must always
> > + * return up-to-date information when evaluated!
> > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.  
> 
> For now,there seems to be three ways of calling *_maybe_stop():
> 1. called before transimting a skb.
> 2. called after transimting a skb.
> 3. called both before and after transimting a skb.
> 
> Which one should new driver follow?
> Do we need to make it clear here?

After, the check before is more of a sanity check.
AFAIR netdevice.rst or some other place documents the stopping
expectations.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
                   ` (3 preceding siblings ...)
  2023-03-23  3:05 ` Yunsheng Lin
@ 2023-03-23  4:53 ` Pavan Chebbi
  2023-03-23  5:08   ` Jakub Kicinski
  2023-03-23 16:05 ` Alexander H Duyck
  5 siblings, 1 reply; 45+ messages in thread
From: Pavan Chebbi @ 2023-03-23  4:53 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, willemb, alexander.duyck

[-- Attachment #1: Type: text/plain, Size: 1844 bytes --]

On Thu, Mar 23, 2023 at 5:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)             \
> +       ({                                                              \
> +               int _res;                                               \
> +                                                                       \
> +               netif_tx_stop_queue(txq);                               \
> +                                                                       \
> +               smp_mb();                                               \
> +                                                                       \
> +               /* We need to check again in a case another             \
> +                * CPU has just made room available.                    \
> +                */                                                     \
> +               if (likely(get_desc < start_thrs)) {                    \

I am only curious to understand why initializing _res with likely
result and having a condition to cover only the unlikely case, would
not be better.
As in:
    int _res = 0;
    if (unlikely(get_desc >= start_thrs) {
        start_queue()
        _res = -1
    }

> +                       _res = 0;                                       \
> +               } else {                                                \
> +                       netif_tx_start_queue(txq);                      \
> +                       _res = -1;                                      \
> +               }                                                       \
> +               _res;                                                   \
> +       })                                                              \
> +
> --
> 2.39.2
>

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-23  4:53 ` Pavan Chebbi
@ 2023-03-23  5:08   ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-23  5:08 UTC (permalink / raw)
  To: Pavan Chebbi; +Cc: davem, netdev, edumazet, pabeni, willemb, alexander.duyck

On Thu, 23 Mar 2023 10:23:32 +0530 Pavan Chebbi wrote:
> > +               /* We need to check again in a case another             \
> > +                * CPU has just made room available.                    \
> > +                */                                                     \
> > +               if (likely(get_desc < start_thrs)) {                    \  
> 
> I am only curious to understand why initializing _res with likely
> result and having a condition to cover only the unlikely case, would
> not be better.
> As in:
>     int _res = 0;
>     if (unlikely(get_desc >= start_thrs) {
>         start_queue()
>         _res = -1
>     }

I don't think it matters.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
                   ` (4 preceding siblings ...)
  2023-03-23  4:53 ` Pavan Chebbi
@ 2023-03-23 16:05 ` Alexander H Duyck
  2023-03-24  3:09   ` Jakub Kicinski
  5 siblings, 1 reply; 45+ messages in thread
From: Alexander H Duyck @ 2023-03-23 16:05 UTC (permalink / raw)
  To: Jakub Kicinski, davem; +Cc: netdev, edumazet, pabeni, willemb

On Wed, 2023-03-22 at 16:30 -0700, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> ---
>  include/net/netdev_queues.h | 171 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 171 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..64e059647274
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,171 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up
> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\
> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		if (likely(get_desc < start_thrs)) {			\
> +			_res = 0;					\
> +		} else {						\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +

The issue I see here is that with this being a macro it abstracts away
the relationship between get_desc and the memory barrier. At a minimum
I think we should be treating get_desc as a function instead of just
passing it and its arguments as a single value. Maybe something more
like how read_poll_timeout passes the "op" and then uses it as a
function with args passed seperately. What we want to avoid is passing
a precomuted value to this function as get_desc.

In addition I think I would prefer to see _res initialized to the
likely value so that we can drop this to one case instead of having to
have two. Same thing for the macros below.

> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> + *
> + * Returns:
> + *	 0 if the queue was stopped
> + *	 1 if the queue was left enabled
> + *	-1 if the queue was re-enabled (raced with waking)
> + */

We may want to change the values here. The most likely case is "left
enabled" with that being the case we probably want to make that the 0
case. I would then probably make 1 the re-enabled case and -1 the
stopped case.

With that the decision tree becomes more straightforward as we would do
something like:
	if (result) {
		if (result < 0)
			Increment stopped stat
			return
		else
			Increment restarted stat
	}

In addition for readability we may want consider adding an enum simliar
to the netdev_tx enum as then we have the return types locked and
usable should we want to specifically pick out one.


> +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> +	({								\
> +		int _res;						\
> +									\
> +		if (likely(get_desc > stop_thrs))			\
> +			_res = 1;					\
> +		else							\
> +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> +						       start_thrs);	\
> +		_res;							\
> +	})								\
> +
> +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		/* Make sure that anybody stopping the queue after	\
> +		 * this sees the new next_to_clean.			\
> +		 */							\
> +		smp_mb();						\
> +		if (netif_tx_queue_stopped(txq) && !(down_cond)) {	\
> +			netif_tx_wake_queue(txq);			\
> +			_res = 0;					\
> +		} else {						\
> +			_res = 1;					\
> +		}							\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
> +
> +/**
> + * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @start_thrs:	minimal number of descriptors to re-enable the queue
> + * @down_cond:	down condition, predicate indicating that the queue should
> + *		not be woken up even if descriptors are available
> + *
> + * All arguments may be evaluated multiple times.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + *
> + * Returns:
> + *	 0 if the queue was woken up
> + *	 1 if the queue was already enabled (or disabled but @down_cond is true)
> + *	-1 if the queue was left stopped
> + */

I would go with the same here. The most common case should probably be
0 which would be "already enabled" with 1 being woken up and -1 being
stopped. In addition keeping the two consistent with each other would
allow for easier understanding of the two.

> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		if (likely(get_desc < start_thrs))			\
> +			_res = -1;					\
> +		else							\
> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> +							 start_thrs,	\
> +							 down_cond);	\
> +		_res;							\
> +	})
> +

The likely here is probably not correct. In most cases the queue will
likely have enough descriptors to enable waking since Tx cleanup can
usually run pretty fast compared to the transmit path itself since it
can run without needing to take locks.

> +#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
> +
> +/* subqueue variants follow */
> +
> +#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
> +	})
> +
> +#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_maybe_stop(txq, get_desc,		\
> +					  stop_thrs, start_thrs);	\
> +	})
> +
> +#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_try_wake(txq, get_desc,		\
> +					  start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
> +	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_maybe_wake(txq, get_desc,		\
> +					    start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
> +	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#endif


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-23  1:04     ` [Intel-wired-lan] " Jakub Kicinski
@ 2023-03-23 21:02       ` Andrew Lunn
  -1 siblings, 0 replies; 45+ messages in thread
From: Andrew Lunn @ 2023-03-23 21:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, willemb, alexander.duyck,
	michael.chan, intel-wired-lan, anthony.l.nguyen,
	jesse.brandeburg

On Wed, Mar 22, 2023 at 06:04:06PM -0700, Jakub Kicinski wrote:
> CC: maintainers, in case there isn't a repost
> https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
> 
> On Thu, 23 Mar 2023 01:35:34 +0100 Andrew Lunn wrote:
> > On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> > > A lot of drivers follow the same scheme to stop / start queues
> > > without introducing locks between xmit and NAPI tx completions.
> > > I'm guessing they all copy'n'paste each other's code.
> > >
> > > Smaller drivers shy away from the scheme and introduce a lock
> > > which may cause deadlocks in netpoll.  
> > 
> > I notice there is no patch 0/X. Seems like the above would be good
> > material for it, along with a comment that a few drivers are converted
> > to make use of the new macros.
> 
> Then do I repeat the same text in the commit? Or cut the commit down?
> Doesn't that just take away information from the commit which will
> show up in git blame?
> 
> Having a cover letter is a good default, and required if the series 
> is a larger change decomposed into steps. But here there is a major
> change and a bunch of loose conversions. More sample users than
> meaningful part.
> 
> LMK what your preference for splitting this info is, I'm unsure.

We do seem to have a policy of asking for a 0/X. And it is used for
the merge commit. That is my real point. And i don't see why the text
can be repeated in the merge commit and the individual commits.

    Andrew

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-wired-lan] [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
@ 2023-03-23 21:02       ` Andrew Lunn
  0 siblings, 0 replies; 45+ messages in thread
From: Andrew Lunn @ 2023-03-23 21:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: willemb, netdev, jesse.brandeburg, edumazet, intel-wired-lan,
	anthony.l.nguyen, michael.chan, pabeni, davem

On Wed, Mar 22, 2023 at 06:04:06PM -0700, Jakub Kicinski wrote:
> CC: maintainers, in case there isn't a repost
> https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
> 
> On Thu, 23 Mar 2023 01:35:34 +0100 Andrew Lunn wrote:
> > On Wed, Mar 22, 2023 at 04:30:26PM -0700, Jakub Kicinski wrote:
> > > A lot of drivers follow the same scheme to stop / start queues
> > > without introducing locks between xmit and NAPI tx completions.
> > > I'm guessing they all copy'n'paste each other's code.
> > >
> > > Smaller drivers shy away from the scheme and introduce a lock
> > > which may cause deadlocks in netpoll.  
> > 
> > I notice there is no patch 0/X. Seems like the above would be good
> > material for it, along with a comment that a few drivers are converted
> > to make use of the new macros.
> 
> Then do I repeat the same text in the commit? Or cut the commit down?
> Doesn't that just take away information from the commit which will
> show up in git blame?
> 
> Having a cover letter is a good default, and required if the series 
> is a larger change decomposed into steps. But here there is a major
> change and a bunch of loose conversions. More sample users than
> meaningful part.
> 
> LMK what your preference for splitting this info is, I'm unsure.

We do seem to have a policy of asking for a 0/X. And it is used for
the merge commit. That is my real point. And i don't see why the text
can be repeated in the merge commit and the individual commits.

    Andrew
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-23 21:02       ` [Intel-wired-lan] " Andrew Lunn
@ 2023-03-23 22:46         ` Jakub Kicinski
  -1 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-23 22:46 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: davem, netdev, edumazet, pabeni, willemb, alexander.duyck,
	michael.chan, intel-wired-lan, anthony.l.nguyen,
	jesse.brandeburg

On Thu, 23 Mar 2023 22:02:29 +0100 Andrew Lunn wrote:
> We do seem to have a policy of asking for a 0/X. And it is used for
> the merge commit. That is my real point. And i don't see why the text
> can be repeated in the merge commit and the individual commits.

Alright, I'll respin and add a cover. But I think we should be critical
of the rules. The rules are just a mental shortcut, we shouldn't lose
sight of why they are in place and maintain some flexibility.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [Intel-wired-lan] [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
@ 2023-03-23 22:46         ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-23 22:46 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: willemb, netdev, jesse.brandeburg, edumazet, intel-wired-lan,
	anthony.l.nguyen, michael.chan, pabeni, davem

On Thu, 23 Mar 2023 22:02:29 +0100 Andrew Lunn wrote:
> We do seem to have a policy of asking for a 0/X. And it is used for
> the merge commit. That is my real point. And i don't see why the text
> can be repeated in the merge commit and the individual commits.

Alright, I'll respin and add a cover. But I think we should be critical
of the rules. The rules are just a mental shortcut, we shouldn't lose
sight of why they are in place and maintain some flexibility.
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-23 16:05 ` Alexander H Duyck
@ 2023-03-24  3:09   ` Jakub Kicinski
  2023-03-24 15:45     ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-24  3:09 UTC (permalink / raw)
  To: Alexander H Duyck; +Cc: davem, netdev, edumazet, pabeni, willemb

On Thu, 23 Mar 2023 09:05:35 -0700 Alexander H Duyck wrote:
> > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		netif_tx_stop_queue(txq);				\
> > +									\
> > +		smp_mb();						\
> > +									\
> > +		/* We need to check again in a case another		\
> > +		 * CPU has just made room available.			\
> > +		 */							\
> > +		if (likely(get_desc < start_thrs)) {			\
> > +			_res = 0;					\
> > +		} else {						\
> > +			netif_tx_start_queue(txq);			\
> > +			_res = -1;					\
> > +		}							\
> > +		_res;							\
> > +	})								\
> > +  
> 
> The issue I see here is that with this being a macro it abstracts away
> the relationship between get_desc and the memory barrier. At a minimum
> I think we should be treating get_desc as a function instead of just
> passing it and its arguments as a single value. Maybe something more
> like how read_poll_timeout passes the "op" and then uses it as a
> function with args passed seperately. What we want to avoid is passing
> a precomuted value to this function as get_desc.

The kdocs hopefully have enough warnings. The issue I see with
read_poll_timeout() is that I always have to have the definition 
open side by side to match up the arguments. I wish there was 
a way the test that something is not an lval, but I couldn't
find it :(

Let's see if anyone gets this wrong, you can tell me "I told you so"?

> In addition I think I would prefer to see _res initialized to the
> likely value so that we can drop this to one case instead of having to
> have two. Same thing for the macros below.

Alright.

> > +/**
> > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > + * @txq:	struct netdev_queue to stop/start
> > + * @get_desc:	get current number of free descriptors (see requirements below!)
> > + * @stop_thrs:	minimal number of available descriptors for queue to be left
> > + *		enabled
> > + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> > + *		equal to @stop_thrs or higher to avoid frequent waking
> > + *
> > + * All arguments may be evaluated multiple times, beware of side effects.
> > + * @get_desc must be a formula or a function call, it must always
> > + * return up-to-date information when evaluated!
> > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> > + *
> > + * Returns:
> > + *	 0 if the queue was stopped
> > + *	 1 if the queue was left enabled
> > + *	-1 if the queue was re-enabled (raced with waking)
> > + */  
> 
> We may want to change the values here. The most likely case is "left
> enabled" with that being the case we probably want to make that the 0
> case. I would then probably make 1 the re-enabled case and -1 the
> stopped case.

I chose the return values this way because the important information is
whether the queue was in fact stopped (in case the macro is used at the
start of .xmit as a safety check). If stopped is zero caller can check
!ret vs !!ret.

Seems pretty normal for the kernel function called stop() to return 0
if it did stop.

> With that the decision tree becomes more straightforward as we would do
> something like:
> 	if (result) {
> 		if (result < 0)
> 			Increment stopped stat
> 			return
> 		else
> 			Increment restarted stat
> 	}

Do you see a driver where it matters? ixgbe and co. use
netif_tx_queue_try_stop() and again they just test stopped vs not stopped.

> In addition for readability we may want consider adding an enum simliar
> to the netdev_tx enum as then we have the return types locked and
> usable should we want to specifically pick out one.

Hm, I thought people generally dislike the netdev_tx enum.
Maybe it's just me.

> > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		if (likely(get_desc < start_thrs))			\
> > +			_res = -1;					\
> > +		else							\
> > +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> > +							 start_thrs,	\
> > +							 down_cond);	\
> > +		_res;							\
> > +	})
> > +  
> 
> The likely here is probably not correct. In most cases the queue will
> likely have enough descriptors to enable waking since Tx cleanup can
> usually run pretty fast compared to the transmit path itself since it
> can run without needing to take locks.

Good catch, must be a copy paste from the other direction without
inverting the likely.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-24  3:09   ` Jakub Kicinski
@ 2023-03-24 15:45     ` Alexander Duyck
  2023-03-24 21:28       ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2023-03-24 15:45 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, willemb

On Thu, Mar 23, 2023 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 23 Mar 2023 09:05:35 -0700 Alexander H Duyck wrote:
> > > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)         \
> > > +   ({                                                              \
> > > +           int _res;                                               \
> > > +                                                                   \
> > > +           netif_tx_stop_queue(txq);                               \
> > > +                                                                   \
> > > +           smp_mb();                                               \
> > > +                                                                   \
> > > +           /* We need to check again in a case another             \
> > > +            * CPU has just made room available.                    \
> > > +            */                                                     \
> > > +           if (likely(get_desc < start_thrs)) {                    \
> > > +                   _res = 0;                                       \
> > > +           } else {                                                \
> > > +                   netif_tx_start_queue(txq);                      \
> > > +                   _res = -1;                                      \
> > > +           }                                                       \
> > > +           _res;                                                   \
> > > +   })                                                              \
> > > +
> >
> > The issue I see here is that with this being a macro it abstracts away
> > the relationship between get_desc and the memory barrier. At a minimum
> > I think we should be treating get_desc as a function instead of just
> > passing it and its arguments as a single value. Maybe something more
> > like how read_poll_timeout passes the "op" and then uses it as a
> > function with args passed seperately. What we want to avoid is passing
> > a precomuted value to this function as get_desc.
>
> The kdocs hopefully have enough warnings. The issue I see with
> read_poll_timeout() is that I always have to have the definition
> open side by side to match up the arguments. I wish there was
> a way the test that something is not an lval, but I couldn't
> find it :(
>
> Let's see if anyone gets this wrong, you can tell me "I told you so"?

The setup for it makes me really uncomfortable. Passing a function w/
arguments as an argument itself usually implies that it is called
first, not during.

> > In addition I think I would prefer to see _res initialized to the
> > likely value so that we can drop this to one case instead of having to
> > have two. Same thing for the macros below.
>
> Alright.
>
> > > +/**
> > > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > > + * @txq:   struct netdev_queue to stop/start
> > > + * @get_desc:      get current number of free descriptors (see requirements below!)
> > > + * @stop_thrs:     minimal number of available descriptors for queue to be left
> > > + *         enabled
> > > + * @start_thrs:    minimal number of descriptors to re-enable the queue, can be
> > > + *         equal to @stop_thrs or higher to avoid frequent waking
> > > + *
> > > + * All arguments may be evaluated multiple times, beware of side effects.
> > > + * @get_desc must be a formula or a function call, it must always
> > > + * return up-to-date information when evaluated!
> > > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> > > + *
> > > + * Returns:
> > > + *  0 if the queue was stopped
> > > + *  1 if the queue was left enabled
> > > + * -1 if the queue was re-enabled (raced with waking)
> > > + */
> >
> > We may want to change the values here. The most likely case is "left
> > enabled" with that being the case we probably want to make that the 0
> > case. I would then probably make 1 the re-enabled case and -1 the
> > stopped case.
>
> I chose the return values this way because the important information is
> whether the queue was in fact stopped (in case the macro is used at the
> start of .xmit as a safety check). If stopped is zero caller can check
> !ret vs !!ret.
>
> Seems pretty normal for the kernel function called stop() to return 0
> if it did stop.

Except this isn't "stop", this is "maybe stop". Maybe we should just
do away with the stop/wake messaging and go with something such as a
RTS/CTS type setup. Basically this function is acting as a RTS to
verify that we have room on the ring to place the frame. If we don't
we are stopped. The "wake" function is on what is essentially the
receiving end on the other side of the hardware after it has DMAed the
frames and is providing the CTS signal back.

> > With that the decision tree becomes more straightforward as we would do
> > something like:
> >       if (result) {
> >               if (result < 0)
> >                       Increment stopped stat
> >                       return
> >               else
> >                       Increment restarted stat
> >       }
>
> Do you see a driver where it matters? ixgbe and co. use
> netif_tx_queue_try_stop() and again they just test stopped vs not stopped.

The thing is in order to make this work for the ixgbe patch you didn't
use the maybe_stop instead you went with the try_stop. If you replaced
the ixgbe_maybe_stop_tx with your maybe stop would have to do
something such as the code above to make it work. That is what I am
getting at. From what I can tell the only real difference between
ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
move the restart_queue stat increment out.

The general thought is I would prefer to keep it so that 0 is the
default most likely case in both where the queue is enabled and is
still enabled. By moving the "take action" items into the 1/-1 values
then it becomes much easier to sort them out with 1 being a stat
increment and -1 being an indication to stop transmitting and prep for
watchdog hang if we don't clear this in the next watchdog period.

Also in general it makes it easier to understand if these all work
with the same logic.

> > In addition for readability we may want consider adding an enum simliar
> > to the netdev_tx enum as then we have the return types locked and
> > usable should we want to specifically pick out one.
>
> Hm, I thought people generally dislike the netdev_tx enum.
> Maybe it's just me.

The thought I had with the enum is to more easily connect the outcomes
with the sources. It would also help to prevent any confusion on what
is what. Having the two stop/wake functions return different values is
a potential source for errors since 0/1 means different things in the
different functions. Basically since we have 3 possible outcomes using
the enum would make it very clear what the mapping is between the two.

> > > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > > +   ({                                                              \
> > > +           int _res;                                               \
> > > +                                                                   \
> > > +           if (likely(get_desc < start_thrs))                      \
> > > +                   _res = -1;                                      \
> > > +           else                                                    \
> > > +                   _res = __netif_tx_queue_try_wake(txq, get_desc, \
> > > +                                                    start_thrs,    \
> > > +                                                    down_cond);    \
> > > +           _res;                                                   \
> > > +   })
> > > +
> >
> > The likely here is probably not correct. In most cases the queue will
> > likely have enough descriptors to enable waking since Tx cleanup can
> > usually run pretty fast compared to the transmit path itself since it
> > can run without needing to take locks.
>
> Good catch, must be a copy paste from the other direction without
> inverting the likely.

Actually the original code had it there in ixgbe although the check
also included total_packets in it which implies that maybe the
assumption was that Tx cleanup wasn't occurring as often as Rx? Either
that or it was a carry-over and had a check for
__netif_subqueue_stopped included in there previously.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-24 15:45     ` Alexander Duyck
@ 2023-03-24 21:28       ` Jakub Kicinski
  2023-03-26 21:23         ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-24 21:28 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: davem, netdev, edumazet, pabeni, willemb

On Fri, 24 Mar 2023 08:45:23 -0700 Alexander Duyck wrote:
> On Thu, Mar 23, 2023 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > We may want to change the values here. The most likely case is "left
> > > enabled" with that being the case we probably want to make that the 0
> > > case. I would then probably make 1 the re-enabled case and -1 the
> > > stopped case.  
> >
> > I chose the return values this way because the important information is
> > whether the queue was in fact stopped (in case the macro is used at the
> > start of .xmit as a safety check). If stopped is zero caller can check
> > !ret vs !!ret.
> >
> > Seems pretty normal for the kernel function called stop() to return 0
> > if it did stop.  
> 
> Except this isn't "stop", this is "maybe stop".

So the return value from try_stop and maybe_stop would be different?
try_stop needs to return 0 if it stopped - the same semantics as
trylock(), AFAIR. Not that I love those semantics, but it's a fairly
strong precedent.

> Maybe we should just
> do away with the stop/wake messaging and go with something such as a
> RTS/CTS type setup. Basically this function is acting as a RTS to
> verify that we have room on the ring to place the frame. If we don't
> we are stopped. The "wake" function is on what is essentially the
> receiving end on the other side of the hardware after it has DMAed the
> frames and is providing the CTS signal back.

I'm definitely open to different naming but wouldn't calling RTS _after_
send be even more confusing than maybe_stop?

> > > With that the decision tree becomes more straightforward as we would do
> > > something like:
> > >       if (result) {
> > >               if (result < 0)
> > >                       Increment stopped stat
> > >                       return
> > >               else
> > >                       Increment restarted stat
> > >       }  
> >
> > Do you see a driver where it matters? ixgbe and co. use
> > netif_tx_queue_try_stop() and again they just test stopped vs not stopped.  
> 
> The thing is in order to make this work for the ixgbe patch you didn't
> use the maybe_stop instead you went with the try_stop. If you replaced
> the ixgbe_maybe_stop_tx with your maybe stop would have to do
> something such as the code above to make it work. That is what I am
> getting at. From what I can tell the only real difference between
> ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> move the restart_queue stat increment out.

I can convert ixgbe further, true, but I needed the try_stop, anyway,
because bnxt does:

if (/* need to stop */) {
	if (xmit_more())
		flush_db_write();
	netif_tx_queue_try_stop();
}

which seems reasonable.

> The general thought is I would prefer to keep it so that 0 is the
> default most likely case in both where the queue is enabled and is
> still enabled. By moving the "take action" items into the 1/-1 values
> then it becomes much easier to sort them out with 1 being a stat
> increment and -1 being an indication to stop transmitting and prep for
> watchdog hang if we don't clear this in the next watchdog period.

Maybe worth taking a step back - the restart stat which ixgbe
maintains made perfect sense when you pioneered this approach but
I think we had a decade of use, and have kprobes now, so we don't
really need to maintain a statistic for a condition with no impact 
to the user? New driver should not care 1 vs -1..

> Also in general it makes it easier to understand if these all work
> with the same logic.
> 
> > > In addition for readability we may want consider adding an enum simliar
> > > to the netdev_tx enum as then we have the return types locked and
> > > usable should we want to specifically pick out one.  
> >
> > Hm, I thought people generally dislike the netdev_tx enum.
> > Maybe it's just me.  
> 
> The thought I had with the enum is to more easily connect the outcomes
> with the sources. It would also help to prevent any confusion on what
> is what. Having the two stop/wake functions return different values is
> a potential source for errors since 0/1 means different things in the
> different functions. Basically since we have 3 possible outcomes using
> the enum would make it very clear what the mapping is between the two.

IMO only two outcomes matter in practice (as mentioned above).
I really like the ability to treat the return value as a bool, if only
we had negative zero we would have a perfect compromise :(

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-24 21:28       ` Jakub Kicinski
@ 2023-03-26 21:23         ` Alexander Duyck
  2023-03-29  0:56           ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2023-03-26 21:23 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, willemb

On Fri, Mar 24, 2023 at 2:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 24 Mar 2023 08:45:23 -0700 Alexander Duyck wrote:
> > On Thu, Mar 23, 2023 at 8:09 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > We may want to change the values here. The most likely case is "left
> > > > enabled" with that being the case we probably want to make that the 0
> > > > case. I would then probably make 1 the re-enabled case and -1 the
> > > > stopped case.
> > >
> > > I chose the return values this way because the important information is
> > > whether the queue was in fact stopped (in case the macro is used at the
> > > start of .xmit as a safety check). If stopped is zero caller can check
> > > !ret vs !!ret.
> > >
> > > Seems pretty normal for the kernel function called stop() to return 0
> > > if it did stop.
> >
> > Except this isn't "stop", this is "maybe stop".
>
> So the return value from try_stop and maybe_stop would be different?
> try_stop needs to return 0 if it stopped - the same semantics as
> trylock(), AFAIR. Not that I love those semantics, but it's a fairly
> strong precedent.

The problem is this isn't a lock. Ideally with this we aren't taking
the action. So if anything this functions in my mind more like the
inverse where if this does stop we have to abort more like trylock
failing.

This is why I mentioned that maybe this should be renamed. I view this
more as a check to verify we are good to proceed. In addition there is
the problem that there are 3 possible outcomes with maybe_stop versus
the two from try_stop.

> > Maybe we should just
> > do away with the stop/wake messaging and go with something such as a
> > RTS/CTS type setup. Basically this function is acting as a RTS to
> > verify that we have room on the ring to place the frame. If we don't
> > we are stopped. The "wake" function is on what is essentially the
> > receiving end on the other side of the hardware after it has DMAed the
> > frames and is providing the CTS signal back.
>
> I'm definitely open to different naming but wouldn't calling RTS _after_
> send be even more confusing than maybe_stop?

We don't call maybe_stop after the send. For that we would be calling
try_stop. The difference being in the case of maybe_stop we might have
to return busy.

> > > > With that the decision tree becomes more straightforward as we would do
> > > > something like:
> > > >       if (result) {
> > > >               if (result < 0)
> > > >                       Increment stopped stat
> > > >                       return
> > > >               else
> > > >                       Increment restarted stat
> > > >       }
> > >
> > > Do you see a driver where it matters? ixgbe and co. use
> > > netif_tx_queue_try_stop() and again they just test stopped vs not stopped.
> >
> > The thing is in order to make this work for the ixgbe patch you didn't
> > use the maybe_stop instead you went with the try_stop. If you replaced
> > the ixgbe_maybe_stop_tx with your maybe stop would have to do
> > something such as the code above to make it work. That is what I am
> > getting at. From what I can tell the only real difference between
> > ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> > move the restart_queue stat increment out.
>
> I can convert ixgbe further, true, but I needed the try_stop, anyway,
> because bnxt does:
>
> if (/* need to stop */) {
>         if (xmit_more())
>                 flush_db_write();
>         netif_tx_queue_try_stop();
> }
>
> which seems reasonable.

I wasn't saying we didn't need try_stop. However the logic here
doesn't care about the return value. In the ixgbe case we track the
queue restarts so we would want a 0 on success and a non-zero if we
have to increment the stat. I would be okay with the 0 (success) / -1
(queue restarted) in this case.

> > The general thought is I would prefer to keep it so that 0 is the
> > default most likely case in both where the queue is enabled and is
> > still enabled. By moving the "take action" items into the 1/-1 values
> > then it becomes much easier to sort them out with 1 being a stat
> > increment and -1 being an indication to stop transmitting and prep for
> > watchdog hang if we don't clear this in the next watchdog period.
>
> Maybe worth taking a step back - the restart stat which ixgbe
> maintains made perfect sense when you pioneered this approach but
> I think we had a decade of use, and have kprobes now, so we don't
> really need to maintain a statistic for a condition with no impact
> to the user? New driver should not care 1 vs -1..

Actually the restart_queue stat is VERY useful for debugging. It tells
us we are seeing backlogs develop in the Tx queue. We track it any
time we wake up the queue, not just in the maybe_stop case.

WIthout that we are then having to break out kprobes and the like
which we could only add after-the-fact which makes things much harder
to debug when issues occur. For example, a common case to use it is to
monitor it when we see a system with slow Tx connections. With that
stat we can tell if we are building a backlog in the qdisc or if it is
something else such as a limited amount of socket memory is limiting
the transmits.

> > Also in general it makes it easier to understand if these all work
> > with the same logic.
> >
> > > > In addition for readability we may want consider adding an enum simliar
> > > > to the netdev_tx enum as then we have the return types locked and
> > > > usable should we want to specifically pick out one.
> > >
> > > Hm, I thought people generally dislike the netdev_tx enum.
> > > Maybe it's just me.
> >
> > The thought I had with the enum is to more easily connect the outcomes
> > with the sources. It would also help to prevent any confusion on what
> > is what. Having the two stop/wake functions return different values is
> > a potential source for errors since 0/1 means different things in the
> > different functions. Basically since we have 3 possible outcomes using
> > the enum would make it very clear what the mapping is between the two.
>
> IMO only two outcomes matter in practice (as mentioned above).
> I really like the ability to treat the return value as a bool, if only
> we had negative zero we would have a perfect compromise :(

I think we are just thinking about two different things. I am focusing
on the "maybe" calls that have 3 outcomes whereas I think you are
mostly focused on the "try" calls. My thought is to treat it something
like the msix allocation calls where a negative indicates a failure
forcing us to stop since the ring is full, 0 is a success, and a value
indicates that there are resources but they are/were limited.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-26 21:23         ` Alexander Duyck
@ 2023-03-29  0:56           ` Jakub Kicinski
  2023-03-30 14:56             ` Paolo Abeni
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-03-29  0:56 UTC (permalink / raw)
  To: Alexander Duyck, pabeni; +Cc: davem, netdev, edumazet, willemb

On Sun, 26 Mar 2023 14:23:07 -0700 Alexander Duyck wrote:
> > > Except this isn't "stop", this is "maybe stop".  
> >
> > So the return value from try_stop and maybe_stop would be different?
> > try_stop needs to return 0 if it stopped - the same semantics as
> > trylock(), AFAIR. Not that I love those semantics, but it's a fairly
> > strong precedent.  
> 
> The problem is this isn't a lock. Ideally with this we aren't taking
> the action. So if anything this functions in my mind more like the
> inverse where if this does stop we have to abort more like trylock
> failing.

No.. for try_stop we are trying to stop.

> This is why I mentioned that maybe this should be renamed. I view this
> more as a check to verify we are good to proceed. In addition there is
> the problem that there are 3 possible outcomes with maybe_stop versus
> the two from try_stop.

I'm open to other names :S

> > > The thing is in order to make this work for the ixgbe patch you didn't
> > > use the maybe_stop instead you went with the try_stop. If you replaced
> > > the ixgbe_maybe_stop_tx with your maybe stop would have to do
> > > something such as the code above to make it work. That is what I am
> > > getting at. From what I can tell the only real difference between
> > > ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> > > move the restart_queue stat increment out.  
> >
> > I can convert ixgbe further, true, but I needed the try_stop, anyway,
> > because bnxt does:
> >
> > if (/* need to stop */) {
> >         if (xmit_more())
> >                 flush_db_write();
> >         netif_tx_queue_try_stop();
> > }
> >
> > which seems reasonable.  
> 
> I wasn't saying we didn't need try_stop. However the logic here
> doesn't care about the return value. In the ixgbe case we track the
> queue restarts so we would want a 0 on success and a non-zero if we
> have to increment the stat. I would be okay with the 0 (success) / -1
> (queue restarted) in this case.
> 
> > > The general thought is I would prefer to keep it so that 0 is the
> > > default most likely case in both where the queue is enabled and is
> > > still enabled. By moving the "take action" items into the 1/-1 values
> > > then it becomes much easier to sort them out with 1 being a stat
> > > increment and -1 being an indication to stop transmitting and prep for
> > > watchdog hang if we don't clear this in the next watchdog period.  
> >
> > Maybe worth taking a step back - the restart stat which ixgbe
> > maintains made perfect sense when you pioneered this approach but
> > I think we had a decade of use, and have kprobes now, so we don't
> > really need to maintain a statistic for a condition with no impact
> > to the user? New driver should not care 1 vs -1..  
> 
> Actually the restart_queue stat is VERY useful for debugging. It tells
> us we are seeing backlogs develop in the Tx queue. We track it any
> time we wake up the queue, not just in the maybe_stop case.
> 
> WIthout that we are then having to break out kprobes and the like
> which we could only add after-the-fact which makes things much harder
> to debug when issues occur. For example, a common case to use it is to
> monitor it when we see a system with slow Tx connections. With that
> stat we can tell if we are building a backlog in the qdisc or if it is
> something else such as a limited amount of socket memory is limiting
> the transmits.

Oh, I missed that wake uses the same stat. Let me clarify - the
stop/start counter is definitely useful. What I thought the restart
counter is counting is just the race cases. I don't think the race
cases are worth counting in any way.

> > > The thought I had with the enum is to more easily connect the outcomes
> > > with the sources. It would also help to prevent any confusion on what
> > > is what. Having the two stop/wake functions return different values is
> > > a potential source for errors since 0/1 means different things in the
> > > different functions. Basically since we have 3 possible outcomes using
> > > the enum would make it very clear what the mapping is between the two.  
> >
> > IMO only two outcomes matter in practice (as mentioned above).
> > I really like the ability to treat the return value as a bool, if only
> > we had negative zero we would have a perfect compromise :(  
> 
> I think we are just thinking about two different things. I am focusing
> on the "maybe" calls that have 3 outcomes whereas I think you are
> mostly focused on the "try" calls. My thought is to treat it something
> like the msix allocation calls where a negative indicates a failure
> forcing us to stop since the ring is full, 0 is a success, and a value
> indicates that there are resources but they are/were limited.

I don't see a strong analogy to PCI resource allocation :(

I prefer to keep the 0 vs non-0 distinction to indicate whether 
the action was performed.

Paolo, Eric, any opinion? Other than the one likely vs unlikely
flip -- is this good enough to merge for you?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-03-29  0:56           ` Jakub Kicinski
@ 2023-03-30 14:56             ` Paolo Abeni
  0 siblings, 0 replies; 45+ messages in thread
From: Paolo Abeni @ 2023-03-30 14:56 UTC (permalink / raw)
  To: Jakub Kicinski, Alexander Duyck; +Cc: davem, netdev, edumazet, willemb

Hi,

On Tue, 2023-03-28 at 17:56 -0700, Jakub Kicinski wrote:
> On Sun, 26 Mar 2023 14:23:07 -0700 Alexander Duyck wrote:
> > > > Except this isn't "stop", this is "maybe stop".  
> > > 
> > > So the return value from try_stop and maybe_stop would be different?
> > > try_stop needs to return 0 if it stopped - the same semantics as
> > > trylock(), AFAIR. Not that I love those semantics, but it's a fairly
> > > strong precedent.  
> > 
> > The problem is this isn't a lock. Ideally with this we aren't taking
> > the action. So if anything this functions in my mind more like the
> > inverse where if this does stop we have to abort more like trylock
> > failing.
> 
> No.. for try_stop we are trying to stop.
> 
> > This is why I mentioned that maybe this should be renamed. I view this
> > more as a check to verify we are good to proceed. In addition there is
> > the problem that there are 3 possible outcomes with maybe_stop versus
> > the two from try_stop.
> 
> I'm open to other names :S
> 
> > > > The thing is in order to make this work for the ixgbe patch you didn't
> > > > use the maybe_stop instead you went with the try_stop. If you replaced
> > > > the ixgbe_maybe_stop_tx with your maybe stop would have to do
> > > > something such as the code above to make it work. That is what I am
> > > > getting at. From what I can tell the only real difference between
> > > > ixgbe_maybe_stop_tx and your maybe_stop is that you avoided having to
> > > > move the restart_queue stat increment out.  
> > > 
> > > I can convert ixgbe further, true, but I needed the try_stop, anyway,
> > > because bnxt does:
> > > 
> > > if (/* need to stop */) {
> > >         if (xmit_more())
> > >                 flush_db_write();
> > >         netif_tx_queue_try_stop();
> > > }
> > > 
> > > which seems reasonable.  
> > 
> > I wasn't saying we didn't need try_stop. However the logic here
> > doesn't care about the return value. In the ixgbe case we track the
> > queue restarts so we would want a 0 on success and a non-zero if we
> > have to increment the stat. I would be okay with the 0 (success) / -1
> > (queue restarted) in this case.
> > 
> > > > The general thought is I would prefer to keep it so that 0 is the
> > > > default most likely case in both where the queue is enabled and is
> > > > still enabled. By moving the "take action" items into the 1/-1 values
> > > > then it becomes much easier to sort them out with 1 being a stat
> > > > increment and -1 being an indication to stop transmitting and prep for
> > > > watchdog hang if we don't clear this in the next watchdog period.  
> > > 
> > > Maybe worth taking a step back - the restart stat which ixgbe
> > > maintains made perfect sense when you pioneered this approach but
> > > I think we had a decade of use, and have kprobes now, so we don't
> > > really need to maintain a statistic for a condition with no impact
> > > to the user? New driver should not care 1 vs -1..  
> > 
> > Actually the restart_queue stat is VERY useful for debugging. It tells
> > us we are seeing backlogs develop in the Tx queue. We track it any
> > time we wake up the queue, not just in the maybe_stop case.
> > 
> > WIthout that we are then having to break out kprobes and the like
> > which we could only add after-the-fact which makes things much harder
> > to debug when issues occur. For example, a common case to use it is to
> > monitor it when we see a system with slow Tx connections. With that
> > stat we can tell if we are building a backlog in the qdisc or if it is
> > something else such as a limited amount of socket memory is limiting
> > the transmits.
> 
> Oh, I missed that wake uses the same stat. Let me clarify - the
> stop/start counter is definitely useful. What I thought the restart
> counter is counting is just the race cases. I don't think the race
> cases are worth counting in any way.
> 
> > > > The thought I had with the enum is to more easily connect the outcomes
> > > > with the sources. It would also help to prevent any confusion on what
> > > > is what. Having the two stop/wake functions return different values is
> > > > a potential source for errors since 0/1 means different things in the
> > > > different functions. Basically since we have 3 possible outcomes using
> > > > the enum would make it very clear what the mapping is between the two.  
> > > 
> > > IMO only two outcomes matter in practice (as mentioned above).
> > > I really like the ability to treat the return value as a bool, if only
> > > we had negative zero we would have a perfect compromise :(  
> > 
> > I think we are just thinking about two different things. I am focusing
> > on the "maybe" calls that have 3 outcomes whereas I think you are
> > mostly focused on the "try" calls. My thought is to treat it something
> > like the msix allocation calls where a negative indicates a failure
> > forcing us to stop since the ring is full, 0 is a success, and a value
> > indicates that there are resources but they are/were limited.
> 
> I don't see a strong analogy to PCI resource allocation :(
> 
> I prefer to keep the 0 vs non-0 distinction to indicate whether 
> the action was performed.
> 
> Paolo, Eric, any opinion? Other than the one likely vs unlikely
> flip -- is this good enough to merge for you?

As you know I'm usually horrible at name related choice, but you asked,
so...

I'm personally ok with the current naming, and AFAICS the coding style
guidelines suggest returning 0 when imperative functions complete
successfully. 

I think we should apply the guidelines here, even if are talking about
macros.

That means netif_tx_queue_maybe_stop() and netif_tx_queue_try_stop()
should return 0 when the queue is actually stopped.

I'm personally fine with the current implementation.

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-07  1:14                             ` Herbert Xu
@ 2023-04-07  1:21                               ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-07  1:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Fri, 7 Apr 2023 09:14:34 +0800 Herbert Xu wrote:
> Right, netconsole is the one exception that can occur in any
> context.  However, as I said in the previous email I don't see
> how this can break the wake logic and leave the queue stopped
> forever.

Yup, agreed! just wanted to check I'm not missing some netpoll detail :)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-07  1:03                           ` Jakub Kicinski
@ 2023-04-07  1:14                             ` Herbert Xu
  2023-04-07  1:21                               ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Herbert Xu @ 2023-04-07  1:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Thu, Apr 06, 2023 at 06:03:37PM -0700, Jakub Kicinski wrote:
>
> I couldn't find a check in netpoll/netconsole which would defer
> when in hardirq. Does it defer?

Right, netconsole is the one exception that can occur in any
context.  However, as I said in the previous email I don't see
how this can break the wake logic and leave the queue stopped
forever.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-07  0:58                         ` Herbert Xu
@ 2023-04-07  1:03                           ` Jakub Kicinski
  2023-04-07  1:14                             ` Herbert Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-07  1:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Fri, 7 Apr 2023 08:58:06 +0800 Herbert Xu wrote:
> Packet transmit can only occur in process context or softirq context

I couldn't find a check in netpoll/netconsole which would defer
when in hardirq. Does it defer?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 14:46                       ` Jakub Kicinski
  2023-04-06 15:45                         ` Paul E. McKenney
@ 2023-04-07  0:58                         ` Herbert Xu
  2023-04-07  1:03                           ` Jakub Kicinski
  1 sibling, 1 reply; 45+ messages in thread
From: Herbert Xu @ 2023-04-07  0:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paul E. McKenney, Alexander Duyck, Heiner Kallweit, davem,
	netdev, edumazet, pabeni

On Thu, Apr 06, 2023 at 07:46:48AM -0700, Jakub Kicinski wrote:
> On Thu, 6 Apr 2023 07:17:09 -0700 Paul E. McKenney wrote:
>
> > Agreed, preemption will be enabled in softirq, but interrupts can still
> > happen, correct?
> 
> Starting the queue only happens from softirq (I hope) and stopping 
> can happen from any context. So we're risking false-starts again.
> I think this puts to bed any hope of making this code safe against
> false-starts with just barriers :(

Packet transmit can only occur in process context or softirq context
so it can't interrupt an existing softirq context on the same CPU.

However, even if we did allow one or both to occur in IRQ context I
can't see how this can break.

I think Paul didn't see my earlier message because he wasn't part
of the original CC list.  The only race we're trying to stop is the
one which leaves the TX queue stopped forever.  We don't care about
false starts.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 15:56                           ` Jakub Kicinski
@ 2023-04-06 16:25                             ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2023-04-06 16:25 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, Apr 06, 2023 at 08:56:29AM -0700, Jakub Kicinski wrote:
> On Thu, 6 Apr 2023 08:45:10 -0700 Paul E. McKenney wrote:
> > > Starting the queue only happens from softirq (I hope) and stopping 
> > > can happen from any context. So we're risking false-starts again.
> > > I think this puts to bed any hope of making this code safe against
> > > false-starts with just barriers :(  
> > 
> > Is it possible to jam all the relevant state into a single variable?
> > (I believe that that answer is "no", but just in case asking this question
> > inspires someone to come up with a good idea.)
> 
> Not in any obvious way, half of the state is driver-specific the other
> half is flags maintained by the core :S

Hey, I had to ask!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 15:45                         ` Paul E. McKenney
@ 2023-04-06 15:56                           ` Jakub Kicinski
  2023-04-06 16:25                             ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-06 15:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, 6 Apr 2023 08:45:10 -0700 Paul E. McKenney wrote:
> > Starting the queue only happens from softirq (I hope) and stopping 
> > can happen from any context. So we're risking false-starts again.
> > I think this puts to bed any hope of making this code safe against
> > false-starts with just barriers :(  
> 
> Is it possible to jam all the relevant state into a single variable?
> (I believe that that answer is "no", but just in case asking this question
> inspires someone to come up with a good idea.)

Not in any obvious way, half of the state is driver-specific the other
half is flags maintained by the core :S

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 14:46                       ` Jakub Kicinski
@ 2023-04-06 15:45                         ` Paul E. McKenney
  2023-04-06 15:56                           ` Jakub Kicinski
  2023-04-07  0:58                         ` Herbert Xu
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2023-04-06 15:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, Apr 06, 2023 at 07:46:48AM -0700, Jakub Kicinski wrote:
> On Thu, 6 Apr 2023 07:17:09 -0700 Paul E. McKenney wrote:
> > > > Mightn't preemption or interrupts cause further issues?  Or are preemption
> > > > and/or interrupts disabled across the relevant sections of code?  
> > > 
> > > The code in question is supposed to run in softirq context.  So
> > > both interrupts and preemption should be disabled.  
> > 
> > Agreed, preemption will be enabled in softirq, but interrupts can still
> > happen, correct?
> 
> Starting the queue only happens from softirq (I hope) and stopping 
> can happen from any context. So we're risking false-starts again.
> I think this puts to bed any hope of making this code safe against
> false-starts with just barriers :(

Is it possible to jam all the relevant state into a single variable?
(I believe that that answer is "no", but just in case asking this question
inspires someone to come up with a good idea.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06 14:17                     ` Paul E. McKenney
@ 2023-04-06 14:46                       ` Jakub Kicinski
  2023-04-06 15:45                         ` Paul E. McKenney
  2023-04-07  0:58                         ` Herbert Xu
  0 siblings, 2 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-06 14:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Herbert Xu, Alexander Duyck, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, 6 Apr 2023 07:17:09 -0700 Paul E. McKenney wrote:
> > > Mightn't preemption or interrupts cause further issues?  Or are preemption
> > > and/or interrupts disabled across the relevant sections of code?  
> > 
> > The code in question is supposed to run in softirq context.  So
> > both interrupts and preemption should be disabled.  
> 
> Agreed, preemption will be enabled in softirq, but interrupts can still
> happen, correct?

Starting the queue only happens from softirq (I hope) and stopping 
can happen from any context. So we're risking false-starts again.
I think this puts to bed any hope of making this code safe against
false-starts with just barriers :(

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-06  5:15                   ` Herbert Xu
@ 2023-04-06 14:17                     ` Paul E. McKenney
  2023-04-06 14:46                       ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2023-04-06 14:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alexander Duyck, Jakub Kicinski, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Thu, Apr 06, 2023 at 01:15:25PM +0800, Herbert Xu wrote:
> On Wed, Apr 05, 2023 at 03:20:39PM -0700, Paul E. McKenney wrote:
> >
> > Mightn't preemption or interrupts cause further issues?  Or are preemption
> > and/or interrupts disabled across the relevant sections of code?
> 
> The code in question is supposed to run in softirq context.  So
> both interrupts and preemption should be disabled.

Agreed, preemption will be enabled in softirq, but interrupts can still
happen, correct?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-05 22:20                 ` Paul E. McKenney
@ 2023-04-06  5:15                   ` Herbert Xu
  2023-04-06 14:17                     ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Herbert Xu @ 2023-04-06  5:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alexander Duyck, Jakub Kicinski, Heiner Kallweit, davem, netdev,
	edumazet, pabeni

On Wed, Apr 05, 2023 at 03:20:39PM -0700, Paul E. McKenney wrote:
>
> Mightn't preemption or interrupts cause further issues?  Or are preemption
> and/or interrupts disabled across the relevant sections of code?

The code in question is supposed to run in softirq context.  So
both interrupts and preemption should be disabled.

Chyeers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 20:27               ` Alexander Duyck
@ 2023-04-05 22:20                 ` Paul E. McKenney
  2023-04-06  5:15                   ` Herbert Xu
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2023-04-05 22:20 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Heiner Kallweit, davem, netdev, edumazet, pabeni,
	Herbert Xu

On Mon, Apr 03, 2023 at 01:27:44PM -0700, Alexander Duyck wrote:
> On Mon, Apr 3, 2023 at 12:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 3 Apr 2023 11:11:35 -0700 Alexander Duyck wrote:
> > > On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > I don't think in terms of flushes. Let me add line numbers to the
> > > > producer and the consumer.
> > > >
> > > >  c1. WRITE cons
> > > >  c2. mb()  # A
> > > >  c3. READ stopped
> > > >  c4. rmb() # C
> > > >  c5. READ prod, cons
> > > >
> > > >  p1. WRITE prod
> > > >  p2. READ prod, cons
> > > >  p3. mb()  # B
> > > >  p4. WRITE stopped
> > > >  p5. READ prod, cons
> > > >
> > > > The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> > > > orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..
> > >
> > > So which function is supposed to be consumer vs producer here?
> >
> > producer is xmit consumer is NAPI
> >
> > > I think your write stopped is on the wrong side of the memory barrier.
> > > It should be writing prod and stopped both before the barrier.
> >
> > Indeed, Paul pointed out over chat that we need two barriers there
> > to be correct :( Should be fine in practice, first one is BQL,
> > second one is on the slow path.
> >
> > > The maybe/try stop should essentially be:
> > > 1. write tail
> > > 2. read prod/cons
> > > 3. if unused >= 1x packet
> > > 3.a return
> > >
> > > 4. set stop
> > > 5. mb()
> > > 6. Re-read prod/cons
> > > 7. if unused >= 1x packet
> > > 7.a. test_and_clear stop
> > >
> > > The maybe/try wake would be:
> > > 1. write head
> > > 2. read prod/cons
> > > 3. if consumed == 0 || unused < 2x packet
> > > 3.a. return
> > >
> > > 4. mb()
> > > 5. test_and_clear stop
> > >
> > > > > One other thing to keep in mind is that the wake gives itself a pretty
> > > > > good runway. We are talking about enough to transmit at least 2
> > > > > frames. So if another consumer is stopping it we aren't waking it
> > > > > unless there is enough space for yet another frame after the current
> > > > > consumer.
> > > >
> > > > Ack, the race is very unlikely, basically the completing CPU would have
> > > > to take an expensive IRQ between checking the descriptor count and
> > > > checking if stopped -- to let the sending CPU queue multiple frames.
> > > >
> > > > But in theory the race is there, right?
> > >
> > > I don't think this is so much a race as a skid. Specifically when we
> > > wake the queue it will only run for one more packet in such a
> > > scenario. I think it is being run more like a flow control threshold
> > > rather than some sort of lock.
> > >
> > > I think I see what you are getting at though. Basically if the xmit
> > > function were to cycle several times between steps 3.a and 4 in the
> > > maybe/try wake it could fill the queue and then trigger the wake even
> > > though the queue is full and the unused space was already consumed.
> >
> > Yup, exactly. So we either need to sprinkle a couple more barriers
> > and tests in, or document that the code is only 99.999999% safe
> > against false positive restarts and drivers need to check for ring
> > full at the beginning of xmit.
> >
> > I'm quite tempted to add the barriers, because on the NAPI/consumer
> > side we could use this as an opportunity to start piggy backing on
> > the BQL barrier.
> 
> The thing is the more barriers we add the more it will hurt
> performance. I'd be tempted to just increase the runway we have as we
> could afford a 1 packet skid if we had a 2 packet runway for the
> start/stop thresholds.
> 
> I suspect that is probably why we haven't seen any issues as the
> DESC_NEEDED is pretty generous since it is assuming worst case
> scenarios.

Mightn't preemption or interrupts cause further issues?  Or are preemption
and/or interrupts disabled across the relevant sections of code?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-04  6:39         ` Herbert Xu
@ 2023-04-04 22:36           ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-04 22:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Alexander Duyck, Heiner Kallweit, davem, netdev, edumazet, pabeni

On Tue, 4 Apr 2023 14:39:11 +0800 Herbert Xu wrote:
> Thanks for adding me to this thread as otherwise I would've surely
> missed it.
> 
> I see where the confusion is coming from.  The key is that we weren't
> trying to stop every single race, because not all of them are fatal.
> 
> In particular, we tolerate the race where a wake is done when it
> shouldn't be because the network stack copes with that by requeueing
> the skb onto the qdisc.
> 
> So it's a trade-off.  We could make our code water-tight, but then
> we would be incurring a penalty for every skb.  With our current
> approach, the penalty is only incurred in the unlikely event of a
> race which results in the unlucky skb being requeued.
> 
> The race that we do want to stop is a queue being stuck in a stopped
> state when it shouldn't because that indeed is fatal.
> 
> Going back to the proposed helpers, we only need one mb because
> that's all we need to fix the stuck/stopped queue race.

Thanks, I'm impressed you still remember the details :)

I'll leave it racy in the next version. Re-using the BQL barrier
is a bit more tricky on the xmit path than I thought. I'll just
document that false-positive wake ups are possible.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 15:18       ` Alexander Duyck
  2023-04-03 15:56         ` Jakub Kicinski
@ 2023-04-04  6:39         ` Herbert Xu
  2023-04-04 22:36           ` Jakub Kicinski
  1 sibling, 1 reply; 45+ messages in thread
From: Herbert Xu @ 2023-04-04  6:39 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jakub Kicinski, Heiner Kallweit, davem, netdev, edumazet, pabeni

On Mon, Apr 03, 2023 at 08:18:04AM -0700, Alexander Duyck wrote:
>
> > I *think* that the right ordering would be:
> >
> > WRITE cons
> > mb()  # A
> > READ stopped
> > rmb() # C
> > READ prod, cons
> 
> What would the extra rmb() get you? The mb() will have already flushed
> out any writes and if stopped is set the tail should have already been
> written before setting it.
> 
> One other thing to keep in mind is that the wake gives itself a pretty
> good runway. We are talking about enough to transmit at least 2
> frames. So if another consumer is stopping it we aren't waking it
> unless there is enough space for yet another frame after the current
> consumer.
> 
> > And on the producer side (existing):
> >
> > WRITE prod
> > READ prod, cons
> > mb()  # B
> > WRITE stopped
> > READ prod, cons
> >
> > But I'm slightly afraid to change it, it's been working for over
> > a decade :D
> 
> I wouldn't change it. The code has predated BQL in the e1000 driver
> and has been that way since the inception of it I believe in 2.6.19.

Thanks for adding me to this thread as otherwise I would've surely
missed it.

I see where the confusion is coming from.  The key is that we weren't
trying to stop every single race, because not all of them are fatal.

In particular, we tolerate the race where a wake is done when it
shouldn't be because the network stack copes with that by requeueing
the skb onto the qdisc.

So it's a trade-off.  We could make our code water-tight, but then
we would be incurring a penalty for every skb.  With our current
approach, the penalty is only incurred in the unlikely event of a
race which results in the unlucky skb being requeued.

The race that we do want to stop is a queue being stuck in a stopped
state when it shouldn't because that indeed is fatal.

Going back to the proposed helpers, we only need one mb because
that's all we need to fix the stuck/stopped queue race.

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 19:03             ` Jakub Kicinski
@ 2023-04-03 20:27               ` Alexander Duyck
  2023-04-05 22:20                 ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2023-04-03 20:27 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, Apr 3, 2023 at 12:03 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 3 Apr 2023 11:11:35 -0700 Alexander Duyck wrote:
> > On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > I don't think in terms of flushes. Let me add line numbers to the
> > > producer and the consumer.
> > >
> > >  c1. WRITE cons
> > >  c2. mb()  # A
> > >  c3. READ stopped
> > >  c4. rmb() # C
> > >  c5. READ prod, cons
> > >
> > >  p1. WRITE prod
> > >  p2. READ prod, cons
> > >  p3. mb()  # B
> > >  p4. WRITE stopped
> > >  p5. READ prod, cons
> > >
> > > The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> > > orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..
> >
> > So which function is supposed to be consumer vs producer here?
>
> producer is xmit consumer is NAPI
>
> > I think your write stopped is on the wrong side of the memory barrier.
> > It should be writing prod and stopped both before the barrier.
>
> Indeed, Paul pointed out over chat that we need two barriers there
> to be correct :( Should be fine in practice, first one is BQL,
> second one is on the slow path.
>
> > The maybe/try stop should essentially be:
> > 1. write tail
> > 2. read prod/cons
> > 3. if unused >= 1x packet
> > 3.a return
> >
> > 4. set stop
> > 5. mb()
> > 6. Re-read prod/cons
> > 7. if unused >= 1x packet
> > 7.a. test_and_clear stop
> >
> > The maybe/try wake would be:
> > 1. write head
> > 2. read prod/cons
> > 3. if consumed == 0 || unused < 2x packet
> > 3.a. return
> >
> > 4. mb()
> > 5. test_and_clear stop
> >
> > > > One other thing to keep in mind is that the wake gives itself a pretty
> > > > good runway. We are talking about enough to transmit at least 2
> > > > frames. So if another consumer is stopping it we aren't waking it
> > > > unless there is enough space for yet another frame after the current
> > > > consumer.
> > >
> > > Ack, the race is very unlikely, basically the completing CPU would have
> > > to take an expensive IRQ between checking the descriptor count and
> > > checking if stopped -- to let the sending CPU queue multiple frames.
> > >
> > > But in theory the race is there, right?
> >
> > I don't think this is so much a race as a skid. Specifically when we
> > wake the queue it will only run for one more packet in such a
> > scenario. I think it is being run more like a flow control threshold
> > rather than some sort of lock.
> >
> > I think I see what you are getting at though. Basically if the xmit
> > function were to cycle several times between steps 3.a and 4 in the
> > maybe/try wake it could fill the queue and then trigger the wake even
> > though the queue is full and the unused space was already consumed.
>
> Yup, exactly. So we either need to sprinkle a couple more barriers
> and tests in, or document that the code is only 99.999999% safe
> against false positive restarts and drivers need to check for ring
> full at the beginning of xmit.
>
> I'm quite tempted to add the barriers, because on the NAPI/consumer
> side we could use this as an opportunity to start piggy backing on
> the BQL barrier.

The thing is the more barriers we add the more it will hurt
performance. I'd be tempted to just increase the runway we have as we
could afford a 1 packet skid if we had a 2 packet runway for the
start/stop thresholds.

I suspect that is probably why we haven't seen any issues as the
DESC_NEEDED is pretty generous since it is assuming worst case
scenarios.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 18:11           ` Alexander Duyck
@ 2023-04-03 19:03             ` Jakub Kicinski
  2023-04-03 20:27               ` Alexander Duyck
  0 siblings, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-03 19:03 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, 3 Apr 2023 11:11:35 -0700 Alexander Duyck wrote:
> On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > I don't think in terms of flushes. Let me add line numbers to the
> > producer and the consumer.
> >
> >  c1. WRITE cons
> >  c2. mb()  # A
> >  c3. READ stopped
> >  c4. rmb() # C
> >  c5. READ prod, cons
> >
> >  p1. WRITE prod
> >  p2. READ prod, cons
> >  p3. mb()  # B
> >  p4. WRITE stopped
> >  p5. READ prod, cons
> >
> > The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> > orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..  
> 
> So which function is supposed to be consumer vs producer here? 

producer is xmit consumer is NAPI

> I think your write stopped is on the wrong side of the memory barrier. 
> It should be writing prod and stopped both before the barrier.

Indeed, Paul pointed out over chat that we need two barriers there 
to be correct :( Should be fine in practice, first one is BQL,
second one is on the slow path.

> The maybe/try stop should essentially be:
> 1. write tail
> 2. read prod/cons
> 3. if unused >= 1x packet
> 3.a return
> 
> 4. set stop
> 5. mb()
> 6. Re-read prod/cons
> 7. if unused >= 1x packet
> 7.a. test_and_clear stop
> 
> The maybe/try wake would be:
> 1. write head
> 2. read prod/cons
> 3. if consumed == 0 || unused < 2x packet
> 3.a. return
> 
> 4. mb()
> 5. test_and_clear stop
> 
> > > One other thing to keep in mind is that the wake gives itself a pretty
> > > good runway. We are talking about enough to transmit at least 2
> > > frames. So if another consumer is stopping it we aren't waking it
> > > unless there is enough space for yet another frame after the current
> > > consumer.  
> >
> > Ack, the race is very unlikely, basically the completing CPU would have
> > to take an expensive IRQ between checking the descriptor count and
> > checking if stopped -- to let the sending CPU queue multiple frames.
> >
> > But in theory the race is there, right?  
> 
> I don't think this is so much a race as a skid. Specifically when we
> wake the queue it will only run for one more packet in such a
> scenario. I think it is being run more like a flow control threshold
> rather than some sort of lock.
> 
> I think I see what you are getting at though. Basically if the xmit
> function were to cycle several times between steps 3.a and 4 in the
> maybe/try wake it could fill the queue and then trigger the wake even
> though the queue is full and the unused space was already consumed.

Yup, exactly. So we either need to sprinkle a couple more barriers 
and tests in, or document that the code is only 99.999999% safe 
against false positive restarts and drivers need to check for ring
full at the beginning of xmit.

I'm quite tempted to add the barriers, because on the NAPI/consumer
side we could use this as an opportunity to start piggy backing on
the BQL barrier.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 15:56         ` Jakub Kicinski
@ 2023-04-03 18:11           ` Alexander Duyck
  2023-04-03 19:03             ` Jakub Kicinski
  0 siblings, 1 reply; 45+ messages in thread
From: Alexander Duyck @ 2023-04-03 18:11 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, Apr 3, 2023 at 8:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 3 Apr 2023 08:18:04 -0700 Alexander Duyck wrote:
> > On Sat, Apr 1, 2023 at 11:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > > One more question: Don't we need a read memory barrier here to ensure
> > > > get_desc is up-to-date?
> > >
> > > CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> > > and sparse CC list.. :|
> > >
> > > I was thinking about this too yesterday. AFAICT this implementation
> > > could indeed result in waking even tho the queue is full on non-x86.
> > > That's why the drivers have an extra check at the start of .xmit? :(
> >
> > The extra check at the start is more historical than anything else.
> > Logic like that has been there since the e1000 days. I think it
> > addressed items like pktgen which I think didn't make use of the
> > stop/wake flags way back when. I'll add in Herbet who was the original
> > author for this code so he can add some additional history if needed.
>
> Thanks for the pointer, you weren't kidding with the 2.6.19, that seems
> to be when to code was added to e1000 :) Looks fairly similar to the
> current code minus the BQL.
>
> > > I *think* that the right ordering would be:
> > >
> > > c1. WRITE cons
> > > c2. mb()  # A
> > > c3. READ stopped
> > > c4. rmb() # C
> > > c5. READ prod, cons
> >
> > What would the extra rmb() get you? The mb() will have already flushed
> > out any writes and if stopped is set the tail should have already been
> > written before setting it.
>
> I don't think in terms of flushes. Let me add line numbers to the
> producer and the consumer.
>
>  c1. WRITE cons
>  c2. mb()  # A
>  c3. READ stopped
>  c4. rmb() # C
>  c5. READ prod, cons
>
>  p1. WRITE prod
>  p2. READ prod, cons
>  p3. mb()  # B
>  p4. WRITE stopped
>  p5. READ prod, cons
>
> The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
> orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..

So which function is supposed to be consumer vs producer here? I think
your write stopped is on the wrong side of the memory barrier. It
should be writing prod and stopped both before the barrier.

The maybe/try stop should essentially be:
1. write tail
2. read prod/cons
3. if unused >= 1x packet
3.a return

4. set stop
5. mb()
6. Re-read prod/cons
7. if unused >= 1x packet
7.a. test_and_clear stop

The maybe/try wake would be:
1. write head
2. read prod/cons
3. if consumed == 0 || unused < 2x packet
3.a. return

4. mb()
5. test_and_clear stop

> > One other thing to keep in mind is that the wake gives itself a pretty
> > good runway. We are talking about enough to transmit at least 2
> > frames. So if another consumer is stopping it we aren't waking it
> > unless there is enough space for yet another frame after the current
> > consumer.
>
> Ack, the race is very unlikely, basically the completing CPU would have
> to take an expensive IRQ between checking the descriptor count and
> checking if stopped -- to let the sending CPU queue multiple frames.
>
> But in theory the race is there, right?

I don't think this is so much a race as a skid. Specifically when we
wake the queue it will only run for one more packet in such a
scenario. I think it is being run more like a flow control threshold
rather than some sort of lock.

I think I see what you are getting at though. Basically if the xmit
function were to cycle several times between steps 3.a and 4 in the
maybe/try wake it could fill the queue and then trigger the wake even
though the queue is full and the unused space was already consumed.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-03 15:18       ` Alexander Duyck
@ 2023-04-03 15:56         ` Jakub Kicinski
  2023-04-03 18:11           ` Alexander Duyck
  2023-04-04  6:39         ` Herbert Xu
  1 sibling, 1 reply; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-03 15:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu,
	Paul E. McKenney

On Mon, 3 Apr 2023 08:18:04 -0700 Alexander Duyck wrote:
> On Sat, Apr 1, 2023 at 11:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > > One more question: Don't we need a read memory barrier here to ensure
> > > get_desc is up-to-date?  
> >
> > CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> > and sparse CC list.. :|
> >
> > I was thinking about this too yesterday. AFAICT this implementation
> > could indeed result in waking even tho the queue is full on non-x86.
> > That's why the drivers have an extra check at the start of .xmit? :(  
> 
> The extra check at the start is more historical than anything else.
> Logic like that has been there since the e1000 days. I think it
> addressed items like pktgen which I think didn't make use of the
> stop/wake flags way back when. I'll add in Herbet who was the original
> author for this code so he can add some additional history if needed.

Thanks for the pointer, you weren't kidding with the 2.6.19, that seems
to be when to code was added to e1000 :) Looks fairly similar to the
current code minus the BQL.

> > I *think* that the right ordering would be:
> >
> > c1. WRITE cons
> > c2. mb()  # A
> > c3. READ stopped
> > c4. rmb() # C
> > c5. READ prod, cons  
> 
> What would the extra rmb() get you? The mb() will have already flushed
> out any writes and if stopped is set the tail should have already been
> written before setting it.

I don't think in terms of flushes. Let me add line numbers to the
producer and the consumer.

 c1. WRITE cons
 c2. mb()  # A
 c3. READ stopped
 c4. rmb() # C
 c5. READ prod, cons  

 p1. WRITE prod
 p2. READ prod, cons
 p3. mb()  # B
 p4. WRITE stopped
 p5. READ prod, cons

The way I think the mb() orders c1 and c3 vs p2 and p4. The rmb()
orders c3 and c5 vs p1 and p4. Let me impenitently add Paul..

> One other thing to keep in mind is that the wake gives itself a pretty
> good runway. We are talking about enough to transmit at least 2
> frames. So if another consumer is stopping it we aren't waking it
> unless there is enough space for yet another frame after the current
> consumer.

Ack, the race is very unlikely, basically the completing CPU would have
to take an expensive IRQ between checking the descriptor count and
checking if stopped -- to let the sending CPU queue multiple frames.

But in theory the race is there, right?

> > And on the producer side (existing):
> >
> > p1. WRITE prod
> > p2. READ prod, cons
> > p3. mb()  # B
> > p4. WRITE stopped
> > p5. READ prod, cons
> >
> > But I'm slightly afraid to change it, it's been working for over
> > a decade :D  
> 
> I wouldn't change it. The code has predated BQL in the e1000 driver
> and has been that way since the inception of it I believe in 2.6.19.
> 
> > One neat thing that I noticed, which we could potentially exploit
> > if we were to touch this code is that BQL already has a smp_mb()
> > on the consumer side. So on any kernel config and driver which support
> > BQL we can use that instead of adding another barrier at #A.
> >
> > It would actually be a neat optimization because right now, AFAICT,
> > completion will fire the # A -like barrier almost every time.  
> 
> Yeah, the fact is the barrier in the wake path may actually be
> redundant if BQL is enabled. My advice is if you are wanting to get a
> better idea of how this was setup you could take a look at the e1000
> driver in the 2.6.19 kernel as that was where this code originated and
> I am pretty certain it predates anything in any of the other Intel
> drivers other than maybe e100.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 18:58     ` Jakub Kicinski
  2023-04-01 20:41       ` Heiner Kallweit
@ 2023-04-03 15:18       ` Alexander Duyck
  2023-04-03 15:56         ` Jakub Kicinski
  2023-04-04  6:39         ` Herbert Xu
  1 sibling, 2 replies; 45+ messages in thread
From: Alexander Duyck @ 2023-04-03 15:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Heiner Kallweit, davem, netdev, edumazet, pabeni, Herbert Xu

On Sat, Apr 1, 2023 at 11:58 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 1 Apr 2023 17:18:12 +0200 Heiner Kallweit wrote:
> > > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > > +   ({                                                              \
> > > +           int _res;                                               \
> > > +                                                                   \
> > > +           _res = -1;                                              \
> >
> > One more question: Don't we need a read memory barrier here to ensure
> > get_desc is up-to-date?
>
> CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> and sparse CC list.. :|
>
> I was thinking about this too yesterday. AFAICT this implementation
> could indeed result in waking even tho the queue is full on non-x86.
> That's why the drivers have an extra check at the start of .xmit? :(

The extra check at the start is more historical than anything else.
Logic like that has been there since the e1000 days. I think it
addressed items like pktgen which I think didn't make use of the
stop/wake flags way back when. I'll add in Herbet who was the original
author for this code so he can add some additional history if needed.

> I *think* that the right ordering would be:
>
> WRITE cons
> mb()  # A
> READ stopped
> rmb() # C
> READ prod, cons

What would the extra rmb() get you? The mb() will have already flushed
out any writes and if stopped is set the tail should have already been
written before setting it.

One other thing to keep in mind is that the wake gives itself a pretty
good runway. We are talking about enough to transmit at least 2
frames. So if another consumer is stopping it we aren't waking it
unless there is enough space for yet another frame after the current
consumer.

> And on the producer side (existing):
>
> WRITE prod
> READ prod, cons
> mb()  # B
> WRITE stopped
> READ prod, cons
>
> But I'm slightly afraid to change it, it's been working for over
> a decade :D

I wouldn't change it. The code has predated BQL in the e1000 driver
and has been that way since the inception of it I believe in 2.6.19.

> One neat thing that I noticed, which we could potentially exploit
> if we were to touch this code is that BQL already has a smp_mb()
> on the consumer side. So on any kernel config and driver which support
> BQL we can use that instead of adding another barrier at #A.
>
> It would actually be a neat optimization because right now, AFAICT,
> completion will fire the # A -like barrier almost every time.

Yeah, the fact is the barrier in the wake path may actually be
redundant if BQL is enabled. My advice is if you are wanting to get a
better idea of how this was setup you could take a look at the e1000
driver in the 2.6.19 kernel as that was where this code originated and
I am pretty certain it predates anything in any of the other Intel
drivers other than maybe e100.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 18:58     ` Jakub Kicinski
@ 2023-04-01 20:41       ` Heiner Kallweit
  2023-04-03 15:18       ` Alexander Duyck
  1 sibling, 0 replies; 45+ messages in thread
From: Heiner Kallweit @ 2023-04-01 20:41 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, Alexander Duyck

On 01.04.2023 20:58, Jakub Kicinski wrote:
> On Sat, 1 Apr 2023 17:18:12 +0200 Heiner Kallweit wrote:
>>> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
>>> +	({								\
>>> +		int _res;						\
>>> +									\
>>> +		_res = -1;						\  
>>
>> One more question: Don't we need a read memory barrier here to ensure
>> get_desc is up-to-date?
> 
> CC: Alex, maybe I should not be posting after 10pm, with the missing v2
> and sparse CC list.. :|
> 
> I was thinking about this too yesterday. AFAICT this implementation
> could indeed result in waking even tho the queue is full on non-x86.
> That's why the drivers have an extra check at the start of .xmit? :(
> 
> I *think* that the right ordering would be:
> 
> WRITE cons
> mb()  # A
> READ stopped
> rmb() # C
> READ prod, cons
> 
> And on the producer side (existing):
> 
> WRITE prod
> READ prod, cons
> mb()  # B
> WRITE stopped
> READ prod, cons
> 
> But I'm slightly afraid to change it, it's been working for over 
> a decade :D
> 
> One neat thing that I noticed, which we could potentially exploit 
> if we were to touch this code is that BQL already has a smp_mb() 
> on the consumer side. So on any kernel config and driver which support
> BQL we can use that instead of adding another barrier at #A.
> 
To me it seems ixgbe relies on this BQL-related barrier in
netdev_tx_completed_queue, maybe unintentionally.

I wonder whether we should move the smp_mb() from __netif_tx_queue_try_wake
to __netif_tx_queue_maybe_wake, before accessing get_desc.
Answer may depend on whether there's any use case where drivers would
call the "try" version directly, instead of the "may" version.
And, just thinking out loud, maybe add a flag to the macro to disable
the smp_mb() call, for cases where the driver has the required barrier
already. Setting this flag may look like this.

bool disable_barrier_in_may_wake = IS_ENABLED(CONFIG_BQL);

However this is quite some effort just to avoid a duplicated barrier.

> It would actually be a neat optimization because right now, AFAICT,
> completion will fire the # A -like barrier almost every time.
> 
>>> +		if (likely(get_desc > start_thrs))			\
>>> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
>>> +							 start_thrs,	\
>>> +							 down_cond);	\
>>> +		_res;							\
>>> +	})


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 15:18   ` Heiner Kallweit
@ 2023-04-01 18:58     ` Jakub Kicinski
  2023-04-01 20:41       ` Heiner Kallweit
  2023-04-03 15:18       ` Alexander Duyck
  0 siblings, 2 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-01 18:58 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: davem, netdev, edumazet, pabeni, Alexander Duyck

On Sat, 1 Apr 2023 17:18:12 +0200 Heiner Kallweit wrote:
> > +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		_res = -1;						\  
> 
> One more question: Don't we need a read memory barrier here to ensure
> get_desc is up-to-date?

CC: Alex, maybe I should not be posting after 10pm, with the missing v2
and sparse CC list.. :|

I was thinking about this too yesterday. AFAICT this implementation
could indeed result in waking even tho the queue is full on non-x86.
That's why the drivers have an extra check at the start of .xmit? :(

I *think* that the right ordering would be:

WRITE cons
mb()  # A
READ stopped
rmb() # C
READ prod, cons

And on the producer side (existing):

WRITE prod
READ prod, cons
mb()  # B
WRITE stopped
READ prod, cons

But I'm slightly afraid to change it, it's been working for over 
a decade :D

One neat thing that I noticed, which we could potentially exploit 
if we were to touch this code is that BQL already has a smp_mb() 
on the consumer side. So on any kernel config and driver which support
BQL we can use that instead of adding another barrier at #A.

It would actually be a neat optimization because right now, AFAICT,
completion will fire the # A -like barrier almost every time.

> > +		if (likely(get_desc > start_thrs))			\
> > +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> > +							 start_thrs,	\
> > +							 down_cond);	\
> > +		_res;							\
> > +	})

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01 15:04   ` Heiner Kallweit
@ 2023-04-01 18:03     ` Jakub Kicinski
  0 siblings, 0 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-01 18:03 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: davem, netdev, edumazet, pabeni

On Sat, 1 Apr 2023 17:04:17 +0200 Heiner Kallweit wrote:
> > +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		netif_tx_stop_queue(txq);				\
> > +									\
> > +		smp_mb();						\  
> 
> Wouldn't a smp_mb__after_atomic() be sufficient here, because netif_tx_stop_queue()
> includes a set_bit()? At least on X86 this would result in a no-op.

Yup, good point, __after_atomic() should be perfectly fine. I didn't
think too much about the smp_mb() 'cause it's what existing code was
using and it's a slow path. I'll fix in v3.

> > +									\
> > +		/* We need to check again in a case another		\
> > +		 * CPU has just made room available.			\
> > +		 */							\
> > +		_res = 0;						\
> > +		if (unlikely(get_desc >= start_thrs)) {			\
> > +			netif_tx_start_queue(txq);			\
> > +			_res = -1;					\
> > +		}							\
> > +		_res;							\
> > +	})								\
> > +
> > +/**
> > + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> > + * @txq:	struct netdev_queue to stop/start
> > + * @get_desc:	get current number of free descriptors (see requirements below!)
> > + * @stop_thrs:	minimal number of available descriptors for queue to be left
> > + *		enabled
> > + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> > + *		equal to @stop_thrs or higher to avoid frequent waking
> > + *
> > + * All arguments may be evaluated multiple times, beware of side effects.
> > + * @get_desc must be a formula or a function call, it must always
> > + * return up-to-date information when evaluated!
> > + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> > + *
> > + * Returns:
> > + *	 0 if the queue was stopped
> > + *	 1 if the queue was left enabled
> > + *	-1 if the queue was re-enabled (raced with waking)
> > + */
> > +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> > +	({								\
> > +		int _res;						\
> > +									\
> > +		_res = 1;						\
> > +		if (unlikely(get_desc < stop_thrs))			\
> > +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> > +						       start_thrs);	\
> > +		_res;							\
> > +	})								\
> > +
> > +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \  
> 
> Maybe I miss something, but: Why the get_desc and start_thrs parameters
> if they aren't used?

Copy'n'paste fail, will fix :(

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
  2023-04-01 15:04   ` Heiner Kallweit
@ 2023-04-01 15:18   ` Heiner Kallweit
  2023-04-01 18:58     ` Jakub Kicinski
  1 sibling, 1 reply; 45+ messages in thread
From: Heiner Kallweit @ 2023-04-01 15:18 UTC (permalink / raw)
  To: Jakub Kicinski, davem; +Cc: netdev, edumazet, pabeni

On 01.04.2023 07:12, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> v2:
>  - really flip the unlikely into a likely in __netif_tx_queue_maybe_wake()
>  - convert if / else into pre-init of _ret
> v1: https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
> ---
>  include/net/netdev_queues.h | 167 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..d050eb5e5bea
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,167 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up
> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\
> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		_res = 0;						\
> +		if (unlikely(get_desc >= start_thrs)) {			\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +
> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> + *
> + * Returns:
> + *	 0 if the queue was stopped
> + *	 1 if the queue was left enabled
> + *	-1 if the queue was re-enabled (raced with waking)
> + */
> +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> +	({								\
> +		int _res;						\
> +									\
> +		_res = 1;						\
> +		if (unlikely(get_desc < stop_thrs))			\
> +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> +						       start_thrs);	\
> +		_res;							\
> +	})								\
> +
> +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		/* Make sure that anybody stopping the queue after	\
> +		 * this sees the new next_to_clean.			\
> +		 */							\
> +		smp_mb();						\
> +		_res = 1;						\
> +		if (unlikely(netif_tx_queue_stopped(txq)) && !(down_cond)) { \
> +			netif_tx_wake_queue(txq);			\
> +			_res = 0;					\
> +		}							\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
> +
> +/**
> + * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @start_thrs:	minimal number of descriptors to re-enable the queue
> + * @down_cond:	down condition, predicate indicating that the queue should
> + *		not be woken up even if descriptors are available
> + *
> + * All arguments may be evaluated multiple times.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + *
> + * Returns:
> + *	 0 if the queue was woken up
> + *	 1 if the queue was already enabled (or disabled but @down_cond is true)
> + *	-1 if the queue was left stopped
> + */
> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		_res = -1;						\

One more question: Don't we need a read memory barrier here to ensure
get_desc is up-to-date?

> +		if (likely(get_desc > start_thrs))			\
> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> +							 start_thrs,	\
> +							 down_cond);	\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
> +
> +/* subqueue variants follow */
> +
> +#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
> +	})
> +
> +#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_maybe_stop(txq, get_desc,		\
> +					  stop_thrs, start_thrs);	\
> +	})
> +
> +#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_try_wake(txq, get_desc,		\
> +					  start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
> +	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_maybe_wake(txq, get_desc,		\
> +					    start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
> +	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#endif


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
@ 2023-04-01 15:04   ` Heiner Kallweit
  2023-04-01 18:03     ` Jakub Kicinski
  2023-04-01 15:18   ` Heiner Kallweit
  1 sibling, 1 reply; 45+ messages in thread
From: Heiner Kallweit @ 2023-04-01 15:04 UTC (permalink / raw)
  To: Jakub Kicinski, davem; +Cc: netdev, edumazet, pabeni

On 01.04.2023 07:12, Jakub Kicinski wrote:
> A lot of drivers follow the same scheme to stop / start queues
> without introducing locks between xmit and NAPI tx completions.
> I'm guessing they all copy'n'paste each other's code.
> 
> Smaller drivers shy away from the scheme and introduce a lock
> which may cause deadlocks in netpoll.
> 
> Provide macros which encapsulate the necessary logic.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> v2:
>  - really flip the unlikely into a likely in __netif_tx_queue_maybe_wake()
>  - convert if / else into pre-init of _ret
> v1: https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
>  - perdicate -> predicate
>  - on race use start instead of wake and make a note of that
>    in the doc / comment at the start
> rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
> ---
>  include/net/netdev_queues.h | 167 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 167 insertions(+)
>  create mode 100644 include/net/netdev_queues.h
> 
> diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
> new file mode 100644
> index 000000000000..d050eb5e5bea
> --- /dev/null
> +++ b/include/net/netdev_queues.h
> @@ -0,0 +1,167 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_NET_QUEUES_H
> +#define _LINUX_NET_QUEUES_H
> +
> +#include <linux/netdevice.h>
> +
> +/* Lockless queue stopping / waking helpers.
> + *
> + * These macros are designed to safely implement stopping and waking
> + * netdev queues without full lock protection. We assume that there can
> + * be no concurrent stop attempts and no concurrent wake attempts.
> + * The try-stop should happen from the xmit handler*, while wake up
> + * should be triggered from NAPI poll context. The two may run
> + * concurrently (single producer, single consumer).
> + *
> + * All descriptor ring indexes (and other relevant shared state) must
> + * be updated before invoking the macros.
> + *
> + * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
> + *   instead of netif_tx_wake_queue()) so uses outside of the xmit
> + *   handler may lead to bugs
> + */
> +
> +#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
> +	({								\
> +		int _res;						\
> +									\
> +		netif_tx_stop_queue(txq);				\
> +									\
> +		smp_mb();						\

Wouldn't a smp_mb__after_atomic() be sufficient here, because netif_tx_stop_queue()
includes a set_bit()? At least on X86 this would result in a no-op.

> +									\
> +		/* We need to check again in a case another		\
> +		 * CPU has just made room available.			\
> +		 */							\
> +		_res = 0;						\
> +		if (unlikely(get_desc >= start_thrs)) {			\
> +			netif_tx_start_queue(txq);			\
> +			_res = -1;					\
> +		}							\
> +		_res;							\
> +	})								\
> +
> +/**
> + * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @stop_thrs:	minimal number of available descriptors for queue to be left
> + *		enabled
> + * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
> + *		equal to @stop_thrs or higher to avoid frequent waking
> + *
> + * All arguments may be evaluated multiple times, beware of side effects.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + * Expected to be used from ndo_start_xmit, see the comment on top of the file.
> + *
> + * Returns:
> + *	 0 if the queue was stopped
> + *	 1 if the queue was left enabled
> + *	-1 if the queue was re-enabled (raced with waking)
> + */
> +#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
> +	({								\
> +		int _res;						\
> +									\
> +		_res = 1;						\
> +		if (unlikely(get_desc < stop_thrs))			\
> +			_res = netif_tx_queue_try_stop(txq, get_desc,	\
> +						       start_thrs);	\
> +		_res;							\
> +	})								\
> +
> +#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \

Maybe I miss something, but: Why the get_desc and start_thrs parameters
if they aren't used?

> +	({								\
> +		int _res;						\
> +									\
> +		/* Make sure that anybody stopping the queue after	\
> +		 * this sees the new next_to_clean.			\
> +		 */							\
> +		smp_mb();						\
> +		_res = 1;						\
> +		if (unlikely(netif_tx_queue_stopped(txq)) && !(down_cond)) { \
> +			netif_tx_wake_queue(txq);			\
> +			_res = 0;					\
> +		}							\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
> +
> +/**
> + * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
> + * @txq:	struct netdev_queue to stop/start
> + * @get_desc:	get current number of free descriptors (see requirements below!)
> + * @start_thrs:	minimal number of descriptors to re-enable the queue
> + * @down_cond:	down condition, predicate indicating that the queue should
> + *		not be woken up even if descriptors are available
> + *
> + * All arguments may be evaluated multiple times.
> + * @get_desc must be a formula or a function call, it must always
> + * return up-to-date information when evaluated!
> + *
> + * Returns:
> + *	 0 if the queue was woken up
> + *	 1 if the queue was already enabled (or disabled but @down_cond is true)
> + *	-1 if the queue was left stopped
> + */
> +#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
> +	({								\
> +		int _res;						\
> +									\
> +		_res = -1;						\
> +		if (likely(get_desc > start_thrs))			\
> +			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
> +							 start_thrs,	\
> +							 down_cond);	\
> +		_res;							\
> +	})
> +
> +#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
> +	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
> +
> +/* subqueue variants follow */
> +
> +#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
> +	})
> +
> +#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		netif_tx_queue_maybe_stop(txq, get_desc,		\
> +					  stop_thrs, start_thrs);	\
> +	})
> +
> +#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_try_wake(txq, get_desc,		\
> +					  start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
> +	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
> +	({								\
> +		struct netdev_queue *txq;				\
> +									\
> +		txq = netdev_get_tx_queue(dev, idx);			\
> +		__netif_tx_queue_maybe_wake(txq, get_desc,		\
> +					    start_thrs, down_cond);	\
> +	})
> +
> +#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
> +	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
> +
> +#endif


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code
  2023-04-01  5:12 [PATCH net-next 0/3] " Jakub Kicinski
@ 2023-04-01  5:12 ` Jakub Kicinski
  2023-04-01 15:04   ` Heiner Kallweit
  2023-04-01 15:18   ` Heiner Kallweit
  0 siblings, 2 replies; 45+ messages in thread
From: Jakub Kicinski @ 2023-04-01  5:12 UTC (permalink / raw)
  To: davem; +Cc: netdev, edumazet, pabeni, Jakub Kicinski

A lot of drivers follow the same scheme to stop / start queues
without introducing locks between xmit and NAPI tx completions.
I'm guessing they all copy'n'paste each other's code.

Smaller drivers shy away from the scheme and introduce a lock
which may cause deadlocks in netpoll.

Provide macros which encapsulate the necessary logic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
v2:
 - really flip the unlikely into a likely in __netif_tx_queue_maybe_wake()
 - convert if / else into pre-init of _ret
v1: https://lore.kernel.org/all/20230322233028.269410-1-kuba@kernel.org/
 - perdicate -> predicate
 - on race use start instead of wake and make a note of that
   in the doc / comment at the start
rfc: https://lore.kernel.org/all/20230311050130.115138-1-kuba@kernel.org/
---
 include/net/netdev_queues.h | 167 ++++++++++++++++++++++++++++++++++++
 1 file changed, 167 insertions(+)
 create mode 100644 include/net/netdev_queues.h

diff --git a/include/net/netdev_queues.h b/include/net/netdev_queues.h
new file mode 100644
index 000000000000..d050eb5e5bea
--- /dev/null
+++ b/include/net/netdev_queues.h
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_NET_QUEUES_H
+#define _LINUX_NET_QUEUES_H
+
+#include <linux/netdevice.h>
+
+/* Lockless queue stopping / waking helpers.
+ *
+ * These macros are designed to safely implement stopping and waking
+ * netdev queues without full lock protection. We assume that there can
+ * be no concurrent stop attempts and no concurrent wake attempts.
+ * The try-stop should happen from the xmit handler*, while wake up
+ * should be triggered from NAPI poll context. The two may run
+ * concurrently (single producer, single consumer).
+ *
+ * All descriptor ring indexes (and other relevant shared state) must
+ * be updated before invoking the macros.
+ *
+ * * the try-stop side does not reschedule Tx (netif_tx_start_queue()
+ *   instead of netif_tx_wake_queue()) so uses outside of the xmit
+ *   handler may lead to bugs
+ */
+
+#define netif_tx_queue_try_stop(txq, get_desc, start_thrs)		\
+	({								\
+		int _res;						\
+									\
+		netif_tx_stop_queue(txq);				\
+									\
+		smp_mb();						\
+									\
+		/* We need to check again in a case another		\
+		 * CPU has just made room available.			\
+		 */							\
+		_res = 0;						\
+		if (unlikely(get_desc >= start_thrs)) {			\
+			netif_tx_start_queue(txq);			\
+			_res = -1;					\
+		}							\
+		_res;							\
+	})								\
+
+/**
+ * netif_tx_queue_maybe_stop() - locklessly stop a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @stop_thrs:	minimal number of available descriptors for queue to be left
+ *		enabled
+ * @start_thrs:	minimal number of descriptors to re-enable the queue, can be
+ *		equal to @stop_thrs or higher to avoid frequent waking
+ *
+ * All arguments may be evaluated multiple times, beware of side effects.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ * Expected to be used from ndo_start_xmit, see the comment on top of the file.
+ *
+ * Returns:
+ *	 0 if the queue was stopped
+ *	 1 if the queue was left enabled
+ *	-1 if the queue was re-enabled (raced with waking)
+ */
+#define netif_tx_queue_maybe_stop(txq, get_desc, stop_thrs, start_thrs)	\
+	({								\
+		int _res;						\
+									\
+		_res = 1;						\
+		if (unlikely(get_desc < stop_thrs))			\
+			_res = netif_tx_queue_try_stop(txq, get_desc,	\
+						       start_thrs);	\
+		_res;							\
+	})								\
+
+#define __netif_tx_queue_try_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		/* Make sure that anybody stopping the queue after	\
+		 * this sees the new next_to_clean.			\
+		 */							\
+		smp_mb();						\
+		_res = 1;						\
+		if (unlikely(netif_tx_queue_stopped(txq)) && !(down_cond)) { \
+			netif_tx_wake_queue(txq);			\
+			_res = 0;					\
+		}							\
+		_res;							\
+	})
+
+#define netif_tx_queue_try_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_try_wake(txq, get_desc, start_thrs, false)
+
+/**
+ * __netif_tx_queue_maybe_wake() - locklessly wake a Tx queue, if needed
+ * @txq:	struct netdev_queue to stop/start
+ * @get_desc:	get current number of free descriptors (see requirements below!)
+ * @start_thrs:	minimal number of descriptors to re-enable the queue
+ * @down_cond:	down condition, predicate indicating that the queue should
+ *		not be woken up even if descriptors are available
+ *
+ * All arguments may be evaluated multiple times.
+ * @get_desc must be a formula or a function call, it must always
+ * return up-to-date information when evaluated!
+ *
+ * Returns:
+ *	 0 if the queue was woken up
+ *	 1 if the queue was already enabled (or disabled but @down_cond is true)
+ *	-1 if the queue was left stopped
+ */
+#define __netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, down_cond) \
+	({								\
+		int _res;						\
+									\
+		_res = -1;						\
+		if (likely(get_desc > start_thrs))			\
+			_res = __netif_tx_queue_try_wake(txq, get_desc,	\
+							 start_thrs,	\
+							 down_cond);	\
+		_res;							\
+	})
+
+#define netif_tx_queue_maybe_wake(txq, get_desc, start_thrs)		\
+	__netif_tx_queue_maybe_wake(txq, get_desc, start_thrs, false)
+
+/* subqueue variants follow */
+
+#define netif_subqueue_try_stop(dev, idx, get_desc, start_thrs)		\
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_try_stop(txq, get_desc, start_thrs);	\
+	})
+
+#define netif_subqueue_maybe_stop(dev, idx, get_desc, stop_thrs, start_thrs) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		netif_tx_queue_maybe_stop(txq, get_desc,		\
+					  stop_thrs, start_thrs);	\
+	})
+
+#define __netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_try_wake(txq, get_desc,		\
+					  start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_try_wake(dev, idx, get_desc, start_thrs)		\
+	__netif_subqueue_try_wake(dev, idx, get_desc, start_thrs, false)
+
+#define __netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, down_cond) \
+	({								\
+		struct netdev_queue *txq;				\
+									\
+		txq = netdev_get_tx_queue(dev, idx);			\
+		__netif_tx_queue_maybe_wake(txq, get_desc,		\
+					    start_thrs, down_cond);	\
+	})
+
+#define netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs)	\
+	__netif_subqueue_maybe_wake(dev, idx, get_desc, start_thrs, false)
+
+#endif
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2023-04-07  1:21 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-22 23:30 [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Jakub Kicinski
2023-03-22 23:30 ` [PATCH net-next 2/3] ixgbe: use new queue try_stop/try_wake macros Jakub Kicinski
2023-03-22 23:30 ` [PATCH net-next 3/3] bnxt: " Jakub Kicinski
2023-03-23  0:35 ` [PATCH net-next 1/3] net: provide macros for commonly copied lockless queue stop/wake code Andrew Lunn
2023-03-23  1:04   ` Jakub Kicinski
2023-03-23  1:04     ` [Intel-wired-lan] " Jakub Kicinski
2023-03-23 21:02     ` Andrew Lunn
2023-03-23 21:02       ` [Intel-wired-lan] " Andrew Lunn
2023-03-23 22:46       ` Jakub Kicinski
2023-03-23 22:46         ` [Intel-wired-lan] " Jakub Kicinski
2023-03-23  3:05 ` Yunsheng Lin
2023-03-23  3:27   ` Jakub Kicinski
2023-03-23  4:53 ` Pavan Chebbi
2023-03-23  5:08   ` Jakub Kicinski
2023-03-23 16:05 ` Alexander H Duyck
2023-03-24  3:09   ` Jakub Kicinski
2023-03-24 15:45     ` Alexander Duyck
2023-03-24 21:28       ` Jakub Kicinski
2023-03-26 21:23         ` Alexander Duyck
2023-03-29  0:56           ` Jakub Kicinski
2023-03-30 14:56             ` Paolo Abeni
2023-04-01  5:12 [PATCH net-next 0/3] " Jakub Kicinski
2023-04-01  5:12 ` [PATCH net-next 1/3] " Jakub Kicinski
2023-04-01 15:04   ` Heiner Kallweit
2023-04-01 18:03     ` Jakub Kicinski
2023-04-01 15:18   ` Heiner Kallweit
2023-04-01 18:58     ` Jakub Kicinski
2023-04-01 20:41       ` Heiner Kallweit
2023-04-03 15:18       ` Alexander Duyck
2023-04-03 15:56         ` Jakub Kicinski
2023-04-03 18:11           ` Alexander Duyck
2023-04-03 19:03             ` Jakub Kicinski
2023-04-03 20:27               ` Alexander Duyck
2023-04-05 22:20                 ` Paul E. McKenney
2023-04-06  5:15                   ` Herbert Xu
2023-04-06 14:17                     ` Paul E. McKenney
2023-04-06 14:46                       ` Jakub Kicinski
2023-04-06 15:45                         ` Paul E. McKenney
2023-04-06 15:56                           ` Jakub Kicinski
2023-04-06 16:25                             ` Paul E. McKenney
2023-04-07  0:58                         ` Herbert Xu
2023-04-07  1:03                           ` Jakub Kicinski
2023-04-07  1:14                             ` Herbert Xu
2023-04-07  1:21                               ` Jakub Kicinski
2023-04-04  6:39         ` Herbert Xu
2023-04-04 22:36           ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.