linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-rc 0/2] AIP fixes
@ 2020-06-23 20:43 Dennis Dalessandro
  2020-06-23 20:43 ` [PATCH for-rc 1/2] IB/hfi1: Correct -EBUSY handling in tx code Dennis Dalessandro
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Dennis Dalessandro @ 2020-06-23 20:43 UTC (permalink / raw)
  To: jgg, dledford; +Cc: linux-rdma

Two fixes for AIP which was accepted this cycle.
---

Mike Marciniszyn (2):
      IB/hfi1: Correct -EBUSY handling in tx code
      IB/hfi1: Add atomic triggered sleep/wakeup


 drivers/infiniband/hw/hfi1/ipoib.h    |    6 ++
 drivers/infiniband/hw/hfi1/ipoib_tx.c |  102 +++++++++++++++++++++------------
 2 files changed, 71 insertions(+), 37 deletions(-)

--
-Denny

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH for-rc 1/2] IB/hfi1: Correct -EBUSY handling in tx code
  2020-06-23 20:43 [PATCH for-rc 0/2] AIP fixes Dennis Dalessandro
@ 2020-06-23 20:43 ` Dennis Dalessandro
  2020-06-23 20:43 ` [PATCH for-rc 2/2] IB/hfi1: Add atomic triggered sleep/wakeup Dennis Dalessandro
  2020-06-24 19:14 ` [PATCH for-rc 0/2] AIP fixes Jason Gunthorpe
  2 siblings, 0 replies; 4+ messages in thread
From: Dennis Dalessandro @ 2020-06-23 20:43 UTC (permalink / raw)
  To: jgg, dledford; +Cc: linux-rdma, Mike Marciniszyn, Kaike Wan

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

The current code mishandles -EBUSY in two ways:
- The flow change doesn't test the return from the flush
  and runs on to process the current packet racing
  with the wakeup processing
- The -EBUSY handling for a single packet inserts the
  tx into the txlist after the submit call, racing
  with the same wakeup processing

Fix the first by dropping the skb and returning NETDEV_TX_OK.

Fix the second by insuring the the list entry within the txreq
is inited when allocated.   This enables the sleep routine to
detect that the txreq has used the non-list api and queue the packet
to the txlist.

Both flaws can lead to having the flushing thread executing in
causing two threads to manipulate the txlist.

Fixes: d99dc602e2a5 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/ipoib_tx.c |   35 +++++++++++++++++----------------
 1 file changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/ipoib_tx.c b/drivers/infiniband/hw/hfi1/ipoib_tx.c
index 175290c5..76dcb56 100644
--- a/drivers/infiniband/hw/hfi1/ipoib_tx.c
+++ b/drivers/infiniband/hw/hfi1/ipoib_tx.c
@@ -369,6 +369,7 @@ static struct ipoib_txreq *hfi1_ipoib_send_dma_common(struct net_device *dev,
 	tx->priv = priv;
 	tx->txq = txp->txq;
 	tx->skb = skb;
+	INIT_LIST_HEAD(&tx->txreq.list);
 
 	hfi1_ipoib_build_ib_tx_headers(tx, txp);
 
@@ -469,6 +470,7 @@ static int hfi1_ipoib_send_dma_single(struct net_device *dev,
 
 	ret = hfi1_ipoib_submit_tx(txq, tx);
 	if (likely(!ret)) {
+tx_ok:
 		trace_sdma_output_ibhdr(tx->priv->dd,
 					&tx->sdma_hdr.hdr,
 					ib_is_sc5(txp->flow.sc5));
@@ -478,20 +480,8 @@ static int hfi1_ipoib_send_dma_single(struct net_device *dev,
 
 	txq->pkts_sent = false;
 
-	if (ret == -EBUSY) {
-		list_add_tail(&tx->txreq.list, &txq->tx_list);
-
-		trace_sdma_output_ibhdr(tx->priv->dd,
-					&tx->sdma_hdr.hdr,
-					ib_is_sc5(txp->flow.sc5));
-		hfi1_ipoib_check_queue_depth(txq);
-		return NETDEV_TX_OK;
-	}
-
-	if (ret == -ECOMM) {
-		hfi1_ipoib_check_queue_depth(txq);
-		return NETDEV_TX_OK;
-	}
+	if (ret == -EBUSY || ret == -ECOMM)
+		goto tx_ok;
 
 	sdma_txclean(priv->dd, &tx->txreq);
 	dev_kfree_skb_any(skb);
@@ -509,9 +499,17 @@ static int hfi1_ipoib_send_dma_list(struct net_device *dev,
 	struct ipoib_txreq *tx;
 
 	/* Has the flow change ? */
-	if (txq->flow.as_int != txp->flow.as_int)
-		(void)hfi1_ipoib_flush_tx_list(dev, txq);
-
+	if (txq->flow.as_int != txp->flow.as_int) {
+		int ret;
+
+		ret = hfi1_ipoib_flush_tx_list(dev, txq);
+		if (unlikely(ret)) {
+			if (ret == -EBUSY)
+				++dev->stats.tx_dropped;
+			dev_kfree_skb_any(skb);
+			return NETDEV_TX_OK;
+		}
+	}
 	tx = hfi1_ipoib_send_dma_common(dev, skb, txp);
 	if (IS_ERR(tx)) {
 		int ret = PTR_ERR(tx);
@@ -612,6 +610,9 @@ static int hfi1_ipoib_sdma_sleep(struct sdma_engine *sde,
 
 		netif_stop_subqueue(txq->priv->netdev, txq->q_idx);
 
+		if (list_empty(&txreq->list))
+			/* came from non-list submit */
+			list_add_tail(&txreq->list, &txq->tx_list);
 		if (list_empty(&txq->wait.list))
 			iowait_queue(pkts_sent, wait->iow, &sde->dmawait);
 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH for-rc 2/2] IB/hfi1: Add atomic triggered sleep/wakeup
  2020-06-23 20:43 [PATCH for-rc 0/2] AIP fixes Dennis Dalessandro
  2020-06-23 20:43 ` [PATCH for-rc 1/2] IB/hfi1: Correct -EBUSY handling in tx code Dennis Dalessandro
@ 2020-06-23 20:43 ` Dennis Dalessandro
  2020-06-24 19:14 ` [PATCH for-rc 0/2] AIP fixes Jason Gunthorpe
  2 siblings, 0 replies; 4+ messages in thread
From: Dennis Dalessandro @ 2020-06-23 20:43 UTC (permalink / raw)
  To: jgg, dledford; +Cc: linux-rdma, Mike Marciniszyn, Kaike Wan

From: Mike Marciniszyn <mike.marciniszyn@intel.com>

When running iperf in a two host configuration the
following trace can occur:

[  319.728730] NETDEV WATCHDOG: ib0 (hfi1): transmit queue 0 timed out

The issue happens because the current implementation relies on the
netif txq being stopped to control the flushing of the tx list.

There are two resources that the transmit logic can
wait on and stop the txq:
- SDMA descriptors
- Ring space to hold completions

The ring space is tested on the sending side and relieved
when the ring is consumed in the napi tx reaping.

Unfortunately, that reaping can run conncurrently with the
workqueue flushing of the txlist.   If the txq is started
just before the workitem executes, the txlist will never be
flushed, leading to the txq being stuck.

Fix by:
- Adding sleep/wakeup wrappers
  * Use an atomic to control the call to the netif routines
    inside the wrappers
- Use another atomic to record ring space exhaustion
  * Only wakeup when the a ring space exhaustion has
    happened and it relieved

Add additional wrappers to clarify the ring space
resource handling.

Fixes: d99dc602e2a5 ("IB/hfi1: Add functions to transmit datagram ipoib packets")
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/ipoib.h    |    6 +++
 drivers/infiniband/hw/hfi1/ipoib_tx.c |   67 +++++++++++++++++++++++----------
 2 files changed, 53 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/ipoib.h b/drivers/infiniband/hw/hfi1/ipoib.h
index 185c9b0..b8c9d0a 100644
--- a/drivers/infiniband/hw/hfi1/ipoib.h
+++ b/drivers/infiniband/hw/hfi1/ipoib.h
@@ -67,6 +67,9 @@ struct hfi1_ipoib_circ_buf {
  * @sde: sdma engine
  * @tx_list: tx request list
  * @sent_txreqs: count of txreqs posted to sdma
+ * @stops: count of stops of queue
+ * @ring_full: ring has been filled
+ * @no_desc: descriptor shortage seen
  * @flow: tracks when list needs to be flushed for a flow change
  * @q_idx: ipoib Tx queue index
  * @pkts_sent: indicator packets have been sent from this queue
@@ -80,6 +83,9 @@ struct hfi1_ipoib_txq {
 	struct sdma_engine *sde;
 	struct list_head tx_list;
 	u64 sent_txreqs;
+	atomic_t stops;
+	atomic_t ring_full;
+	atomic_t no_desc;
 	union hfi1_ipoib_flow flow;
 	u8 q_idx;
 	bool pkts_sent;
diff --git a/drivers/infiniband/hw/hfi1/ipoib_tx.c b/drivers/infiniband/hw/hfi1/ipoib_tx.c
index 76dcb56..9df292b5 100644
--- a/drivers/infiniband/hw/hfi1/ipoib_tx.c
+++ b/drivers/infiniband/hw/hfi1/ipoib_tx.c
@@ -55,23 +55,48 @@ static u64 hfi1_ipoib_txreqs(const u64 sent, const u64 completed)
 	return sent - completed;
 }
 
-static void hfi1_ipoib_check_queue_depth(struct hfi1_ipoib_txq *txq)
+static u64 hfi1_ipoib_used(struct hfi1_ipoib_txq *txq)
+{
+	return hfi1_ipoib_txreqs(txq->sent_txreqs,
+				 atomic64_read(&txq->complete_txreqs));
+}
+
+static void hfi1_ipoib_stop_txq(struct hfi1_ipoib_txq *txq)
 {
-	if (unlikely(hfi1_ipoib_txreqs(++txq->sent_txreqs,
-				       atomic64_read(&txq->complete_txreqs)) >=
-	    min_t(unsigned int, txq->priv->netdev->tx_queue_len,
-		  txq->tx_ring.max_items - 1)))
+	if (atomic_inc_return(&txq->stops) == 1)
 		netif_stop_subqueue(txq->priv->netdev, txq->q_idx);
 }
 
+static void hfi1_ipoib_wake_txq(struct hfi1_ipoib_txq *txq)
+{
+	if (atomic_dec_and_test(&txq->stops))
+		netif_wake_subqueue(txq->priv->netdev, txq->q_idx);
+}
+
+static uint hfi1_ipoib_ring_hwat(struct hfi1_ipoib_txq *txq)
+{
+	return min_t(uint, txq->priv->netdev->tx_queue_len,
+		     txq->tx_ring.max_items - 1);
+}
+
+static uint hfi1_ipoib_ring_lwat(struct hfi1_ipoib_txq *txq)
+{
+	return min_t(uint, txq->priv->netdev->tx_queue_len,
+		     txq->tx_ring.max_items) >> 1;
+}
+
+static void hfi1_ipoib_check_queue_depth(struct hfi1_ipoib_txq *txq)
+{
+	++txq->sent_txreqs;
+	if (hfi1_ipoib_used(txq) >= hfi1_ipoib_ring_hwat(txq) &&
+	    !atomic_xchg(&txq->ring_full, 1))
+		hfi1_ipoib_stop_txq(txq);
+}
+
 static void hfi1_ipoib_check_queue_stopped(struct hfi1_ipoib_txq *txq)
 {
 	struct net_device *dev = txq->priv->netdev;
 
-	/* If the queue is already running just return */
-	if (likely(!__netif_subqueue_stopped(dev, txq->q_idx)))
-		return;
-
 	/* If shutting down just return as queue state is irrelevant */
 	if (unlikely(dev->reg_state != NETREG_REGISTERED))
 		return;
@@ -86,11 +111,9 @@ static void hfi1_ipoib_check_queue_stopped(struct hfi1_ipoib_txq *txq)
 	 * Use the minimum of the current tx_queue_len or the rings max txreqs
 	 * to protect against ring overflow.
 	 */
-	if (hfi1_ipoib_txreqs(txq->sent_txreqs,
-			      atomic64_read(&txq->complete_txreqs))
-	    < min_t(unsigned int, dev->tx_queue_len,
-		    txq->tx_ring.max_items) >> 1)
-		netif_wake_subqueue(dev, txq->q_idx);
+	if (hfi1_ipoib_used(txq) < hfi1_ipoib_ring_lwat(txq) &&
+	    atomic_xchg(&txq->ring_full, 0))
+		hfi1_ipoib_wake_txq(txq);
 }
 
 static void hfi1_ipoib_free_tx(struct ipoib_txreq *tx, int budget)
@@ -608,13 +631,14 @@ static int hfi1_ipoib_sdma_sleep(struct sdma_engine *sde,
 			return -EAGAIN;
 		}
 
-		netif_stop_subqueue(txq->priv->netdev, txq->q_idx);
-
 		if (list_empty(&txreq->list))
 			/* came from non-list submit */
 			list_add_tail(&txreq->list, &txq->tx_list);
-		if (list_empty(&txq->wait.list))
+		if (list_empty(&txq->wait.list)) {
+			if (!atomic_xchg(&txq->no_desc, 1))
+				hfi1_ipoib_stop_txq(txq);
 			iowait_queue(pkts_sent, wait->iow, &sde->dmawait);
+		}
 
 		write_sequnlock(&sde->waitlock);
 		return -EBUSY;
@@ -649,9 +673,9 @@ static void hfi1_ipoib_flush_txq(struct work_struct *work)
 	struct net_device *dev = txq->priv->netdev;
 
 	if (likely(dev->reg_state == NETREG_REGISTERED) &&
-	    likely(__netif_subqueue_stopped(dev, txq->q_idx)) &&
 	    likely(!hfi1_ipoib_flush_tx_list(dev, txq)))
-		netif_wake_subqueue(dev, txq->q_idx);
+		if (atomic_xchg(&txq->no_desc, 0))
+			hfi1_ipoib_wake_txq(txq);
 }
 
 int hfi1_ipoib_txreq_init(struct hfi1_ipoib_dev_priv *priv)
@@ -705,6 +729,9 @@ int hfi1_ipoib_txreq_init(struct hfi1_ipoib_dev_priv *priv)
 		txq->sde = NULL;
 		INIT_LIST_HEAD(&txq->tx_list);
 		atomic64_set(&txq->complete_txreqs, 0);
+		atomic_set(&txq->stops, 0);
+		atomic_set(&txq->ring_full, 0);
+		atomic_set(&txq->no_desc, 0);
 		txq->q_idx = i;
 		txq->flow.tx_queue = 0xff;
 		txq->flow.sc5 = 0xff;
@@ -770,7 +797,7 @@ static void hfi1_ipoib_drain_tx_list(struct hfi1_ipoib_txq *txq)
 		atomic64_inc(complete_txreqs);
 	}
 
-	if (hfi1_ipoib_txreqs(txq->sent_txreqs, atomic64_read(complete_txreqs)))
+	if (hfi1_ipoib_used(txq))
 		dd_dev_warn(txq->priv->dd,
 			    "txq %d not empty found %llu requests\n",
 			    txq->q_idx,


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH for-rc 0/2] AIP fixes
  2020-06-23 20:43 [PATCH for-rc 0/2] AIP fixes Dennis Dalessandro
  2020-06-23 20:43 ` [PATCH for-rc 1/2] IB/hfi1: Correct -EBUSY handling in tx code Dennis Dalessandro
  2020-06-23 20:43 ` [PATCH for-rc 2/2] IB/hfi1: Add atomic triggered sleep/wakeup Dennis Dalessandro
@ 2020-06-24 19:14 ` Jason Gunthorpe
  2 siblings, 0 replies; 4+ messages in thread
From: Jason Gunthorpe @ 2020-06-24 19:14 UTC (permalink / raw)
  To: Dennis Dalessandro; +Cc: dledford, linux-rdma

On Tue, Jun 23, 2020 at 04:43:16PM -0400, Dennis Dalessandro wrote:
> Two fixes for AIP which was accepted this cycle.
> ---
> 
> Mike Marciniszyn (2):
>       IB/hfi1: Correct -EBUSY handling in tx code
>       IB/hfi1: Add atomic triggered sleep/wakeup

Applied to for-rc

Thanks
Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-06-24 19:14 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-23 20:43 [PATCH for-rc 0/2] AIP fixes Dennis Dalessandro
2020-06-23 20:43 ` [PATCH for-rc 1/2] IB/hfi1: Correct -EBUSY handling in tx code Dennis Dalessandro
2020-06-23 20:43 ` [PATCH for-rc 2/2] IB/hfi1: Add atomic triggered sleep/wakeup Dennis Dalessandro
2020-06-24 19:14 ` [PATCH for-rc 0/2] AIP fixes Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).