linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 01/18] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 02/18] ixgbe: eliminate " Sinan Kaya
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 8 ++++----
 drivers/net/ethernet/intel/i40evf/i40e_txrx.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e554aa6cf..9455869 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -185,7 +185,7 @@ static int i40e_program_fdir_filter(struct i40e_fdir_filter *fdir_data,
 	/* Mark the data descriptor to be watched */
 	first->next_to_watch = tx_desc;
 
-	writel(tx_ring->next_to_use, tx_ring->tail);
+	writel_relaxed(tx_ring->next_to_use, tx_ring->tail);
 	return 0;
 
 dma_fail:
@@ -1375,7 +1375,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2258,7 +2258,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 		 */
 		wmb();
 
-		writel(xdp_ring->next_to_use, xdp_ring->tail);
+		writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
 	}
 
 	rx_ring->skb = skb;
@@ -3286,7 +3286,7 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 
 	/* notify HW of packet */
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
diff --git a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
index 357d605..56eea20 100644
--- a/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40evf/i40e_txrx.c
@@ -667,7 +667,7 @@ static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 	 * such as IA-64).
 	 */
 	wmb();
-	writel(val, rx_ring->tail);
+	writel_relaxed(val, rx_ring->tail);
 }
 
 /**
@@ -2243,7 +2243,7 @@ static inline void i40evf_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
 
 	/* notify HW of packet */
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 02/18] ixgbe: eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
  2018-03-16 16:16 ` [PATCH v3 01/18] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 03/18] igbvf: " Sinan Kaya
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0da5aa2..58ed70f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1692,7 +1692,7 @@ void ixgbe_alloc_rx_buffers(struct ixgbe_ring *rx_ring, u16 cleaned_count)
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
@@ -2453,7 +2453,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		 * know there are new descriptors to fetch.
 		 */
 		wmb();
-		writel(ring->next_to_use, ring->tail);
+		writel_relaxed(ring->next_to_use, ring->tail);
 
 		xdp_do_flush_map();
 	}
@@ -8078,7 +8078,7 @@ static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
 	ixgbe_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
@@ -10014,7 +10014,7 @@ static void ixgbe_xdp_flush(struct net_device *dev)
 	 * are new descriptors to fetch.
 	 */
 	wmb();
-	writel(ring->next_to_use, ring->tail);
+	writel_relaxed(ring->next_to_use, ring->tail);
 
 	return;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 03/18] igbvf: eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
  2018-03-16 16:16 ` [PATCH v3 01/18] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 02/18] ixgbe: eliminate " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 04/18] igb: " Sinan Kaya
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/igbvf/netdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 4214c15..edb1c34 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -251,7 +251,7 @@ static void igbvf_alloc_rx_buffers(struct igbvf_ring *rx_ring,
 		 * such as IA-64).
 		*/
 		wmb();
-		writel(i, adapter->hw.hw_addr + rx_ring->tail);
+		writel_relaxed(i, adapter->hw.hw_addr + rx_ring->tail);
 	}
 }
 
@@ -2297,7 +2297,7 @@ static inline void igbvf_tx_queue_adv(struct igbvf_adapter *adapter,
 
 	tx_ring->buffer_info[first].next_to_watch = tx_desc;
 	tx_ring->next_to_use = i;
-	writel(i, adapter->hw.hw_addr + tx_ring->tail);
+	writel_relaxed(i, adapter->hw.hw_addr + tx_ring->tail);
 	/* we need this if more than one processor can write to our tail
 	 * at a time, it synchronizes IO on IA64/Altix systems
 	 */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 04/18] igb: eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (2 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 03/18] igbvf: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 05/18] ixgbevf: keep writel() closer to wmb() Sinan Kaya
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/igb/igb_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index b88fae7..82aea92 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -5671,7 +5671,7 @@ static int igb_tx_map(struct igb_ring *tx_ring,
 	igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
 
 	if (netif_xmit_stopped(txring_txq(tx_ring)) || !skb->xmit_more) {
-		writel(i, tx_ring->tail);
+		writel_relaxed(i, tx_ring->tail);
 
 		/* we need this if more than one processor can write to our tail
 		 * at a time, it synchronizes IO on IA64/Altix systems
@@ -8072,7 +8072,7 @@ void igb_alloc_rx_buffers(struct igb_ring *rx_ring, u16 cleaned_count)
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 05/18] ixgbevf: keep writel() closer to wmb()
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (3 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 04/18] igb: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 06/18] ixgbevf: eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Remove ixgbevf_write_tail() in favor of moving writel() close to
wmb().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h      | 5 -----
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index f695242..11e893e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -244,11 +244,6 @@ static inline u16 ixgbevf_desc_unused(struct ixgbevf_ring *ring)
 	return ((ntc > ntu) ? 0 : ring->count) + ntc - ntu - 1;
 }
 
-static inline void ixgbevf_write_tail(struct ixgbevf_ring *ring, u32 value)
-{
-	writel(value, ring->tail);
-}
-
 #define IXGBEVF_RX_DESC(R, i)	\
 	(&(((union ixgbe_adv_rx_desc *)((R)->desc))[i]))
 #define IXGBEVF_TX_DESC(R, i)	\
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 9b3d43d..6bf778a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring,
 		 * such as IA-64).
 		 */
 		wmb();
-		ixgbevf_write_tail(rx_ring, i);
+		writel(i, rx_ring->tail);
 	}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 	tx_ring->next_to_use = i;
 
 	/* notify HW of packet */
-	ixgbevf_write_tail(tx_ring, i);
+	writel(i, tx_ring->tail);
 
 	return;
 dma_error:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 06/18] ixgbevf: eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (4 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 05/18] ixgbevf: keep writel() closer to wmb() Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 07/18] drivers: net: cxgb: Eliminate " Sinan Kaya
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 6bf778a..774b2a6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -659,7 +659,7 @@ static void ixgbevf_alloc_rx_buffers(struct ixgbevf_ring *rx_ring,
 		 * such as IA-64).
 		 */
 		wmb();
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
@@ -3644,7 +3644,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 	tx_ring->next_to_use = i;
 
 	/* notify HW of packet */
-	writel(i, tx_ring->tail);
+	writel_relaxed(i, tx_ring->tail);
 
 	return;
 dma_error:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 07/18] drivers: net: cxgb: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (5 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 06/18] ixgbevf: eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 08/18] scsi: hpsa: " Sinan Kaya
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, David S. Miller,
	Kees Cook, Johannes Berg, Allen Pais, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/chelsio/cxgb/sge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb/sge.c b/drivers/net/ethernet/chelsio/cxgb/sge.c
index 30de26e..57891bd6 100644
--- a/drivers/net/ethernet/chelsio/cxgb/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb/sge.c
@@ -495,7 +495,7 @@ static struct sk_buff *sched_skb(struct sge *sge, struct sk_buff *skb,
 static inline void doorbell_pio(struct adapter *adapter, u32 val)
 {
 	wmb();
-	writel(val, adapter->regs + A_SG_DOORBELL);
+	writel_relaxed(val, adapter->regs + A_SG_DOORBELL);
 }
 
 /*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 08/18] scsi: hpsa: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (6 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 07/18] drivers: net: cxgb: Eliminate " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 09/18] fm10k: " Sinan Kaya
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Don Brace,
	James E.J. Bottomley, Martin K. Petersen, esc.storagedev,
	linux-scsi, linux-kernel

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/scsi/hpsa.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
index 018f980..c7d7e6a 100644
--- a/drivers/scsi/hpsa.h
+++ b/drivers/scsi/hpsa.h
@@ -599,7 +599,7 @@ static unsigned long SA5_ioaccel_mode1_completed(struct ctlr_info *h, u8 q)
 		 * but with current driver design this is easiest.
 		 */
 		wmb();
-		writel((q << 24) | rq->current_entry, h->vaddr +
+		writel_relaxed((q << 24) | rq->current_entry, h->vaddr +
 				IOACCEL_MODE1_CONSUMER_INDEX);
 		atomic_dec(&h->commands_outstanding);
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 09/18] fm10k: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (7 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 08/18] scsi: hpsa: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:30   ` [Intel-wired-lan] " Alexander Duyck
  2018-03-16 16:16 ` [PATCH v3 10/18] net: qla3xxx: " Sinan Kaya
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Jeff Kirsher,
	intel-wired-lan, linux-kernel

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 8e12aae..8d04e26 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -179,7 +179,7 @@ void fm10k_alloc_rx_buffers(struct fm10k_ring *rx_ring, u16 cleaned_count)
 		wmb();
 
 		/* notify hardware of new descriptors */
-		writel(i, rx_ring->tail);
+		writel_relaxed(i, rx_ring->tail);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 10/18] net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (8 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 09/18] fm10k: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 11/18] qlcnic: " Sinan Kaya
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Dept-GELinuxNICDev,
	linux-kernel

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qla3xxx.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c b/drivers/net/ethernet/qlogic/qla3xxx.c
index 9e5264d..0e71b74 100644
--- a/drivers/net/ethernet/qlogic/qla3xxx.c
+++ b/drivers/net/ethernet/qlogic/qla3xxx.c
@@ -1858,8 +1858,8 @@ static void ql_update_small_bufq_prod_index(struct ql3_adapter *qdev)
 			qdev->small_buf_release_cnt -= 8;
 		}
 		wmb();
-		writel(qdev->small_buf_q_producer_index,
-			&port_regs->CommonRegs.rxSmallQProducerIndex);
+		writel_relaxed(qdev->small_buf_q_producer_index,
+			       &port_regs->CommonRegs.rxSmallQProducerIndex);
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 11/18] qlcnic: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (9 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 10/18] net: qla3xxx: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-19 20:10   ` Chopra, Manish
  2018-03-16 16:16 ` [PATCH v3 12/18] bnx2x: " Sinan Kaya
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Harish Patil,
	Manish Chopra, Dept-GELinuxNICDev, linux-kernel

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index 46b0372..97c146e7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct qlcnic_adapter *adapter)
 	wmb();
 
 	/* clear the interrupt trigger control register */
-	writel(0, adapter->isr_int_vec);
+	writel_relaxed(0, adapter->isr_int_vec);
 	intr_val = readl(adapter->isr_int_vec);
 	do {
 		intr_val = readl(adapter->tgt_status_reg);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 12/18] bnx2x: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (10 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 11/18] qlcnic: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 13/18] net: cxgb4/cxgb4vf: " Sinan Kaya
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Ariel Elior,
	everest-linux-l2, linux-kernel

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index 76a4668..3b2f1bd 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -170,7 +170,7 @@ static int bnx2x_send_msg2pf(struct bnx2x *bp, u8 *done, dma_addr_t msg_mapping)
 	wmb();
 
 	/* Trigger the PF FW */
-	writeb(1, &zone_data->trigger.vf_pf_channel.addr_valid);
+	writeb_relaxed(1, &zone_data->trigger.vf_pf_channel.addr_valid);
 
 	/* Wait for PF to complete */
 	while ((tout >= 0) && (!*done)) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 13/18] net: cxgb4/cxgb4vf: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (11 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 12/18] bnx2x: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 14/18] net: cxgb3: " Sinan Kaya
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Ganesh Goudar,
	Casey Leedom, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |  6 ++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 13 +++++++------
 drivers/net/ethernet/chelsio/cxgb4/sge.c        |  8 ++++----
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |  2 +-
 drivers/net/ethernet/chelsio/cxgb4vf/adapter.h  | 14 ++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c      | 16 +++++++++-------
 6 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 9040e13..6bde0b9 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1202,6 +1202,12 @@ static inline void t4_write_reg(struct adapter *adap, u32 reg_addr, u32 val)
 	writel(val, adap->regs + reg_addr);
 }
 
+static inline void t4_write_reg_relaxed(struct adapter *adap, u32 reg_addr,
+					u32 val)
+{
+	writel_relaxed(val, adap->regs + reg_addr);
+}
+
 #ifndef readq
 static inline u64 readq(const volatile void __iomem *addr)
 {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7b452e8..276472d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -1723,8 +1723,8 @@ int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, u16 pidx,
 		else
 			val = PIDX_T5_V(delta);
 		wmb();
-		t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-			     QID_V(qid) | val);
+		t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+				     QID_V(qid) | val);
 	}
 out:
 	return ret;
@@ -1902,8 +1902,9 @@ static void enable_txq_db(struct adapter *adap, struct sge_txq *q)
 		 * are committed before we tell HW about them.
 		 */
 		wmb();
-		t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-			     QID_V(q->cntxt_id) | PIDX_V(q->db_pidx_inc));
+		t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+				     QID_V(q->cntxt_id) |
+						PIDX_V(q->db_pidx_inc));
 		q->db_pidx_inc = 0;
 	}
 	q->db_disabled = 0;
@@ -2003,8 +2004,8 @@ static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q)
 		else
 			val = PIDX_T5_V(delta);
 		wmb();
-		t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-			     QID_V(q->cntxt_id) | val);
+		t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+				     QID_V(q->cntxt_id) | val);
 	}
 out:
 	q->db_disabled = 0;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 6e310a0..1a1738a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -530,11 +530,11 @@ static inline void ring_fl_db(struct adapter *adap, struct sge_fl *q)
 		 * mechanism.
 		 */
 		if (unlikely(q->bar2_addr == NULL)) {
-			t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
-				     val | QID_V(q->cntxt_id));
+			t4_write_reg_relaxed(adap, MYPF_REG(SGE_PF_KDOORBELL_A),
+					     val | QID_V(q->cntxt_id));
 		} else {
-			writel(val | QID_V(q->bar2_qid),
-			       q->bar2_addr + SGE_UDB_KDOORBELL);
+			writel_relaxed(val | QID_V(q->bar2_qid),
+				       q->bar2_addr + SGE_UDB_KDOORBELL);
 
 			/* This Write memory Barrier will force the write to
 			 * the User Doorbell area to be flushed.
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 920bccd..8b723a0 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -139,7 +139,7 @@ void t4_write_indirect(struct adapter *adap, unsigned int addr_reg,
 {
 	while (nregs--) {
 		t4_write_reg(adap, addr_reg, start_idx++);
-		t4_write_reg(adap, data_reg, *vals++);
+		t4_write_reg_relaxed(adap, data_reg, *vals++);
 	}
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
index 5883f09..00247be4 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
@@ -442,6 +442,20 @@ static inline void t4_write_reg(struct adapter *adapter, u32 reg_addr, u32 val)
 	writel(val, adapter->regs + reg_addr);
 }
 
+/**
+ * t4_write_reg_relaxed - write a HW register without ordering guarantees
+ * @adapter: the adapter
+ * @reg_addr: the register address
+ * @val: the value to write
+ *
+ * Write a 32-bit value into the given HW register.
+ */
+static inline void t4_write_reg_relaxed(struct adapter *adapter, u32 reg_addr,
+					u32 val)
+{
+	writel_relaxed(val, adapter->regs + reg_addr);
+}
+
 #ifndef readq
 static inline u64 readq(const volatile void __iomem *addr)
 {
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index dfce5df..1d98387 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -546,12 +546,13 @@ static inline void ring_fl_db(struct adapter *adapter, struct sge_fl *fl)
 		 * mechanism.
 		 */
 		if (unlikely(fl->bar2_addr == NULL)) {
-			t4_write_reg(adapter,
-				     T4VF_SGE_BASE_ADDR + SGE_VF_KDOORBELL,
-				     QID_V(fl->cntxt_id) | val);
+			t4_write_reg_relaxed(adapter,
+					     T4VF_SGE_BASE_ADDR +
+							SGE_VF_KDOORBELL,
+					     QID_V(fl->cntxt_id) | val);
 		} else {
-			writel(val | QID_V(fl->bar2_qid),
-			       fl->bar2_addr + SGE_UDB_KDOORBELL);
+			writel_relaxed(val | QID_V(fl->bar2_qid),
+				       fl->bar2_addr + SGE_UDB_KDOORBELL);
 
 			/* This Write memory Barrier will force the write to
 			 * the User Doorbell area to be flushed.
@@ -980,8 +981,9 @@ static inline void ring_tx_db(struct adapter *adapter, struct sge_txq *tq,
 	if (unlikely(tq->bar2_addr == NULL)) {
 		u32 val = PIDX_V(n);
 
-		t4_write_reg(adapter, T4VF_SGE_BASE_ADDR + SGE_VF_KDOORBELL,
-			     QID_V(tq->cntxt_id) | val);
+		t4_write_reg_relaxed(adapter,
+				     T4VF_SGE_BASE_ADDR + SGE_VF_KDOORBELL,
+				     QID_V(tq->cntxt_id) | val);
 	} else {
 		u32 val = PIDX_T5_V(n);
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 14/18] net: cxgb3: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (12 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 13/18] net: cxgb4/cxgb4vf: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 15/18] RDMA/bnxt_re: " Sinan Kaya
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Santosh Raspatur,
	linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/net/ethernet/chelsio/cxgb3/adapter.h |  7 +++++++
 drivers/net/ethernet/chelsio/cxgb3/sge.c     | 19 ++++++++++---------
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb3/adapter.h b/drivers/net/ethernet/chelsio/cxgb3/adapter.h
index 087ff0f..0e21e66 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/adapter.h
+++ b/drivers/net/ethernet/chelsio/cxgb3/adapter.h
@@ -281,6 +281,13 @@ static inline void t3_write_reg(struct adapter *adapter, u32 reg_addr, u32 val)
 	writel(val, adapter->regs + reg_addr);
 }
 
+static inline void t3_write_reg_relaxed(struct adapter *adapter, u32 reg_addr,
+					u32 val)
+{
+	CH_DBG(adapter, MMIO, "setting register 0x%x to 0x%x\n", reg_addr, val);
+	writel_relaxed(val, adapter->regs + reg_addr);
+}
+
 static inline struct port_info *adap2pinfo(struct adapter *adap, int idx)
 {
 	return netdev_priv(adap->port[idx]);
diff --git a/drivers/net/ethernet/chelsio/cxgb3/sge.c b/drivers/net/ethernet/chelsio/cxgb3/sge.c
index e988caa..0baab06 100644
--- a/drivers/net/ethernet/chelsio/cxgb3/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb3/sge.c
@@ -487,7 +487,8 @@ static inline void ring_fl_db(struct adapter *adap, struct sge_fl *q)
 	if (q->pend_cred >= q->credits / 4) {
 		q->pend_cred = 0;
 		wmb();
-		t3_write_reg(adap, A_SG_KDOORBELL, V_EGRCNTX(q->cntxt_id));
+		t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+				     V_EGRCNTX(q->cntxt_id));
 	}
 }
 
@@ -1058,8 +1059,8 @@ static inline void check_ring_tx_db(struct adapter *adap, struct sge_txq *q)
 	}
 #else
 	wmb();			/* write descriptors before telling HW */
-	t3_write_reg(adap, A_SG_KDOORBELL,
-		     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+	t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+			     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
 #endif
 }
 
@@ -1510,8 +1511,8 @@ static int ctrl_xmit(struct adapter *adap, struct sge_txq *q,
 	}
 	spin_unlock(&q->lock);
 	wmb();
-	t3_write_reg(adap, A_SG_KDOORBELL,
-		     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+	t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+			     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
 	return NET_XMIT_SUCCESS;
 }
 
@@ -1554,8 +1555,8 @@ static void restart_ctrlq(unsigned long data)
 
 	spin_unlock(&q->lock);
 	wmb();
-	t3_write_reg(qs->adap, A_SG_KDOORBELL,
-		     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+	t3_write_reg_relaxed(qs->adap, A_SG_KDOORBELL,
+			     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
 }
 
 /*
@@ -1793,8 +1794,8 @@ again:	reclaim_completed_tx(adap, q, TX_RECLAIM_CHUNK);
 #endif
 	wmb();
 	if (likely(written))
-		t3_write_reg(adap, A_SG_KDOORBELL,
-			     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
+		t3_write_reg_relaxed(adap, A_SG_KDOORBELL,
+				     F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id));
 }
 
 /**
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 15/18] RDMA/bnxt_re: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (13 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 14/18] net: cxgb3: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 16/18] IB/mlx4: " Sinan Kaya
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Selvin Xavier,
	Devesh Sharma, Somnath Kotur, Sriharsha Basavapatna,
	Doug Ledford, Jason Gunthorpe, linux-rdma, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/bnxt_re/qplib_rcfw.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
index 8329ec6..4a6b981 100644
--- a/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
+++ b/drivers/infiniband/hw/bnxt_re/qplib_rcfw.c
@@ -181,10 +181,10 @@ static int __send_message(struct bnxt_qplib_rcfw *rcfw, struct cmdq_base *req,
 
 	/* ring CMDQ DB */
 	wmb();
-	writel(cmdq_prod, rcfw->cmdq_bar_reg_iomem +
-	       rcfw->cmdq_bar_reg_prod_off);
-	writel(RCFW_CMDQ_TRIG_VAL, rcfw->cmdq_bar_reg_iomem +
-	       rcfw->cmdq_bar_reg_trig_off);
+	writel_relaxed(cmdq_prod, rcfw->cmdq_bar_reg_iomem +
+		       rcfw->cmdq_bar_reg_prod_off);
+	writel_relaxed(RCFW_CMDQ_TRIG_VAL, rcfw->cmdq_bar_reg_iomem +
+		       rcfw->cmdq_bar_reg_trig_off);
 done:
 	spin_unlock_irqrestore(&cmdq->lock, flags);
 	/* Return the CREQ response pointer */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 16/18] IB/mlx4: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (14 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 15/18] RDMA/bnxt_re: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 17/18] RDMA/i40iw: " Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Yishai Hadas,
	Doug Ledford, Jason Gunthorpe, linux-rdma, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/mlx4/qp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index f045491..74b27b0 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -3880,8 +3880,8 @@ int mlx4_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
 		 */
 		wmb();
 
-		writel(qp->doorbell_qpn,
-		       to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
+		writel_relaxed(qp->doorbell_qpn,
+			to_mdev(ibqp->device)->uar_map + MLX4_SEND_DOORBELL);
 
 		/*
 		 * Make sure doorbells don't leak out of SQ spinlock
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 17/18] RDMA/i40iw: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (15 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 16/18] IB/mlx4: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
  17 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Faisal Latif,
	Shiraz Saleem, Doug Ledford, Jason Gunthorpe, linux-rdma,
	linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/i40iw/i40iw_ctrl.c  |  6 ++++--
 drivers/infiniband/hw/i40iw/i40iw_osdep.h |  1 +
 drivers/infiniband/hw/i40iw/i40iw_uk.c    |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_utils.c | 11 +++++++++++
 4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw_ctrl.c b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
index c74fd33..47f473e 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
@@ -706,9 +706,11 @@ static void i40iw_sc_ccq_arm(struct i40iw_sc_cq *ccq)
 	wmb();       /* make sure shadow area is updated before arming */
 
 	if (ccq->dev->is_pf)
-		i40iw_wr32(ccq->dev->hw, I40E_PFPE_CQARM, ccq->cq_uk.cq_id);
+		i40iw_wr32_relaxed(ccq->dev->hw, I40E_PFPE_CQARM,
+				   ccq->cq_uk.cq_id);
 	else
-		i40iw_wr32(ccq->dev->hw, I40E_VFPE_CQARM1, ccq->cq_uk.cq_id);
+		i40iw_wr32_relaxed(ccq->dev->hw, I40E_VFPE_CQARM1,
+				   ccq->cq_uk.cq_id);
 }
 
 /**
diff --git a/drivers/infiniband/hw/i40iw/i40iw_osdep.h b/drivers/infiniband/hw/i40iw/i40iw_osdep.h
index f27be3e..e06f4b9 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_osdep.h
+++ b/drivers/infiniband/hw/i40iw/i40iw_osdep.h
@@ -213,5 +213,6 @@ void i40iw_hw_stats_start_timer(struct i40iw_sc_vsi *vsi);
 void i40iw_hw_stats_stop_timer(struct i40iw_sc_vsi *vsi);
 #define i40iw_mmiowb() mmiowb()
 void i40iw_wr32(struct i40iw_hw *hw, u32 reg, u32 value);
+void i40iw_wr32_relaxed(struct i40iw_hw *hw, u32 reg, u32 value);
 u32  i40iw_rd32(struct i40iw_hw *hw, u32 reg);
 #endif				/* _I40IW_OSDEP_H_ */
diff --git a/drivers/infiniband/hw/i40iw/i40iw_uk.c b/drivers/infiniband/hw/i40iw/i40iw_uk.c
index 8afa5a6..7f0ebed 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_uk.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_uk.c
@@ -723,7 +723,7 @@ static void i40iw_cq_request_notification(struct i40iw_cq_uk *cq,
 
 	wmb(); /* make sure WQE is populated before valid bit is set */
 
-	writel(cq->cq_id, cq->cqe_alloc_reg);
+	writel_relaxed(cq->cq_id, cq->cqe_alloc_reg);
 }
 
 /**
diff --git a/drivers/infiniband/hw/i40iw/i40iw_utils.c b/drivers/infiniband/hw/i40iw/i40iw_utils.c
index ddc1056..99aa6f8 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_utils.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_utils.c
@@ -125,6 +125,17 @@ inline void i40iw_wr32(struct i40iw_hw *hw, u32 reg, u32 value)
 }
 
 /**
+ * i40iw_wr32_relaxed - write 32 bits to hw register without ordering
+ * @hw: hardware information including registers
+ * @reg: register offset
+ * @value: vvalue to write to register
+ */
+inline void i40iw_wr32_relaxed(struct i40iw_hw *hw, u32 reg, u32 value)
+{
+	writel_relaxed(value, hw->hw_addr + reg);
+}
+
+/**
  * i40iw_rd32 - read a 32 bit hw register
  * @hw: hardware information including registers
  * @reg: register offset
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
       [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
                   ` (16 preceding siblings ...)
  2018-03-16 16:16 ` [PATCH v3 17/18] RDMA/i40iw: " Sinan Kaya
@ 2018-03-16 16:16 ` Sinan Kaya
  2018-03-16 21:05   ` Steve Wise
  17 siblings, 1 reply; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:16 UTC (permalink / raw)
  To: netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Sinan Kaya, Steve Wise,
	Doug Ledford, Jason Gunthorpe, linux-rdma, linux-kernel

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
---
 drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..7a48c9e 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
 				 (u64 *)wqe);
 		} else {
 			pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-			writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-			       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+				       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
 		}
 
 		/* Flush user doorbell area writes. */
 		wmb();
 		return;
 	}
-	writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+	writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
 				 (void *)wqe);
 		} else {
 			pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-			writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-			       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+				       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
 		}
 
 		/* Flush user doorbell area writes. */
 		wmb();
 		return;
 	}
-	writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+	writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline int t4_wq_in_error(struct t4_wq *wq)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Intel-wired-lan] [PATCH v3 09/18] fm10k: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 16:16 ` [PATCH v3 09/18] fm10k: " Sinan Kaya
@ 2018-03-16 16:30   ` Alexander Duyck
  2018-03-16 16:33     ` Sinan Kaya
  0 siblings, 1 reply; 36+ messages in thread
From: Alexander Duyck @ 2018-03-16 16:30 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Netdev, Timur Tabi, sulrich, linux-arm-msm, LKML,
	intel-wired-lan, linux-arm-kernel

On Fri, Mar 16, 2018 at 9:16 AM, Sinan Kaya <okaya@codeaurora.org> wrote:
> Code includes wmb() followed by writel(). writel() already has a
> barrier on some architectures like arm64.
>
> This ends up CPU observing two barriers back to back before executing
> the register write.
>
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
>
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>

You can update the writel call in fm10k_tx_map as well.

Of the drivers updated in drivers/net/ethernet/intel/* it looks like
this is the only one that still requires any additional changes.

Thanks.

- Alex

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Intel-wired-lan] [PATCH v3 09/18] fm10k: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 16:30   ` [Intel-wired-lan] " Alexander Duyck
@ 2018-03-16 16:33     ` Sinan Kaya
  0 siblings, 0 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 16:33 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, Timur Tabi, sulrich, linux-arm-msm, LKML,
	intel-wired-lan, linux-arm-kernel

On 3/16/2018 12:30 PM, Alexander Duyck wrote:
> On Fri, Mar 16, 2018 at 9:16 AM, Sinan Kaya <okaya@codeaurora.org> wrote:
>> Code includes wmb() followed by writel(). writel() already has a
>> barrier on some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing
>> the register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> 
> You can update the writel call in fm10k_tx_map as well.
> 
> Of the drivers updated in drivers/net/ethernet/intel/* it looks like
> this is the only one that still requires any additional changes.

will do. thanks for the feedback.

> 
> Thanks.
> 
> - Alex
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
@ 2018-03-16 21:05   ` Steve Wise
  2018-03-16 21:46     ` Sinan Kaya
  2018-03-16 22:13     ` Jason Gunthorpe
  0 siblings, 2 replies; 36+ messages in thread
From: Steve Wise @ 2018-03-16 21:05 UTC (permalink / raw)
  To: 'Sinan Kaya', netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

> Code includes wmb() followed by writel(). writel() already has a barrier
on
> some architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>

NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
writeX().  

I was just looking at this with Chelsio developers, and they said the
writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
get rid of the extra barrier for all architectures.

Also, t4.h:pio_copy() needs to  use __raw_writeq() to enable the write
combining fastpath for ARM and PowerPC.  The code as it stands doesn't
achieve any write combining on PowerPC at least.  

And the writel()s at the end of the ring functions (the non bar2 udb path)
needs a mmiowb() afterwards if you're going to use __raw_writeX() there.
However that path is only used for very old hardware (T4), so I wouldn't
worry about them. 

Steve.


> ---
>  drivers/infiniband/hw/cxgb4/t4.h | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/cxgb4/t4.h
> b/drivers/infiniband/hw/cxgb4/t4.h
> index 8369c7c..7a48c9e 100644
> --- a/drivers/infiniband/hw/cxgb4/t4.h
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
>  				 (u64 *)wqe);
>  		} else {
>  			pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -			writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -			       wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >sq.bar2_qid),
> +				       wq->sq.bar2_va +
> SGE_UDB_KDOORBELL);
>  		}
> 
>  		/* Flush user doorbell area writes. */
>  		wmb();
>  		return;
>  	}
> -	writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +	writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
>  				 (void *)wqe);
>  		} else {
>  			pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -			writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -			       wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +			writel_relaxed(PIDX_T5_V(inc) | QID_V(wq-
> >rq.bar2_qid),
> +				       wq->rq.bar2_va +
> SGE_UDB_KDOORBELL);
>  		}
> 
>  		/* Flush user doorbell area writes. */
>  		wmb();
>  		return;
>  	}
> -	writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +	writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>  }
> 
>  static inline int t4_wq_in_error(struct t4_wq *wq)
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 21:05   ` Steve Wise
@ 2018-03-16 21:46     ` Sinan Kaya
  2018-03-16 23:05       ` Steve Wise
  2018-03-16 22:13     ` Jason Gunthorpe
  1 sibling, 1 reply; 36+ messages in thread
From: Sinan Kaya @ 2018-03-16 21:46 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On 3/16/2018 5:05 PM, Steve Wise wrote:
>> Code includes wmb() followed by writel(). writel() already has a barrier
> on
>> some architectures like arm64.
>>
>> This ends up CPU observing two barriers back to back before executing the
>> register write.
>>
>> Since code already has an explicit barrier call, changing writel() to
>> writel_relaxed().
>>
>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> 
> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> writeX().  
> 
> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

OK. I can do that but isn't the problem at PowerPC adaptation?

/*
 * We don't do relaxed operations yet, at least not with this semantic
 */
#define readb_relaxed(addr)	readb(addr)
#define readw_relaxed(addr)	readw(addr)
#define readl_relaxed(addr)	readl(addr)
#define readq_relaxed(addr)	readq(addr)
#define writeb_relaxed(v, addr)	writeb(v, addr)
#define writew_relaxed(v, addr)	writew(v, addr)
#define writel_relaxed(v, addr)	writel(v, addr)
#define writeq_relaxed(v, addr)	writeq(v, addr)

Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

>From API perspective both __raw_writeX() and writeX_relaxed() are correct.
It is just PowerPC doesn't seem the follow the definition yet.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 21:05   ` Steve Wise
  2018-03-16 21:46     ` Sinan Kaya
@ 2018-03-16 22:13     ` Jason Gunthorpe
  2018-03-16 23:04       ` Steve Wise
  1 sibling, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2018-03-16 22:13 UTC (permalink / raw)
  To: Steve Wise
  Cc: 'Sinan Kaya',
	netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > Code includes wmb() followed by writel(). writel() already has a barrier
> on
> > some architectures like arm64.
> > 
> > This ends up CPU observing two barriers back to back before executing the
> > register write.
> > 
> > Since code already has an explicit barrier call, changing writel() to
> > writel_relaxed().
> > 
> > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> 
> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> writeX().  

?? Why is changing writex() to writeX() a NAK then?

> I was just looking at this with Chelsio developers, and they said the
> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> get rid of the extra barrier for all architectures.

That doesn't seem semanticaly sane.

__raw_writeX() should not appear in driver code, IMHO. Only the arch
code can know what the exact semantics of that accessor are..

If ppc can't use writel_relaxed to optimize then we probably need yet
another io accessor semantic defined :(

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 22:13     ` Jason Gunthorpe
@ 2018-03-16 23:04       ` Steve Wise
  2018-03-17  4:08         ` Timur Tabi
  0 siblings, 1 reply; 36+ messages in thread
From: Steve Wise @ 2018-03-16 23:04 UTC (permalink / raw)
  To: 'Jason Gunthorpe'
  Cc: 'Sinan Kaya',
	netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

> 
> On Fri, Mar 16, 2018 at 04:05:10PM -0500, Steve Wise wrote:
> > > Code includes wmb() followed by writel(). writel() already has a
barrier
> > on
> > > some architectures like arm64.
> > >
> > > This ends up CPU observing two barriers back to back before executing
> the
> > > register write.
> > >
> > > Since code already has an explicit barrier call, changing writel() to
> > > writel_relaxed().
> > >
> > > Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >
> > NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is
just
> > writeX().
> 
> ?? Why is changing writex() to writeX() a NAK then?

Because I want it correct for PPC as well.

> 
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(),
to
> > get rid of the extra barrier for all architectures.
> 
> That doesn't seem semanticaly sane.
> 
> __raw_writeX() should not appear in driver code, IMHO. Only the arch
> code can know what the exact semantics of that accessor are..
> 
> If ppc can't use writel_relaxed to optimize then we probably need yet
> another io accessor semantic defined :(


Anybody understand why the PPC implementation of writeX_relaxed() isn't
relaxed?


Steve.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 21:46     ` Sinan Kaya
@ 2018-03-16 23:05       ` Steve Wise
  2018-03-17  3:40         ` Sinan Kaya
  0 siblings, 1 reply; 36+ messages in thread
From: Steve Wise @ 2018-03-16 23:05 UTC (permalink / raw)
  To: 'Sinan Kaya', netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

> 
> On 3/16/2018 5:05 PM, Steve Wise wrote:
> >> Code includes wmb() followed by writel(). writel() already has a barrier
> > on
> >> some architectures like arm64.
> >>
> >> This ends up CPU observing two barriers back to back before executing
> the
> >> register write.
> >>
> >> Since code already has an explicit barrier call, changing writel() to
> >> writel_relaxed().
> >>
> >> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> >
> > NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
> > writeX().
> >
> > I was just looking at this with Chelsio developers, and they said the
> > writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
> > get rid of the extra barrier for all architectures.
> 
> OK. I can do that but isn't the problem at PowerPC adaptation?
> 
> /*
>  * We don't do relaxed operations yet, at least not with this semantic
>  */
> #define readb_relaxed(addr)	readb(addr)
> #define readw_relaxed(addr)	readw(addr)
> #define readl_relaxed(addr)	readl(addr)
> #define readq_relaxed(addr)	readq(addr)
> #define writeb_relaxed(v, addr)	writeb(v, addr)
> #define writew_relaxed(v, addr)	writew(v, addr)
> #define writel_relaxed(v, addr)	writel(v, addr)
> #define writeq_relaxed(v, addr)	writeq(v, addr)
> 
> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?

I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?


> 
> >From API perspective both __raw_writeX() and writeX_relaxed() are
> correct.
> It is just PowerPC doesn't seem the follow the definition yet.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 23:05       ` Steve Wise
@ 2018-03-17  3:40         ` Sinan Kaya
  2018-03-17  4:03           ` Sinan Kaya
  0 siblings, 1 reply; 36+ messages in thread
From: Sinan Kaya @ 2018-03-17  3:40 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On 3/16/2018 7:05 PM, Steve Wise wrote:
>>
>> On 3/16/2018 5:05 PM, Steve Wise wrote:
>>>> Code includes wmb() followed by writel(). writel() already has a barrier
>>> on
>>>> some architectures like arm64.
>>>>
>>>> This ends up CPU observing two barriers back to back before executing
>> the
>>>> register write.
>>>>
>>>> Since code already has an explicit barrier call, changing writel() to
>>>> writel_relaxed().
>>>>
>>>> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
>>>
>>> NAK - This isn't correct for PowerPC.  For PowerPC, writeX_relaxed() is just
>>> writeX().
>>>
>>> I was just looking at this with Chelsio developers, and they said the
>>> writeX() should be replaced with __raw_writeX(), not writeX_relaxed(), to
>>> get rid of the extra barrier for all architectures.
>>
>> OK. I can do that but isn't the problem at PowerPC adaptation?
>>
>> /*
>>  * We don't do relaxed operations yet, at least not with this semantic
>>  */
>> #define readb_relaxed(addr)	readb(addr)
>> #define readw_relaxed(addr)	readw(addr)
>> #define readl_relaxed(addr)	readl(addr)
>> #define readq_relaxed(addr)	readq(addr)
>> #define writeb_relaxed(v, addr)	writeb(v, addr)
>> #define writew_relaxed(v, addr)	writew(v, addr)
>> #define writel_relaxed(v, addr)	writel(v, addr)
>> #define writeq_relaxed(v, addr)	writeq(v, addr)
>>
>> Why don't we fix the PowerPC's relaxed operators? Is that a bigger task?
> 
> I don't know the answer, but perhaps the proper fix is to correctly implement these for PPC?
> 

I found this article: https://lwn.net/Articles/698014/#PowerPC%20Implementation

Apparently, this issue was discussed at a conference in 2015.

Based on how I read this article, writel() and writel_relaxed() behave exactly
the same on PowerPC because the barrier is enforced by the time code is leaving
a critical section not during MMIO execution.

I also see that __raw_writel() is being heavily used in drivers directory.

https://elixir.bootlin.com/linux/latest/ident/__raw_writel

I'll change writel_relaxed() with __raw_writel() in the series like you suggested
and also look at your other comments.

This seems to be the current norm.

> 
>>
>> >From API perspective both __raw_writeX() and writeX_relaxed() are
>> correct.
>> It is just PowerPC doesn't seem the follow the definition yet.
> 
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  3:40         ` Sinan Kaya
@ 2018-03-17  4:03           ` Sinan Kaya
  2018-03-17  4:25             ` Sinan Kaya
  0 siblings, 1 reply; 36+ messages in thread
From: Sinan Kaya @ 2018-03-17  4:03 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> and also look at your other comments.

I spoke too soon.

Now that I realized, code needs to follow one of the following patterns for correctness

1)
wmb()
writel()/writel_relaxed()

or 

2)
wmb()
__raw_wrltel()
mmiowb()

but definitely not

wmb()
__raw_wrltel()

Since #1 == #2, I'll stick to my current implementation of writel_relaxed()

Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
for correctness. ARM's mmiowb() implementation is empty.

So, there is no one size fits all solution with the current state of affairs.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 23:04       ` Steve Wise
@ 2018-03-17  4:08         ` Timur Tabi
  0 siblings, 0 replies; 36+ messages in thread
From: Timur Tabi @ 2018-03-17  4:08 UTC (permalink / raw)
  To: Steve Wise, 'Jason Gunthorpe', linuxppc-dev
  Cc: 'Sinan Kaya',
	netdev, sulrich, linux-arm-msm, linux-arm-kernel,
	'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On 3/16/18 6:04 PM, Steve Wise wrote:
> Anybody understand why the PPC implementation of writeX_relaxed() isn't
> relaxed?

You probably should ask that on the linuxppc-dev@lists.ozlabs.org 
mailing list.

I've always wondered why PowerPC has non-standard I/O accessors.

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:03           ` Sinan Kaya
@ 2018-03-17  4:25             ` Sinan Kaya
  2018-03-17  4:30               ` Timur Tabi
                                 ` (3 more replies)
  0 siblings, 4 replies; 36+ messages in thread
From: Sinan Kaya @ 2018-03-17  4:25 UTC (permalink / raw)
  To: Steve Wise, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>> and also look at your other comments.
> 
> I spoke too soon.
> 
> Now that I realized, code needs to follow one of the following patterns for correctness
> 
> 1)
> wmb()
> writel()/writel_relaxed()
> 
> or 
> 
> 2)
> wmb()
> __raw_wrltel()
> mmiowb()
> 
> but definitely not
> 
> wmb()
> __raw_wrltel()
> 
> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> 
> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> for correctness. ARM's mmiowb() implementation is empty.
> 
> So, there is no one size fits all solution with the current state of affairs.
> 
> 

I think I finally got what you mean.

Code seems to have

wmb()
writel()/writeq()
wmb()

this can be safely replaced with

wmb()
__raw_writel()/__raw_writeq()
wmb()

This will work on all arches. Below is the new version. Let me know if this is OK.

+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
        int count = 8;

        while (count) {
-               writeq(*src, dst);
+               __raw_writeq(*src, dst);
                src++;
                dst++;
                count--;
@@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
                                 (u64 *)wqe);
                } else {
                        pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
                }

                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       mmiowmb()
 }

 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
                                 (void *)wqe);
                } else {
                        pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
                }

                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       mmiowmb();
 }


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
@ 2018-03-17  4:30               ` Timur Tabi
  2018-03-17 13:23               ` Steve Wise
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Timur Tabi @ 2018-03-17  4:30 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Steve Wise, netdev, Timur Tabi, sulrich, Casey Leedom,
	Steve Wise, linux-rdma, linux-arm-msm, lkml, Jason Gunthorpe,
	Doug Ledford, Michael Werner, linux-arm-kernel

On Fri, Mar 16, 2018 at 11:25 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),

I think Jason might be right that drivers shouldn't be calling the
__raw_write functions.  You definitely need to run this by the PPC
developers, specifically Ben Herrenschmidt.


-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
  2018-03-17  4:30               ` Timur Tabi
@ 2018-03-17 13:23               ` Steve Wise
  2018-03-17 13:27               ` David Miller
  2018-03-17 15:05               ` Jason Gunthorpe
  3 siblings, 0 replies; 36+ messages in thread
From: Steve Wise @ 2018-03-17 13:23 UTC (permalink / raw)
  To: 'Sinan Kaya', netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, 'Steve Wise',
	'Doug Ledford', 'Jason Gunthorpe',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

> 
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you
> suggested
> >> and also look at your other comments.
> >
> > I spoke too soon.
> >
> > Now that I realized, code needs to follow one of the following patterns
> for correctness
> >
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> >
> > or
> >
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> >
> > but definitely not
> >
> > wmb()
> > __raw_wrltel()
> >
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> >
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough.
> PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> >
> > So, there is no one size fits all solution with the current state of affairs.
> >
> >
> 
> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is
> OK.
> 
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64
> *src)
>         int count = 8;
> 
>         while (count) {
> -               writeq(*src, dst);
> +               __raw_writeq(*src, dst);
>                 src++;
>                 dst++;
>                 count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq,
> u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb()
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq,
> u16 inc,
>                                  (void *)wqe);
>                 } else {
>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb();
>  }
> 
> 

Yes, this is what chelsio recommended to me.  

Reviewed-by: Steve Wise <swise@opengridcomputing.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
  2018-03-17  4:30               ` Timur Tabi
  2018-03-17 13:23               ` Steve Wise
@ 2018-03-17 13:27               ` David Miller
  2018-03-17 15:05               ` Jason Gunthorpe
  3 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2018-03-17 13:27 UTC (permalink / raw)
  To: okaya
  Cc: swise, netdev, timur, sulrich, linux-arm-msm, linux-arm-kernel,
	swise, dledford, jgg, linux-rdma, linux-kernel, werner, leedom

From: Sinan Kaya <okaya@codeaurora.org>
Date: Sat, 17 Mar 2018 00:25:14 -0400

> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is OK.

Unfortunately, I think this won't work.

At least on sparc, the __raw_*() variants also change the endianness to
native endianness.

PowerPC does this as well, even documented in a comment :-)

/*
 * Non ordered and non-swapping "raw" accessors
 */

Only the non-__raw_ variants do the endianness swap to/from
little-endian.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17  4:25             ` Sinan Kaya
                                 ` (2 preceding siblings ...)
  2018-03-17 13:27               ` David Miller
@ 2018-03-17 15:05               ` Jason Gunthorpe
  2018-03-17 18:30                 ` Sinan Kaya
  3 siblings, 1 reply; 36+ messages in thread
From: Jason Gunthorpe @ 2018-03-17 15:05 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Steve Wise, netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom'

On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
> > On 3/16/2018 11:40 PM, Sinan Kaya wrote:
> >> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
> >> and also look at your other comments.
> > 
> > I spoke too soon.
> > 
> > Now that I realized, code needs to follow one of the following patterns for correctness
> > 
> > 1)
> > wmb()
> > writel()/writel_relaxed()
> > 
> > or 
> > 
> > 2)
> > wmb()
> > __raw_wrltel()
> > mmiowb()
> > 
> > but definitely not
> > 
> > wmb()
> > __raw_wrltel()
> > 
> > Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
> > 
> > Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
> > for correctness. ARM's mmiowb() implementation is empty.
> > 
> > So, there is no one size fits all solution with the current state of affairs.
> > 
> > 
> 
> I think I finally got what you mean.
> 
> Code seems to have
> 
> wmb()
> writel()/writeq()
> wmb()
> 
> this can be safely replaced with
> 
> wmb()
> __raw_writel()/__raw_writeq()
> wmb()
> 
> This will work on all arches. Below is the new version. Let me know if this is OK.
> 
> +++ b/drivers/infiniband/hw/cxgb4/t4.h
> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>         int count = 8;
> 
>         while (count) {
> -               writeq(*src, dst);
> +               __raw_writeq(*src, dst);
>                 src++;
>                 dst++;
>                 count--;
> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>                                  (u64 *)wqe);
>                 } else {
>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb()
>  }
> 
>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>                                  (void *)wqe);
>                 } else {
>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>                 }
> 
>                 /* Flush user doorbell area writes. */
>                 wmb();
>                 return;
>         }
> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
> +       mmiowmb();

No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!

Your first patch was right, replacing
  wmb()
  writel()

With wmb(); writel_relaxed() is fine, and helps ARM.

The problem with PPC has nothing to do with the writel, it is with the
spinlock, and we can't improve it without adding some new
infrastructure. Certainly don't hack around the problem by using
__raw_writel in multi-arch drivers!

In userspace we settled on something like this as a pattern:

 mmio_wc_spin_lock()
 writel_wc_mmio()
 mmio_wc_spin_unlock()

Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
fully contain all MMIO access to WC and UC memory.

Due to our userspace the pattern is specifically used with MMIO writes
to WC BAR memory, but it could be generalized..

This allows all known arches to use the best code in all cases. x86
can even optimize a lot and combine the mmiowmb() into the
spin_unlock, apparently.

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17 15:05               ` Jason Gunthorpe
@ 2018-03-17 18:30                 ` Sinan Kaya
  2018-03-19  1:48                   ` Jason Gunthorpe
  0 siblings, 1 reply; 36+ messages in thread
From: Sinan Kaya @ 2018-03-17 18:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Steve Wise, netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom',
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

+linuxppc-dev@lists.ozlabs.org

On 3/17/2018 11:05 AM, Jason Gunthorpe wrote:
> On Sat, Mar 17, 2018 at 12:25:14AM -0400, Sinan Kaya wrote:
>> On 3/17/2018 12:03 AM, Sinan Kaya wrote:
>>> On 3/16/2018 11:40 PM, Sinan Kaya wrote:
>>>> I'll change writel_relaxed() with __raw_writel() in the series like you suggested
>>>> and also look at your other comments.
>>>
>>> I spoke too soon.
>>>
>>> Now that I realized, code needs to follow one of the following patterns for correctness
>>>
>>> 1)
>>> wmb()
>>> writel()/writel_relaxed()
>>>
>>> or 
>>>
>>> 2)
>>> wmb()
>>> __raw_wrltel()
>>> mmiowb()
>>>
>>> but definitely not
>>>
>>> wmb()
>>> __raw_wrltel()
>>>
>>> Since #1 == #2, I'll stick to my current implementation of writel_relaxed()
>>>
>>> Changing writel() to writel_relaxed() or __raw_writel() isn't enough. PowerPC needs mmiowb()
>>> for correctness. ARM's mmiowb() implementation is empty.
>>>
>>> So, there is no one size fits all solution with the current state of affairs.
>>>
>>>
>>
>> I think I finally got what you mean.
>>
>> Code seems to have
>>
>> wmb()
>> writel()/writeq()
>> wmb()
>>
>> this can be safely replaced with
>>
>> wmb()
>> __raw_writel()/__raw_writeq()
>> wmb()
>>
>> This will work on all arches. Below is the new version. Let me know if this is OK.
>>
>> +++ b/drivers/infiniband/hw/cxgb4/t4.h
>> @@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
>>         int count = 8;
>>
>>         while (count) {
>> -               writeq(*src, dst);
>> +               __raw_writeq(*src, dst);
>>                 src++;
>>                 dst++;
>>                 count--;
>> @@ -477,15 +477,16 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
>>                                  (u64 *)wqe);
>>                 } else {
>>                         pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
>> -                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> -                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
>> +                                    wq->sq.bar2_va + SGE_UDB_KDOORBELL);
>>                 }
>>
>>                 /* Flush user doorbell area writes. */
>>                 wmb();
>>                 return;
>>         }
>> -       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +       __raw_writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
>> +       mmiowmb()
>>  }
>>
>>  static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>> @@ -502,15 +503,16 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
>>                                  (void *)wqe);
>>                 } else {
>>                         pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
>> -                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> -                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>> +                       __raw_writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
>> +                                    wq->rq.bar2_va + SGE_UDB_KDOORBELL);
>>                 }
>>
>>                 /* Flush user doorbell area writes. */
>>                 wmb();
>>                 return;
>>         }
>> -       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +       __raw_writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
>> +       mmiowmb();
> 
> No! NAK! Adding *new* barriers to use the *wrong* accessor is crazy!
> 
> Your first patch was right, replacing
>   wmb()
>   writel()
> 
> With wmb(); writel_relaxed() is fine, and helps ARM.
> 
> The problem with PPC has nothing to do with the writel, it is with the
> spinlock, and we can't improve it without adding some new
> infrastructure. Certainly don't hack around the problem by using
> __raw_writel in multi-arch drivers!
> 
> In userspace we settled on something like this as a pattern:
> 
>  mmio_wc_spin_lock()
>  writel_wc_mmio()
>  mmio_wc_spin_unlock()

> 
> Where mmio_wc_spin_unlock incorporates the mmiowmb and is defined to
> fully contain all MMIO access to WC and UC memory.
> 
> Due to our userspace the pattern is specifically used with MMIO writes
> to WC BAR memory, but it could be generalized..
> 
> This allows all known arches to use the best code in all cases. x86
> can even optimize a lot and combine the mmiowmb() into the
> spin_unlock, apparently.

Given both Dave [1] and Jason's feedback, we have to ask PowerPC developers
to fix writel_relaxed().

Otherwise, PowerPC won't be able to take advantage of the optimizations
introduced in this series.

I don't think we need writel_really_relaxed() API.

Somebody also has to take a task and work very hard to get rid of __raw_writeX()
APIs in drivers/net directory. It looked like a very common practice though
it clearly violates multiarch portability concerns Jason and Deve highlighted.

This will be the next version:

iff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 8369c7c..6e5658a 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -457,7 +457,7 @@ static inline void pio_copy(u64 __iomem *dst, u64 *src)
        int count = 8;
 
        while (count) {
-               writeq(*src, dst);
+               writeq_relaxed(*src, dst);
                src++;
                dst++;
                count--;
@@ -477,15 +477,15 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc, union t4_wr *wqe)
                                 (u64 *)wqe);
                } else {
                        pr_debug("DB wq->sq.pidx = %d\n", wq->sq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
-                              wq->sq.bar2_va + SGE_UDB_KDOORBELL);
+                       writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->sq.bar2_qid),
+                                      wq->sq.bar2_va + SGE_UDB_KDOORBELL);
                }
 
                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
+       writel_relaxed(QID_V(wq->sq.qid) | PIDX_V(inc), wq->db);
 }
 
 static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
@@ -502,15 +502,15 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 inc,
                                 (void *)wqe);
                } else {
                        pr_debug("DB wq->rq.pidx = %d\n", wq->rq.pidx);
-                       writel(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
-                              wq->rq.bar2_va + SGE_UDB_KDOORBELL);
+                       writel_relaxed(PIDX_T5_V(inc) | QID_V(wq->rq.bar2_qid),
+                                      wq->rq.bar2_va + SGE_UDB_KDOORBELL);
                }
 
                /* Flush user doorbell area writes. */
                wmb();
                return;
        }
-       writel(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
+       writel_relaxed(QID_V(wq->rq.qid) | PIDX_V(inc), wq->db);
 }


[1] https://lkml.org/lkml/2018/3/17/100

> 
> Jason
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 18/18] infiniband: cxgb4: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-17 18:30                 ` Sinan Kaya
@ 2018-03-19  1:48                   ` Jason Gunthorpe
  0 siblings, 0 replies; 36+ messages in thread
From: Jason Gunthorpe @ 2018-03-19  1:48 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Steve Wise, netdev, timur, sulrich, linux-arm-msm,
	linux-arm-kernel, 'Steve Wise', 'Doug Ledford',
	linux-rdma, linux-kernel, 'Michael Werner',
	'Casey Leedom',
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Sat, Mar 17, 2018 at 02:30:10PM -0400, Sinan Kaya wrote:

> Somebody also has to take a task and work very hard to get rid of __raw_writeX()
> APIs in drivers/net directory. It looked like a very common practice though
> it clearly violates multiarch portability concerns Jason and Deve highlighted.

When you posted your list I thought most of the hits were in what I'd
think of 'one-arch drivers', eg an IRQ controller or clock driver or
something.. Some might have a reason for it (eg avoiding the swap, for
instance), maybe it is a hold over from before writel_relaxed, or
maybe it is just a cargo-cult behavior..

It is the obviously multi-arch drivers that probably need some
attention..

Jason

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [PATCH v3 11/18] qlcnic: Eliminate duplicate barriers on weakly-ordered archs
  2018-03-16 16:16 ` [PATCH v3 11/18] qlcnic: " Sinan Kaya
@ 2018-03-19 20:10   ` Chopra, Manish
  0 siblings, 0 replies; 36+ messages in thread
From: Chopra, Manish @ 2018-03-19 20:10 UTC (permalink / raw)
  To: Sinan Kaya, netdev, timur, sulrich
  Cc: linux-arm-msm, linux-arm-kernel, Patil, Harish,
	Dept-GE Linux NIC Dev, linux-kernel

> -----Original Message-----
> From: Sinan Kaya [mailto:okaya@codeaurora.org]
> Sent: Friday, March 16, 2018 9:46 PM
> To: netdev@vger.kernel.org; timur@codeaurora.org; sulrich@codeaurora.org
> Cc: linux-arm-msm@vger.kernel.org; linux-arm-kernel@lists.infradead.org;
> Sinan Kaya <okaya@codeaurora.org>; Patil, Harish <Harish.Patil@cavium.com>;
> Chopra, Manish <Manish.Chopra@cavium.com>; Dept-GE Linux NIC Dev <Dept-
> GELinuxNICDev@cavium.com>; linux-kernel@vger.kernel.org
> Subject: [PATCH v3 11/18] qlcnic: Eliminate duplicate barriers on weakly-ordered
> archs
> 
> Code includes wmb() followed by writel(). writel() already has a barrier on some
> architectures like arm64.
> 
> This ends up CPU observing two barriers back to back before executing the
> register write.
> 
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
> 
> Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
> ---
>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
> b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
> index 46b0372..97c146e7 100644
> --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
> +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
> @@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct
> qlcnic_adapter *adapter)
>  	wmb();
> 
>  	/* clear the interrupt trigger control register */
> -	writel(0, adapter->isr_int_vec);
> +	writel_relaxed(0, adapter->isr_int_vec);
>  	intr_val = readl(adapter->isr_int_vec);
>  	do {
>  		intr_val = readl(adapter->tgt_status_reg);
> --
> 2.7.4

Acked-by: Manish Chopra <manish.chopra@cavium.com>

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-03-19 20:10 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1521216991-28706-1-git-send-email-okaya@codeaurora.org>
2018-03-16 16:16 ` [PATCH v3 01/18] i40e/i40evf: Eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 02/18] ixgbe: eliminate " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 03/18] igbvf: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 04/18] igb: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 05/18] ixgbevf: keep writel() closer to wmb() Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 06/18] ixgbevf: eliminate duplicate barriers on weakly-ordered archs Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 07/18] drivers: net: cxgb: Eliminate " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 08/18] scsi: hpsa: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 09/18] fm10k: " Sinan Kaya
2018-03-16 16:30   ` [Intel-wired-lan] " Alexander Duyck
2018-03-16 16:33     ` Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 10/18] net: qla3xxx: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 11/18] qlcnic: " Sinan Kaya
2018-03-19 20:10   ` Chopra, Manish
2018-03-16 16:16 ` [PATCH v3 12/18] bnx2x: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 13/18] net: cxgb4/cxgb4vf: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 14/18] net: cxgb3: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 15/18] RDMA/bnxt_re: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 16/18] IB/mlx4: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 17/18] RDMA/i40iw: " Sinan Kaya
2018-03-16 16:16 ` [PATCH v3 18/18] infiniband: cxgb4: " Sinan Kaya
2018-03-16 21:05   ` Steve Wise
2018-03-16 21:46     ` Sinan Kaya
2018-03-16 23:05       ` Steve Wise
2018-03-17  3:40         ` Sinan Kaya
2018-03-17  4:03           ` Sinan Kaya
2018-03-17  4:25             ` Sinan Kaya
2018-03-17  4:30               ` Timur Tabi
2018-03-17 13:23               ` Steve Wise
2018-03-17 13:27               ` David Miller
2018-03-17 15:05               ` Jason Gunthorpe
2018-03-17 18:30                 ` Sinan Kaya
2018-03-19  1:48                   ` Jason Gunthorpe
2018-03-16 22:13     ` Jason Gunthorpe
2018-03-16 23:04       ` Steve Wise
2018-03-17  4:08         ` Timur Tabi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).